| William Bennett - Lead Data Engineer |
| [email protected] |
| Location: Dallas, Texas, USA |
| Relocation: |
| Visa: USC |
| Resume file: William Lead Data Engineer_1777313737966.docx Please check the file(s) for viruses. Files are checked manually and then made available for download. |
|
William Bennett
Email [email protected] PROFESSIONAL SUMMARY Lead Data Engineer with 12+ years of progressive experience starting from Python development and evolving into enterprise-scale data architecture and leadership roles across healthcare, e-commerce, retail, and telecom domains in the US. Proven expertise in designing and optimizing large-scale batch and real-time data pipelines, building cloud-native data platforms, and enabling data-driven decision-making. Strong background in Python, SQL, distributed systems, and modern data stack technologies including Spark, Airflow, Kafka, and Snowflake. Experienced in leading cross-functional teams, implementing data governance, and integrating AI/ML workflows into production systems. Adept at translating business requirements into scalable data solutions while ensuring data quality, compliance (HIPAA), and performance optimization. TECHNICAL SKILLLS Programming & Scripting Python, SQL, Shell Scripting Data Engineering & Processing Pandas, NumPy, Apache Spark, PySpark, Hive, Azure Databricks, Azure Data Factory (ADF) ETL & Orchestration Airflow, Talend, Informatica, Cron Jobs Databases & Warehousing PostgreSQL, MySQL, Amazon Redshift, Snowflake, Azure Synapse Analytics Big Data & Streaming Apache Kafka, Hadoop Ecosystem (HDFS, MapReduce), Azure Event Hubs Cloud Platforms AWS and Storage (S3, EC2, Lambda, EMR, Glue, IAM, CloudWatch), Azure Blob Storage, Azure Data Lake Storage (ADLS Gen1/Gen2) Data Modeling & BI Dimensional Modeling, Star Schema, Snowflake Schema, Tableau, Power BI Security & Governance Azure Active Directory (AAD), RBAC, Azure Key Vault DevOps & Tools Docker Git, CI/CD Pipelines AI/ML & Advanced Tools Scikit-learn, TensorFlow, Basic LLM Integrations PROFESSIONAL EXPERIENCE Lead Data Engineer / Data Architect T Mobile, (Frisco, TX) Feb 2020 - Present Responsibilities; Led the end-to-end design and implementation of enterprise-scale, cloud-native data platforms capable of processing and analyzing billions of telecom events daily, leveraging distributed processing frameworks like Apache Spark, real-time ingestion through Apache Kafka, and scalable warehousing solutions such as Snowflake, ensuring high availability, fault tolerance, and performance optimization across mission-critical systems. Architected and deployed real-time streaming data pipelines using Apache Kafka and Spark Structured Streaming to process high-volume call detail records (CDRs), network logs, and subscriber activity data, enabling near real-time monitoring, fraud detection, and network performance analytics for business-critical telecom operations. Designed and implemented a modern data stack architecture by integrating Apache Airflow for orchestration, dbt for transformation and modeling, and Snowflake for analytics, creating a scalable, modular, and maintainable ecosystem that significantly improved data accessibility and reduced time-to-insight for analytics teams. Built and optimized cloud-native data solutions on AWS, utilizing services such as S3 for data lake storage, Glue for ETL processing, Lambda for serverless transformations, and EMR for large-scale distributed workloads, ensuring seamless scalability, cost-efficiency, and operational resilience. Established comprehensive data governance frameworks that enforced data quality, lineage tracking, cataloging, and regulatory compliance, implementing best practices around metadata management and ensuring adherence to data security and privacy standards across the organization. Led and mentored a team of data engineers by providing technical direction, conducting in-depth code reviews, establishing engineering standards, and fostering a culture of continuous learning and innovation, resulting in improved team productivity and higher-quality data solutions. Implemented cloud cost optimization strategies by analyzing usage patterns, optimizing storage formats (Parquet/ORC), tuning compute workloads, and introducing auto-scaling mechanisms, leading to significant reductions in infrastructure costs without compromising performance. Developed advanced AI-powered analytics pipelines by integrating machine learning models built using Scikit-learn and TensorFlow into data workflows, enabling use cases such as predictive maintenance of telecom infrastructure, customer churn prediction, and intelligent anomaly detection. Enabled near real-time analytics capabilities by designing low-latency data processing architectures that allowed business teams to monitor network performance metrics, customer usage patterns, and service quality indicators with minimal delay. Designed and implemented scalable data lakehouse architectures that combined the flexibility of data lakes with the performance of data warehouses, ensuring efficient storage, governance, and querying of both structured and semi-structured telecom data. Collaborated extensively with cross-functional stakeholders including product managers, network engineers, data scientists, and business analysts to gather requirements, translate them into technical solutions, and deliver high-impact data products aligned with business objectives. Implemented advanced monitoring, logging, and alerting systems using AWS CloudWatch and custom dashboards to proactively identify pipeline failures, performance bottlenecks, and data inconsistencies, ensuring high reliability and uptime. Built and maintained CI/CD pipelines for data engineering workflows using modern DevOps practices, enabling automated testing, deployment, and version control of data pipelines, thereby improving release cycles and reducing manual errors. Integrated modern AI and data processing frameworks into existing data ecosystems to enhance intelligent data processing capabilities, including automated anomaly detection and pattern recognition within large telecom datasets. Led modernization efforts by migrating legacy Hadoop-based data pipelines and on-premise HDFS storage systems to cloud-native architectures on AWS and Snowflake, improving scalability, performance, and cost efficiency. Integrated historical datasets stored in Hadoop clusters into modern data lakehouse architectures, ensuring continuity of analytics and seamless access to legacy telecom data. Significantly improved query performance and cost efficiency in Snowflake by optimizing data partitioning, clustering keys, and query execution plans, enabling faster analytics and reporting for large-scale datasets. Designed and developed secure, scalable APIs for data access using Python-based frameworks, allowing downstream applications and business users to interact with curated datasets in a controlled and efficient manner. Ensured robust security practices by implementing encryption (at-rest and in-transit), role-based access control (RBAC), and fine-grained data access policies, safeguarding sensitive telecom and customer data. Led large-scale migration initiatives to modernize legacy on-premise data systems into cloud-based architectures, ensuring minimal downtime and seamless transition while improving system scalability and maintainability. Built reusable, modular data engineering frameworks and libraries to standardize ingestion, transformation, and validation processes, reducing development effort and ensuring consistency across projects. Supported enterprise-wide analytics initiatives by delivering high-quality, reliable datasets that empowered business intelligence, customer analytics, and operational decision-making across multiple departments. Implemented enterprise-grade metadata management and data catalog solutions to improve data discoverability, lineage tracking, and governance across complex data ecosystems. Designed fault-tolerant systems with robust retry mechanisms, checkpointing, and failover strategies to ensure continuous data processing even in the face of infrastructure or system failures. Applied data mesh architectural principles to promote decentralized data ownership, enabling domain teams to manage and serve their own data products while maintaining centralized governance standards. Enabled self-service analytics by building curated data layers, semantic models, and user-friendly data access mechanisms, allowing business users to independently explore and analyze data without heavy engineering dependency. Drove continuous innovation by evaluating and adopting emerging technologies and tools in the data engineering and AI ecosystem, ensuring the organization remained at the forefront of modern data practices. Managed stakeholder expectations by effectively communicating technical solutions, timelines, and trade-offs, ensuring alignment between engineering efforts and business priorities. Played a strategic role in defining the long-term data platform roadmap, making key architectural decisions, and aligning data engineering initiatives with organizational growth and scalability goals. Ensured that all data engineering practices, tools, and architectures were aligned with business requirements, scalability needs, and industry best practices, delivering robust, future-proof data solutions. Senior Data Engineer / Analyst AT&T (Dallas, TX) May 2018 - Jan 2020 Responsibilities; Architected and implemented highly scalable and fault-tolerant data pipelines using Apache Spark and PySpark to process massive volumes of retail transactional data, including point-of-sale systems, inventory feeds, and customer purchase histories, ensuring efficient distributed processing and enabling near real-time analytics for business-critical operations. Designed, built, and maintained centralized data lake architectures on AWS S3, enabling seamless ingestion and storage of both structured and semi-structured retail datasets, while implementing efficient data partitioning, lifecycle policies, and storage optimization strategies to improve performance and reduce costs. Implemented robust real-time data streaming pipelines using Apache Kafka to capture high-velocity retail events such as customer interactions, order placements, and inventory updates, allowing downstream systems to react quickly to changes in demand and improve operational responsiveness. Developed and orchestrated complex ETL workflows using Apache Airflow, designing multi-stage DAGs that handled dependencies, retries, and scheduling, thereby ensuring reliable and automated execution of large-scale data pipelines across multiple environments. Built and enhanced advanced data models based on dimensional modeling techniques, including star and snowflake schemas, to support a wide range of business intelligence and analytics use cases such as sales forecasting, customer segmentation, and inventory optimization. Optimized Spark-based data processing jobs by implementing advanced performance tuning techniques such as partitioning, bucketing, caching, and efficient memory management, significantly reducing job execution times and improving overall system efficiency. Led the migration of legacy ETL workflows and on-premise data systems to modern, cloud-based architectures on AWS, carefully planning data migration strategies, ensuring minimal downtime, and enhancing system scalability and maintainability. Collaborated closely with data scientists to operationalize machine learning models by integrating them into production-grade data pipelines, enabling use cases such as demand forecasting, recommendation systems, and customer behavior analysis. Designed and implemented comprehensive data quality frameworks that included validation rules, anomaly detection, and data consistency checks, ensuring high data reliability and accuracy across all downstream analytics systems. Designed data ingestion frameworks to load structured and semi-structured retail data into Hadoop clusters, ensuring data consistency, fault tolerance, and scalability across distributed systems. Collaborated in transitioning legacy Hadoop MapReduce jobs to Apache Spark-based processing, improving processing speed, developer productivity, and overall pipeline efficiency. Built reusable and modular data ingestion frameworks using Python and Spark, standardizing data processing patterns and significantly reducing development effort for new data sources and pipelines. Improved scalability and system performance by leveraging distributed computing principles and optimizing resource allocation across Spark clusters, ensuring efficient handling of growing data volumes. Mentored junior data engineers by conducting code reviews, sharing best practices, and guiding them through complex data engineering challenges, fostering a collaborative and high-performing engineering culture. Integrated multiple third-party retail data sources, including supplier feeds and external market data, into the organization s data ecosystem, ensuring seamless data flow and consistency across systems. Implemented CI/CD pipelines for data engineering workflows using modern DevOps practices, enabling automated testing, deployment, and version control of data pipelines. Worked closely with business stakeholders and analysts to understand evolving data requirements and translated them into scalable and efficient technical solutions that aligned with business goals. Developed interactive dashboards and reports using business intelligence tools to provide insights into sales performance, customer behavior, and operational efficiency. Ensured compliance with data governance, security, and privacy standards by implementing role-based access control and data protection mechanisms across data platforms. Reduced data processing latency and improved throughput by continuously monitoring pipeline performance and applying optimization strategies across different stages of data processing. Enhanced monitoring and alerting mechanisms by integrating logging frameworks and cloud monitoring tools, enabling proactive detection and resolution of pipeline issues. Played a key role in defining and evolving the organization s data architecture, contributing to long-term scalability, flexibility, and alignment with modern data engineering best practices. Data Engineer / Analyst Primary Arms, LLC, (Houston, TX) Oct 2016 - Dec 2018 Responsibilities; Designed scalable ETL pipelines to process large volumes of transactional e-commerce data, enabling real-time and batch analytics for customer behaviour insights. Implemented Apache Airflow for workflow orchestration, replacing manual scheduling and significantly improving pipeline reliability and visibility. Built data pipelines on AWS using S3 for storage and Redshift for warehousing, enabling efficient querying of large datasets. Developed complex SQL queries and transformations to support reporting dashboards used by business stakeholders. Created dimensional data models to support analytics use cases such as customer segmentation, sales trends, and inventory optimization. Integrated multiple data sources including APIs, relational databases, and clickstream logs into a unified data platform. Leveraged Spark (early adoption phase) for distributed data processing, improving performance for large-scale transformations. Automated data ingestion processes using Python scripts and Airflow DAGs to ensure timely availability of analytics data. Integrated Hadoop ecosystem tools with AWS infrastructure by transferring large datasets between on-premise HDFS clusters and Amazon S3, ensuring hybrid data architecture compatibility Designed and implemented scalable data ingestion pipelines using Hadoop Distributed File System (HDFS) to store and process large volumes of clickstream and transactional e-commerce data, enabling efficient batch analytics and long-term storage of raw datasets. Collaborated with product and marketing teams to deliver actionable insights through data-driven reporting. Built dashboards using Tableau to visualize key business metrics and trends. Ensured data quality by implementing validation checks and anomaly detection mechanisms within ETL pipelines. Optimized Redshift queries using distribution keys and sort keys to improve performance. Worked on A/B testing data pipelines to analyse user engagement and conversion metrics. Implemented logging and alerting mechanisms using AWS CloudWatch for pipeline monitoring. Assisted in migrating on-premise data systems to AWS cloud infrastructure. Maintained documentation for data pipelines, schemas, and workflows. Collaborated with data scientists to provide clean datasets for machine learning models. Improved pipeline performance through parallel processing and efficient resource utilization. Python Developer / ETL Developer HCA Healthcare (Dallas, TX) Oct 2015 - Dec 2016 Responsibilities; Developed robust Python-based data processing scripts using Pandas and NumPy to transform raw patient and claims datasets into structured formats aligned with healthcare analytics requirements while ensuring compliance with HIPAA data handling standards. Built RESTful services using Flask to expose healthcare data endpoints for internal analytics teams, enabling seamless integration between clinical data systems and reporting tools. Designed and implemented ETL pipelines using Python and SQL to extract data from relational databases like MySQL and PostgreSQL, transform it based on business logic, and load it into staging environments for downstream reporting. Automated ingestion of CSV and flat-file based medical datasets using Python scripts scheduled via cron jobs, significantly reducing manual intervention and improving data availability timelines. Worked closely with healthcare analysts to understand claims processing workflows and translated those into scalable ETL jobs that ensured accurate data aggregation for insurance reporting. Optimized SQL queries involving large datasets by implementing indexing strategies and efficient joins, improving query performance for patient record lookups. Developed data validation scripts to ensure consistency and completeness of sensitive healthcare datasets before loading into reporting systems. Integrated third-party healthcare APIs to extract patient data securely and transform it into normalized schemas for internal usage. Created logging and monitoring mechanisms within Python scripts to track ETL job performance and failures, enabling faster debugging and recovery. Collaborated with cross-functional teams to support data migration activities from legacy systems into modern relational databases. Assisted in designing database schemas for storing structured healthcare data, ensuring alignment with reporting and compliance requirements. Built reusable Python modules for common ETL operations, improving development efficiency across the team. Performed exploratory data analysis using Pandas to identify anomalies in patient and billing data, contributing to improved data quality. Supported deployment of ETL jobs in Linux environments and ensured smooth execution through shell scripting and environment configuration. Documented data workflows, ETL processes, and system dependencies to maintain transparency and ease of onboarding for new team members. Keywords: continuous integration continuous deployment artificial intelligence machine learning business intelligence sthree information technology Texas |