Home

Lahari - DATA ENGINEER
[email protected]
Location: Dallas, Texas, USA
Relocation: Yes
Visa: GC
Lahari A
Sr.Data Engineer
[email protected]
+1 470-369- 5278

Over 11 years of expertise in IT, specializing in cloud services, big data analytics, and data engineering. Proficient in using GCP, AWS, and Hadoop ecosystems for designing and implementing scalable data solutions. Skilled in programming languages including Python, Scala, and SQL, with a strong background in ETL processes, data modeling, and data visualization tools.
Data Engineer | GCP| AWS | Big Data Analytics | Python| ETL |Spark | MSSQL| Tableau
PROFESSIONAL SUMMARY
Expertise in designing and implementing serverless data processing workflows using GCP services, optimizing data pipelines for efficiency and cost-effectiveness.
Proven design, development, and deployment skills for projects using Google Cloud Platform tools including Big Query, Data Flow, Data Proc, Google Cloud Storage, Composer, Looker, etc.
Experience in designing and implementing data solutions across AWS, and GCP cloud platforms, showcasing versatility in multi-cloud environments.
Worked with GCP services to orchestrate ETL workflows, ensuring data is ingested, transformed, and loaded into data warehouses.
Hands on experience with Spark streaming to receive real time data using Kafka.
Significant working expertise with the Hadoop Cloudera Distribution platform.
Designed, implemented, and maintained data pipelines that ingested high-velocity, real-time data streams from various sources into Google Pub Sub.
Experience in cloud provisioning tools such as Terraform and CloudFormation.
Good experience in Amazon Web Services like S3, IAM, EC2, EMR, Kinesis, VPC, Dynamo DB, RedShift, Amazon RDS, Lambda, Athena, Glue, DMS, Quick Sight, Amazon Elastic Load Balancing, Auto Scaling, Cloud Watch, SNS, SQS and other services of the AWS family.
Knowledge of NoSQL databases such as HBase, MongoDB, Cassandra.
Proficient in various RDBMS including MySQL, PostgreSQL, Oracle, and Microsoft SQL Server, bringing expertise in database design and management to data engineering projects.
Expertise in usage of Hadoop and its ecosystem tools such as HDFS, MapReduce HIVE, PIG, HBASE, Kafka, YARN, OOZIE, Pig, SQOOP, Impala, Spark.
Proven ability to execute end-to-end data processing tasks for data analysis utilizing MapReduce, Spark, and Hive.
Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
Experience with Snowflake Multi-Cluster Warehouses and Virtual Warehouses.
Experienced in processing large datasets of different forms including structured, semi-structured and unstructured data.
Hands-on experience in programming using Python, Scala, Java and SQL.
Experienced in Python data manipulation for loading and extraction as well as with python libraries such as matplotlib, NumPy, SciPy and Pandas for data analysis.
Automated deployment pipelines using Jira, Git, and Jenkins which resulted in a reduction in deployment time and reduced manual errors.
Worked on Dimensional and Relational Data Modelling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modelling.
Significant expertise with Spark Streaming, Spark SQL, and other Spark features including accumulators, broadcast variables, various levels of caching, and spark job optimization strategies.
Experience with DBT to prepare data for visualization and reporting, as well as familiarity with Snowflake's interaction with other data tools including Looker, Tableau, and Power BI.
Experience in handling various file formats such as Parquet, Avro, ORC, JSON, XML, CSV, Text, ASCII.
Demonstrated flexibility to both Waterfall and Agile/Scrum methodologies, demonstrating adaptability to traditional and iterative project frameworks for successful SDLC execution.
TECHNICAL SKILLS:

Category Technologies and Tools
Cloud Environments AWS S3, Amazon Glue, Kinesis, EMR, Step Functions, Lambda Functions, IAM.
GCP GCS, DataProc, BigQuery, pub/sub, Dataflow, Cloud Composer.
AZURE
Big Data Technologies Spark, Hadoop, Apache Kafka, Apache Cassandra, Apache HBase, Apache Hive, Apache Pig, Apache Storm, Apache Airflow, Nifi, Snowflake, Sqoop, HDFS, Map Reduce, Flume, Oozie, Impala, YARN.
Languages Python 2.7/3.6/3.8/3.10/3.12, SQL, Pyspark, Scala, Unix Shell Scripting, Java, PL/SQL, Perl, T-SQL.
Relational Databases & Data warehouses Oracle 11g/10g, MySQL, Snowflake, SQL Server 2016/2014/2012/2008 ,Redshift, DB2, Big Query, Teradata 15/14.
Data Visualization tools PowerBI, Tableau, QlikView.
NoSQL Databases Cassandra, MongoDB, DynamoDB, HBase.
Version Control Tools GitHub, Gitlab, Bitbucket.
Operating Systems Ubuntu, Windows, Linux.
Messaging System Kafka, SNS, Kinesis.
Methodologies Agile/ Scrum and Traditional Waterfall.
Data Integration tools Informatica, Talend, Matillion.
Data Transformation and Modelling DBT, Alteryx

PROFESSIONAL EXPERIENCE
CENTENE CORP, St. Louis, Missouri May 2023 Present
Role: Senior AWS Data Engineer
Key Responsibilities:

Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines.
Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines.
Performed data extraction, transformation, loading, and integration in data warehouse using python and DBT.
Designed and developed ETL processes for data migration from external sources into AWS Redshift, integrating data from various sources and formats to support analytics and reporting needs.
Create and Maintain Data Dictionaries, Data Model Diagrams, Design Documents, Architecture Diagrams, Deployment Diagrams, and Run Books.
Performed Administration includes Instantiations, Upgradations, Configurations, Maintenance, and Security for AWS Redshift using tools such as Amazon Redshift Console, AWS Management Console, and AWS CLI.
Successfully migrated all the environments from Dimensional Data Cloud Provider to AWS Cloud.
Design, Develop and Maintain AWS Redshift Data Models, SQL queries, Procedures, Functions, and Transformations using DBT (Data Build Tool).
Leveraged Subject Matter Expertise to design the data model, data dictionary.
Translate user requirements into a modeling design in AWS Redshift.
Technical Design for the ETL mappings, BI Dashboards, and AWS Redshift Services.
Wrote Python scripts using libraries such as pandas, boto3, and pyodbc to automate data extraction, transformation, and loading (ETL) processes. Stored data in S3 buckets and loaded it into AWS Redshift.
Used centralized repository for version control to check in and check out jobs to maintain version history and support multiuser development.
Designed the process to submit the code into GIT using SourceTree on daily intervals, reviewed the code and created/approved/merged pull requests. Parsed/created/deployed builds using Jenkins, managed task workflow using Jira.
Environment: Apache Airflow, Python 3.12, DBT (Data Build Tool), AWS Redshift, Amazon Redshift Console, AWS Management Console, AWS CLI, S3, pandas, boto3, pyodbc, GIT, SourceTree, Jenkins, Jira, BI Dashboards, SQL, and AWS Cloud.
ARVEST BANK, Bentonville, Arkansas Feb 2021 - May 2023
Role: GCP Data Engineer
Key Responsibilities:

Utilized Dataflow and Cloud Data proc on GCP to compose MapReduce jobs in Java, Pig, and Python, facilitating efficient data processing and analysis.
Experience in setting up CI/CD pipelines with Cloud Run, automating the deployment process for efficient and rapid updates of applications.
Used Python to transfer the data from on-premises clusters to Google Cloud Platform.
Stored both raw and processed data in Google Cloud Storage (GCS), organizing a structured data lake to ensure accessibility and usability across various use cases.
Worked on implementing data quality checks using Spark Streaming and arranged passable and bad flags on the data.
Utilized Apache Kafka to manage real-time data streams for GCP services, integrating with Google Cloud Storage and BigQuery, and enabling efficient data processing and analytics with Spark and Dataflow.
Integrated GraphQL with existing data sources and services, including RESTful APIs, databases, and microservices, to orchestrate data access and composition.
Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds.
Transferred data into GCP's storage and processing solutions like Cloud Storage and Big Query, employing tools such as Cloud Dataflow for importing and exporting data.
Collaborated with cross-functional teams by sharing DBT code and documentation, ensuring alignment on data pipeline requirements and objectives.
Implemented robust security measures within GCP, including encryption, IAM policies, and audit logging, safeguarding sensitive financial data across Google Cloud Storage and Big Query.
Worked on Databricks to write Scripts in PySpark, Python, and SQL & integrate Databricks with GCP.
Structured HBase tables within GCP to accommodate diverse data types originating from UNIX, NoSQL sources, and various portfolios, optimizing for structured, semi-structured, and unstructured data handling.
Involved in implementing error handling mechanisms and optimizing Python scripts for improved performance and efficiency in data processing workflows within GCP.
Familiarity with monitoring and logging tools within Cloud Run, ensuring visibility into application performance, error tracking, and resource utilization.
Implemented Neo4j to analyze interconnected data relationships, enhancing data insights beyond traditional relational databases.
Executed Hive queries within GCP, enabling market analysts to identify emerging trends by comparing new data with Enterprise Data Warehouse (EDW) reference tables and historical statistics.
Collaborated with BI teams to generate and distribute reports using Looker, providing valuable insights into business performance and trends.
Developed serverless functions on GCP using Python, utilizing Cloud Functions to execute small units of code triggered by various events for streamlined data processing.
Automated data loading into GCP's Hadoop Distributed File System (HDFS) using Oozie, optimizing with PIG for preprocessing data, allowing for quick reviews and competitive advantages.
Utilized Google Cloud Data prep and Data proc for transforming and transferring large data volumes within and across GCP services like Cloud Storage and Google Cloud Bigtable.
Generated reports for the Business Intelligence (BI) team by exporting data into GCP's storage solutions using tools like Cloud Dataflow and Big Query.
Developed Java-based MapReduce jobs on GCP for data cleaning and preprocessing, ensuring efficient data transformation within Google Cloud's ecosystem.
Leveraged GCP services such as Google Compute Engine (GCE) and Google Cloud Storage for enhanced system scalability and reliability.
Simulated production environments on GCP, configuring YARN for resource management, user access controls, and secure Oozie workflows, ensuring a mirrored secured environment for testing and development purposes.
Environment: Dataflow, Cloud Dataproc, BigQuery, Google Cloud Storage, Compute Engine, Cloud Functions, Cloud Bigtable, Apache Spark, Apache Kafka, Databricks, Hive, Pig, PySpark, GCS, HBase, Cloud Dataflow, Cloud Pub/Sub, Data Fusion, IAM, Cloud Audit Logs, VPC, Python, Java, SQL, Oozie, Airflow, Spark Streaming, Looker.


FM Global Group, Johnston, Rhode ISLAND March 2018 - Jan 2021
Role: Sr. AWS Data Engineer
Key Responsibilities:

Designed and setup Enterprise Data Lake to provide support for various uses cases including Storing, processing, Analytics and Reporting of voluminous, rapidly changing data by using various AWS Services.
Installed and Configured Apache Airflow for AWS S3 bucket and created DAGs to run the Airflow.
Developed PySpark scripts utilizing SQL and RDD in spark for data analysis and storing back into S3.
Created multiple scripts to automate ETL/ ELT process using PySpark from multiple sources.
Extracted data from multiple source systems S3, Redshift, RDS and created multiple tables/databases in Glue Catalog by creating Glue Crawlers.
Created multiple Glue ETL jobs in Glue Studio and then processed the data by using different transformations and then loaded it into S3, Redshift and RDS.
Developed error handling mechanisms to detect and resolve inconsistencies or discrepancies during HL7 to FHIR mapping.
Created multiple Recipes in Glue Data Brew and then used in various Glue ETL Jobs.
Orchestrated and migrated CI/CD processes using Cloud Formation and Terraform, packer Templates and Containerized the infrastructure using Docker, which was setup in OpenShift, AWS and VPCs.
Design and Develop ETL Processes in AWS Glue to migrate data from external sources like S3, Parquet/Text Files into AWS Redshift.
Skilled in crafting and optimizing complex search queries using Elasticsearch Query DSL to achieve fast and relevant search results.
Used AWS glue catalog with crawler to get the data from S3 and perform SQL query operations using AWS Athena.
Leveraged Cypher queries to traverse complex graph structures, identifying patterns and relationships crucial for strategic decision-making.
Written PySpark job in AWS Glue to merge data from multiple tables and in Utilizing Crawler to populate AWS Glue data Catalog with metadata table definitions.
Used AWS Glue for transformations and AWS Lambda to automate the process.
Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs using CloudWatch.
Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift and S3.
Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
Familiarity with managing Elasticsearch clusters in a distributed environment, including cluster provisioning, scaling, monitoring, and troubleshooting.
To analyze the data Vastly used Athena to run multiple queries on processed data from Glue ETL Jobs and then used Quick Sight to generate Reports for Business Intelligence.
Used AWS EMR to transform and move large amounts of data into and out of AWS S3.
Managed multiple Kubernetes clusters in a production environment.
Container management using Docker by writing Docker files and set up the automated build on Docker HUB and installed and configured Kubernetes.
Used DMS to migrate tables from homogeneous and heterogeneous DBs from on-premises to AWS Cloud.
Created Kinesis Data streams, Kinesis Data Firehose and Kinesis Data Analytics to capture and process the streaming data and then output into S3, Dynamo DB and Redshift for storage and analyzation.
Created Lambda functions to run the AWS Glue job based on the AWS S3 events.
Environment: S3, Redshift, RDS, Glue, Glue Studio, Glue DataBrew, Glue Catalog, Lambda, CloudFormation, Terraform, EMR, Athena, QuickSight, DMS, CloudWatch, VPC, Apache Airflow, Docker, OpenShift, Kubernetes, Elasticsearch, PySpark, SQL, HL7 to FHIR, Parquet/Text Files, Kinesis Data Streams, Kinesis Data Firehose, Kinesis Data Analytics, Packer, AWS Data Lake.

STATE OF TEXAS, Tx Aug 2015 Feb 2018
Role: Big Data Engineer
Key Responsibilities:
Implemented solutions utilizing Advanced AWS Components: EMR, EC2, etc. integrated with Big Data/Hadoop Distribution Frameworks: Hadoop YARN, MapReduce, Spark, Hive, etc.
Worked with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, and Spark Yarn.
Extensive experience in designing, developing, and optimizing ETL processes using Informatica PowerCenter. Proficient in creating complex mappings, transformations, and workflows.
Developed Spark applications and utilized Hadoop components to perform complex data transformations and analysis, integrating data from HDFS and AWS S3 for optimized processing and storage.
Deployed Apache Kafka to stream real-time data into Hadoop and Hive environments, supporting data processing and integration tasks, and optimizing the flow of data between various Hadoop ecosystem components
Data integrity checks have been handled using hive queries, Hadoop, and Spark.
Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS and converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size.
Wrote Spark applications for Data validation, cleansing, transformations, and custom aggregations and imported data from different sources into Spark RDD for processing and developed custom aggregate functions using Spark SQL and performed interactive querying.
Involved in converting Hive/SQL queries into Spark Transformations using Spark RDDs and Scala and involved in using SQOOP for importing and exporting data between RDBMS and HDFS.
Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performed necessary Transformations and Aggregations on the fly to build the common learner data model and persistence the data in HDFS.
Created AWS Glue job for archiving data from Redshift tables to S3 (online to cold storage) as per data retention requirements and involved in managing S3 data layers and databases including Redshift and Postgres.
Processed the web server logs by developing multi-hop flume agents by using Avro Sink and loaded into MongoDB for further analysis and worked on MongoDB NoSQL data modeling, tuning, disaster recovery and backup.
Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3 buckets, performed folder management in each bucket, managed logs and objects within each bucket.
Worked with different file formats like JSON, AVRO and parquet and compression techniques like snappy and developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
Developed shell scripts for dynamic partitions adding to hive stage table, verifying JSON schema change of source files, and verifying duplicate files in source location.
Worked with importing metadata into Hive using Python and migrated existing tables and applications to work on AWS cloud (S3).
Integrated Hadoop into traditional ETL, accelerating the extraction, transformation, and loading of massive structured and unstructured data.
Involved with writing scripts in Oracle, SQL Server and Netezza databases to extract data for reporting and analysis and worked in importing and cleansing of data from various sources like DB2, Oracle, flat files onto SQL Server with high volume data.
Implemented ad-hoc analysis solutions using HDInsight.
Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud and making the data available in Athena and Snowflake.
Knowledge in performance troubleshooting and tuning Hadoop Clusters.
Extensively used Stash Git-Bucket for Code Control and worked on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena and Snowflake.
Environment: EMR, EC2, S3, Glue, Athena, Redshift, Hadoop (YARN, MapReduce, HDFS), Spark (Spark Context, Spark SQL, Data Frame, Spark Streaming, RDD), Hive, Sqoop, Flume, MongoDB, Python, Scala, Shell scripting, Airflow, JSON, AVRO, Parquet, Snappy compression, Oracle, SQL Server, Netezza, DB2, HDInsight, Snowflake, Stash, Git-Bucket.

CHEWY, Dallas, Texas Jan 2013 - July 2015
Role: Data Engineer
Key Responsibilities:

Analyzed client requirements, collaborated with the team to develop a detailed design, and worked closely with the client team to confirm and implemented design changes.
Involved in creating multiple Hive tables, loading with data and writing hive queries that run internally in Map reduce.
Converted Hive queries to Spark transformations and utilized Spark RDDs for enhanced data processing and analytics, integrating data from Hadoop and RDBMS to streamline data workflows.
Configured and scheduled SSIS jobs to perform routine data import, transformation, and cleansing operations, improving overall data management and operational workflows.
Converted SQL queries to Spark transformations using Spark RDDs and Scala.
Worked on importing real time data to Hadoop using Kafka and implemented Oozie job.
Collected log data from web servers and exported to HDFS for further processing and analysis.
Involved in defining job flows, management, and log files reviews.
Created HBase tables and columns to store the user event data.
Involved in Database migrations from Traditional Data Warehouses to Spark Clusters.
Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala.
Used Hive and Impala to query the data in HBase.
Devised PL/SQL Stored Procedures, Functions, Triggers, Views and Packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
Configured and deployed Oozie workflow on Hadoop cluster to run Spark, Pig jobs simultaneously.
Implemented Dynamic Partitioning and Buckets in Hive for efficient data access.
Worked with Sqoop for importing and exporting the data from HDFS to RDBMS and vice-versa loading data into HDFS.
Created ETL Pipeline using Spark and Hive for ingesting data from multiple sources.
Worked on improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
Successfully led the migration of data from legacy systems to Snowflake, ensuring a smooth transfer of substantial amounts of data.
Performed ETL on data from various file formats such as JSON, Parquet and Database.
Developed SSIS packages to facilitate seamless data integration between various systems and platforms.
Participated in data analysis and validation, communicating with business analysts and customers, and resolving issues as part of production support.
Developed data processing workflows using Hadoop ecosystem tools like Map-Reduce, Apache Spark to analyze and transform data.
Performed query analysis, performance tuning, troubleshooting and optimization of T-SQL queries using SQL profiler, query execution plan and performance monitor.
Environment: HDFS, MapReduce, Hive, Pig, HBase, Sqoop, Oozie, Apache Kafka, Apache Spark (RDD, Spark-SQL, Data Frame, Spark Streaming), Scala, Hive, Impala, HBase, SQL Server (SSIS, T-SQL, PL/SQL), Snowflake, JSON, Parquet, RDBMS, SSIS, Python, Linux, and SQL Profiler..
Keywords: continuous integration continuous deployment business intelligence sthree database active directory information technology procedural language Texas
Keywords: continuous integration continuous deployment business intelligence sthree database active directory information technology procedural language Texas

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];3469
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: