Sai Arundeep - Data Engineer |
[email protected] |
Location: Dallas, Texas, USA |
Relocation: |
Visa: H1B |
Sai Aetukuri
[email protected] +1 980-355-9669 PROFESSIONAL SUMMARY: Around 8 years of experience working in software industry involving in project design, development, deployment, and maintenance. Experienced in working with big data technologies such as SparkSQL, PySpark. Experienced in working with Hadoop ecosystem tools such as validating data in Hive with HQL queries, In-depth understanding of MapReduce concepts. Led end-to-end migration projects from legacy systems and on-premises data warehouses to Snowflake on AWS/Azure, ensuring minimal downtime and data integrity. Streamlined Qlik Data integration processes by collaborating with cross-functional teams, ensuring the alignment of data strategy with business goals. Expertise in creating complex DAGs (Directed Acyclic Graphs) to manage data pipeline orchestration at scale. Further, leveraged various Airflow amazon libraries to create EMR (Elastic Map Reduce) cluster and AWS Glue clusters on the fly for on demand use cases. Experience in working with public cloud services such as AWS (Amazon Web Services) created and managed IAM (Identity and Access management) roles and Access policies at the enterprise level to control data access within the organization, adhering to enterprise data governance rules. Experience in creating and managing AWS S3(Simple storage service) buckets by writing bucket access policies to control data access, managed data retention timelines and S3 Life cycle policy. Created S3 data replication roles to maintain and back-up failover region in (us-west-2) as fault tolerant system. Further, created S3 event triggers to automate the file data processing leveraging AWS SNS (Simple Notification Services). Experience in working with container services like AWS ECS (Elastic Container Service) along with AWS Redshift to create and host Airflow webserver for the ETL (Extract, Transform &Load) pipeline orchestration. Experience in implementing CI/CD pipelines in AWS using Terraform to automate the provisioning and deployment of cloud infrastructure, significantly reducing manual configuration and deployment times. Automated data migration workflows using tools like Apache Airflow, AWS Glue, and Azure Data Factory, streamlining the transition to Snowflake. Developed scalable solutions to ingest high-velocity streaming data from Azure EventHub into Snowflake, ensuring low-latency data availability for analytics and reporting. Leveraged Azure Data Factory to orchestrate data flows between EventHub and Snowflake, optimizing ETL processes to handle diverse data formats and high throughput. Collaborated with data engineering and analytics teams to enhance the integration of Qlik and Databricks, continuously refining the architecture to support evolving business requirements. Implemented best practices for data management and security within the Qlik and Databricks ecosystems, ensuring long-term scalability and robustness of the solutions. Configured and managed Snowflake external tables and stages to facilitate efficient data loading from Azure EventHub streams, enhancing data accessibility and query performance. Experience in working with Microsoft suite for Data Services like Azure Data Factory, Azure Databricks and Synapse Analytics. Experience working with Azure functions for creating and writing data into storage containers using Service principle and Linked Services. Expertise in working with python libraries PySpark to create Spark Data Frames and consume raw data received in various file formats such as CSV, TSV, Parquet, ORC, JSON and AVRO. Expertise in working with Pandas and NumPy library to perform complex data transformations and interconversion between Pandas and NumPy Data Frames and PySpark Data Frames. Led data migration projects to move large datasets from on-premises databases and other cloud platforms to Snowflake, ensuring data integrity and minimal downtime. Developed automated data quality checks and validation rules within Snowflake to ensure accuracy, completeness, and consistency of ingested data Configured and managed Snowflake tasks to automate and schedule data load operations, ensuring timely data availability for analytics. Expert in developing and implementing comprehensive data governance frameworks and strategies within Databricks, including role-based access control, data lineage tracking, and automated data quality checks, while optimizing data storage, processing, and compliance to support advanced analytics and business goals. Utilized Snowflake's zero-copy cloning feature to create quick, efficient copies of databases and tables for testing and development purposes without additional storage costs. Involved in Data Lake formation within the AWS cloud environment and isolated the data access and control to IAM roles based on the access policies. Experience working with Databricks notebooks, leveraged boto3 library to access AWS resources such as S3 object storage with partitioned data and performed data analysis. Technical Skills Programming Languages: SQL, PLSQL, PySpark, Python. BI/Analytic Tool: Tableau, Power BI, Excel, Hadoop (HDFS), Spark. Cloud Services: Azure Databricks, Azure Data Factory, Azure Synapse, Amazon EMR, Amazon S3, Amazon CloudWatch, AWS lambda, AWS SQS, Google Bigquery, Dataproc. Databases: Oracle SQL, MS SQL Server, DBFS, EC2 & Azure Data Lake Storage. Skilled Area: PySpark, SparkSQL, ETL (Extract, Transform & Load). PROFESSIONAL EXPERIENCE: Data Engineer CapitalOne McLean, VA Apr 2023 Present Responsibilities: Worked on Identifying the business requirements to design aggregate data model for complex data ingestion use case, combining multiple data streams of Brokerage data into a single stream. Executed complex system test scripts for regression, stress load, performance, integration, and functional testing using SQL and automated testing frameworks for Snowflake. Built MicroBatch Streaming Data pipeline Using AWS SQS and AWS Lambda to perform prospect Marketing Campaign Journey analysis connecting to SFMC(Sales Force Marketing Cloud) API and performed target email re-marketing. Migrated all the EMR Spark jobs on to Severless either to Lambda or to AWS Glue. Implemented the glue crawlers on AWS S3 and to automatically trigger Glue Jobs. Involved in Data Modeling, built data aggregation models, performed feature engineering to derive attributes serving analytical and reporting use cases. Crafted comprehensive ERDs to model and document the database structure, ensuring clear representation of tables, relationships, and constraints within Snowflake. Set up and managed Snowpipe for continuous data ingestion from various sources, including automated data loading from Amazon S3, enabling near real-time data processing and availability. Designed and developed DAG (Directed Acyclic Graph) orchestration model based on the architecture capabilities within the organization to streamline data ingestion in Apache airflow. Extensively used Airflow libraries such as airflow.providers.amazon.aws for AWS EMR Created AWS S3 bucket to allow file drop mechanism of raw data files into specified object key locations and configured bucket access policy to control and manage human IAM roles and machine IAM roles to access raw data inside the S3 buckets Developed PySpark code to consume raw data received in CSV format from several business streams into S3 buckets leveraging boto3 library and Machine IAM role attached to the EMR (Elastic Map Reduce) cluster and Glue cluster. Implemented Spark Data Frame API and Spark SQL extensively to perform basic and complex data transformations to consolidate the raw data into common aggerate data model (Schema). Leverage boto3 client to access outbound S3 bucket data location, partitioned and versioned every new data load to maintain historical record of data processing events in Enterprise data lake. Further enabled connection to snowflake data warehouse leveraging spark.snowflake.JDBC driver library to write the processed data into an aggregate relational schema for all enterprise operational use case. Proficient in Snowflake modeling roles, databases, schemas, and SQL performance tuning. Extensively Worked on data pipeline design, development, processing and deployment leveraging AWS resources S3, Redshift, ECS, EMR and Glue to orchestrate and process data ingestion streams. Developed CI/CD using Infrastructure-as-Code (IaC) templates with Terraform for automating AWS resource provisioning, including S3, Glue, and Lambda, ensuring scalable and repeatable infrastructure deployments. Integrated Terraform with AWS CodePipeline and AWS CodeBuild to create fully automated deployment pipelines, enabling continuous delivery of infrastructure and application updates. Data Engineer Optum Minneapolis, MN May 2022 Apr 2023 ODBC3 Migration: ODBC3 Migration project is to migrate the existing data products and data pipelines from on-prem (IDW, SAS) to Azure cloud environment (Databricks, Data Factory, SQL WH) for Finance and Accounting including E&I, C&S, M&R and Workers Compensation. Worked on Identifying the business requirements to design aggregate data model for complex data ingestion use case, combining multiple data streams from on-prem and consolidating the data into a single stream. Performed data analysis and designed the architecture for all the pipelines responsible Billing to UHC. Implemented Unity Catalog for Data Governance of the data assets as well as data strategy for Cost optimization and automation. Created detailed test plans in collaboration with business areas, supporting integration test scenarios in Azure and ensuring smooth project execution. Integrated Qlik with Azure Databricks to load data from SQL Server into a landing zone, facilitating seamless data ingestion for processing in a medallion architecture (bronze, silver, gold layers). Automated data replication from SQL Server to Azure using Qlik Replicate, ensuring efficient and reliable data loading into the landing zone for subsequent transformation in Databricks. Built and maintained fully automated data pipelines from SQL Server through Qlik Replicate to Azure Databricks, ensuring efficient data ingestion, transformation, and aggregation. Developed and managed Delta tables for efficient, ACID-compliant storage and fast query performance across large datasets. Leveraged Delta Live Tables (DLT) to build automated, reliable ETL pipelines with real-time data processing and quality monitoring. Streamlined event-driven data capture and processing from Azure EventHub and Kafka, utilizing Azure Data Factory to enhance real-time analytics capabilities. Implemented incremental data processing strategies using Delta Lake s capabilities, significantly reducing processing time and improving data freshness. Managed data governance and security with Unity Catalog, ensuring consistent data access control and policy enforcement across the organization. Enabled seamless and secure data sharing across platforms and stakeholders using Delta Sharing, enhancing collaboration and data monetization opportunities. Implemented a data fabric architecture on Azure to unify data from various sources, enabling a centralized data access layer. Designed and developed Data Pipelines orchestration model based on the architecture capabilities within the organization to streamline data ingestion in Azure Data Factory. Integrated Azure Data Factory with data fabric tools, and automated the creation and management of data pipelines, so that the data can be ingested, processed, and moved across different systems without manual intervention, ensuring a more efficient and error-free process. Designed and executed complex SQL and ASQL queries to extract and manipulate data across various databases, improving data retrieval efficiency by 30%. Successfully migrated data transformation processes from SAS to PySpark, leveraging the power of distributed computing for enhanced performance and scalability. Utilized PySpark's DataFrame API and SQL functionalities to manipulate, cleanse, and preprocess large datasets efficiently. Led the migration effort to transition SAS programs to PySpark, ensuring data integrity, accuracy, and minimal disruption to existing workflows. Migrated complex analytic queries, stored procedures, DDL, DML and DCL into python and spark on the Databricks using Azure Data Factory. Data Engineer InterContinental Hotel Group- Atlanta, GA Jan 2022 May 2022 Revenue Management (DSF): Demand Sensing Forecast serves the sole purpose of analysising the forecasts and actuals of each hotel unit across the world and forecast the future metrics on Bookings and Cancellations. Partnered with IHG Global Revenue Management to create comprehensive documentation of IHG data elements including data platforms, data/workflows, issues, and stakeholders to provide leadership with information necessary to implement data governance initiatives across IHG. Designed and developed Data Pipeline to pull, aggregate, and join external weather and events data to Google Big Query using Google Dataproc. Worked on ETL using GCP pub-sub for streaming real time data updates to Mobile and Web Applications. Conducted an analysis of the events data to derive a potential valuation of over $90M to justify an investment in the events data. Processed data was then passed from Python pandas and NumPy to Google Big Query upon which dashboards were built in Tableau. Automated the process of reading the key drivers from Google Cloud Storage and writing data into Google Big Query using Dataproc. Transformed, Cleansed, and backfill data, created models in Big Query for the business use case to create reports for the Each Hotel. Transformed data from different formats like XML, JSON, DSV to parquet file format using PySpark (Python). Shell Scripts to automate pipelines and error handling for data. Scheduled jobs in Google Cloud Composer/Airflow /Tidal to run jobs. Implement One Time Data Migration of Multistate level data from SQL server to Snowflake by using Python and SnowSQL. Data Engineer Johnson & Johnson New Jersey, USA Oct 2021 Apr 2022 OTIF-D: (On Time In Full Delivery Adhoc Dashboard) Adhoc Dashboard project is the provide data from multiple data sources into a single source transforming form unstructured and semi structured data into GCP. Worked on Identifying the business requirements to design aggregate data model for complex data ingestion use case, combining multiple data streams of Supply chain data into a single stream. Designed and developed Data Pipelines orchestration model based on the architecture capabilities within the organization to streamline data ingestion in Azure Data Factory. Extensively used Data Factory Functions for Hadoop Network and Azure Databricks Developed data pipelines on Azure Data Factory to load the data from EDL to Azure Synapse within J&J s Azure Subscription. Build ingestion framework that is re-iterated for multiple data sources within Data movement to process the external files from Azure Data Lake Store to Azure Databricks tables and perform a schema validation check and push the appended tables into Reporting layer. Build Scheduled triggers on Azure Data Factory by creating an Account level trigger on the storage account to execute notebook in Databricks upon an Event on storage container. Collaborated with cross-functional teams to design and validate data architectures that support real-time event processing from Azure EventHub into Snowflake, aligning with business requirements. Further enabled connection to Synapse data warehouse and Tableau leveraging Spark Simba ODBC driver library to write the processed data into an aggregate relational schema for all enterprise operational use case. Data Engineer Optum Minneapolis, MN, USA Jan 2021 Oct 2021 MDFT: Manufacturing Discount Forecasting tool is the second largest business process of Optum that is Migrated from IBM Netezza to Microsoft Azure Cloud Developed the Manufacturing Discount tool by building the logic on Azure Databricks in SparkSQL and PySpark and accessing the data from Integrated Data Warehouse (IDW) within Optum s Azure Subscription. Build an Input file Automated consumption tool that is reusable across multiple business processes with in Optum to process the files from Azure Data Lake Store to Azure Databricks tables and perform a schema validation check and push the golden copy into Azure Synapse. Collaborated with cross-functional teams to develop work plans, task sequencing, and time schedules for various data engineering projects. Worked on Scheduling the Consumption tool in Azure Data Factory by creating a Account level trigger on the storage account to trigger the notebook in Databricks. Developed multiple resource groups and provided a single private endpoint as a gateway to access the PI & PHI data across multiple applications. Pharmacy Guarantee: Pharmacy Guarantee is a SAS based process used to classify claims by Pharmacies and associated pharmacy contract guarantee. The idea is to leverage our base claim data product and combine the effort with the client guarantee Worked on building an automated consumption tool to load the external file data into Azure Delta Tables using Azure Databricks orchestrating it on Azure Data Factory Migrated Developed Data block and Proc SQL blocks from SAS to Databricks to support existing Application. Developed terraform templates to build necessary resources on Azure to support the application. Worked on getting enterprise level security approval to access PI & PHI data from enhanced layer by justifying the security recommendations and firewalls for the application. Application Development Analyst Novartis Pharmaceuticals, East Hanover, NJ May 2017 Jan 2020 IRMA: Integrated Relation Marketing Architecture is an enterprise-wide marketing platform across the Business operation on Novartis to provide data analysis and reporting to various verticals. Generated server-side PL/SQL scripts for data manipulation and validation and materialized views for remote instances. Also created tables, views, queries for new enhancement in the application according to business requirements. Provided support for change management for both Health care Professional data and Consumer data in Novartis account. Integrated data from third party vendors like SFDC, Mckesson, Amgen, Epsilon and IQVIA. Extracted Data from Azure Delta Lake onto on-prem SQL Server to support reporting dashboard. Conducted Data blending, Data preparation using SQL for tableau consumption and publishing data sources to Tableau server. EDUCATIONAL QUALIFICATIONS: Masters in Computer Science from the University of Central Missouri B.E in Electrical Engineering from Osmania University Keywords: continuous integration continuous deployment business intelligence sthree rlang information technology microsoft procedural language Georgia Minnesota New Jersey Virginia |