Home

Chandrakanth - Data Engineer
[email protected]
Location: Dallas, Texas, USA
Relocation: YES
Visa: H1B
Professional Summary

Data Engineering background with 9+ years of experience in Big Data technologies, Data Pipelines, SQL/NoSQL, Cloud based RDS, Distributed Database, Serverless Architecture, Data Mining, Web Scrapping, Cloud technologies like AWS EMR, Redshift, Lambda, Step Functions, Cloud Watch.
Hands-on experience on AWS Cloud services (EKS, EMR, VPC, EC2, S3, RDS, Athena, Glue, Redshift, Data Pipeline, EMR, Step Functions, Workspaces, Lambda, Kinesis, RDS, SNS, SQS) to run development and production jobs.
Over 4 years of experience in Apache Hadoop ecosystem like Apache Spark Framework, HBase, Hive, Building Data pipelines, and Migrating Data from cloud with domain expertise.
Experience working with Azure Data Factory (ADF), Azure Data Lake (ADL), Azure data Lake Analytics, Azure SQL database, Azure Data Bricks, Azure Synapse, Azure SQL Data Warehouse, Azure Blob Storage
Vast experience in designing, creating, testing, and maintaining the complete data management from Data Ingestion, Data Curation and Data Provision with in-depth knowledge in Spark APIs like Spark Framework-SQL, DSL, Streaming, by working on different file formats like parquet, JSON, and performance tuning of spark applications from various aspects.
Experience in using Spark SQL for Scala & Python interface that automatically converts RDD case classes to schema RDD.
In-depth knowledge of PySpark/ Scala and experience in building Spark applications.
Strong expertise in building Power BI reports on Azure analysis services for better performance.
Experienced in moving data between GCP and Azure Cloud using Azure Data Factory(ADF).
Experience in Migrating SQL Database to Azure data lake, azure data warehouse and Azure Synapse Analytics.
Experience in using SSIS tools like import and export wizard, Package installation and SSIS package designer.
Good understanding of Data Modeling (Dimensional and Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension Tables.
Expertise in handling ETL and ELT processes using tools like Apache NiFi and Matillion ETL. Involved in hand on development and configuration of data processing using PySpark on AWS Glue using PyCharm, Glue studio and Jupiter notebook.
Implemented relations in Graph DB and used the nodes to store data entities.
Performed ETL for Data Scientists to migrate data from Enterprise Cluster.
Used Git extensively and Jenkins for CI/CD.
Developed event based ETL load using S3, Lambda, AWS Glue and Redshift for secondary manufacturing module.

SKILLS


Hadoop Ecosystem HDFS, SQL, YARN, PIG Latin, MapReduce, Hive, Sqoop, Spark, Yarn, Zookeeper, Oozie, Kafka, Storm, Flume
Programming Languages Python, PySpark, Spark with Scala, JavaScript, Shell Scripting
Big Data Platforms Hortonworks, Cloudera
AWS Platform Google Cloud Platform (GCP) Big Query, Dataproc, Pub sub, Cloud Dataflow, GKE, Amazon Web Services (AWS) EKS, EC2, S3, EMR, Kinesis, Athena, Glue, StepFunctions, Lambda Functions, Redshift, SNS, SQS, EBS, VPC, IAM, Microsoft Azure HD Insight, Azure Data Lake Storage, Snowflake
Operating Systems Linux, Windows, UNIX
Databases Netezza, MySQL, UDB, HBase, MongoDB, Cassandra, Snowflake, SSIS
Development Methods Agile/Scrum, Waterfall
IDE s PyCharm, IntelliJ, Ambari, Jupiter Notebook
Data Visualization Tableau, Power BI, BO Reports, Splunk


Professional Experience
Sr. Data Engineer | McKesson, Irving, TX |October 2022 to Present

Designed and maintained SQL databases to store and manage COVID-19 data efficiently.
Designed and implemented scalable data pipelines using technologies such as Apache Spark, Apache Beam, resulting in a 30% improvement in data processing efficiency.
Led a team in building a real-time streaming data ingestion system using Kafka, enabling the organization to react to critical business events within seconds.
Implemented data lake architecture on Azure, leveraging services like S3, Azure Data Lake Storage, to centralize and democratize data access across the organization.
Developed robust ETL processes using tools like Apache Airflow, ensuring data consistency and reliability while handling petabytes of data.
Spearheaded the adoption of CI/CD practices in the data engineering team, resulting in a 50% reduction in deployment time and a more agile development process.
Implemented version control and change management processes for Tableau and Power BI artifacts using tools like Azure DevOps, ensuring consistency and traceability across development environments.
Implemented data governance policies and procedures to ensure compliance with regulatory requirements such as GDPR, enhancing data security and privacy practices within the organization.
Mentored junior data engineers on best practices in data engineering, fostering a culture of continuous learning and professional development within the team.
Collaborated with cross-functional teams including data scientists and business analysts to understand data requirements and deliver tailored solutions that meet business objectives.
Conducted performance tuning and optimization of big data platforms such as Databricks, resulting in significant cost savings and improved resource utilization.
Designed and implemented data quality monitoring frameworks to proactively identify and resolve data anomalies, ensuring high data accuracy and reliability.
Automated report generation and distribution processes using Tableau Server and Power BI Service, reducing manual effort and increasing operational efficiency within the organization.


Azure Big Data Engineer | Walmart, Dallas, TX | March 2022 to October 2022

Developed Data lakes using Azure Data Lakes and Blob Storage. End to End implementation of Azure data factory pipelines.
Worked on migration of data from On-prem SQL server to cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB)
Worked on creating tabular models on Azure analysis services for meeting business reporting requirements.
Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
Prepared the complete Data Mapping for all migrated jobs using SSIS
Have a good experience working with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics.
Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing of data.
Transformed SSIS stored procedures by using Scala to load data into some of the Fact & Dimension tables.
Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer patterns.
Worked on designing and developing the Real - Time Tax Computation Engine using Oracle, StreamSets, NIFI, Spark Structured Streaming and MemSql.
Developed a pipeline using Scala and Kafka to load data from a server to Hive with automatic ingestions and quality audits of the data to the RAW layer of the Data Lake.
Used ETL(SSIS) to develop jobs for extracting, cleaning, Transforming, and loading data into data warehouse.
Implemented PySpark and utilizing Data frames and Spark SQL API for faster processing of data.
Worked on PySpark Data sources, PySpark Data frames, Spark SQL and Streaming using Scala.
Experience in developing PySpark applications using Scala SBT
Performed a POC to check the time taking for Change Data Capture (CDC) of oracle data across Striim, Stream Sets and DB Visit.
Expertise in using different file formats like Text files, CSV, Parquet, JSON
Experience in custom compute functions using Spark SQL and performed interactive querying.
Data Engineer | Drug Plastic, Boyertown, PA | September 2021 to March 2022

Responsibilities:
Migrated existing data from SQL server, Teradata to HADOOP and performed ETL operations.
Designed and implemented Sqoop incremental imports on relation databases to HDFS (S3).
Worked with Avro and Parquet file formats and used various compression techniques to leverage the storage in HDFS.
Created data pipelines to load transformed data into Redshift from data lake.
Written extensive Spark/Scala programming using Data Frames, Data Sets & RDD's for transforming transactional database data and load it into Redshift tables.
Used Apache Airflow for scheduling and orchestrating the data pipelines.
Implemented best design practices for AWS Redshift for query optimization by distribution style of fact tables.
Design and Develop ETL process in AWS Glue to migrate data from external sources.
Used AWS Glue Elastic Views to combine and continuously replicate data across multiple data stores.
Used Athena to analyze unstructured, Semi Structured and structured data whin is stored in Amazon S3.
Implemented delta lake feature to overcome the challenges of backfill and re-ingestion into the data lake.
Used serverless computing platform (AWS lambda services) for running the SPARK jobs.
Developed serverless workflows using AWS Step Function service and automated the workflow using AWS CloudWatch.
Used Neo4j GDS and Graph Data Science library that provides efficiently implemented, parallel versions of common graph algorithms for Neo4j, exposed as Cypher procedures.
Automated resulting scripts and workflow using Apache Airflow and Shell scripting to ensure daily execution in production.
Neo4j and Cypher can be extended with User Defined Procedures and Functions.
Designed data warehouse as per the business requirement and implemented FACT tables, Dimension tables using Star schema.
Used Me graph and Neo4j to help in choosing the best graph analytics platform.
Optimized the existing ETL pipelines by tuning existing SQL queries and data partition techniques.
Verified the data flow using ETL tools like KNIME.
Implemented CICD for AWS Glue using AWS CODE Build, AWS cloud migration.
Synchronizing both the structured data and unstructured data using HIVE on business prospects.
Used AWS Glue catalog with crawler to get the data from S3 and perform Hive operations.
Collaborated with product and engineering teams in multiple areas of project for building innovative data solutions that up-level our features and get results in a data driven way.

Data Engineer| Mr Cooper, Irving, TX | January 2020 to August 2021

Responsibilities:
Was responsible for creating on-demand tables on S3 files using lambda functions and AWS Glue using python and PySpark.
Install and configure Apache Airflow for S3 bucket and snowflake data warehouse and created DAGs to run the Airflow.
Built Amazon ECS Cluster, integrated the data from all sources and created a common schema. We have used Spark to process the data and apply business logics to generate monthly and quarterly reports.
Pulled data from SQL server, Amazon S3 bucket & internal SFTP and put them into IMS S3 bucket.
Used Python Boto 3 to configure the services AWS Glue EC2, S3
Scheduled automated tasks with Oozie for loading data into HDFS through Sqoop and pre-processing the data with Pig and Hive
Develop the Oozie actions like hive, shell, and java to submit and schedule applications to run in Hadoop cluster.
Experienced of building Data Warehouse in Azure platform using Azure data bricks and data factory.
Worked in Azure environment for development and deployment of Custom Hadoop Applications.
Demonstrated QlikView data analyst to create custom reports, charts, and bookmarks.
Designed and implemented scalable Cloud Data and Analytical a solution for various public and private cloud platforms using Azure.
After deploying data from Oracle to HDFS, a specific job will import the curated clinical data from HDFS, on top of that an aggregation layer will be created based on specialization and age group.
Used Spark SQL in order to perform aggregations, created data model for structuring and storing the data efficiently by partitioning and bucketing in Hive.
Worked on developing and predicting trend for business intelligence.

Data Engineer| Fifth Third Bank, Evansville, IN | November 2018 to December 2019

Responsibilities:
Design robust, reusable, and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and unstructured batch and real time data streaming data using Python Programming.
Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modeling, Star and Snowflake schemas.
Applied transformation on the data loaded into Spark Data Frames and done in memory data computation to generate the output response.
Built end-to-end ETL models to sort vast amounts of customer feedback, derive actionable insights and tangible business solution.
Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
Wrote Spark applications for Data Validation, Cleansing, Transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
Prepared scripts to automate the ingestion process using PySpark and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
Responsible for monitoring sentimental prediction model for customer reviews and ensuring high performance ETL process.
Develop the automation scripts to transfer the data from on premise clusters to Google Cloud Platform (GCP).
Involved in performance tuning and optimization of long running spark jobs and queries (Hive/SQL).
Developed using object-oriented methodology a dashboard to monitor all network access points and network performance metrics using Django, Python, MongoDB, JSON.


Data Engineer | Careator Technologies Pvt Ltd Hyderabad, India |May 2014 to July 2018

Responsibilities:
Extensive knowledge/hands on experience in architecting or designing Data warehouse/Database, Modelling, building SQL objects such as tables, views, user defined/ table valued functions, stored procedures, triggers, and indexes.
Created HBase tables from Hive and Wrote HiveQL statements to access HBase table's data.
Developed complex Hive Scripts for processing the data and created dynamic partitions and bucketing in hive to improve the query performance.
Involved in migrating Spark Jobs from Qubole to Databricks.
Designed and implemented distributed data processing pipelines using Apache Spark, Hive, Python, Airflow DAGs and other tools and languages in Hadoop Ecosystem
Designed, Developed and Deployed data pipelines for moving data across various systems.
Developed solutions for import/export of data from Teradata, Oracle to HDFS, S3 and S3 to snowflake.
Resolve Spark and Yarn resource management issues in Spark including Shuffle issues, Out of Memory issues, heap space errors and schema compatibility.
Import and export of data using Sqoop from or to HDFS and Relational DB Oracle and Netezza.
Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames, Spark RDD.
Extensively involved in Installation and configuration of Cloudera Hadoop Distribution.
Created an automated loan leads and opportunities match back model used to analyze loan performance and convert more business leads.
Ingested forecasted budgets history into data warehouse.
Worked on PySpark APIs for data transformations.

Education
Bachelors in computer science and engineering, Lovely Professional University
Master s in Business Analytics, University of North Texas
Keywords: continuous integration continuous deployment business intelligence sthree database information technology Pennsylvania Texas

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];2268
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: