Home

Pavan - Data Engineer
[email protected]
Location: Chicago, Illinois, USA
Relocation: Yes
Visa: H1B
PROFESSIONAL SUMMARY
Dynamic and motivated IT professional with around 8+ years of experience as a Big Data Engineer
with expertise in designing data-intensive applications using Cloud Data engineering, Data
Warehouse, Hadoop Ecosystem, Big Data Analytical, Data Visualization, Reporting, and Data
Quality solutions.
Hands-on experience across the Hadoop Ecosystem that includes extensive experience in Big
Data technologies like HDFS, MapReduce, YARN, Apache Cassandra, NoSQL, Spark, Python,
Scala, Sqoop, HBase, Hive, Oozie, Impala, Pig, Zookeeper, and Flume.
Built real-time data pipelines by developing Kafka producers and Spark streaming applications
for consumption. Utilized Flume to analyze log files and write them into HDFS.
Experienced with Spark improving the performance and optimization of the existing algorithms in
Hadoop using Spark Context, Spark-SQL, Dataframe API, Spark Streaming, and Pair RDDs and
worked explicitly on PySpark.
Developed framework for converting existing PowerCenter mappings to PySpark (Python and
Spark) Jobs.
Hands-on experience in setting up workflow using Apache Airflow and Oozie workflow engine for
managing and scheduling Hadoop jobs.
Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for
small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS
EMR.
Hands-on experience with Amazon EC2, S3, RDS(Aurora), IAM, CloudWatch, SNS, Athena, Glue,
Kinesis, Lambda, EMR, Redshift, DynamoDB, and other services of the AWS family and in
Microsoft Azure.
Proven expertise in deploying major software solutions for various high-end clients meeting
business requirements such as big data Processing, Ingestion, Analytics, and Cloud Migration
from On-prem to AWS Cloud.
Experience in Working on AWS Databases like Elastic Cache (Memcached & Redis) and NoSQL
databases - HBase, Cassandra & MongoDB, database performance tuning & and data
modeling.
Established connection from Azure to On-premises data center using Azure Express Route for
Single and Multi-Subscription.
Created Azure SQL database and performed monitoring and restoring of Azure SQL database.
Performed migration of Microsoft SQL server to Azure SQL database.
Experienced in Data Modeling and Data Analysis. I have experience using Dimensional Data
Modeling and Relational Data Modeling, Star Schema/Snowflake Modeling, FACT and dimensions
tables, and Physical and Logical Data Modeling.
Expertise in OLTP/OLAP System Study, Analysis, and E-R modeling, developing Database Schemas
like Star schema and Snowflake schema used in relational, dimensional, and multidimensional
modeling.

Experience with Partitions, and bucketing concepts in Hive and designed both Managed and
External tables in Hive to optimize performance. Experience with different file formats like Avro,
Parquet, ORC, JSON, XML, and compressions like Snappy & and bzip.

TECHNICAL SKILLS
Cloud Technologies Azure (Data Factory (V2), Data Lake, Databricks, Blob Storage, Data
Box), Amazon EC2, IAM, Amazon S3, Amazon RDS, Amazon Elastic Load
Balancing, AWS Lambda, Amazon EMR, Amazon Glue, Amazon Kinesis.

Automation Tools Azure Logic App, Crontab, Terraform.
Big Data Hadoop, MapReduce, HDFS, Hive, Impala, Spark, Sqoop, HBase, Flume,

Kafka, Oozie, Zookeeper, NiFi.
Code Repository Tools Git, GitHub, Bit Bucket.
Database MySQL, SQL Server Management Studio 18, MS Access, MySQL
Workbench, Oracle Database 11g Release 1, Amazon Redshift, Azure
SQL, Azure Cosmos DB, Snowflake.
End User Analytics Power BI, Tableau, Looker, QlikView.
NoSQL Databases HBase, Cassandra, MongoDB, Dynamo DB.
Languages Python, SQL, PostgreSQL, PySpark, PL/SQL, UNIX Shell Script, Perl,

JAVA, C, C++

ETL Azure Data Factory, SnowFlake, AWS Glue.
Operating System Windows 10/7/XP/2000/NT/98/95, UNIX, LINUX, DOS.
PROFESSIONAL EXPERIENCE
Navistar - Lisle, IL Nov 2022 Current
Sr. Big Data Engineer
Designed and set up Enterprise Data Lake to provide support for various cases including
Analytics, processing, storing, and Reporting of voluminous, rapidly changing data.
Used Data Integration to manage data with speed and scalability using the Apache Spark Engine
and AWS Databricks.
Used SQL approach to create notebooks and DHF_UI in DHF 2.1.
Converted the Code from Scala to PySpark in DHF (Data Harmonization Framework) AND
Migrated the Code and DHF_UI from DHF 1.0 to DHF 2.1.
Extracted structured data from multiple relational data sources as Data Frames in Spark SQL on
Databricks.
Responsible for loading data from the internal server and the Snowflake data warehouse into S3
buckets.
Performed the migration of large data sets to Databricks (Spark), created and administered
clusters, loaded data, configured data pipelines, and loaded data from Oracle to Databricks.
Created Databricks notebooks to streamline and curate the data from various business use cases.
Triggering, Monitoring the Harmonization, and curation of Jobs in the Production Environment.
Also scheduled a few Jobs by using DHF jobs and ESP Jobs.

Raised the Change Request and SNOW Request in Service Now to deploy or send changes into
Production.
Also guiding the development of a team working on PySpark (Python and Spark) jobs.
Using Snowflake cloud data warehouse and AWS S3 bucket to integrate data from multiple
sources, including loading nested JSON formatted data into Snowflake table.
Created AWS Lambda, EC2 instances provisioning on AWS environment and implemented
security groups, administered Amazon VPC's.
Designed and developed a Security Framework to provide fine-grained access to objects in AWS
S3 using AWS Lambda, and Dynamo DB.
Implemented Lambda to configure Dynamo DB Auto scaling feature and implemented Data
Access Layer to access AWS Dynamo DB data.
Developed Spark Applications for various business logic using Python.
Extracted, Transformed, and Loaded (ETL) data from disparate sources to Azure Data Storage
services using a combination of Azure Data Factory, T-SQL, Spark SQL, and Azure Data Lake
Analytics.
Worked on different files like CSV, JSON, Flat, and Parquet to load the data from source to raw
tables.
Implemented Triggers to schedule pipelines.
Designed and developed Power BI graphical and visualization solutions with business
requirement documents and plans for creating interactive dashboards.
Created Build and Release for multiple projects (modules) in the production environment using
Visual Studio Team Services (VSTS).
Have Knowledge regarding Stream sets which are pipelines used for the Injecting data into raw
layer from Oracle Source.
Used Terraform scripts to Automate Instances for Manual Instances that were launched before.
Developed environments of different applications on AWS by provisioning on EC2 instances using
Docker, Bash, and Terraform.
Environment: Snowflake, Scala, PySpark, Python, SQL, AWS S3, Streamsets, Kafka 1.1.0, Sqoop, Spark
2.0, ETL, Power BI, Import and Export Data wizard, Terraform, Visual Studio Team Services.
Statefarm - Richardson, TX Jan 2022 - Oct 2022
Sr. Azure/Snowflake Python Data Engineer
Analyze, develop, and build modern data solutions with the Azure PaaS service to enable data
visualization. Understand the application's current Production state and the impact of new
installation on existing business processes.
Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse
Analytics (DW) & Azure SQL DB).
Extract Transform and Load data from Sources Systems to Azure Data Storage services using a
combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL,
Azure DW) and processing the data in Azure Databricks.

Pipelines were created in Azure Data Factory utilizing Linked Services/Datasets/Pipeline/ to
extract, transform, and load data from many sources such as Azure SQL, Blob storage, Azure SQL
Data warehouse, write-back tool, and backward.
Used Azure ML to build, test, and deploy predictive analytics solutions based on data.
Developed Spark applications with Azure Data Factory and Spark-SQL for data extraction,
transformation, and aggregation from different file formats to analyze and transform the data to
uncover insights into customer usage patterns.
Applied technical knowledge to architect solutions that meet business, and IT needs, created
roadmaps, and ensured long-term technical viability of new deployments, infusing key analytics
and AI technologies where appropriate (e.g., Azure Machine Learning, Machine Learning Server,
BOT framework, Azure Cognitive Services, Azure Databricks, etc.)
Managed relational database service in which Azure SQL handles reliability, scaling, and
maintenance.
Integrated data storage solutions with Spark, particularly with Azure Data Lake storage and Blob
storage.
Configured stream analytics and event hubs and worked to manage IoT solutions with Azure.
Successfully completed a proof of concept for Azure implementation, with the larger goal of
migrating on-premises servers and data to the cloud.
Responsible for estimating cluster size, monitoring, and troubleshooting the Spark Databricks
cluster.
Experienced in adjusting the performance of Spark applications for the proper batch interval
time, parallelism level, and memory tuning.
Extensively involved in the Analysis, design, and Modeling. Worked on Snowflake Schema, Data
Modeling and Elements, and Source to Target Mappings, Interface Matrix, and Design elements.
To meet specific business requirements wrote UDFs in Scala and PySpark.
Analyzed large data sets using Hive queries for Structured data, and unstructured and semi-
structured data.
Worked with structured data in Hive to improve performance by various advanced techniques
like Bucketing, Partitioning, and Optimizing self joins.
Written and used complex data types in storing and retrieving data using HQL in Hive.
Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the
data using the SQL Activity.
Used Snowflake cloud data warehouse for integrating data from multiple source systems which
include nested JSON formatted data into Snowflake table.
Hands-on experience in developing SQL Scripts for automation purposes.
Environment: Azure Data Factory(V2), Snowflake, Azure Databricks, Azure SQL, Azure Data Lake,
Azure Blob Storage, Hive, Azure ML, Scala, PySpark.
ICICI Bank - Hyderabad, India Feb 2021 Jun 2021
Sr. Data Engineer
Performed multiple MapReduce jobs in Hive for data cleaning and pre-processing. Loaded the
data from Teradata tables into Hive Tables.

Experience in importing and exporting data by Sqoop between HDFS and RDBMS and migrating
according to client's requirement.
Used Flume to collect, aggregate, and store the web log data from different sources like web
servers and pushed to HDFS.
Developed Big Data solutions focused on pattern matching and predictive modeling.
Involved in Agile methodologies, Scrum meetings, and Sprint planning.
Worked on installing cluster commissioning decommissioning of data node name node recovery
capacity planning and slot configuration.
Resource management of HADOOP Cluster including adding/removing cluster nodes for
maintenance and capacity needs.
Involved in loading data from UNIX file system to HDFS.
Partitioned the fact tables and materialized views to enhance the performance.
Implemented Hive Partitioning and Bucketing on the collected data in HDFS.
Involved in integrating hive queries into the Spark environment using Spark SQL.
Used Hive to analyze the partitioned and bucketed data to compute various metrics for
reporting.
Improved performance of the tables through load testing using the Cassandra stress tool.
Involved with the admin team to set up, configure, troubleshoot, and scale the hardware on a
Cassandra cluster.
Created data models for customers' data using Cassandra Query Language (CQL).
Developed and ran Map-Reduce Jobs on YARN and Hadoop clusters to produce daily and
monthly reports as per user's needs.
Experienced in connecting Avro Sink ports directly to Spark Streaming for analysis of weblogs.
Address the performance tuning of Hadoop ETL processes against very large data sets and work
directly with statistically on implementing solutions involving predictive analytics.
Performed Linux operations on the HDFS server for data lookups, job changes if any commits
were disabled, and rescheduling data storage jobs.
Created data processing pipelines for data transformation and analysis by developing spark jobs
in Scala.
Testing and validating database tables in relational databases with SQL queries, as well as
performing Data Validation and Data Integration. Worked on visualizing the aggregated datasets
in Tableau.
Migrating code to version controllers using Git Commands for future use and to ensure a smooth
development workflow.
Environment: Hadoop, Spark, MapReduce, Hive, HDFS, YARN, Moba Extrm, Linux, Cassandra, NoSQL
database, Python Spark SQL, Tableau, Flume, Spark Streaming.
Accenture - Bangalore Jul 2018 Jan 2021
Sr. AWS/ Data Engineer
Developed Apache presto and Apache drill setups in AWS EMR (Elastic Map Reduce) cluster, to
combine multiple databases like MySQL and Hive. This enables to comparison of results like joins
and inserts on various data sources controlled through a single platform.

The AWS Lambda functions were written in Scala with cross-functional dependencies that
generated custom libraries for delivering the Lambda function in the cloud.
Writing to the Glue metadata catalog allows us to query the improved data from Athena,
resulting in a serverless querying environment.
Created PySpark frame to bring data from DB2 to Amazon S3.
Worked on Kafka Backup Index, Log4j appender minimized logs, and pointed ambari server logs
to NAS Storage.
Used Curator API on Elasticsearch to data backup and restore data.
Created AWS RDS (Relational database services) to work as Hive metastore and could combine
20 EMR cluster s metadata into a single RDS, which avoids data loss even by terminating the
EMR.
Built a Full-Service Catalog System that has an entire workflow using Elasticsearch, Kinesis, and
CloudWatch.
Leveraged cloud-provider services when migrating on-prem MySQL clusters to AWS RDS MySQL,
provisioned multiple AWS AD forests with AD-integrated DNS and utilized AWS Elasticache for
Redis.
Used AWS Code Commit Repository to store their programming logic and script and have them
again to their new clusters.
Spin up the EMR clusters from 30 to 50 nodes which are memory-optimized such as R2, R4, X1,
and X1e instances with autoscaling feature.
Hive Being the primary query engine of EMR, we have created external table schemas for the
data that is being processed.
Mounted Local directory file path to Amazon S3 using S3fs fuse, to have KMS encryption enabled
on the data reflecting in S3 buckets.
Designed and implemented ETL pipelines on S3 parquet files on data lake using AWS Glue.
Migrated the data from Amazon Redshift data warehouse to Snowflake.
Used AWS glue catalog with crawler to get the data from S3 and perform SQL query operations
and JSON schema to define table and column mapping from S3 data to Redshift.
Applied Auto-scaling techniques to scale in and scale out the instances with given Memory out of
time. This helped in reducing the number of instances when the cluster is not actively in use. This
is applied by even considering Hive s replication factor as 2 leaving a minimum of 5 instances
running.
Environment: Amazon Web Services, Elastic Map Reduce cluster, EC2s, CloudFormation, Amazon S3,
Amazon Redshift, Hive, Scala, PySpark, Snowflake, Shell Scripting, Tableau, Kafka.
Menlo Technologies, Hyderabad April 2016 Jun 2018
Big Data Engineer
Imported real-time weblogs using Kafka as a messaging system and ingested the data to Spark
Streaming.
Implemented data quality checks using Spark Streaming and arranged bad and passable flags on
the data.
Developed business logic using Kafka & and Spark Streaming and implemented business
transformations.

Supported Continuous storage in AWS using Elastic Block Storage, S3, and Glacier. Created
Volumes and configured Snapshots for EC2 instances.
Developed Spark code for using Scala and Spark-SQL for faster processing and testing.
Worked on loading CSV/TXT/AVRO/PARQUET files using Scala language in Spark Framework and
processed the data by creating Spark Data frame and RDD and saving the file in parquet format
in HDFS to load into the fact table using ORC Reader.
Involved in data loading using PL/SQL and SQL*Loader calling UNIX scripts to download and
manipulate files.
Involved in creating data models for customer data using Cassandra Query Language. Performed
benchmarking of the No-SQL databases, Cassandra and HBase
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
Exploring with Spark to improve the performance and optimization of the existing algorithms in
Hadoop using Spark context, Spark-SQL, Data Frame, and pair RDDs.
Processed the schema-oriented and non-schema-oriented data using Scala and Spark.
Configured Spark streaming to get ongoing information from Kafka and store the stream
information to HDFS.
Used Kafka functionalities like distribution, partition, and replicated commit log service for
messaging systems by maintaining feeds.
Involved in loading data from rest endpoints to Kafka Producers and transferring the data
to Kafka Brokers.
Ran many performance tests using the Cassandra -stress tool to measure and improve the read
and write performance of the cluster.
Developed Scala scripts, and UDFs using both Data frames/SQL/Data sets and RDD in Spark for
Data Aggregation, queries, and writing data back into the OLTP system through Sqoop.
Developed Preprocessing job using Spark Data frames to flatten Json documents to flat file.
Environment: Spark-RDD data frames, Kafka, file formats, Scala, Spark UDFs, AWS S3, oracle SQL-
Cassandra, hive.
Keywords: cprogramm cplusplus artificial intelligence machine learning business intelligence sthree database active directory rlang information technology microsoft procedural language Illinois Texas

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];1232
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: