Home

Indu Javvaji - Data engineer
indujavvaji09@gmail.com
Location: Plano, Texas, USA
Relocation:
Visa:
INDU JAVVAJI
Sr. Data Engineer / AWS / Azure

Ph:913-717-6048 | Email: indujavvaji09@gmail.com

PROFESSIONAL SUMMARY:

Experienced AWS Data Engineer with 10+ years in designing, developing, and optimizing data pipelines, data lakes, and analytics solutions using AWS services. Expertise in big data processing, data modeling, performance tuning, and real-time streaming. Proven ability to architect scalable, fault-tolerant, and cost-efficient solutions for handling structured, semi-structured, and unstructured data.

Good experience on architecture of Distributed systems and parallel processing, In-depth understanding of MapReduce Framework and Spark execution framework.
Experience in setting up Hadoop based application process, design, and application.
Experienced Data Center Engineer with expertise in Linux Administration, Data Center Operations, and LSF Grid support.
Hands on expertise in writing different RDD (Resilient Distributed Datasets) transformations and actions using Scala, Python.
Experience in Writing AWS Lambda functions in Python for AWS s Lambda which invokes Python scripts to perform various transformations and analytics on large data sets in EMR clusters.
Expertise in writing end to end Data Processing Jobs to analyze data using MapReduce, Spark and Hive.
Good knowledge of the Spark Architecture, which includes Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors, and Tasks.
Experience in programming Python, Scala, Java, SQL and Shell scripting.
Experience in automating to create infrastructure for Kafka clusters different instances as per components in the cluster using Terraform for creating multiple EC2 instances.
Hands-on practical exposure on Scala, Spark, ETL, DynamoDB, RedShift, Kinesis, Lambda, Glue, Snowflake.
Hands on Experience in integrating Flume with Kafka, using Flume both as a producer and consumer. Also used Kafka for activity tracking and Log aggregation.
Hands on experience in Azure Analytics services, Azure Data Lake store(ADLS), Azure Data Lake Analytics(ADLA), Azure SQL DW, Azure SQL DB, Azure Data Factory(ADF), Azure Data Bricks(ADB), Azure Cloud, Azure Devops, Azure Synapse Analytics, Azure Cosmos NOSQL DB, etc. and built Azure environments by deploying Azure IaaS Virtual Machines (VMs) and cloud services (PaaS).
Experience on Migrating SQL database to Azure Data Lake, Azure SQL Database, Databricks and Azure SQL Data warehouse.
Controlling and granting database access and Migrating on premise databases to Azure Data lake store using Azure Data factory.
Demonstrated understanding of the Fact/Dimension data warehouse design model, including star and snowflake design methods.
Expertise in creating Spark applications that use Spark-SQL to aggregate, transform, and extract data from various file formats for analysis and transformation to gain insights into customer usage patterns.
Proficient understanding of big data processing technologies, particularly Hive, Spark, and SQOOP.
Transferred an already-existing on-premises application to AWS. used EC2 and S3 services from AWS for processing and storing tiny data collections, knowledgeable about maintaining the Hadoop cluster on the AWS EMR.
Vast expertise utilizing Apache SQOOP to import and export data from programs like RDBMS to the Hadoop architecture. Substantial experience working with big data infrastructure tools such as Python and Redshift also proficient in Scala, Spark, Spark Streaming.
Proven ability to create Hive tables utilizing a variety of file formats, including CSV, Avro, Sequence, ORC, and Parquet.
Using spark, a general ETL framework with high availability was implemented for bringing relevant data from diverse sources into Hadoop & Cassandra. Involved in various NOSQL databases like HBase, Cassandra in implementing and integration.
Experience in various data modelling concepts like star schema, snowflake schema in the project.
Proficient in Unix/Linux command-line operations, server management, troubleshooting, and optimizing high-performance computing (HPC) clusters.
Skilled in data center discovery, migration, and automation using shell scripting and Linux tools.
Take strong initiatives in analytical thinking and creative problem-solving processes for any given challenges within the system.
Effective interpersonal skills, great communication abilities, capacity to work alone as well as in teams, rigorous work ethic, and high degree of motivation.

Key Competencies:
Data Center Discovery and Migration
Unix/Linux Administration (Red Hat, Ubuntu, CentOS)
High-Performance Computing (HPC)
LSF Grid Support & Cluster Management
Data Center Operations & Server Maintenance
Shell Scripting for Automation
System Monitoring & Performance Tuning


CERTIFICATIONS:
Azure Data Engineer Associate (DP-203)
Microsoft Certified: Azure Solutions Architect Expert

TECHNICAL SKILLS:
Cloud Platforms: Microsoft Azure, Amazon Web Services (AWS), GCP
Monitoring & Performance Nagios, Prometheus, Grafana
HPC & Grid Computing LSF Grid, Slurm, Kubernetes
Big Data Technologies: HDFS, MapReduce, Pig, Hive, HBase, Sqoop, Spark, Pyspark, Kafka, Flume,
Apache Spark, Apache Kafka, Apache Cassandra
Hadoop Ecosystems: MapR, Cloudera, AWS EMR, Horthon Works
ETL Tools: Azure Data Factory s(SQL Server Integration Services), Informatica, IBM DataStage
Databases: MySQL, PostgreSQL,SQL Server, Oracle, DB2,Cassandra
Languages: Python, Java, Scala, C, C#, R, JavaScript, PHP
Visualization Tools: Power BI, Tableau
Data Modeling: ER diagrams, Dimensional data modeling, Star and Snowflake Schema
Operating Systems: Red Hat Enterprise Linux (RHEL), Ubuntu, CentOS, Unix
Scripting: Shell scripting (Linux/Unix),Python scripting, Bash
Version Control: Git, GitHub
Other Tools and Technologies Microsoft Visual Studio, Jupyter Notebook, Anaconda, PyCharm, Apache
Airflow, Docker, Kubernetes

WORK HISTORY:

CLIENT: CAPITAL GROUP | Sr. Data Engineer | DEC 2022 TO PRESENT
RESPONSIBILITIES:
Used Python to implement simple and complicated spark tasks for data analysis across various data formats.
Managed Linux-based servers in a data center environment, ensuring high availability and system uptime.
Provided LSF Grid Support by monitoring job scheduling, load balancing, and cluster resource management.
Automated Linux system maintenance using Bash and Ansible, reducing manual intervention by 40%.
Troubleshot server performance issues, kernel tuning, and OS-level optimizations.
Configured and optimized LSF Grid clusters to enhance parallel job execution.
Created external and normal tables and views in Snowflake database.
Responsible for delivering datasets from Snowflake to One Lake Data Warehouse and built CI/CD pipeline using Jenkins and AWS lambda and Importing data from DynamoDB to Redshift in Batches using Amazon Batch using TWS scheduler.
Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
Experience in developing Microservices with Spring boot using Java framework using Scala.
Responsible for delivering datasets from Snowflake to One Lake Data Warehouse and built CI/CD pipeline.
Developed code in Spark SQL for implementing Business logic with Python as programming language.
Used Spark streaming to divide streaming data into batches as an input spark engine for batch processing.
Wrote Spark applications for data validation, cleansing, transformation and custom aggression and used spark engine, spark SQL for data analysis.
Applied Python Spark scripts to categorize data organizations according to various categories of records. Spark cluster monitoring with assistance.
Installed and Configured Apache Airflow for AWS S3 bucket and created dags to run the Airflow.
Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon Sage Maker.
Worked on the code transfer of a quality monitoring application from AWS EC2 to AWS Lambda, as well as the construction of logical datasets to administer quality monitoring on snowflake warehouses.
Upgrade and downgrade scripts were developed to move data from tables into Spark-Redis for quick access by a huge client base without sacrificing performance.
In accordance with project proposals, coordinated with end users to build and deploy analytics solutions for Python-based user-based recommendations.
Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to Snowflake.
Worked with Python and Scala to transform Hive/SQL queries into Spark (RDDs, Data frames, and Datasets).
Expertise in using the Scala programming language to build microservices.
Expertise utilizing Spark SQL to manage Hive queries in an integrated Spark environment.
Participated in daily scrum sessions and the story-driven agile development process.
Creating data frames and datasets using Spark and Spark Streaming, then performing transformations and actions.
Experience with Kafka on publish-subscribe messaging as a distributed commit log.

Environment: Hadoop, Scala, Spark, Hive, Teradata, Tableau, Linux, Python, Java, Kafka, AWS S3 Buckets, AWS Glue, NIFI, Postgres, Snowflake, AWS EC2, Oracle PL/SQL, Flink, Development toolkit (JIRA, Bitbucket/Git, Service now etc.,)

CLIENT: HCSC INSURANCE, TX, USA. | Sr. Data Engineer | MARCH 2020 TO NOV 2022
RESPONSIBILITIES:
Extensively worked with Azure cloud platform (HDInsight, Data Lake, Databricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.
Ingested data to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
Created Pipelines in Azure Data Factory (ADF) using Linked Services, Datasets, Pipeline to extract, transform and load data from different sources like Azure SQL, Blob storage, Azure SQL DW, write-back tool and backwards.
Created Application Interface Document for the downstream to create new interface to transfer and receive the files through Azure Data Share.
Designed and configured Azure Cloud relational servers and databases, analyzing current and future business requirements.
Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP.
Worked on Power Shell scripts to automate the Azure cloud system creation of Resource groups, Web Applications, Azure Storage Blobs & Tables, firewall rules.
Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).
Configured Input & Output bindings of Azure Function with Azure Cosmos DB collection to read and write data from the container whenever the function executes.
Designed and deployed data pipelines using Data Lake, Databricks, and Apache Airflow.
Developed Elastic pool databases and scheduled Elastic jobs to execute T-SQL procedures.
Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
Created and provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
Created several Databricks Spark jobs with PySpark to perform several tables to table operations.
Created data pipeline for different events in Azure Blob storage into Hive external tables. Used various Hive optimization techniques like partitioning, bucketing and Map join.
Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS.
Developed automatic job flows and ran through Oozie daily and when needed which runs MapReduce jobs internally.
Extracted Tables and exported data from Teradata through Sqoop and placed in Cassandra.

Environment: Azure HDInsight, Databricks, Data Lake, Cosmos DB, MySQL, Azure SQL, GCP, Snowflake, MongoDB, Cassandra, Teradata, Ambari, Flume, Tableau, PowerBI, Azure AD, Git, Blob Storage, Data Factory, Data Storage Explorer, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark v2.0.2, PySpark, Airflow, Hive, Sqoop, HBase, Oozie.

CLIENT: COX COMMUNICATIONS, ATLANTA, GA | SR. DATA ENGINEER | DEC 2017 TO FEB 2020
RESPONSIBILITIES:
Participated in all stages of the SDLC, including requirement analysis, design, coding, testing, and production, for a big data project.
Sqoop was extensively utilized to import/export data between RDBMS and hive tables, create Sqoop jobs for last saved value, and perform incremental imports.
Participated in the implementation of the data preparation solution, which oversees processing user stories and data transformation.
Contributed to the Hadoop migration of current SQL data and reporting streams.
Setup of an HBase table and shell script for automating the ingestion procedure were worked on.
Built external Hive tables on top of HBase for use in creating feeds.
Scheduled automated run for Talend Open Studio's production ETL data pipelines for Big Data.
I worked on moving a current feed from hive to spark. The existing HQL was changed to run utilizing Spark SQL and Hive Context to decrease feed latency.
Investigating ways to use Spark to enhance the functionality and optimization of the current Hadoop algorithms using Spark Context, Spark-SQL, Data Frames, and Pair RDDs.
Developed programs in Python, Java, C#, Spark SQL, Hive, and Spark SQL to organize incoming data and create data pipelines that produce actionable insights.
Converted the distributed collection of data with named columns using Scala's Data frame API.
Created Spark and Hive jobs to transform and summarize data.
Employed Spark for interactive queries, streaming data processing, and HBase integration with HBase database for large volumes of data.
Making a map Python code reduction is used to remove some security flaws from the data.
Set up Splunk forwarders, worked on monitoring logs using Splunk, and created Splunk dashboards.
Working knowledge of using Spark SQL to process data from Hive tables.
Working on Airflow, a job trigger whose code is written in Scala and the hive query language. This makes it easier to read, backfill, and wrote data from hive tables to HDFS locations for a specific time Frame.
Worked with Spark RDD, Scala, and Python to translate Hive/SQL queries into Spark transformations.
Developed Pig Latin scripts to carry out transformations (ETL) in accordance with the use case specification.
Made dispatcher jobs to send the data into Teradata using Sqoop export target tables.
Developed HQL scripts to carry out data validation after transformations are completed in accordance with the use case.
Used Snappy compression on HBase tables to recover the space after implementing a compression approach to free up some space in the cluster.
Adding a SQL layer on top of HBase to get the optimum reading and writing performance while using the salting feature.
Developed shell scripts that can be scheduled and called from a scheduler to automate the procedure.
Wrote Hive scripts to split the data and load the historical data.
Closely worked with the on-site team and the offshore team

Environment: Hive, Hadoop, Spark, Scala, Pyspark, Python, Sqoop, Kafka, AWS EMR, S3 Buckets, Oracle.

CLIENT: PNC FINANCIAL, PA. | DATA ENGINEER | NOV 2013 TO OCT 2015
RESPONSIBILITIES:
Responsible for the design, implementation, and architecture of very large-scale data intelligence solutions around big data platforms.
Worked on SQL queries in dimensional data warehouses and relational data warehouses. Performed Data Analysis and Data Profiling using Complex SQL queries on various systems.
Troubleshoot and resolve data processing issues and proactively engaged in data modelling discussions.
Worked on RDD Architecture and implementing spark operations on RDD and optimizing transformations and actions in Spark.
Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, ADF, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
Using Azure Data Factory, created data pipelines and data flows and triggered the pipelines.
Written programs in Spark using Python, PySpark and Pandas packages for performance tuning, optimization, and data quality validations.
Worked on developing Kafka Producers and Kafka Consumers for streaming millions of events per second on streaming data.
Experience on Deployment Automation & Containerization (Docker, Kubernetes)
Worked on Tableau to build customize interactive reports, worksheets, and dashboards.
Developed Spark Programs using Scala and Java API's and performed transformations and actions on RDD's.
Expertise in building PySpark, Spark Java and Scala applications for interactive analysis, batch processing, and stream processing.
Experience working with SparkSQL and creating RDD's using PySpark. Extensive experience working with ETL of large datasets using PySpark in Spark on HDFS.
Data warehouse, Business Intelligence architecture design and develop. Designed the ETL process from various sources into Hadoop/HDFS for analysis and further processing of data modules.
Used all major ETL transformations to load the tables through Informatica mappings.
Implemented Cursors into the stored procedures for frequent commits to avoid large data loading failures.
Achieved organizational goals by developing, integrating, and testing software applications for several clients.

Environment: HDFS, Python, SQL, MapReduce, Spark, Kafka, Hive, Yarn, Zookeeper, informatica, Sqoop, PowerBI, Azure, GitHub, Shell Scripting, RDBMS, ETL, PySpark, Hadoop.

EDUCATION DETAILS:
Bachelors in Computer Science -2013
Keywords: cprogramm csharp continuous integration continuous deployment machine learning business intelligence sthree database active directory rlang information technology microsoft procedural language Georgia Pennsylvania Texas

To remove this resume please click here or send an email from indujavvaji09@gmail.com to usjobs@nvoids.com with subject as "delete" (without inverted commas)
indujavvaji09@gmail.com;5039
Enter the captcha code and we will send and email at indujavvaji09@gmail.com
with a link to edit / delete this resume
Captcha Image: