Home

KUSHWANTH - GCP Data Engineer
[email protected]
Location: Memphis, Tennessee, USA
Relocation: yes
Visa: GC
KUSHWANTH
GCP Data Engineer
Email Id: [email protected]
Professional Summary:

Over 10+ years of professional experience in Software development comprehensive experience in Developing Big Data applications using Apache Spark, Kafka, Apache Hadoop Ecosystem-components, Spark streaming, AWS, Google cloud infrastructure and Java application development.
Designed and developed many large-scale, batch & real-time big data applications that use Scala, Java, Python, Spark and other Hadoop ecosystem components.
Worked extensively on AWS Cloud services such as S3, EC2, EMR, Athena, Redshift, RDS and Dynamo DB for developing data pipelines and data analysis.
Worked extensively on GCP Cloud Services such as GCS (google cloud storage), Compute Engine, Dataproc, Cloud SQL, Big query and Bigtable for building data lakes on google cloud platform.
Hands on experience in designing and implementing data engineering pipelines and analyzing data using Hadoop ecosystem tools like HDFS, MapReduce, Yarn, Spark, Sqoop, Hive, Pig, Flume, Kafka, Impala, Oozie and HBase.
Hands on experience in programming using Java, Python, Scala and SQL.
Sound knowledge of architecture of Distributed Systems and parallel processing frameworks.
Designed and implemented end-to-end data pipelines to extract, cleanse, process and analyze huge amounts of behavioral data and log data.
Good experience working with GCP Cloud services like GCS, Dataproc, Big Query, Cloud SQL, Composer etc.,
Worked extensively on fine tuning spark applications to improve performance and troubleshooting failures in spark applications.
Strong experience in using Spark Streaming, Spark Sql and other components of spark like accumulators, Broadcast variables, different levels of caching and optimization techniques for spark jobs.
Experienced in using distributed computing architectures such as AWS products (e.g., EC2, Redshift, and EMR, Elasticsearch), Hadoop, Scala, Python, Spark, and effective use of map-reduce, SQL and Cassandra to solve big data type problems.
Strong experience writing custom UDFs in Spark and Hive.
Strong experience in creating partitioned hive tables, bucketing and other standard optimization practices in Hive.
Expertise in Extraction, Transformation, and Loading (ETL) processes, including UNIX shell scripting, SQL, PL/ SQL, and SQL Loader
Proficient in importing/exporting data from RDBMS to HDFS using Sqoop.
Used hive extensively to performing various data analytics required by business teams.
Good understanding of Data Modeling (Dimensional and Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension Tables.
Solid experience in working various data formats like Parquet, Orc, Avro, JSON etc.,
Experience in large scale SDLC (Software Development and Life Cycle) including Requirements Analysis, Project Planning, System and Database Design, Object Oriented Analysis and Design, utilizing Agile Scrum and Waterfall methodology.
Proficient in deploying applications using Maven build tool, Jenkins, and Docker Continuous integration/ Continuous Deployment tool.
Experience in working with version control systems/source code repository tools like GIT and GitHub maintaining/troubleshooting the CM tool in Windows environment.
Extensive experience in testing applications using Junit, Mockito, and Logging Frameworks like Log4j.
Experience automating end-to-end data pipelines with strong resilience and recoverability.
Worked on Spark Streaming and Structured Spark streaming including Kafka for real time data processing.
Experience utilizing tools for Microservices architecture applications using Spring Boot, Spring Cloud config, MySQL and Restful Web Services.
Good experience using various file formats like Parquet, Orc, Avro, JSON, Csv etc.,
Good working experience in utilizing various compression techniques like Avro, Snappy, LZO.


Technical Skills:

Hadoop/Big Data Spark, Kafka, Hive, HBase, Pig, HDFS, MapReduce, Sqoop, Oozie, Tez, Impala, Ambari, Yarn
AWS Components EC2, EMR, S3, RDS, cloud watch, Athena, Red Shift, Dynamo DB, Lambda
GCP Components Dataproc, BigQuery, Compute Engine, Cloud Storage, Bigtable
Programming Languages Scala, Python, Java
Operating Systems Linux, Windows, Centos, Ubuntu, RHEL, Unix
SQL Databases MySQL, Oracle, MS-SQL Server, Teradata
NoSQL DB HBase, Dynamo DB, Bigtable
Web Technologies Spring, Hibernate, Spring Boot
Tools IntelliJ, Eclipse
Scripting Languages Python, Shell

Professional Experience:

Client: CITI BANK, Dallas, Texas April 2023 to Present
Role: SR GCP Data Engineer
Responsibilities:

Orchestrated the ingestion of substantial volumes of marketing data and Adobe Analytics Clickstream data across diverse channels directly into a GCS-backed data lake.
Managed the ingestion of user profile information and account details from the internal Datawarehouse into the GCS Data Lake.
Engineered resilient, reusable, and scalable data-driven solutions and data pipelines utilizing Dataflow frameworks. These automated the ingestion, processing, and delivery of both structured and semi-structured batch and real-time streaming data.
Implemented Dataproc clusters to support Spark framework functionality.
Developed Spark applications for tasks such as data cleansing, event enrichment, data aggregation, de-normalization, and data preparation essential for machine learning and reporting teams.
Utilized PySpark for ingesting large volumes of customer profile data and clickstream data from diverse sources.
Employed Kafka and Google Pub/Sub to collect data into GCS and processed it through Dataproc Clusters.
Conducted performance tuning and troubleshooting of MapReduce jobs, analyzing Hadoop log files, and working on Dataproc.
Worked on enhancing the fault tolerance of Spark applications.
Constructed Data Flow pipelines for ingesting data from source systems and applied Scala and PySpark transformations using Dataproc clusters.
Fine-tuned Spark applications to improve overall processing time for the pipelines.
Developed Kafka producers to stream data from external REST APIs to Kafka topics.
Created Spark-Streaming applications to consume data from Kafka topics and wrote processed streams to Big Query.
Proficient in handling large datasets using Spark's in-memory capabilities, employing broadcast variables, effective joins, transformations, and other features.
Developed Spark applications utilizing DataFrames and Spark-SQL for data extraction, transformation, and aggregation from various file formats.
Established a data lake in Google Cloud Platform (GCP) to empower business teams to perform data analysis in Big Query.
Collected logs from physical machines and the OpenStack controller, integrating them into HDFS using Flume.
Built HBase tables to load massive amounts of structured, semi-structured, and unstructured data from UNIX, NoSQL, and various portfolios.
Used Stream Sets Data Collector to create ETL pipelines for pulling data from RDBMS systems to HDFS.
Extensive experience with Google Dataproc clusters in GCP and working with GCS for storing both raw data and processed datasets.
Contributed to creating Datawarehouse infrastructure using Big Query.

Worked on pipelines to push processed data into Big Query Data Warehouse.
Involved in creating external Hive tables, loading and analyzing data using Hive scripts.
Implemented Partitioning, Dynamic Partitions, and Bucketing in Hive.
Conducted troubleshooting of Dataproc clusters and optimized Big Query performance.
Proficient in continuous integration of applications using Jenkins.
Utilized reporting tools like Tableau to connect with Big Query for generating daily reports of data.
Designed and documented operational problems following standards and procedures using JIRA.

Environment: Google Dataproc, Spark, PySpark, Hive, GCP, GCS, Big Query, Cloud Composer, Cloud Functions, Kafka, Bigtable, Unix.

Client: Molina HealthCare, Bothell, WA August 2020 to March 2023
Role: GCP Data Engineer/ Bigdata
Responsibilities:

Managed infrastructure across multiple projects within the organization on the Google Cloud Platform using Terraform, adhering to the principles of Infrastructure as Code (IaC).
Enhanced the performance of existing BigQuery and Tableau reporting solutions through optimization techniques such as partitioning key columns and thorough testing under different scenarios.
Developed Extract, Load, Transform (ELT) processes utilizing various data sources, including Ab Initio, Google Sheets in GCP, and computing resources like Dataprep, Dataproc (PySpark), and BigQuery.
Successfully migrated an Oracle SQL ETL process to run on the Google Cloud Platform, leveraging Cloud Dataproc, BigQuery, and Cloud Pub/Sub for orchestrating Airflow jobs.
Proficiently utilized Presto, Hive, Spark SQL, and BigQuery alongside Python client libraries to craft efficient and interoperable programs for analytics platforms.
Extensive hands-on experience with Google Cloud Platform's big data-related services.
Deployed Apache Airflow within a GCP Composer environment to construct data pipelines, utilizing various Airflow operators like Bash Operator, Hadoop Operators, Python Callable, and Branching Operators.
Developed innovative techniques for orchestrating Airflow pipelines and employed environment variables for project-level definition and password encryption.
Demonstrated competency in Kubernetes within GCP, focusing on devising monitoring solutions with Stackdriver's log router and designing reports in Data Studio.
Acted as an integrator, facilitating collaboration between data architects, data scientists, and other data consumers.
Translated SAS code into Python and Spark-based jobs for execution in Cloud Dataproc and BigQuery on the Google Cloud Platform.
Facilitated data transfer between BigQuery and Azure Data Warehouse through Azure Data Factory (ADF) and devised complex DAX language expressions for memory optimization in reporting cubes within Azure Analysis Services (AAS).
Utilized Cloud Pub/Sub and Cloud Functions for specific use cases, including workflow triggers upon incoming messages.
Crafted data pipelines using Cloud Composer to orchestrate processes and employed Cloud Dataflow to build scalable machine learning algorithms, while also migrating existing Cloud Dataprep jobs to BigQuery.
Participated in the creation of Hive tables, data loading, and the authoring of Hive queries, which were executed via MapReduce.
Implemented advanced Hive features such as Partitioning, Dynamic Partitions, and Buckets to optimize data storage and retrieval.
Developed code for importing and exporting data to and from HDFS and Hive using Apache Sqoop.
Demonstrated expertise in Hive SQL, Presto SQL, and Spark SQL to perform ETL tasks, choosing the most suitable technology for each specific job.
Authored Hive SQL scripts for creating sophisticated tables with high-performance attributes like partitioning, clustering, and skewing.
Engaged in the transformation and analysis of extensive structured and semi-structured datasets through the execution of Hive queries.



Collaborated with the Data Science team to implement advanced analytical models within the Hadoop cluster, utilizing large datasets.
Leveraged Power BI and SSRS to create dynamic reports, dashboards, and interactive functionalities for web clients and mobile apps.
Developed SAS scripts for Hadoop, providing data for downstream SAS teams, particularly for SAS Visual Analytics, an in-memory reporting engine.
Monitored data engines to define data requirements and access data from both relational and non-relational databases, including Cassandra and HDFS.
Created complex SQL queries and established JDBC connectivity to retrieve data for presales and secondary sales estimations.

Environment: GCP, Pyspark, SAS, Hive, Sqoop, GCPs Data Proc Big Query, Hadoop, Hive, GCS, Python, Snowflake, Dynamo DB, Oracle Database, Power Bi, SDK S, Data Flow, Glacier, EC2, EMR Cluster, SQL Database, Data Bricks.

Client: Nationwide, Columbus, OH June 2017 to July 2020
Role: AWS Data Engineer
Responsibilities:

Led the establishment of a centralized Data Lake on the AWS Cloud, leveraging key services such as S3, EMR, Redshift, and Athena.
Executed the migration of datasets and ETL workloads from On-prem to AWS Cloud services.
Developed a suite of Spark Applications and Hive scripts to generate diverse analytical datasets essential for digital marketing teams.
Spearheaded the construction and automation of data ingestion pipelines, seamlessly transferring terabytes of data from existing data warehouses to the cloud.
Implemented an Apache Spark data processing project capable of handling data from various RDBMS and Streaming sources.
Extensively fine-tuned Python, Scala, and Spark applications, providing production support for multiple pipelines running in a production environment.
Imported Metadata into Hive and seamlessly migrated existing tables and applications to operate on Hive within the AWS cloud.
Created interactive shell and Python scripts to schedule data cleansing and loading processes.
Conducted data validation on ingested data using MapReduce, employing a custom model to filter and cleanse invalid data.
Demonstrated expertise in optimizing EC2 performance for data-intensive workloads, including tuning networking, storage, and compute resources.
Established secure Snowflake connections through private links from AWS EC2 and AWS EMR for data transfers between applications and databases.
Built data pipelines using AWS Lambda to process and transform large volumes of data in real-time.
Configured AWS RDS/Redshift to integrate with the Hadoop Ecosystem on AWS infrastructure.
Collaborated closely with business and data science teams, ensuring accurate translation of all requirements into data pipelines.
Developed automated deployment pipelines for AWS Lambda functions using tools like AWS CloudFormation.
Covered the full spectrum of data engineering pipelines, including data ingestion, transformations, and data analysis/consumption.
Applied experience in monitoring and debugging AWS Lambda functions using tools like AWS CloudWatch Logs and AWS X-Ray.
Automated infrastructure setup, launching, and termination of EMR clusters.
Created Hive external tables on datasets loaded in S3 buckets, devising various hive scripts to generate aggregated datasets for downstream analysis.
Constructed a real-time streaming pipeline using Kafka, Spark Streaming, and Redshift.
Developed Kafka producers using the Kafka Java Producer API to connect to external Rest livestream applications, producing messages to Kafka topics.
Environment: AWS S3, EMR, Redshift, Athena, Glue, Spark, Unix, Scala, Python, Java, Hive, Kafka, Nifidyn

Client: Paytm Payments Bank, Noida, India September 2014 to April 2017
Role: Azure Data Engineer/ Python Developer
Responsibilities:

Orchestrated the smooth migration of data from legacy database systems to Azure databases.
Collaborated with external team members and stakeholders to assess the implications of their changes, ensuring seamless project releases and minimizing integration issues in the Explore.MS application.
Conducted a comprehensive analysis, design, and implementation of modern data solutions using Azure PaaS services to support data visualization. This required a deep understanding of the current production state and its impact on existing business processes.
Coordinated with external teams and stakeholders, ensuring thorough understanding and comfortable integration of changes to prevent issues in the VLIn-Box application.
Reviewed the test plan and test cases for the VL-In-Box application during System Integration and User Acceptance testing phases.
Proficiently executed Extract, Transform, and Load (ETL) processes, extracting data from source systems and storing it in Azure Data Storage services. Leveraged Azure Data Factory, T-SQL, Spark SQL, and U-SQL in Azure Data Lake Analytics. Data was ingested into Azure Services, including Azure Data Lake, Azure Storage, Azure SQL, and Azure Data Warehouse, with further processing in Azure Databricks.
Designed and implemented migration strategies for traditional systems in Azure, employing approaches like Lift and Shift and Azure Migrate, alongside third-party tools.
Used Azure Synapse to manage processing workloads and deliver data for business intelligence and predictive analytics needs.
Demonstrated experience in data warehouse and business intelligence project implementation using Azure Data Factory.
Collaborated with Business Analysts, Users, and Subject Matter Experts (SMEs) to elaborate on requirements and ensure successful implementation.
Conceptualized and implemented end-to-end data solutions covering storage, integration, processing, and visualization within the Azure ecosystem.
Developed Azure Data Factory (ADF) pipelines, incorporating Linked Services, Datasets, and Pipelines for data extraction, transformation, and loading from diverse sources like Azure SQL, Blob storage, Azure SQL Data Warehouse, and write-back tools.
Estimated cluster sizes, monitored, and troubleshooted Spark Databricks clusters.
Applied performance tuning techniques to Spark Applications, optimizing factors such as Batch Interval time, Parallelism levels, and memory utilization.
Executed ETL processes using Azure Databricks, migrating on-premises Oracle ETL to Azure Synapse Analytics.
Developed custom User-Defined Functions (UDFs) in Scala and PySpark to meet specific business requirements.
Authored JSON scripts to deploy pipelines in Azure Data Factory (ADF), enabling data processing via SQL Activity.
Created Build and Release processes for multiple projects in the production environment using Visual Studio Team Services (VSTS).
Designed and implemented database solutions in Azure SQL Data Warehouse and Azure SQL.
Proposed cost-efficient architectures within Azure, offering recommendations to right-size data infrastructure.
Established and maintained Azure SQL Database, Azure Analysis Service, Azure SQL Data Warehouse, Azure Data Factory, and Azure SQL Data Warehouse.
Developed conceptual solutions and proof-of-concepts to validate the feasibility of proposed solutions.
Implemented Copy activities and custom Azure Data Factory Pipeline Activities to enhance data processing workflows.
Created Requirements Documentation for various projects.
Managed and administered Hadoop clusters for optimal performance and resource utilization.
Implemented security measures such as Kerberos authentication for Hadoop clusters.

Environment: Azure, Azure SQL, Blob storage, Azure SQL Data Warehouse, Azure Databricks, PySpark, Oracle, Azure Data Factory (ADF), T-SQL, Spark SQL.




Client: Couth Infotech Pvt. Ltd, Hyderabad, India May 2013 to August 2016
Role: Data Analyst
Responsibilities:


Engaged in a project involving machine learning, big data, data visualization, and Python development, with proficiency in Unix and SQL.
Conducted exploratory data analysis using NumPy, Matplotlib, and pandas.
Possessed expertise in quantitative analysis, data mining, and presenting data to discern trends and insights beyond numerical values.
Utilized Python libraries such as Pandas, NumPy, SciPy, and Matplotlib for data analysis.
Generated intricate SQL queries and scripts for extracting and aggregating data, ensuring accuracy in line with business requirements. Also, adept in gathering and translating business requirements into clear specifications and queries.
Produced high-level analysis reports using Excel, providing feedback on data quality, including identification of billing patterns and outliers.
Identified and documented data quality limitations that could impact the work of internal and external data analysts.
Implemented standard SQL queries for data validation and created analytical reports in Excel, including Pivot tables and Charts. Developed functional requirements using data modeling and ETL tools.
Extracted data from various sources like CSV files, Excel, HTML pages, and SQL, performing data analysis and writing to different destinations such as CSV files, Excel, or databases.
Utilized Pandas API for analyzing time series and created a regression test framework for new code.
Developed and managed business logic through backend Python code.
Worked on Django REST framework, integrating new and existing API endpoints.
Demonstrated extensive knowledge in loading data into charts using Python code.
Leveraged Highcharts to pass data and create interactive JavaScript charts for web applications.
Demonstrated proficiency in using Python libraries like OS, Pickle, NumPy, and SciPy.

Environment: Python, HTML5, CSS3, Django, SQL, UNIX, Linux, Windows, Oracle, NoSQL, PostgreSQL, python libraries, NumPy, Bit Bucket.

Education Details:
Gitam University Hyderabad (Computer Science Engineer) April 2010 to April 2014
Keywords: business intelligence sthree database information technology microsoft procedural language Delaware Idaho Ohio Washington

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];2962
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: