Home

Sushmith Kumar - GCP Data Engineer
[email protected]
Location: Remote, Remote, USA
Relocation:
Visa: Green Card
Professional Summary:

Over 10+ years of experience as a Lead GCP Data Engineer with demonstrated expertise in building and deploying data pipelines using open-source Hadoop-based technologies such as Apache Spark, Hive, Hadoop, Python, and PySpark.
Hands-on experience on Google Cloud Platform (GCP in all the big data products Big Query, Cloud Data Proc, Google Cloud Storage, Composer (Airflow as a service)
Hands-on experience working in GCP services like Big Query, Cloud Storage (GCS), cloud function, cloud dataflow, Pub/sub, Cloud Shell, GSUTIL, Big Query, Data Proc, and Operations Suite (Stack driver).
Hands-on experience in designing and implementing data engineering pipelines and analyzing data using AWS stacks like AWS EMR, AWS Glue, EC2, AWS Lambda, Athena, Redshift, Sqoop, and Hive.
Hands-on experience in programming using Python, Scala, Java, and SQL.
Experience in Query Optimization and Performance Tuning of Stored Procedures and Functions
Hands-on Experience in developing Spark applications using Pyspark Data Frame, RDD, and Spark SQL.
Expertise in Creating, Debugging, Scheduling, and Monitoring jobs using Composer Airflow.
Expert in working with cloud PUB/SUB to replicate data in real-time from the source system to GCP Big Query.
Solid Working with GCP cloud using GCP Cloud storage, DataProc, Data Flow, Big Query, Cloud Composer, and Cloud Pub/Sub.
Good knowledge of GCP service accounts, billing projects, authorized views, datasets, GCS buckets, and gsutil commands.
Experienced in building and deploying Spark applications on Hortonworks Data Platform and AWS EMR.
Experienced in working with AWS services such as EMR, S3, EC2, IAM, Lambda, Cloud Formation, and Cloud Watch.
Responsible for Implementation, design, architecture, and support of cloud-based solutions across multiple platforms.
Infrastructure (ensuring system availability, performance, capacity, and continuity through proper response to incidents, events, and problems)
Extensively worked on Spark using Scala on the cluster for computational (analytics).
Create and maintain highly scalable and fault-tolerant multi-tier AWS and Azure environments spanning multiple availability zones using Terraform and CloudFormation.
Write terraform scripts from scratch for building Dev, Staging, Prod, and DR environments.
Involved in the design and deployment of a multitude of cloud services on AWS stack such as EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, and IAM, while focusing on high-availability, fault tolerance, and auto-scaling.
Very keen on knowing the newer techno stack that the Google Cloud platform adds.
Worked on structured and semi-structured data storage formats such as Parquet, ORC, CSV, and JSON.
Proficient in data warehousing techniques for Data cleansing, Slowly Changing Dimension phenomenon (SCDs), and Change Data Capture (CDC).
Experience in integration of various data sources from Databases like Teradata, Oracle, SQL Server, and text files.
Developed complex Teradata SQL code in BTEQ script using OLAP and Aggregate functions.
Excellent knowledge and experience in documents like BSD, TSD, and mapping documents.
Extensively involved in identifying performance bottlenecks in targets, sources, and transformations and successfully tuning them for maximum performance using best practices.
Experience in documenting Design specs, Unit test plans, and deployment plans.
Extensive knowledge of Teradata SQL Assistant. Developed BTEQ scripts to Load data from the Teradata Staging area to the Data Warehouse and data Warehouse to data marts for specific reporting requirements. Tuned the existing BTEQ script to enhance performance.
Experience in building and architecting multiple Data pipelines, end-to-end ETL, and ELT processes for Data ingestion and transformation in GCP and coordinating tasks among the team.
Developed automation scripts using AWS Python Boto3 SDK.
Experienced in working with Snowflake cloud data warehouse and Snowflake Data Modeling.
Built ELT workflow using Python and Snowflake COPY utilities to load data into Snowflake.
Experienced in working with RDBMS databases Oracle, and SQL Server.
Developed complex SQL queries and performance tuning of SQL queries.
Experienced in working with CI/CD pipelines like Jenkins, and Bamboo.
Experienced in working with Source Code management tools in GIT and Bit Bucket.
Worked on Application Monitoring tools like Splunk, Elastic Search, Log Stash, and Kibana for application logging and monitoring.
Strong hands-on experience on Teradata Utilities such as BTEQ, FLOAD, MLOAD, TPUMP, and Analyst using SQL SERVER, TERADATA.

Technical Skills:

Google Cloud Platform Big Query, Cloud Data Proc, GCS Bucket, G-Cloud Function, Apache Beam, Cloud Shell, GSUTIL, BQ Command Line, Cloud Data Flow, Cloud Composer.
Hadoop Core Services HDFS, Map Reduce, Spark, YARN.
Hadoop Distribution Cloudera Hortonworks, Apache Hadoop.
On-Premises SAS, DB2, Teradata, Netezza.
Databases HBase, Spark-Redis, Cassandra, Oracle, MySQL, PostgreSQL, Teradata.
Data Services Hive, Pig, Impala, Sqoop, Flume, Kafka.
Scheduling Tools Zookeeper, Oozie.
Monitoring Tools Cloudera Manager.
Cloud Computing Tools AWS, Azure, GCP.
Programming Languages Python, Scala, SQL, PL/SQL, Pig Latin, HiveQL, Shell Scripting.
Operating Systems UNIX, Windows, LINUX.
Build Tools Jenkins, Maven, Ant.
ETL Tools IBM DataStage, Robot Scheduler.
Development Tools Eclipse, NetBeans, Microsoft SQL Studio, Toad.


Professional Experience:

Client: USAA (HCL) (San Antonio, Texas) Sep 2023 - Present
Senior GCP Data Engineer

Technical Stack: GCP, Cloud SQL, Big Query, Cloud Data Proc, GCS, Cloud SQL, Cloud Composer, Informatica Power Center, Scala, SFTP, Talend for Big Data, Power BI, Airflow, Hadoop, Hive, Teradata, SAS, Teradata, Spark, Python, Java, SQL Server.

Experience in building multiple Data pipelines, end-to-end ETL, and ELT processes for Data ingestion and transformation in GCP and coordinating tasks among the team.
Set up GCP Firewall rules to ingress or egress traffic to and from the VM's instances based on specified configuration and used GCP cloud CDN (content delivery network) to deliver content from GCP cache locations drastically improving user experience and latency.
Develop and deploy the outcome using spark and Scala code in the Hadoop cluster running on GCP.
Design and implement various layers of Data Lake, Design star schema in Big Query.
Using the g-cloud function with Python to load data into a big query for arrival CSV files in the GCS bucket.
Process and load bound and unbound data from Google pub/subtopic to big query using cloud Dataflow with Python.
Developed Spark applications by using Scala, and Java and implemented Apache Spark data processing project to handle data from RDMS.
Used Scala components to implement the credit line policy based on the conditions applied to spark data frames.
Designed and created datasets in Big Query to facilitate efficient querying and analysis.
Worked on Secure File Transfer Protocol (SFTP) to transfer data between data sources.
Got involved in migrating the on-prem Hadoop system to using GCP (Google Cloud Platform).
Migrated previously written cron jobs to airflow/composer in GCP.
Support existing GCP Data Management implementations.
Created GCP Big Query authorized views for row-level security or exposing the data to other teams.
Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on-premises ETLs to Google Cloud Platform (GCP) using cloud-native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer.
Designed Pipelines with Apache Beam, KubeFlow, and Dataflow and orchestrated jobs into GCP.
Developed and Demonstrated the POC, to migrate on-prem workload to Google Cloud Platform using GCS, Big Query, Cloud SQL, and Cloud DataProc.
Documented the inventory of modules, infrastructure, storage, and components of the existing On-Prem data warehouse for analysis and identifying the suitable technologies/strategies required for Google Cloud Migration.
Design, development, and implementation of performing ETL pipelines using Python API (PySpark) of Apache Spark.
Worked on Tableau to connect with databases for data integrations and to ensure data security and governance.
Build data pipelines in airflow in GCP for ETL-related jobs using different airflow operators, both old and newer operators.
Set up GCP Firewall rules to ingress or egress traffic to and from the VM's instances based on specified configuration and used GCP cloud CDN (content delivery network) to deliver content from GCP cache locations drastically improving user experience and latency.


Client: Cigna HealthCare (Charlotte, NC) March 2022 - Aug 2023
Senior GCP Data Engineer

Technical Stack: GCP Console, Cloud Storage, Big Query, Data flow, Apache Beam, SQL, Python, Java, SAS, IBM Data Stage, ETL, Apache Airflow (version 2.3.5), Robot Scheduler, Ansible, PuTTY, Git Bash, Git Hub, Shell Scripting, Scala, Teradata, Netezza, WINSCP, DBeaver (version 23.2.1)

Led the end-to-end migration of on-premises data solutions to GCP, overseeing data extraction, transformation, and loading (ETL) processes for transition.
Working with product teams to create various store-level metrics and supporting data pipelines written in GCP s big data stack.
Worked on ingesting the data from the Source system (SAS, Teradata, Netezza) to Google Cloud Storage (GCS Bucket) Raw, Stage, and Distribution layer Tables and displaying data in user-accessible views (Base view, Semantic view, and View)
Used the Data Flow Service of Google Cloud for Data Ingestion framework with a serverless unified Stream and batch processing Service and ingested data files in. PSV formats coming in Cloud Storage in Data lake.
Worked on Cloud composer for data ingestion process and Robot jobs to load the files into the Data lake folder in GCS.
Triggered the ingestion jobs from Airflow Scheduler which invokes the data ingestion pipeline (written in Java) which ingests the data from Source files (.PSV) and loads it into BigQuery layers.
Validated Schema based on the number of columns and data types (INT64, STRING, TIMESTAMP, DATE., etc).
Worked on developing YAML code to remove any duplicate data present in the input tables by executing (Stage) Procedure.
Loaded Staging tables into Distribution Tables using various Transformation Rules and performed data quality Rule Validation in DQAudit Task.
Written Transformation logic through .SQL file specified in the DAG and data gets loaded into the distribution table.
Normalized the data according to the business needs like data cleansing, modifying the data types and various transformations using Spark, Scala, and GCP Dataproc.
Deployed code from Local to dev/test/stage GCP environment to Git Git Bash, Command Prompt
Worked on partitioning and clustering high-volume tables on fields in Big Query to make queries more efficient.
Worked on implementing scalable infrastructure and platform for large amounts of data ingestion, aggregation, integration, and analytics in Hadoop using Spark and Hive.
Built a system for analyzing the column names from all tables and identifying personal information columns of data across on-premises Databases (data migration) to GCP.
Worked on writing terraform scripts from scratch for building Dev, Staging, Prod, and DR environments.
Used Kafka HDFS Connector to export data from Kafka topic to HDFS files in a variety of formats and integrated with Apache Hive to make data immediately available for SQL querying.
Utilized the GCP Console to monitor and manage resource usage, permissions, and access controls.
Used Cloud shell SDK in GCP to configure the services Data Proc, Cloud Storage, and Big Query.
Wrote SQL queries to extract data from source as part of the Extract, Transform, Load (ETL) process.
Uploaded and downloaded data to and from Cloud Storage using the command-line tools, and client libraries.
Developed PySpark scripts to handle the migration of large volumes of data, ensuring minimal downtime and optimal performance and analyzed the SQL scripts, and designed solutions to implement using PySpark.
Worked on querying data using Spark SQL on the top of PySpark engine jobs to perform data cleansing, validation, and applied transformations and executed the program using Python API.
Worked on IBM DataStage to extract the files from Source and used Robot Scheduler to place them in GCS location.
Process and load bound and unbound Data from Google Pub/Subtopic to Big Query using cloud Dataflow with Python.
Developed T-SQL (SQL) queries, stored procedures, user-defined functions, and built-in functions.
Used to write Python DAGs in airflow which orchestrate end-to-end data pipelines for multiple applications.
Used windowing functions to order data and remove duplicates in source data before loading to DataMart for better performance.
Used SFTP to generate detailed logs for file transfers, user activities, and file access.
Developed Power Pivot/SSRS (SQL Server Reporting Services) Reports and added logos, pie charts, and bar graphs for display purposes as per business needs.
Worked on Data serialization formats for converting complex objects into sequence bits by using Parquet, Avro, JSON, and CSV formats.
Developed internal dashboards for the team using Power BI tools for tracking daily tasks.

Client: First Republic Bank (New Jersey) Nov 2019 Feb 2022
Senior Data Engineer

Technical Stack: GCP, Airflow, Big Query, Hadoop, Hive, Tableau, Zookeeper, Sqoop, Spark, Control-M, Python, Bamboo, SQL, Bit Bucket, Linux.

Worked on implementing scalable infrastructure and platform for large amounts of data ingestion, aggregation, integration, and analytics in Hadoop using Spark and Hive.
Modified existing dimension data model by adding required dimensions and facts as per business process.
Got involved in migrating the on-prem Hadoop system to using GCP (Google Cloud Platform).
Worked on developing streamlined workflows using high-performance API services dealing with large amounts of structured and unstructured data.
Developed Spark jobs in Python to perform data transformation, creating Data Frames and Spark SQL.
Migrated previously written cron jobs to airflow/composer in GCP.
Built Scalding jobs to migrate the revenue data from BigQuery to Manhattan and HDFS. Used cloud replicator to run the BQMH jobs on a GCP Hadoop cluster and replicate the data on-prem HDFS.
Developed Spark applications using Spark libraries to perform ETL transformations, eliminating the need for ETL tools.
Developed the end-to-end data pipeline in Spark using Python to ingest, transform, and analyze data.
Designed and created datasets in Big Query to facilitate efficient querying and analysis.
Integrated Tableau with ETL platforms to streamline data workflows.
Created Hive tables using HiveQL, then loaded the data into Hive tables and analyzed the data by developing Hive queries.
Worked on GCP for the purpose of data migration from Oracle database to GCP.
Experience in working with product teams to create various store-level metrics and supporting data pipelines written in GCP s big data stack.
Experience in GCP Dataproc, Dataflow, Pub-Sub, GCS, Cloud functions, Big Query, Stack driver, Cloud logging, IAM, Data studio for reporting, etc.
Created and executed Unit test cases to validate transformations and process functions are working as expected.
Have written Python DAGs in airflow which orchestrate end-to-end data pipelines for multiple applications.
Was involved in setting up Apache airflow service in GCP.
Build data pipelines in airflow in GCP for ETL-related jobs using different airflow operators.
Written shell scripts to automate application deployments.
Implemented solutions to switch schemas based on the dates so that the transformation would be automated.
Developed custom functions and UDFs in Python to incorporate methods and functionality of Spark.
Implemented Spark RDD transformations to map business analysis and apply actions on top of transformations.
Worked on Data serialization formats for converting complex objects into sequence bits by using Parquet, Avro, JSON, and CSV formats.

Client: Ascena Retail Group (Scottsdale, AZ) Aug 2017 Oct 2019
Data Engineer

Technical Stack: Microsoft Azure Cloud, Apache Flume, HDFS, Sqoop, Hive, HBase, Databricks, Pig, Jenkins, Oozie, Power BI, Databricks, Spark, Scala, Hadoop, DBT, SQL, Oracle, UNIX, Talend Open Studio, Informatica.

Integrated Azure Data Factory to seamlessly warehouse data from diverse sources, including on-premises systems like MySQL and Cassandra, as well as cloud sources like Blob storage and Azure SQL DB. Applied transformations and loaded data into Azure Synapse for efficient data processing.
Managed and configured resources across the cluster using Azure Kubernetes Service, monitoring the Spark cluster via Log Analytics and Ambari Web UI. Successfully transitioned log storage from Cassandra to Azure SQL Data Warehouse, resulting in enhanced query performance.
Developed data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also worked with Cosmos DB using SQL API and Mongo API.
Leveraged Azure Logic Apps to build automated workflows, schedule batch jobs, integrate apps, ADF pipelines, and other services like HTTP requests and email triggers.
Extensive experience with Azure Data Factory, including data transformations, Integration Runtimes, Azure Key Vaults, triggers, and migrating data factory pipelines to higher environments using ARM Templates.
Created pipelines to load data from Azure Data Lake into Staging SQLDB and subsequently into Azure SQL DB.
Successfully migrated large datasets to Databricks (Spark), encompassing cluster administration, data loading, and configuration of data pipelines from ADLS Gen2 to Databricks using ADF pipelines.
Developed Databricks notebooks to streamline and curate data for various business use cases, including the mounting of blob storage on Databricks.
Managed the scheduling and monitoring of Hive jobs daily, leveraging the Control-M tool.
Extensive work with Hive tables, partitions, and buckets to facilitate efficient analysis of large volumes of data.
Scheduled Hive queries daily using an Oozie coordinator and developing Oozie workflows.
Configured Pig, Hive, Sqoop, and HBase echo systems on Hadoop to support the development of Pig Latin scripts for data processing and Hive queries for loading data into Hive tables.
Contributed to the ingestion of source data into the Hadoop data lake from multiple databases using the Sqoop tool.
Created ETL pipelines using Python and PySpark to load data into Hive tables within Databricks.
Implemented an HBase (NoSQL) database as part of the project's database design.
Proficient in working with Informatica Power Center Designer Tools, Workflow Manager Tools, and Repository Manager & Admin console for efficient data integration and management.
Leveraged Talend open studio and Talend Enterprise platform for efficient data management, developing complex Talend ETL jobs for seamless data migration from flat files to databases.
Utilized Apache Flume to enable smooth data flow from Guidewire Billing Center to the HDFS destination.

Client: Airtel Cellular Limited (INDIA) June 2013 July 2017
Associate Engineer / Data Engineer II

Technical Stack: HDFS, Hive, MapReduce, Pig, Spark, Kafka, Sqoop, Scala, Oozie, Maven, GitHub, Java, Python, MySQL, Linux, AWS.
Leveraged Talend open studio and Talend Enterprise platform for efficient data management, developing complex Talend ETL jobs for seamless data migration from flat files to databases.
Developed and performed Sqoop import from Oracle to load the data into HDFS.
Created Partitions, Buckets based on State to further process using Bucket-based Hive joins.
Created Hive tables to store the processed results in a tabular format.
Load and transform data using scripting languages and tools (e.g.: Python, Linux shell, Sqoop)
Led ETL efforts to integrate transform and map data from multiple sources using Python.
Imported data sets with data ingestion tools like Sqoop, Kafka, and Flume.
Extensively worked with moving data across cloud architectures including Redshift, Hive, and S3 buckets.
Developed data pipeline using Flume, Sqoop, Hive, and Spark to ingest subscriber data, provider data, and claims into HDFS for analysis.
Working knowledge of Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL, and Spark Streaming.
Load D-Stream data into Spark RDD and do in-memory data Computation to generate output response.
Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations on the fly to build the common learner data model and persistence the data in HDFS.
Being confident in and well-experienced as a SQL server database administrator is also beneficial.
Deployed the packages in the solution explorer to catalog which is in the management studio.
Constraints table in DW to load data to catch many errors like data type violation, NULL constraint violation, foreign key violation, and duplicate data.
Generated reports to maintain zero percent errors in all the data warehouse tables.
Developed SSIS Packages for migrating data from the Staging Area of SQL Server 2008 to SQL Server 2012.
Qualifications Qualified with a relevant technical tertiary qualification and/or relevant professional experience.
Keywords: continuous integration continuous deployment user interface business intelligence sthree database information technology procedural language Arizona North Carolina

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];1997
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: