Home

Sujith - Data Engineer
[email protected]
Location: Atlanta, Georgia, USA
Relocation: Open
Visa: H1B
PROFESSIONAL SUMMARY:
Data Engineer with 9 years of IT industry experience, specializing in Big Data Technologies and ETL processing.
Expert in designing and querying NoSQL databases like HBase, Cassandra, and MongoDB, ensuring efficient data storage and retrieval.
Proficient in performing aggregations using Hive Query Language (HQL) to extract meaningful insights from large datasets.
Skilled in handling structured, unstructured, semi-structured, and live stream data, enhancing data processing capabilities.
Experienced in converting MapReduce applications to Spark jobs, optimizing performance and scalability.
Proficient in data ingestion strategies into HDFS, Hive, and Sqoop from diverse sources, ensuring seamless data flow.
Well-versed in Microsoft Azure platform ETL processing tools, including Azure Data Factory and Data Lakes, ensuring efficient data transformation.
Proficient in Scala programming language for developing and optimizing Apache Spark applications, leveraging its functional programming features to enhance code maintainability and performance
Experience in utilizing the Scala API for Apache Spark, contributing to the development and optimization of Spark jobs for efficient data processing, transformation, and analysis in both batch and real-time scenarios
Proficient in utilizing Microsoft Azure services including Azure Data Factory for seamless ETL orchestration, and Azure Data Lakes for efficient data storage and retrieval.
Experienced in migrating data and ETL workloads from on-premises systems to Azure Cloud, leveraging Python and Bash scripting for smooth transitions.
Skilled in building real-time and batch data pipelines using Google Cloud Platform (GCP) services such as Cloud Storage, Big Query, Dataflow, and Pub/Sub, ensuring efficient data processing and analysis.
Hands-on experience with GCP Stackdriver for effective monitoring and management of cloud resources, ensuring reliable and scalable data processing.
Strong understanding of RDD operations in Apache Spark, enabling effective data manipulation and analysis.
Experienced in developing Unix Scripts for implementing Cron jobs, streamlining Hadoop job execution.
Sound knowledge of Spark framework and scripting using PySpark for both batch and real-time data processing.
Proven track record of extending Hive and Pig functionalities by developing user-defined functions, enhancing data processing capabilities.
Proficient with GCP and Azure Databricks, leveraging cloud services for seamless data processing and storage.
Highly motivated team player with the ability to work independently, adapt quickly to emerging technologies, and drive impactful results.


TECHNICAL SKILLS:
Big Data Frameworks Hadoop, Spark, Hive, Kafka, HBase, Pig, MapReduce, Flume, Oozie, Zookeeper, HCatalog, Sqoop, Impala, NiFi
Big Data Distributions Hortonworks, Amazon EMR, Cloudera
Programming Languages Python, Java, Scala, Unix Shell Scripting,
Operating Systems Linux, Windows, Mac OS X
Databases Oracle, MongoDB, MySQL, DB2, MS SQL Server, PostgreSQL, Cassandra
Cloud Services Microsoft Azure, GCP
Development Cycles Waterfall


Client: AMEX, PHOENIX
Role: Sr. Data Engineer January 2023 Present
Responsibilities:
Utilized GCP Services including Cloud Storage, Big Query, Cloud Composer, Dataflow, Dataproc, and Cloud Function.
Built batch and streaming jobs using GCP Services like Big Query, Pub/Sub, Dataproc, Dataflow, Cloud Run, Compute Engine, and Cloud Composer.
Worked with GCP Stack driver for Trace, Profiler, Logging, and Error Reporting, Monitoring.
Worked on building a centralized Data Lake on GCP Cloud utilizing Cloud Storage, Cloud Functions, and big query.
Migrated datasets and ETL workloads with Python from On-prem to GCP Cloud services.
Extensive experience in utilizing ETL Process for designing and building large-scale data using Python and Bash scripting.
Migrated data from the local Teradata data warehouse to GCP data lakes.
Built Spark Applications and Hive scripts for generating analytical datasets for digital marketing teams.
Implemented advanced optimization techniques in Spark applications and Hive scripts, resulting in a significant reduction in data processing time and enhancing overall system efficiency for generating analytical datasets.
Worked on fine-tuning Spark applications and providing production support for various pipelines.
Built Real-time and batch pipelines using GCP services.
Spearheaded the seamless migration of datasets and ETL workloads from on-premises systems to GCP Cloud, showcasing expertise in Python and Bash scripting to ensure compatibility and data integrity across different platforms.
Developed and optimized Python-based ETL pipelines in both legacy and distributed environments.
Utilized Spark with Python-based pipelines using Spark Data Frame operations to load data to EDL using Dataproc for job execution & GCP cloud storage as a storage layer.
Collaborated with product teams to create store-level metrics and support data pipelines in GCP s big data stack.
Involved in creating a data lake in Google Cloud Platform (GCP) for enabling business teams to perform data analysis in Big Query.
Experience with Google Cloud components, Google Container Builders, and GCP client libraries.
Worked on Google Cloud Platform services like Compute Engine, Cloud Load Balancing, Cloud Storage, Cloud SQL, Stack driver Monitoring, and Cloud Deployment Manager.
Collaborated with cross-functional teams to design and implement data governance strategies, ensuring data accuracy, compliance, and security across pipelines and services.
Environment: GCP (Cloud Storage, Big Query, Cloud Composer, Dataflow, Dataproc, Cloud Function), Spark, Hive, Python, Cloud SQL, Compute Engine, Cloud Load Balancing, , Big Query, Dataproc, Pub/Sub.

Client: UPS, New Jersey
Role: Sr. Data Engineer October 2021 Jan 2023
Responsibilities:
Managed ETL tasks with Airflow/Composer, efficiently handling the flow and connections of data processes..
Employed Sqoop to ingest raw data into Cloud Storage via Cloud Dataproc clusters.
Demonstrated expertise in GCP services such as Dataproc, GCS, Cloud Functions, and Big Query.
Established comprehensive monitoring protocols for Big Query, Dataproc, and Cloud Dataflow jobs using Stackdriver. Proactively identified and addressed issues, minimizing downtime and ensuring the reliability of critical data processing workflows.
Played a key role in migrating on-premises Hadoop jobs to GCP environments, significantly enhancing scalability and performance. Employed best practices to optimize resource utilization and streamline data processing workflows.
Designed, developed, and executed Python-based ETL pipelines, optimizing data processing efficiency.
Successfully implemented complex ETL workflows using Airflow/Composer, ensuring efficient data processing and reliability.
Streamlined dependencies and managed workflows effectively, resulting in a 20% improvement in job execution time.
Spearheaded successful data streaming projects using Dataflow templates and Cloud Pub/Sub, enabling real-time data processing.
Achieved a 30% reduction in data processing latency, enhancing the timeliness of analytical insights for business teams.
Utilized Spark DataFrames in conjunction with PySpark for data loading, transformation, and analysis.
Ensured reliable and efficient data streaming using Dataflow templates and Cloud Pub/Sub service.
Monitored Big Query, Dataproc, and Cloud Dataflow jobs through Stackdriver, promptly addressing issues.
Created and maintained alerting policies for Cloud Composer, scheduled queries, and resource consumption.
Developed scripts to extract and process Big Query data into pandas or Spark data frames for advanced ETL.
Played a pivotal role in migrating on-prem Hadoop jobs to GCP environments, enhancing scalability and performance.
Designed and executed data marts using Hive and PySpark, enabling downstream teams to access analytical insights.
Revamped existing Oracle and SQL Server scripts into PySpark-based code, optimizing job execution and scheduling.
Environment: HDFS, Hive, Spark, Kubernetes, Terraform, Ansible, Docker, Tableau Server, Python, MySQL, MongoDB, Oozie, Azure Data Bricks, PySpark, Airflow.

Client: Molina Healthcare, California
Role: Data Engineer Feb 2018 October 2021
Responsibilities:
Worked on Talend integrations to ingest data from multiple sources into Data Lake. Used Azure Data Lake storage gen2 to store excel
Utilized Azure Data Lake Storage Gen2 for efficient storage and retrieval of diverse data formats.
Developed real-time applications using Kafka Streams and Spark Streaming, ensuring timely data processing.
Managed data movement involving Hadoop and NoSQL databases like HBase and Cassandra.
Led the migration of processes and data from on-premises SQL Server to Azure Data Lake.
Collaborated in the development of microservices using Spring Boot, enhancing system scalability.
Conducted data validation between raw source files and Big Query tables through Apache Beam on Cloud Dataflow.
Streamlined file collection by moving data from various vendor SFTP locations into GCP, including API data retrieval.
Automated the data validation process between raw source files and Big Query tables through Apache Beam on Cloud Dataflow. This initiative ensured consistent data quality and accuracy across diverse datasets.
Implemented robust security measures by creating and managing Big Query authorized views, ensuring secure data access and sharing practices. Strengthened data governance and compliance across teams.
Analyzed snowflake datasets for performance optimization, enhancing data processing efficiency.
Created Big Query authorized views, ensuring secure data access and sharing across teams.
Migrated data from HDFS to Azure HD Insights and Azure Databricks, optimizing data storage and processing.
Demonstrated proficiency in containerization technologies like Docker and Kubernetes, enhancing application scalability.
Developed Jenkins pipelines to streamline continuous integration and deployment processes.
Collaborated with cross-functional teams to gather requirements and distill insights, driving data-driven decisions
Environment: Hadoop, Kafka, Spark, Sqoop, Hive, HBase, Cassandra, Azure Data Lake Storage Gen2, Spring Boot, Azure HD Insights, Azure Databricks, Docker, Kubernetes, Grafana, Jenkins, SQL, Python, Shell, Microservices.

Client: Genpact Inc, India
Role: Data engineer Nov 2014 Jan 2018
Responsibilities:
Worked on Talend integrations to ingest data from multiple sources into Data Lake.
Utilized Azure Data Lake Storage Gen2 to store Excel data
Collaborated extensively with clients throughout the project lifecycle, from requirements elicitation to deployment, ensuring a comprehensive understanding of their business needs.
Proactively assumed different job roles in a startup environment, ranging from Backend and Frontend Developer to Data Analyst and Tech Lead.
Successfully reduced the product delivery time to customers by approximately 15%, contributing to improved customer service.
Automated and scheduled the company's weekly metrics report using Python and MS Excel.
Took complete ownership of various presentations and communications with clients, showcasing leadership and effective communication skills.
Designed and developed multiple E-commerce websites using various web technologies within an MVC architecture.
Wrote PL/SQL queries, stored procedures, and triggers for back-end database operations.
Dealt with information retrieval from databases and reported required results using SQL and advanced Excel.
Developed forecasting models using MS Excel to predict operational performance on a weekly and monthly basis.
Environment: MySQL, WordPress, PhpMyAdmin, Python, MS Excel, WAMP server
Keywords: information technology microsoft procedural language

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];1763
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: