Sathish - Data Engineer |
[email protected] |
Location: Detroit, Michigan, USA |
Relocation: Yes |
Visa: H1B |
SATHISH MATTAPALLI
+1 (313) 241-0956| USA | Senior Data Engineer |[email protected] | PROFESSIONAL SUMMARY 9 years of working experience as Data Engineer with high proficient knowledge in Data Analysis, Design, Development, Implementation and Testing of Data warehousing applications using Data Modelling, Data Engineering, Data Extraction, Data Transformation, and Data Loading. Experienced using "Big data" work on Hadoop, Spark, PySpark, Hive, HDFS and other NoSQL platforms. Experience in transferring data using Informatica tool from AWS S3 to AWS Redshift. Hands on experience with Amazon Web Services along with provisioning and maintaining AWS resources such as EMR, S3 buckets, EC2instances, Glue, RDS and others. Experienced Data Engineer with a strong background in managing and deploying cloud infrastructure on Google Cloud Platform (GCP), utilizing services such as BigQuery, Cloud Storage, and Dataflow to handle and process data at scale. Proficient in creating and maintaining data pipelines using GCP tools like Dataflow and Pub/Sub, ensuring efficient ingestion and processing of large datasets. Experience in leveraging Databricks for scalable data engineering, including ETL pipelines, data lakes, and real-time data processing. Efficient in all phases of the development lifecycle, coherent with Data Cleansing, Data Conversion, Data Profiling, Data Mapping, Performance Tuning and System Testing. Experienced in designing and implementing end-to-end ETL pipelines using Airflow and Terraform, automating complex data workflows to enhance efficiency and reliability. Experience in designing and implementing microservices using Spring Boot and RESTful APIs. Good Knowledge on SQL queries and creating database objects like stored procedures, triggers, packages and functions using SQL and PL/SQL for implementing the business techniques. Good understanding and exposure to Python programming and Bash. Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application. Spark with Hive and SQL/Oracle. Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala. Hands on experience in handling Hive tables using Spark SQL. Developed PySpark enterprise-wide application to load and process transactional data into Cassandra NoSQL Database. Created custom new columns depending on the use case while ingesting the data into Hadoop Lake using PySpark. Implemented Apache Kafka for real-time data streaming and integration, enabling efficient data pipelines and enhancing data flow across distributed systems. Experience in leveraging Databricks for scalable data engineering, including ETL pipelines, data lakes, and real-time data processing. Expertized in using JIRA software with Jenkins and GitHub for real time bug tracking and issue management. Implemented Regression models using PySpark MLlib. Excellent working experience in Scrum/Agile framework and Waterfall project execution methodologies. Extensive experience working with business users/SMEs as well as senior management. Experience in Big Data Hadoop Ecosystem in ingestion, storage, querying, processing and analysis of Bigdata. Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement. Experienced in Technical consulting and end-to-end delivery with architecture, data modeling, data governance and design - development - implementation of solutions. Strong experience and knowledge of NoSQL databases such as MongoDB and Cassandra. Experienced in creating interactive and insightful dashboards and reports using Power BI and Tableau, enhancing decision-making capabilities across various business functions. Strong experience in using MS Excel and MS Access to dump data and analyze based on business needs. Good experience in Data Analysis as a Proficient in gathering business requirements and handling requirements management. TECHNICAL SKILLS Big Data & Hadoop Ecosystem: MapReduce, Spark, PySpark, HBase, HDFS, Hive, Kafka, Hue, Cloudera Manager, Hadoop, Flink. NOSQL Database: Mongo DB, Cassandra Databases: Microsoft SQL Server, MySQL Server, Oracle Cloud Platforms: AWS, S3, IAM, EC2, EC3, Redshift, Glue, GCP, Cloud Storage, BigQuery, Pub sub, Data Flow. BI Tools: Tableau 10, SSRS, Looker, Power BI. Programming Languages: SQL, PL/SQL, Python, SCALA ,Java Operating Systems: Microsoft Windows, UNIX, and Linux. Methodologies: Agile, JIRA, System Development Life Cycle (SDLC), Waterfall Model. Others: Airflow, Terraform, Docker, Jenkens, Kubernetes. WORK EXPERIENCE American Family Insurance, Madison WI April 2023 Present Senior Data Engineer Responsibilities: Assisted in regulatory data tasks including system design, data querying, cleaning, manipulation, and reporting. Performed data analysis and developed analytic solutions, identifying correlations and trends. Extracted, transformed, and loaded data to generate CSV files using Python and SQL queries. Built end-to-end ETL pipelines using Airflow and Terraform, automating data workflows. Constructed ETL pipelines using DBT to transform and load data from diverse sources into the data warehouse. Worked on Snowflake schemas and data warehousing, processing batch and streaming data load pipelines using Snowpipe and Matillion from AWS S3 data lakes. Conducted data analysis using SQL, PL/SQL, Python, Databricks, Teradata SQL Assistant, SQL Server Management Studio, and SAS. Configured data loads from S3 to Redshift using AWS Data Pipeline. Used Databricks to build and manage ETL pipelines, data lakes, and real-time data processing, significantly improving data handling and processing efficiency. Expertise in core Java concepts, including object-oriented programming, exception handling, and collections. Skilled in developing web applications using Java EE, Servlets, JSP, and frameworks like Spring MVC and Struts. Implemented data solutions on AWS, including the use of the Data Platform (CDP) Program, to enhance data storage, processing, and analytics capabilities. Enhanced data processing and reporting capabilities by implementing scalable solutions in Databricks, optimizing data flows and performance. Worked with AWS and GCP cloud services including GCP Cloud Storage, Dataproc, Dataflow, BigQuery, EMR, S3, Glacier, and EC2 with EMR Cluster. Developed PySpark frameworks to transfer data from DB2 to Amazon S3. Created Spark jobs using Scala for real-time analytics and utilized SparkSQL for querying; managed shard sets and analyzed data distribution. Identified areas of improvement in ETL logic, analyzing large datasets using PostgreSQL and Oracle Databases. Developed data visualizations using Tableau; knowledgeable in numerical optimization, anomaly detection, A/B testing, statistics, and big data techniques including Hadoop, MapReduce, NoSQL, Pig/Hive, Spark, MLlib, Scala, NumPy, SciPy, Pandas, and scikit-learn. Developed and configured test environments using Docker containers and Kubernetes. Optimized SQL queries and DBT models to enhance performance and reduce processing time. Used AWS Glue for data transformation, validation, and cleansing. Extensively used Apache Spark Data Frames, Spark-SQL, and Spark MLlib; designed POCs using Scala, Spark SQL, and MLlib libraries. Worked on Agile (Scrum) Methodology, participated in daily scrum meetings, and was actively involved in sprint planning and product backlog creation. Developed dashboards and reports for various business functions including marketing, sales, finance, product, customer success, and engineering by utilizing Power BI and Tableau. Collaborated with the data team to ensure consistent data validation and single-source truth for company metrics. Designed and implemented data pipelines to collect, process, and analyze large-scale datasets from Teradata, Oracle, SQL, DB2, and various file formats. Expertise in Python, Pandas, NumPy, and PySpark to design and maintain data pipelines, automate data workflows, and ensure the efficiency and scalability of the data infrastructure. Worked closely with internal stakeholders to enable data-informed decision-making through analytics and insights. Documented best practices and participated in knowledge transfer sessions on SQL query optimization and reusability. Supported the implementation of strategic technology data architecture to enhance data availability, accessibility, quality, and reliability. Recommended optimal solutions for data storage and transfer, building dashboards using Tableau and Power BI. Monitored pipeline performance, troubleshot issues, optimized processes, and set up alerts for quick issue resolution. Used Scrum to develop and deliver Spark jobs with Python, adapting to new business needs every 15 days. Developed PySpark tasks for data processing, including reading from external sources, merging, enriching, and loading data into target destinations. Designed and created efficient Hive tables with static and dynamic partitions to meet specific requirements. Managed data import from various sources, performed transformations using Spark, and stored results in Hive and S3 buckets. Environment: Python, ETL, PySpark, Spark, Hadoop, Scala, Hive, Pig, AWS, S3, EC2, EMR, Redshift, Glue, GCP, Java, Cloud storage, BigQuery, Pub/Sub, Dataflow, Pandas, NumPy, DB2, SQL, Snowflake, PL/SQL, MySQL, Tableau, Power BI, Databricks, Docker, Kubernetes, Airflow, Terraform, Agile, SDLC. Capgemini, India May 2021 Aug 2022 Senior Data Engineer Responsibilities: Deployed, and managed cloud infrastructure components on Google Cloud Platform (GCP). Managed and deployed cloud infrastructure on Google Cloud Platform (GCP), utilizing services like BigQuery, Cloud Storage, and Dataflow to handle and process data at scale. Created and maintained data pipelines using GCP tools such as Dataflow and Pub/Sub, ensuring efficient ingestion and processing of large datasets. Developed data processing solutions with Python and PySpark, leveraging GCP s Dataflow for both batch and real-time processing tasks. Built and managed data lakes on GCP with Cloud Storage and BigQuery, supporting a wide range of analytical and operational requirements. In-depth knowledge of Java concurrency, including the use of threads, synchronization, and concurrent collections. Worked closely with data scientists and analysts to integrate data pipelines with GCP s analytics tools, supporting their analytical needs. Implemented and optimized data warehousing solutions using BigQuery, including schema design, partitioning, and performance tuning. Designed, implemented, and maintained data ingestion pipelines to bring data from various sources into the Hadoop ecosystem. Developed data processing and analytics solutions using Python and PySpark. Utilized Hadoop ecosystem tools such as MapReduce, Apache Spark, Apache Hive, Apache Pig, Apache Flink, and Apache Beam for processing large volumes of data efficiently. Managed distributed storage systems such as Hadoop Distributed File System (HDFS) for storing structured and unstructured data. Leveraged Apache Beam and Apache Spark to build distributed data processing pipelines. Designed and implemented database schemas, tables, indexes, and constraints in PostgreSQL based on application requirements and data modeling best practices. Integrated Pub/Sub and Kafka messaging systems for real-time data ingestion and stream processing. Designed scalable and reliable workflow orchestration architectures to meet business requirements and data processing needs. Implemented IAM policies and configured cloud networking infrastructure for security and optimization. Designed efficient data processing workflows using Spark RDDs (Resilient Distributed Datasets), Data Frames, and Datasets APIs. Familiar with deploying Java applications on cloud platforms like AWS, Google Cloud, and Azure. Developed DynamoDB tables, global secondary indexes (GSI), local secondary indexes (LSI), and streams to support various use cases and access patterns. Integrated Kafka with other data processing systems, messaging queues, and streaming frameworks using Kafka Connect connectors. Gathered functional requirements and converted them into technical specifications. Optimized Cassandra query performance by designing appropriate data models, partitioning strategies, and indexing schemes. Advanced SQL skills including complex joins, stored procedures, cloning, views, and materialized views in Snowflake. Experienced with Snowflake Cloud Architecture, Snowsql, and Snow pipe for continuous data ingestion. Developed ETL pipelines to ingest data from various sources into Hadoop clusters and Hive tables. Enhanced data visualization and reporting processes by creating user-friendly dashboards in Power BI and Tableau, providing actionable insights across departments. Environment: GCP, Python, ETL, PySpark, Map Reduce, Hadoop, Spark, Hive, Flink, Java, AWS, IAM, Cassadra, Kafka, DynamoDB, HDFS, SQL, GCP, Cloud storage, BigQuery, Pub/Sub, and Dataflow, etc. TATA Auto comp Systems, India Jan 2018 April 2021 Data Engineer Responsibilities: As a Data Engineer I was responsible for building a data lake as a cloud-based solution in AWS using Apache Spark and Hadoop. Involved in Agile methodologies, daily Scrum meetings, Sprint planning. Installed and configured Hadoop and responsible for maintaining cluster and managing and reviewing Hadoop log files. Used AWS Cloud and On-Premises environments with Infrastructure Provisioning/ Configuration. Developed views on Redshift to load data from and to an AWS S3 bucket, as well code migration to production. Implemented AWS EMR Spark for faster processing of data using PySpark, Data Frames and SparkSQL API. Worked on AWS Athena to import structured data from AWS S3 bucket into multiple systems, including RedShift to generate reports. Installed, configured, and managed Hadoop clusters, including monitoring and reviewing Hadoop log files to ensure smooth operations and troubleshooting issues. Contributed to the development of key data integration and advanced analytics solutions leveraging Apache Hadoop. Wrote complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist the data into HDFS. Utilized Hive to analyze data ingested into HBase via Hive-HBase integration, computing various metrics and generating insights for dashboards. Developed Big Data solutions focused on pattern matching and predictive modeling. Developed the code for Importing and exporting data into HDFS and Hive using Sqoop. Developed a data pipeline using Kafka, HBase, Spark and Hive to ingest, transform and analyze customer behavioral data. Developed Spark jobs and Hive Jobs to summarize and transform data. Developed reconciliation process to make sure elastic search index document count matches to source records. Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data. Implemented Sqoop to transform the data from Oracle to Hadoop and load back in parquet format Developed incremental and complete load Python processes to ingest data into Elastic Search from oracle database Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard Created Hive External tables to stage data and then move the data from Staging to main tables Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations. Load the data through HBase into Spark RDD and implement in memory data computation to generate the output response. Continuously tuned Hive UDF's for faster queries by employing partitioning and bucketing. Environment: Hadoop, Spark, Hive, Sqoop, AWS, HBase, Kafka, Python, HDFS, Elastic Search & Agile Methodology Tech Mahindra, India Feb 2015 Dec 2017 Junior Data Engineer Responsibilities: Analysed web log data using HiveQL to derive insights. Developed Hive queries to perform trend analysis of user behavior across online modules. Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS. Designed and developed Kafka and Storm-based data pipelines, accommodating high throughput. Utilized Kafka producer APIs for message production. Set up HBase to work with Hadoop HDFS. Explored Spark and Scala for transitioning from Hadoop/MapReduce to Spark. Performed benchmarking of No-SQL databases like Cassandra and HBase. Involved in data pipelines integrated with Amazon Web Services (AWS) EMR, S3, and RDS. Devised simple and complex SQL scripts to check and validate Dataflow in various applications. Devised PL/SQL stored procedures, functions, triggers, views, and packages, utilizing indexing, aggregation, and materialized views to optimize query performance. Worked with various software development methodologies, including Waterfall and Agile (JIRA). Performed data analysis, migration, cleansing, transformation, integration, import, and export using Python. Developed logistic regression models (using Python) to predict subscription response rate based on customers variables like past transactions, response to prior mailings, promotions, demographics, interests and hobbies, etc. Created PowerBI dashboards/reports for data visualization, Reporting and Analysis and presented it to Business. Created Data Connections, Published on Tableau Server for usage with Operational or Monitoring Dashboards. Knowledge in Tableau Administration Tool for Configuration, adding users, managing licenses and data connections, scheduling tasks, embedding views by integrating with other platforms. Writing stored procedures, complex SQL queries for backend operations with the database. GitHub has been used as a Version Controlling System. Creating tracking sheet for tasks and timely report generation for tasks progress. Environment: HBase, Hive, SQL, PL/SQL, AWS, Big Data, Hadoop, Spark, Scala, Python, Kafka, HDFS, ETL, Tableau and Power BI. EDUCATION University of Michigan Dearborn, MI | Aug 2023 Master of Science in Data Science Jawaharlal Nehru Technological University Hyderabad, IN | May 2015 Bachelor of Technology Keywords: business intelligence sthree database information technology microsoft procedural language Michigan Wisconsin |