Sainath - Data Engineer |
[email protected] |
Location: New York, New York, USA |
Relocation: Open |
Visa: Green Card |
Sainath Reddy
Sr. Data Engineer Ph: +1(234) 219 2017 Professional Summary: Around 9+ years of solid experience building end-to-end pipelines with PySpark, Python, and AWS services, as well as a thorough understanding of Distributed Systems Architecture and Parallel Processing Frameworks. Experience writing complex SQL queries, as well as creating reports and dashboards. Experience in implementing and deploying workloads on Azure VM. Designed and implemented Enterprise Data Lake to support a variety of use cases, including analytics, processing, storing, and reporting on large amounts of rapidly changing data. Working knowledge of the Hadoop ecosystem, including Spark, Kafka, HBase, Scala, Pig, Hive, Sqoop, Oozie and other big data technologies. In-depth understanding of the Snowflake Database, Schema, and Table structures. Experience in assessment with tools like Azure Migrate and Cloud amize, Create data pipelines in airflow in GCP for ETL jobs using various airflow operators. Working knowledge of GCP Dataproc, GCP Data Flow, GCS, Cloud functions, and BigQuery. Strong experience with the Airflow and automating daily data import tasks. Experience developing Spark applications in Databricks using Spark - SQL, PySpark, and Delta Lake for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming data to uncover insights into customer usage patterns. Expertise in transferring on-premise ETLs to GCP using cloud-native tools like Composer, BIG query, Cloud Data Proc, and Google Cloud Storage. Worked with Sparks, Spark Streaming, and the CoreSpark API to investigate Spark features for building data pipelines. Proficient with Cloudera Manager for managing Hadoop clusters and services. Developed Python scripts for parsing XML documents and loading data into databases. Worked with various scripting technologies such as Python and UNIX shell scripts. Extensive knowledge of Snowflake Clone and Time Travel. Utilized Hive, SparkSQL, and PySpark to load and transform data. Very interested in learning about newer technological stacks that Google Cloud Platform (GCP) adds. Worked with MapReduce programs in Apache Hadoop to work with Big Data. Good knowledge of Azure Active Directory. Experience of integrating Azure AD with Windows based AD. Experience of integrating applications with Azure AD. Created a Python Kafka consumer API for consuming data from Kafka topics. Used Kafka to consume Extensible Markup Language (XML) messages and Spark Streaming to process the XML file to capture User Interface (UI) updates. Extensive experience with IDE tools such as My Eclipse, RAD, IntelliJ, and NetBeans. Working knowledge of various data sources such as flat files, XML, JSON, CSV, Avro and Parquet files, and databases. Hands on Experience in Managing Azure services and subscriptions using Azure portals and managed Azure resources to Azure Resource Manager. Working Knowledge of database design, entity relationships, and database analysis, as well as SQL programming, PL/SQL stored procedures, packages, and triggers in Oracle. Excellent working knowledge of the Scrum/Agile framework and Waterfall project management methodologies. Technical Skills: Databases Oracle, MySQL, Hive, SQL Server, PostgreSQL BigData Technologies HDFS, Hive, Talend, PySpark, NiFi, Map Reduce, Pig, YARN, Sqoop, Oozie, Zookeeper. Programming Languages Python, Java,Microservices , Springboot SQL, R, PL/SQL, Scala, JSON, XML, HL7 Cloud Computing AWS S3, EMR, Lambda, EC2, Redshift, GCP - Bigquery,Azure, Data Proc, Data Flow,Big data G-Cloud, Cloud Composer, Google Monitors GCS, Firebase. Techniques Datamining, Clustering, Data Visualization, Data Analytics NoSQL Databases HBase, Cassandra Container Platforms Docker, Kubernetes, Jenkins Visualization Tools PowerBI, Tableau, Cognos Professional Experience: Client: Capital One ,Richmond, VA Jan 2022 - Till Date Sr. Data Engineer (GCP) Responsibilities: Created Spark applications in Python and implemented the Apache Spark data processing project, which handled data from various RDBMS and streaming sources. In charge of creating scalable distributed data solutions with Hadoop. Implemented Spring Circuit breaker pattern, integrated Hystrix dashboard to monitor spring micro services. Create data pipelines in GCP using air flow for ETL jobs using various airflow operators. Built data pipelines using Apache airflow in the GCP composer environment, using various airflow operators such as bash operators, Hadoop operators, and Python callable and branching operators. Used various services to migrate an existing on-premises application to AWS. Developed Physician order entry forms using Java Swing API. Knowledge of how to maintain a Hadoop cluster on GCP using Google cloud storage, Big Query, and DataProc. Develop Bigdata pipelines for new subject areas coming into existing systems using SQL Hive SQL, UNIXandother Hadoop tools. Configured private and puNic facing Azure load balancers etc. Stored in Hive for data analysis to meet business logic specifications. Used Spring Boot at back-end which helps to develop application with ease In GCP, I used the cloud shell SDK to configure the services Data Proc, Storage, and Big Query. Used the GCP environment for event-based triggering, cloud monitoring, and alerting. Using Python and the G-cloud function to load data into Big query for on-arrival csv files in the GCS bucket. Involved in migration of Tibco SOAP service to Java REST/JSON API'S. Created a NiFi dataflow to consume data from Kafka, transform the data, store it in HDFS, and expose a port to run a spark streaming job. Gain experience with Spark RDD, Data Frame API, Data Set API, Data Source API, Spark SQL, and Spark Streaming. Used Spark Streaming APIs to perform on-the-fly transformations and actions for common. Created a Python Kafka consumer API for consuming data from Kafka topics. Used Kafka to consume XML formatted messages and Spark Streaming to process the XML message to capture User Interface (UI) updates. Configured Azure blob storage and Azure file servers based on the requirements. Created a pre-processing job that uses Spark Data Frames to flatten JSON documents into a flat file. Created the GCP Cloud composer DAG to load data from on-premises csv files into GCP Big Query Tables. DAG has been scheduled to load incrementally. Collaborated with Spark to improve performance and optimize existing Hadoop algorithms. Set up Snow pipe to pull data from Google Cloud buckets into the Snowflakes table. De-normalizing the data as part of the Netezza transformation and loading it into NoSQL databases and MySQL. I involved in delivering Software evaluation matrix for all competitive vendors in BigData Hadoop enterprise solution Designed and developed Microservices using Spring boot. Created ETL programs in Netezza (using nzload and nzsql) to load data into the data warehouse Support existing GCP Data Management Implementation. Managing physical and logical data structures, as well as metadata. Implemented Micro Services based Cloud Architecture on AWS Platform. Solid knowledge of Cassandra architecture, replication strategy, gossip, snitches, and so on. Applied Hive QL to partitioned and bucketed data and ran Hive queries on Parquet tables. Used Apache Kafka to collect web log data from multiple servers and make it available in downstream systems for data analysis and engineering roles. Azure Storage Planning - Migrated Blob Storage for document and media file, Table storage for reliable messaging for workflow processing and file storage to share file data. Contributed to the implementation of Kafka security and its performance enhancement. Environment: Spark, Spark-Streaming, Spark SQL, GCP, AWS, map R, HDFS, Hive, Pig, Apache Kafka, Sqoop, Python, PySpark, Shell scripting, Linux, MySQL, Jenkins, Eclipse,Azure, Oracle, Git, Oozie, Tableau, MySQL, Soap, Agile Methodologies. Client: ConcertAI - Cambridge, MA Mar 2019 - Dec 2021 Sr. Data Engineer (GCP) Responsibilities: Implemented the installation and configuration of a multi-node cloud cluster using Amazon Web Services (AWS) on EC2. Handled AWS Management Tools such as Cloud Watch and Cloud Trail. The log files were saved to AWS S3. Versioning was used in S3 buckets containing highly sensitive data. Created Spark applications extensively using Spark Data frames and Spark SQL API, and used Spark Scala API to implement batch processing of jobs. Used Windows Azure portal to manage Virtual Network. Contributed the development of common framework, which is going to be used across all Bigdata applications. Implemented Spring boot microservices to process the messages into the Kafka cluster setup. Used Micro services architecture with Spring Boot based services interacting through a combination of REST and Apache Kafka message brokers Design and implement GCP driven data solutions for enterprise data warehouse and data lakes. Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators. Involved in facilitating data for Tableau dashboard based on Hive tables that are used by business users. Developed real-time data movement using Spark Structure Streaming and Kafka. Real-time tableau refresh using AWS lambda functions. Developed an interface using Java for automated creation of EMS components. Worked on Servlets, JSP, Java Beans. RMI, JDBC and Common Utilities E-Mail Service Framework. Worked on migrating REST APIs from AWS (Lambda, API Gateway) to Microservices architecture using Docker and Kubernetes (GCP GKE) for deploying scalable REST APIs using FLASK and Gunicorn. Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Query to develop and maintain GCP cloud base solution. Experience in integrating on premises DC to Azure DC. Deploying Spark jobs and running the job on GCP data proc clusters. Extensive experience with Cloudera Hadoop. Hands-on experience with Hadoop on the GCP and AWS stacks. Knowledge of SPARK APIs, AWS Glue, and the AWS CICD pipeline. Gitlab CI/CD was used to deploy the majority of the applications and data pipelines. Good exposure on Jenkins and Terraform as well. Extensive experience with Core Java 8, Spring Boot, Spring, Hibernate, Web Services, Kubernetes, Swagger, Docker, and integrating databases such as MongoDB and MySQL with webpages such as HTML, PHP, and CSS to update, insert, delete, and retrieve data using simple ad-hoc queries. Created an AWS cloud formation script to create an environment. Using hive scripts to create Hive tables, load data, and analyze it. Used Spring Boot which is radically faster in building cloud Microservices and develop spring based application with very less configuration Exposure in Power Shell scripting to automate many services management in azure. Implemented Azure AD configured SSO and multifactor Authentication. Also configured SSO from Window 10 based computer which is joined to Azure AD. Implemented and managed AD synchronization. Using Glue and EMR, a sizable amount of data was transferred from AWS S3 buckets to AWS Redshift. Used EMR, Glue, and Spark to analyze large and critical datasets. Worked on Dimensional and Relational Data Modeling with Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical, and Physical Data Modeling with Erwin. Used Spring Boot, which is radically faster in building cloud Micro services and develop Spring based application with very less configuration. Apache Airflow was integrated with AWS to monitor multi-stage ML workflows with tasks running on Amazon SageMaker. Predictive analytics report created with Python and Tableau, including visualization of model performance and prediction results. Environment: Python, PySpark,springboot, Spark SQL, Flask, HDFS, Hive, GitHub, Oozie, Scala, Hive, HQL, Jenkins, SQL, AWS Cloud,Azure, GCP, S3, EC2, S3, AWS, EMR, Redshift, Athena, Glue, Airflow. Client: CVS - Plano, TX Feb 2017 - Mar 2019 Data Engineer Responsibilities: Data and business requirements are gathered from end users and management. Designed and built data migration solutions from Data Warehouse to Atlas Data Lake (Big Data) Developing Azure ARM templates and deploying using VSTS to provision the Infrastructure in Azure Analyzed massive amounts of data Created simple and complex HIVE and SQL scripts to validate dataflow in a variety of applications. Validation of reports was performed. HUB was used to validate Data Profiling and Data Lineage. Implemented Java Design Patterns such as factory pattern. MVC and singleton pattern throughout the application. Build ETL pipelines and processing using Bigdata to create feeds and load them in DataLake. Deployed Spring Boot based micro services Docker container using Amazon EC2 and using AWS admin console Performed Automated Dynamic Scans for java and .net applications using IBM AppScan. AWS Step Functions were used to automate and orchestrate Amazon SageMaker tasks like publishing data to S3, training ML models, and deploying them for prediction. Apache Airflow was integrated with AWS to monitor multi-stage ML workflows with Amazon SageMaker tasks. PL/SQL statements created - stored procedures, functions, triggers, views, and packages Indexing, Aggregation, and Materialized views were used to improve query performance. Used Tableau/Power Bi/Cognas to create reports for data validation. Participated in the creation of Created Tableau dash boards by utilizing the show me functionality and stack bars, bar graphs, scattered plots, geographical maps, Gantt charts, and so on. Tableau Desktop and Tableau Server are used to create dashboards and stories as needed. Experience in designing and Implementing Azure Backup and recovery. Implementing PowerShell scripts in Azure Automation to fix the known issues from OMS. Performed statistical analysis with SQL, Python, R Programming, and Excel. Extensively worked with Excel VBA Macros and Microsoft Access Forms. Using tools like SQL, HIVE, and PIG, import, clean, filter, and analyze data. Extracted, transformed, and loaded source data from transaction systems using Python and SAS, generating reports, insights, and key conclusions. Installed, configured, administered, monitored Azure, IAAS and Azure AD Analyzed and recommended changes to improve data consistency and efficiency. Designed and developed data mapping procedures for ETL-Data Extraction, Data Analysis, and Loading using R programming. Managing multiple Subscriptions in Azure using PowerShell and Azure Portal. Pester implementation for validating the Azure Resources after deployment. Effectively communicated project plans, status, risks, and metrics to project team planned test strategies in accordance with project scope Data ingestion from Sqoop and flume from the Oracle database. Responsible for broad data ingestion via Sqoop and HDFS commands. Compile partitioned data in various storage formats such as text, JSON, Parquet, and so on. Participated in the loading of data from the LINUX file system to HDFS. Begin collaborating with AWS for data storage and handing for terabytes of data for customer BI reporting tools. Real-world experience with dimensional modeling (Star schema, Snowflake schema), transactional modeling, and SC (Slowly changing dimension) Created PU/SQL Stored Procedures, Functions, Triggers, Views, and Packages indexing, Aggregation, and Materialized views were used to improve query performance. Used Apache Airflow to author, schedule, and monitor Data Pipelines. Processed multiple sources across domains and created a Bigdata Refinery layer. Automated the VPN setup in Azure Virtual Network. Experience with Confluence and Jira, as well as data visualization tools such as Matplotlib and the Seaborn library Previous experience implementing a machine learning back-end pipeline using Pandas and NumPy. Environment: AWS, Python, PySpark, springboot, Tableau, R Programming, PIG, SQL, NumPy,Azure LINUX, HDFS, JSON, ETL, Snowflake, Power BI, HIVE, SQOOP, HUB. Client: Tata Consultancy Services, India May 2015 - Dec 2016 Data Engineer Responsibilities: Responsible for requirements analysis, application design, coding, testing, maintenance, and support. Created Stored Procedures, functions, Data base triggers, Packages and SQL Scripts based on requirements. As a consultant my role and responsibility involve to bring out BigData Hadoop Platform as an extensively solution platform Developed enterprise java beans like Session & entity beans for both the applicaiton modules Created complex SQL queries using views, sub queries, correlated sub queries. Complete architecture and implementation assessments of a number of AWS services, including Amazon EMR, Redshift, and S3. Created an Oozie workflow to automate the tasks of loading data into HDFS and pre-processing with Pig and HiveQL. Worked on full stack Bigdata technologies Created Pig UDFs to pre-process data for analysis. Worked on creating Hive tables, loading data, and writing hive queries that would run internally in a map reduce fashion. Worked on Zookeeper Cluster Coordination Services. Build OOZIE workflows and automate processes in the Cloudera environment. Developed Kafka consumer API in python for consuming data from topics. Used Sqoop to export the analyzed data to relational databases for visualization and report generation for the BI team. Worked on migrating SQL scripts from Redshift and Athena. Experience in building CICD pipelines for testing and production environments using Terraform. Proficient with container systems like Docker and container orchestration like EC2Container Service, Kubernetes, worked with Terraform. Contributed to the PySpark API in order to process larger data sets in the cluster. Analyzed data using Hive queries and presented it using Tableau dashboards. Writing scripts to automate the scheduling of Hive, Spark SQL, Pig, and Sqoop jobs. Created new Tables, Indexes, Synonyms, and Sequences as needed to meet new requirements. Used joins, sub queries, and correlated sub queries to implement complex SQLs. Developed shell scripts to invoke SQL scripts and scheduled them using corn tab. Create Unit Test Cases from Functional Requirements. Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, MongoDB, T-SQL, and SQL Server using Python. Environment: Hadoop, Python, PySpark, HDFS, Java, MapReduce, HIVE, SQOOP, Spark SQL, HQL, OOZIE, Git, Oracle, Pig, Cloudera, UNIX, Agile, T-SQL, AWS Redshift, MongoDB. Client: Aurobindo Pharma, India June 2014- May 2015 Hadoop SQL Developer Responsibilities: Installed and configured SQL Server 2005. In SQL Server 2005, I worked on the development and optimization of a new Loans database. Responsible for building scalable distributed data solutions using Hadoop. Installed and configured Hive, Pig, Sqoop and Oozie on the Hadoop cluster. Developed and Designed ETL jobs to load data from multiple source system to Teradata Database to load data into target schema. Worked on data extraction, aggregation, and analysis in HDFS by using PySpark and store the data needed to Hive. Using SSRS 2005, I created and deployed SSIS 2005 packages and reports. Worked extensively on enrichment/ETL in real time stream jobs using Spark Streaming, Spark SQL and loads in to HBase. Developed and tested Spark code using Scala for Spark Streaming/Spark SQL for faster processing of data. Developed Micro Services (REST API's) using Java. Spring Boot to support Citi NGA cloud framework and deployed the Microservices in dev space of Pivotal Cloud Foundry Developed entire frontend and backend modules using Python on Django Web Framework. Database maintenance plans for backup and database optimization have been configured. Developed users, user groups, and access permissions. Written and executed various MYSQL database queries from python using Python MySQL connector and MySQL dB package. Used performance monitor, SQL profiler, and DBCC to tune performance. Worked on various projects on automating the ETL process from different data sources to SQL server using SQL Server Agent. Created constraints, indexes, and views, as well as stored procedures and triggers. Collaborated with developers to fine-tune queries, run scripts, and migrate databases. Developed Simple to complex Map/Reduce Jobs using Hive and Pig. Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms. Create backup and recovery procedures for databases on production, test, and development servers. Assist in the resolution of production issues. Environment: Hadoop, Hive, Mapreduce, Windows 2000 Advanced Server/Server 2003, MS SQL Server 2005 and 2000, T-SQL, ETLSQL. Keywords: continuous integration continuous deployment machine learning user interface business intelligence sthree database active directory rlang information technology microsoft procedural language Delaware Massachusetts South Carolina Texas Virginia |