Nirnay Reddy - Data Engineer |
[email protected] |
Location: Frisco, Texas, USA |
Relocation: Yes |
Visa: H1B |
Professional Summary
Data engineering professional with 9 plus years of experience having analytical and intuitive skills in building successful Bigdata and Cloud projects. Able to support cross-functional teams in the design, development, testing, and delivery of cutting-edge solutions. Technical Summary Cloud Components AWS (S3, EC2, Glue Data Catalog, Athena, EMR, RDS, Redshift) Azure, Azure ADF, Snowflake, Stream Sets, DBT, Airflow, Prefect, Azure synapse Hadoop Ecosystem Development HDFS, Hive, Map Reduce, Pig, Oozie, Flume, Impala, Spark, Kafka and Sqoop. Database Systems MySQL, Oracle, SQL Server 2005/2008, Microsoft Access, DB2, Hbase, MongoDB Languages C, Python, JAVA, Scala, SQL, HIVEQL, PIG LATIN, UNIX shell scripting, COBOL, CICS, Spark, Pyspark Tools and Utilities Micro strategy, Qlik Sense, Tableau, Eclipse, Main Frames, Alteryx, Librarian, Change Controls, checklists, GGS, DFSORT, ICE TOOL, Xpeditor, Macros, Jsfiddle, Bootstrap, Opera Mobile Emulator, JIRA, Scrum, Agile methodologies, GIT, Maven. Professional Experience Data Engineer | Ford July`2021 Till Date Technologies used: Hive, SQL, Python, Scala, Unix, Shell scripting, Bitbucket, spark, Kafka, HBase, HDFS, GIT, Jenkins, MYSQL database (IDE- Data Grip), Airflow, Azure.. Specific Duties, Activities, and Responsibilities: Created and designed front end User Interface for third party external Ford vendors.Worked in the Corrective Action Management team to make sure the ordered vehicles are taking the most efficient path to get delivered to the end client. Enabled user functionalities and process pipeline for Vendors to create Invoice. Maintained and ensured to enhance the company s multi-branded websites such as Moving Vehicle - Corrective Action Management. Implemented machine learning models to analyze historical delivery data, identifying patterns and optimizing future delivery routes for autonomous vehicles. Utilized SQL and Spark to construct databases, identifying weaknesses in client codes and proposing improvements. Experience in using various version control systems like CVS, Git, GitHub, and deployment using Heroku. Proven track record in migrating data warehouses and databases into Hadoop/NoSQL platforms. Conducted complex data analysis, collaborating with senior data scientists, and compiled detailed reports, contributing valuable insights to decision-making processes. Employed PyTorch for implementing machine learning models, enhancing predictive analytics capabilities within the data analysis framework. Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks. Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the Sql Activity. Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards. Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster. Utilized Azure Service Fabric s container hosting, cluster resource management, and workload orchestration capabilities. Built stateless or stateful microservices to power the most complex, low-latency, data-intensive scenarios and scale them into or across the cloud with Azure Service Fabric. Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL. Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB). Comprehensive understanding of Apache Spark job execution components, coupled with practical experience in working with NoSQL databases such as HBase and MongoDB. Managed Databricks clusters, configuring cluster settings, instance types, and autoscaling policies to optimize performance and resource utilization for data processing workloads. Developed data engineering pipelines on Databricks using Apache Spark, PySpark, or SQL, ingesting, transforming, and aggregating large-scale datasets for analytics and reporting purposes. Conducted performance tuning and optimization of Spark jobs on Databricks, analyzing execution plans, optimizing data partitioning, and leveraging caching mechanisms to improve job throughput and reduce execution times. Utilized Databricks Delta for building scalable and reliable data lakes, managing ACID transactions, schema enforcement, and time travel capabilities for data versioning and rollbacks. Implemented machine learning workflows on Databricks using MLflow and Databricks ML libraries, building and training predictive models for classification, regression, and clustering tasks. Designed and implemented integration workflows using Azure Logic Apps to automate business processes, orchestrate data flows, and integrate disparate systems and services across the Azure ecosystem. Configured connectors within Azure Logic Apps to connect to various Azure services, SaaS applications, on-premises systems, and third-party APIs, enabling seamless data exchange and communication between different endpoints. Developed trigger-based automation workflows in Azure Logic Apps, utilizing triggers such as HTTP requests, timers, file system events, and message queues to initiate workflow executions based on predefined conditions. Integrated Amazon Redshift with ETL (Extract, Transform, Load) processes, ensuring seamless data movement and transformation for analytics and reporting. Developed and optimized database schemas in Amazon Redshift, maximizing query performance and minimizing data redundancy. Developed a Streaming pipeline to consume data from Kafka and ingest it into HDFS in near real-time, enhancing the responsiveness of data processing systems. Integrated AWS Glue Catalog seamlessly with AWS Glue ETL Jobs, facilitating streamlined data processing and ensuring consistency between the catalog and the actual data. Trained and fine-tuned models using PyTorch's extensive set of optimization techniques, including gradient descent variants, learning rate scheduling, and weight regularization Continuously sought to improve processes by converting Hive QL queries into Spark transformations, employing Spark RDDs, Spark SQL, and Python to enhance data processing efficiency. Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs. Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and create dags to run the Airflow. Experience in Importing and exporting data from different databases like MySQL, RDBMS into HDFS and HBASE using Sqoop. Azure Data Engineer | Morgan Stanley July 2019 June 2023 Technologies used: AWS, Redshift, DynamoDB, Athena, Hadoop,HDFS, Hive, Spark, SQL, Python, Tableau Unix, Shell scripting. Specific Duties, Activities, and Responsibilities: Successfully executed Extract, Transform, Load (ETL) functions on diverse data sources, including Mainframe, PADT, Socring-Pairing, and vesting, directing the data into Hadoop Distributed File System (HDFS). Leveraged the Data Ingestion Accelerator framework, I conducted various types of ingestion for Stock plan, Finwell, and Retirement data into HDFS, incorporating Type-2 Slowly Changing Dimensions (SCD). Enriched the ingestion process by utilizing PySpark to import enriched data into the Hadoop Cluster, implementing business rules on unstructured data in the HDFS data curation layer. Strategically targeted Individual Retirement Account clients, those with positive stock balances, and those experiencing a vesting event through Adobe targeted messaging. This involved displaying a global banner on Morgan Stanley's online webpage and mobile application, contributing to effective client engagement. Played a pivotal role in preparing data for prescriptive and predictive modeling to optimize processes, working alongside senior data engineers to design ETL processes for consolidating data in a single repository. Scheduled daily and monthly batch jobs, exporting files into Salesforce Marketing Cloud, significantly improving user satisfaction by ~10%. Conducted in-depth analysis of client and Financial Advisor data, resulting in an enhanced clientele portfolio and improved client satisfaction. Designed Setup maintained Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse. Experience in building ETL(Azure Data Bricks) data pipelines leveraging PySpark, Spark SQL. Experience in building the Orchestration on Azure Data Factory for scheduling purposes. Experience working with Azure Logic APP Integration tool. Worked on NoSQL databases including HBase and Cassandra. Expertise on working with databases like Azure SQL DB, Azure SQL DW. Hands - on experience in Azure Analytics Services - Azure Data Lake Store (ADLS), Azure Data Lake Analytics (ADLA), Azure SQL DW, Azure Data Factory (ADF), Azure Data BrickS (ADB) etc Orchestrated data integration pipelines in ADF using various Activities like Get Metadata, Lookup, For Each, Wait, Execute Pipeline, Set Variable, Filter, until, etc. Designed and developed a new solution to process the NRT data by using Azure stream analytics, Azure Event Hub and Service Bus Queue. Created Linked services to connect the external resources to ADF. Used Azure Devops & Jenkins pipelines to build and deploy different resources(Code and Infrastructure)in Azure. Developed real-time stream processing applications on Databricks using Structured Streaming, ingesting and processing streaming data from sources like Kafka, Kinesis, or Event Hubs for real-time analytics. Architected and implemented medium to large-scale BI solutions on Azure using Azure Data Platform services, including Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, and NoSQL DB. Leveraged Databricks collaborative notebooks for interactive data exploration, analysis, and visualization, collaborating with team members to share insights and findings in a collaborative environment. Supported data science workflows on Databricks, assisting data scientists in data preparation, feature engineering, model training, and evaluation tasks using Python. Created interactive dashboards and visualizations in Databricks using libraries like Matplotlib, Seaborn, or Plotly, presenting insights and trends derived from data analysis and machine learning models. Implemented data transformation and mapping logic within Azure Logic Apps using built-in functions, expressions, and mapping components, ensuring compatibility and consistency of data exchanged between systems. Implemented error handling mechanisms and retry policies within Azure Logic Apps to handle exceptions, retries, and fault tolerance in integration workflows, ensuring robustness and reliability of automated processes. Integrated cloud-based and on-premises systems using Azure Logic Apps, leveraging integration runtime components such as Azure Data Gateway and Azure Service Bus to bridge connectivity gaps between different environments. Worked on outlier detection with data visualizations using boxplots, feature engineering using KN distances using Pandas, NumPy packages in Python Used various python libraries like Pandas. NumPys, seaborn, SciPy, Matplotlib, Scikit-learn to develop machine learning algorithms. Developed Spark API to import data into HDFS from Teradata and created Hive tables Worked on different layers within Data Lake such as data ingestion layer, data curation layer and data distribution layer. Tableau is the primary advanced analytics tool used for reporting and creating dashboards. Working with a data management team during data integration specifically focusing on data quality, data profiling to integrate, transform and to provision accurate data. Implemented cost-effective analytics solutions with AWS Athena by leveraging serverless architecture, pay-as-you-go pricing model, and auto-scaling capabilities, minimizing infrastructure costs while maximizing analytical capabilities Integrated AWS DynamoDB with AWS Lambda to implement event-driven architectures, automating data processing and triggering business logic based on changes in DynamoDB tables.Working with a data management team during data integration specifically focusing on data quality, data profiling to integrate, transform and to provision accurate data. Data Migration from Oracle, My SQL to Hadoop for building advanced data analytics to achieve better performance and notification for Data-in-Motion. Contributed to data migration from Oracle, MySQL to Hadoop, enhancing data analytics performance and notification for Data-in-Motion. Designed and implemented real-time Big Data processing, enabling analytics and event detection, leading to a substantial increase in data quality by ~12.5%. Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required. Utilized Python scripting and worked closely with application teams to manage operating system updates, patches, and version upgrades within the Hadoop environment. My expertise also extends to working with Spark for performance optimization and real-time analytics on Hadoop, utilizing Spark Context, Spark-SQL, Data Frame, Pair RDDs, and YARN. Responsible for building scalable distributed data solutions using Hadoop. Experience in scripting for automation and monitoring using Python Data Engineer | CVS Health June`2017 July 2019 Technologies used: Bitbucket, SQL, Python, Pyspark, Unix, Shell scripting, GCP,Big Query, Big Table, Kafka, Hive Specific Duties, Activities, and Responsibilities: Handled importing of data from various data sources, performed transformations using Hive, MapReduce loaded data into HDFS and extracted the data from MySQL into HDFS using Sqoop. Worked on analyzing Hadoop clusters and different big data analytic tools including Pig, HBase database and Sqoop. Working with a data management team during data integration specifically focusing on data quality, data profiling to integrate, transform and data provisioning. Designed and Implemented real-time Big Data processing to enable real-time analytics, event detection and notification for Data-in-Motion. Involved in importing data from Weblog and Apps log using Flume. Involved in Map Reduce and Hive Optimization. Involved in importing data from Oracle to HDFS using SQOOP. Involved in writing Map Reduce program and Hive queries to load and process data in Hadoop File System. Involved in creating Hive tables, loading with data and extensively worked on writing hive queries. Experienced in using Kafka as a data pipeline between JMS (Producer) and Spark Streaming Application (Consumer). Involved in the development of Spark Streaming application for one of the data sources using Scala, Spark by applying the transformations. Developed a script in Scala to read all the Parquet Tables in a Database and parse them as Json files, another script to parse them as structured tables in Hive. Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required. Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive. Loaded the data into Spark RDD and do in memory data Computation to generate the Output response. Responsible for building scalable distributed data solutions using Hadoop. Experience in scripting for automation and monitoring using Python. Installed Oozie workflow engine to run multiple Hive and pig jobs. Creating a Google Cloud Storage bucket to hold data inputs, data outputs and logs. Experience is Implementing Spark applications on GCP Dataproc. Exploring with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD, YARN Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required. Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive. Loaded the data into Spark RDD and do in memory data Computation to generate the Output response. Manipulated the data according to the busing requirements using Big Table. Increased the scalability and decreased the latency of querying the data utilizing Big Table. Experience in migrating data from Hbase to Big Table. Performed different kinds of ingestion on Stock plan, Finwell and Retirement data into HDFS by leveraging the existing Data ingestion accelerator framework which in turn used Type-2 SCD. Enriched the ingestion process by importing the data through PySpark into the Hadoop Cluster. Worked with Google data catalog and other Google cloud API s for monitoring, query and billing related analysis for BigQuery usage. Migrating the needed data from Oracle, MySQL into HDFS using Sqoop and importing various formats of flat files into HDFS. Data Engineer | Atria Convergence Technologies May`2014 July 2016 Technologies used: Alteryx, Tableau, SQL, ETL processes, UNIX, PowerCenter, KPIs Specific Duties, Activities, and Responsibilities: Automated workflows using Alteryx to send emails to the stakeholders. Created rich dashboards using Tableau Dashboard and prepared user stories to create compelling dashboards to deliver actionable insights. Write complex SQL queries for validating the data against different kinds of reports. Worked with Excel Pivot tables. Designed/developed/Modified stored Procedures, Packages and Functions for derived columns in the table transformation. Worked on User Defined Data Types, Nested Tables and Collections. Expertise in Regular Expression, Hierarchical SQL function & SQL Modeling. Performing data management projects and fulfilling ad-hoc requests according to user specifications by utilizing data management software programs and tools like Excel and SQL. Extensively worked on SQL, created joins among huge tables to map the columns from multiple tables. Involved in extensive DATA validation by writing several complex SQL queries and Involved in back-end testing and worked with data quality issues. Database design and development including Tables, Primary and Foreign Keys, Indexes and Stored Procedures. Generated SQL and PL/SQL scripts to install create and drop database objects, including tables, views, primary keys, indexes, constraints, packages, sequences, grants and synonyms. Maintained a series of UNIX Shell Scripts to manage batch order processing. Database performance monitoring and identifying bottlenecks hindering the performance of the database. Tune SQL statements using hints for maximum efficiency and performance, create and Maintain/modify PL/SQL Packages, mentor others with the creation of complex SQL Statements, perform data modeling and create/maintain and modify complex database Triggers and Data Migration scripts. Created changes in the database according to new requirements (new tables in the existing database and fields in the existing tables). Constructed and implemented multiple-table links requiring complex join statements, including Outer-Joins and Self-Joins. Developed Shell scripts for job automation and daily backup. Created, debugged, and modified Stored Procedures, Triggers, Tables, Views Education: Bachelors in computer science and Technology, SRM University, Chennai, India. Masters in Information Technology Management, Lindenwood University, MO, USA. Certification: AWS Certified Developer Associate (Amazon Web Services) - Issued Sep 2022 Keywords: cprogramm machine learning business intelligence sthree database active directory information technology golang procedural language Missouri |