Home

Harika - Sr. Data engineer
[email protected]
Location: Indianapolis, Indiana, USA
Relocation: Yes
Visa: GC
Name: Harika Pasupuleti
Role : Senior Data Engineer/Big Data Engineer
Email: [email protected]
Mobile: +1 (469) 907 4576
PROFESSIONAL SUMMARY:
Over 10+ years of experience as a Sr. Data Engineer, Hadoop developer, big data engineer utilizing big data, Hadoop technologies, Spark, Scala, Python, Machine Learning Algorithms, Deployment, Data Pipeline Design, Development, and Implementation as a Data Engineer.
Directed a team of 8 data engineers in a high-impact data engineering project, ensuring successful delivery from inception to completion.
Strong experience in Big Data Analytics using Hadoop HDFS, Hadoop YARN, Hadoop MapReduce, Hadoop Hive, Hadoop Impala, Hadoop Pig, Hadoop Sqoop, Hadoop HBase, Spark, Spark SQL, Spark Streaming, Apache Flume, Hadoop Oozie, Zookeeper, and Hue.
Good exposure working with Hadoop distributions such as Cloudera, Hortonworks, and Databricks Delta Lake.
Extensive knowledge in writing Hadoop jobs for data analysis as per business requirements using Hadoop Hive and worked on HiveQL queries for required data extraction, join operations, writing custom UDFs as required, and optimizing Hive Queries.
Designed and implemented real-time data integration solutions using technologies like change data capture (CDC), event streaming, and real-time data processing frameworks.
Good working experience on Spark (Spark Streaming, Spark SQL) with Scala and Apache Kafka. Worked on reading multiple data formats on Hadoop HDFS using Scala.
Contributed significantly to the architecture and implementation of multi-tier applications, leveraging an extensive use of AWS services including AWS EC2, Route53, AWS S3, AWS Lambda, AWS CloudWatch, AWS RDS, AWS DynamoDB, AWS SNS, AWS SQS, and IAM.
Demonstrated strong knowledge and experience working with the Snowflake platform for data warehousing and analytics.
Designed and implemented data lake architectures on AWS S3, leveraging partitioning and columnar formats such as Avro, Parquet, and ORC to optimize query performance and minimize storage costs.
Experience in migrating SQL databases to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Azure Databricks, and Azure SQL Data Warehouse, and controlling and granting database access, and migrating on-premise databases to Azure Data Lake Store using Azure Data Factory.
Good understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
Expertise in real-time data processing technologies such as Apache Storm, Apache Kafka, Apache Flink, Spark Streaming, and Apache Flume to enable real-time analytics and integration with diverse data sources.
Developed automated data pipelines within Data Vault architectures using tools like Apache Airflow or custom scripts, reducing manual intervention and improving data processing efficiency.
Skilled in using ETL/ELT tools like AWS Glue, Ab Initio, Talend, Informatica PowerCenter, Apache Flink, Apache Kafka, Apache NiFi, and Apache Airflow.
Good knowledge in OLAP, OLTP, Business Intelligence, and Data Warehousing concepts with emphasis on ETL and Business Reporting needs.
Developed and optimized complex ETL pipelines using Apache NiFi and Apache Airflow, ensuring efficient and reliable data ingestion, transformation, and loading processes.
Implemented data quality and validation processes using Python and SQL to ensure the accuracy and consistency of data across various sources and destinations.
Hands-on expertise in using cloud-based data warehousing solutions such as Hadoop Hive, Teradata, Azure Synapse Analytics, and Snowflake to manage and scale large datasets effectively.
Deep knowledge in scripting languages like Python, Bash, JavaScript, PowerShell, Ruby, Perl, Go, JSON, YAML, and Groovy.
Proficient in designing and developing interactive dashboards and reports using Power BI Desktop, Power BI Service, and Power BI Embedded.
Expertly orchestrated applications on Azure, utilizing services such as Azure Kubernetes Service (AKS), Azure Functions, and Azure App Services.
Skilled in using issue tracking and project management tools like Azure DevOps, JIRA, and ServiceNow to track and manage issues related to the SDLC, ensuring that bugs are identified, reported, tracked, and resolved promptly and effectively.
Strong expertise in CI/CD (Continuous Integration/Continuous Deployment) practices and tools, including Jenkins, Maven, and Kubernetes.
Designed and maintained scalable data architectures, ensuring high availability and performance of data systems using technologies such as Hadoop HDFS, Hadoop HBase, and Elasticsearch.
Experienced in developing real-time data processing solutions using PySpark, ensuring efficient data processing and analytics.
Proficient in working with PostgreSQL, MySQL, Microsoft SQL Server, and Oracle databases for data storage and management.
Implemented NoSQL solutions using Cassandra and MongoDB for handling large volumes of unstructured data.
Utilized Docker and Kubernetes for containerization and orchestration of data engineering applications, ensuring seamless deployment and scalability.
Worked extensively with AWS Redshift, Azure Synapse Analytics, and GCP BigQuery for cloud-based data warehousing and analytics.
Expert in data serialization formats such as Avro, Parquet, and ORC for efficient data storage and retrieval.
Implemented data encryption and security measures to ensure compliance with GDPR and CCPA regulations.
Skilled in using data governance and data quality tools to maintain data integrity and compliance across the organization.
Developed and maintained data pipelines using Luigi and Dagster for robust and scalable data workflows.
Utilized Terraform for infrastructure as code (IaC) to automate the provisioning and management of cloud resources.
Experience in data migration projects involving data extraction, transformation, and loading (ETL) across different platforms and environments.
Strong knowledge of agile and scrum methodologies for effective project management and delivery of data engineering projects.
Proficient in using GIT and SVN for version control and collaborative development.
Experienced in developing and deploying data applications on Google Cloud Platform (GCP) using services like GCP Cloud Dataflow, Google Cloud Storage, and Google Cloud Pub/Sub.
Implemented data deduplication, cleansing, and profiling techniques to ensure high-quality data for analytics and reporting.
Expert in data profiling and data deduplication techniques to maintain data quality and accuracy.
Developed ETL processes using SSIS (SQL Server Integration Services) for efficient data transformation and loading. Utilized Tableau for creating interactive and insightful data visualizations.Implemented Azure Stream Analytics for real-time data processing and analytics.
Skilled in designing and managing Azure Cosmos DB for globally distributed, multi-model databases. Proficient in using GCP Cloud Pub/Sub for building robust data pipelines.
Implemented Azure Functions for serverless computing and event-driven data processing.Skilled in using AWS Athena for interactive query services and analysis.
Developed data pipeline optimization techniques to enhance performance and efficiency. Utilized data serialization formats like JSON and XML for data interchange and storage.

EDUCATION:
Bachelor of Technology in Computer Science and Engineering from Bharath University, Chennai 2013

SKILLS:
Big Data Ecosystem
HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop,HBase, Oozie, Zookeeper, Hue,Spark: Spark SQL, Spark Streaming, PySpark,Apache Flume, Apache Storm, Apache Kafka, Apache Flink

Cloud Platforms
AWS: AWS EC2, AWS Route53, AWS S3, AWS Lambda, AWS CloudWatch, AWS RDS, AWS DynamoDB, AWS SNS, AWS SQS, AWS IAM, AWS Redshift, AWS Athena, AWS EMR, AWS Glue, Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Azure Databricks, Azure SQL Data Warehouse, Azure Kubernetes Service (AKS), Azure Functions, Azure App Services, Azure Synapse Analytics, Azure Stream Analytics, Azure Cosmos DB,GCP Cloud Dataflow, GCP Cloud Storage, GCP Cloud Pub/Sub, GCP BigQuery

ETL
AWS Glue, Ab Initio, Talend, Informatica PowerCenter, Apache Flink, Apache Kafka, Apache NiFi, Apache Airflow, SSIS, Luigi, Dagster
Data Warehousing
Snowflake, Teradata, Hadoop Hive, Azure Synapse Analytics, AWS Redshift

Databases
PostgreSQL, MySQL, Microsoft SQL Server, Oracle, Cassandra, MongoDB
Programming/Scripting
Python, Scala, Bash, JavaScript, PowerShell, Ruby, Perl, Go, JSON, YAML, Groovy
Methodologies
Agile and Scrum methodologies
Monitoring/Security Tools
Data Encryption, GDPR, CCPA Compliance, Data Governance, Data Quality, Data Profiling, Data Deduplication
DevOps
CI/CD Jenkins, Maven, Kubernetes Docker, Kubernetes Azure DevOps, JIRA, ServiceNow
Big Data Ecosystem
Hadoop: HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Oozie, Zookeeper, Hue, Spark SQL, Spark Streaming, PySpark, Apache Flume, Apache Storm, Apache Kafka, Apache Flink

Cloud Platforms
AWS EC2, AWS Route53, AWS S3, AWS Lambda, AWS CloudWatch, AWS RDS, AWS DynamoDB, AWS SNS, AWS SQS, AWS IAM, AWS Redshift, AWS Athena, AWS EMR, AWS Glue, Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Azure Databricks, Azure SQL Data Warehouse, Azure Kubernetes Service (AKS), Azure Functions, Azure App Services, Azure Synapse Analytics, Azure Stream Analytics, Azure Cosmos DB, GCP Cloud Dataflow, GCP Cloud Storage, GCP Cloud Pub/Sub, GCP BigQuery

ETL
ETL/ELT: AWS Glue, Ab Initio, Talend, Informatica PowerCenter, Apache Flink, Apache Kafka, Apache NiFi, Apache Airflow, SSIS, Luigi, Dagster

PROFESSIONAL EXPERIENCE:

AMD, Austin, TX
Senior Data Engineer/Cloud Data Engineer Jun 2021 to Present

Responsibilities:
Designed and implemented data pipelines on Azure using tools like Azure Data Factory for ETL processes, leveraging Azure Data Lake Storage and Azure SQL Database for scalable and cost-effective data solutions.
Designed and implemented distributed data processing workflows using Apache Hadoop (HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, Oozie), Apache Spark (Core, SQL, Streaming, PySpark), and Apache Kafka to handle large-scale data sets.
Optimized Spark programs for performance by leveraging efficient coding practices, indexing, and appropriate use of Spark options, ensuring high performance and scalability.
Automated and scheduled Databricks jobs using Azure Data Factory and integrated Delta Lake tables to ensure seamless data pipeline orchestration, data consistency, and incremental updates.
Implemented data lake solutions using Azure Data Lake Storage and PySpark, adhering to Medallion architecture best practices to ensure data quality and governance. Configured Change Data Capture (CDC) to track and process changes in real-time, updating Delta Lake tables accordingly.
Handled streaming data using Spark Streaming and Spark Structured Streaming APIs to enable real-time analytics and processing.
Performed data analysis, data migration, data cleansing, transformation, integration, data import, and data export using Python, SQL, and Azure Data Factory.
Implemented Snowflake data warehouses and data lakes, configuring performance settings and access controls, and enhanced query performance by 35% through efficient indexing and partitioning strategies.
Hands-on experience in Azure Cloud Services (PaaS & IaaS), including Azure Synapse Analytics, SQL Azure, Azure Data Factory, Azure Analysis Services, Application Insights, Azure Monitoring, Key Vault, and Azure Data Lake. Worked on creating tabular models on Azure Analysis Services and building data warehouses in Azure Synapse Analytics for scalable storage and high-performance analytics.
Developed and managed ETL processes in Snowflake and SSIS (SQL Server Integration Services), ensuring efficient data transformation, loading, and integration from various data sources.
Designed and implemented data integration pipelines in Azure Synapse, combining data from various sources such as Azure Blob Storage, Azure SQL Database, and external sources.
Orchestrated complex infrastructure deployments using Terraform, including multi-tier applications, networking configurations, and security policies, integrating monitoring and logging tools like Prometheus, Grafana, and ELK Stack.
Developed Spark DataFrame optimization techniques, such as predicate pushdown, column pruning, and vectorized execution, improving query performance and resource utilization, resulting in a 20% cost reduction.
Conducted ad-hoc data analysis and exploratory data analysis (EDA) in Azure Synapse Studio to discover trends, patterns, and anomalies in large datasets.
Integrated APIs and web services to extract data from external systems and applications, leveraging RESTful APIs, SOAP APIs, and other integration protocols.
Implemented data transformations and manipulations within T-SQL scripts for ETL processes, and handled real-time analytics using Azure Stream Analytics.
Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP systems, and Fact Tables and Dimension Tables, employing tools like Erwin.
Developed data deduplication and cleansing processes using Python and Azure Data Factory, maintaining data integrity and performing data validation.
Implemented Delta Lake for data versioning, ACID transactions, and schema enforcement, ensuring data quality and reliability.
Utilized Apache Airflow to track data lineage, providing transparency and traceability for data transformations and workflows.
Built and optimized data warehouses in Azure Synapse Analytics for scalable storage and high-performance analytics. Implemented data processing workflows and pipelines using Databricks, Azure Data Factory, and Azure Data Lake Storage (ADLS), ensuring efficient data ingestion, transformation, and storage.
Implemented data quality checks and monitoring mechanisms within Databricks and PySpark pipelines to detect and address issues proactively.
Designed and deployed APIs using Azure API Management to expose and manage data services securely.
Utilized Docker and Kubernetes for containerized deployment of Apache Airflow, enhancing scalability and isolation. Orchestrated continuous integration and deployment (CI/CD) pipelines with Jenkins for streamlined development workflows.
Developed Key Performance Indicator (KPI) dashboards using Power BI and other advanced visualization tools to provide actionable insights. Practiced agile development methodologies such as Scrum, XP, and Kanban, participating in sprint planning, daily stand-ups, and retrospectives to deliver iterative and incremental software solutions. Integrated KPI dashboards with data sources to enable monitoring and analysis of critical business metrics.
Implemented data governance and security measures using Azure Data Lake, ensuring compliance with GDPR and CCPA. Designed and optimized data serialization formats using Avro, Parquet, ORC, JSON, and XML to improve data processing and storage efficiency.
Utilized GIT and SVN for version control and collaborative development in data engineering projects.
Developed data profiling and validation scripts using Python to ensure data quality and accuracy.
Implemented NoSQL databases like MongoDB and HBase to handle large volumes of unstructured data efficiently.
Designed data ingestion and integration pipelines using Azure Data Factory to streamline data flow from various sources.
Created OLAP cubes and fact tables for data warehousing solutions to support business intelligence requirements. Integrated data processing and transformation pipelines using Azure Functions for serverless computing capabilities.
Utilized Java for developing data processing applications and integrating with Hadoop ecosystems like Apache Hadoop and Apache Spark. Implemented data storage solutions using HDFS (Hadoop Distributed File System) for scalable and reliable data storage.
Developed and managed data processing workflows using Hadoop MapReduce for large-scale data processing tasks. Configured and managed Hadoop YARN for resource management and job scheduling in a Hadoop cluster.
Utilized Hadoop Hive for data warehousing solutions and running SQL-like queries on large datasets stored in Hadoop. Handled data transformation tasks using Hadoop Pig for processing and analyzing large data sets in a Hadoop environment.
Utilized Hadoop Sqoop for transferring data between Hadoop and relational databases like MySQL and Oracle.
Automated data workflows using Hadoop Oozie for scheduling and managing Hadoop jobs.
Leveraged Spark Core for the foundation of Apache Spark applications to ensure high performance and scalability.
Managed data storage and retrieval tasks using Microsoft SQL Server and Oracle databases for various data engineering projects. Utilized Apache Flink for real-time data processing and stream analytics in large-scale data environments.
Designed and managed globally distributed, multi-model databases using Azure Cosmos DB for scalable data solutions.
Integrated and documented data engineering processes using Confluence for effective team collaboration and knowledge sharing.
Implemented data encryption and security measures to ensure compliance with GDPR and CCPA regulations. Developed data deduplication processes using Python and Azure Data Factory to maintain data integrity. Configured Azure Stream Analytics for real-time data processing and analytics.
Developed and managed ETL processes using SSIS (SQL Server Integration Services) for efficient data transformation and loading tasks.
Utilized Ansible for configuring Azure Virtual Machines and managing infrastructure as code.
Worked with PowerShell and UNIX scripts for file transfer, emailing, and other file-related tasks.

Environment: Data Engineer, Azure Data Factory, Azure Data Lake Storage, Azure SQL Database, Apache Hadoop, HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, Oozie, Apache Spark, Spark Core, Spark SQL, Spark Streaming, PySpark, Apache Kafka, Databricks, Delta Lake, Change Data Capture (CDC), Spark Structured Streaming APIs, Python, SQL, Snowflake, Azure Synapse Analytics, SQL Azure, Azure Analysis Services, Application Insights, Azure Monitoring, Key Vault, Azure Blob Storage, Azure API Management, Terraform, Prometheus, Grafana, ELK Stack, Azure Stream Analytics, T-SQL, Erwin, Apache Airflow, Docker, Kubernetes, Jenkins, Power BI, Avro, Parquet, ORC, JSON, XML, GIT, SVN, MongoDB, HBase, Azure Functions, OLAP cubes, fact tables, OLTP, Java, Apache Flink, Microsoft SQL Server, Confluence, Ansible, PowerShell, UNIX scripts

CG Infinity, Plano, TX
Data Engineer/ AWS Data Engineer Dec 2018 to May 2021

Responsibilities:
Implemented Spark streaming jobs to continuously retrieve data from Apache Kafka, subsequently storing the streaming information into Hadoop HDFS.
Used Spark SQL for Scala and Python interface, optimizing the conversion of RDD case classes to schema RDD for improved performance and ease of data manipulation.
Ingested data from diverse sources such as Hadoop HDFS/HBase into Spark RDD, leveraging PySpark for computational tasks and generating output responses.
Implemented data ingestion pipelines using AWS Glue and Apache NiFi for efficient and scalable data integration from diverse sources into AWS S3. Designed and optimized data transformation workflows using Apache Spark and AWS EMR, focusing on data cleansing, aggregation, and enrichment.
Developed real-time data processing solutions using Apache Flink and Apache Kafka, ensuring low-latency data processing and integration. Created and managed data lakes on AWS S3 using AWS Lake Formation, ensuring proper data governance and access control policies.
Implemented ETL pipelines using AWS Glue, integrating with AWS Redshift for data warehousing and analytical processing. Designed and implemented data partitioning and bucketing strategies in Snowflake to improve query performance and data retrieval efficiency.
Developed scalable and reliable data pipelines using Databricks Delta Lake and Apache Airflow for orchestrating complex workflows. Utilized AWS SageMaker for building, training, and deploying machine learning models, integrating them into data pipelines for predictive analytics.
Automated data extraction, transformation, and loading (ETL) processes using AWS Step Functions and AWS Lambda for serverless data processing. Integrated third-party tools and APIs with Snowflake for seamless data integration and synchronization.
Integrated Redis with data processing pipelines and streaming frameworks (e.g., Apache Kafka, Apache Flink) for real-time data ingestion and processing. Automated data delivery workflows using tools like Apache Airflow and Databricks jobs to ensure timely and consistent data distribution.
Conducted in-depth assessments of current data lake and Databricks configurations to identify bottlenecks and inefficiencies. Implemented Terraform scripts to optimize infrastructure costs by rightsizing resources, leveraging spot instances, and implementing auto-scaling policies.
Developed Spark scripts and UDFs, employing both Spark DSL and Spark SQL queries for tasks like data aggregation, querying, and writing data back into RDBMS through Hadoop Sqoop. Integrated APIs with backend services and data sources for seamless data access and interaction.
Utilized partitioning and bucketing strategies to enhance data retrieval efficiency and reduce I/O operations. Constructed multiple Hadoop MapReduce Jobs using the Java API, along with Hadoop Pig, for data extraction, transformation, and aggregation from various file formats, including Parquet, Avro, XML, JSON, CSV, ORC, and others.
Worked with binary and textual data formats in Spark, such as CSV, JSON, and XML, and their serialization and deserialization using Spark DataFrames and RDDs. Configured and managed Apache Airflow pools to control the concurrency of specific tasks, optimizing resource allocation and preventing overloading.
Managed and optimized AWS Redshift clusters in AWS for parallel data processing, query optimization, and workload management. Designed and created dashboards and reports using tools like Tableau and Power BI, aiding patients in comprehending their migraine data and model-generated insights.
Validated the performance of machine learning models using appropriate evaluation metrics. Used Terraform for provisioning AWS resources and AWS CloudFormation for defining and deploying infrastructure.
Ensured the accuracy and reliability of the model through continuous testing and refinement. Created Hadoop Oozie Workflow to automate data loading into the Hadoop Distributed File System and Hadoop Pig to pre-process the data and to apply complex transformation.
Integrated Terraform with various providers to manage heterogeneous cloud environments. Improved data processing speed by 30% through optimized ETL pipelines in Snowflake.
Designed and managed complex task dependencies within Apache Airflow DAGs to ensure tasks execute in the correct sequence and handle data correctly. Designed and executed data pipelines for batch and real-time data processing using AWS Redshift and Spark pools.
Optimized data integration workflows and ETL processes for performance, scalability, and reliability, leveraging parallel processing techniques and distributed computing architectures. Utilized Snowflake features such as clustering, partitioning, and materialized views to optimize query performance.
Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed data using Crawlers and scheduled the job and crawler using workflow feature. Automated data pipeline orchestration and scheduling using AWS Step Functions and AWS Glue, reducing manual intervention and improving operational efficiency.
Developed dashboards and visualizations to help business users analyze data as well as provide data insight to upper management with a focus on AWS services like QuickSight. Integrated Jenkins pipelines with Kubernetes and Docker to automate the deployment of data pipelines and data-driven applications to Kubernetes clusters.
Integrated Terraform with CI/CD pipelines (e.g., Jenkins, GitLab CI/CD, GitHub Actions) for automated infrastructure deployment and updates. Implemented SQL-based data processing and data integration tasks across various systems, ensuring seamless data flow and consistency.
Utilized NoSQL databases like Cassandra for efficient storage and retrieval of unstructured data. Implemented data governance and security measures to ensure compliance with GDPR and CCPA regulations, utilizing AWS S3 for secure and scalable data storage.
Developed data deduplication and cleansing processes using Python and AWS Glue to maintain data integrity and accuracy. Ensured data profiling and validation using tools like Informatica to guarantee data quality and reliability.
Integrated Confluence for effective team collaboration and knowledge sharing across data engineering processes. Implemented data serialization formats such as Avro, Parquet, ORC, JSON, and XML to improve data processing and storage efficiency.
Configured and managed SVN for version control in collaborative data engineering projects. Developed real-time data processing workflows using AWS Lambda and AWS EMR, ensuring timely and efficient data analysis.
Utilized AWS Athena for interactive query services and analysis on large datasets stored in AWS S3. Implemented data modeling techniques such as Star Schema and Snowflake Schema to support OLAP and business intelligence requirements.
Employed fact tables and dimension tables for effective data warehousing and analytics. Utilized CI/CD pipelines with Jenkins and GitHub Actions for automated deployment and updates.
Implemented Scrum and Agile methodologies for efficient project management and delivery of data engineering projects. Configured JIRA for tracking and managing issues related to the SDLC.
Developed scripts using GIT for version control and collaboration in data engineering projects. Implemented data extraction, transformation, and loading (ETL) processes for efficient data integration and processing workflows.

Environment: Data Engineer, PySpark , Apache Kafka, Hadoop HDFS, Spark streaming, Spark SQL, Scala, Python, Spark RDD, PySpark, AWS Glue, Apache NiFi, AWS S3, Apache Spark, AWS EMR, Apache Flink, AWS Lake Formation, AWS Redshift, Snowflake, Databricks Delta Lake, Apache Airflow, AWS SageMaker, AWS Step Functions, AWS Lambda, Redis, Databricks jobs, Terraform, Spark DSL, Hadoop Sqoop, APIs, Spark DataFrames, Spark RDDs, Apache Airflow pools, Tableau, Power BI, AWS CloudFormation, Hadoop Oozie, Hadoop Pig, AWS Quick Sight, Jenkins, Kubernetes, Docker, GitLab CI/CD, GitHub Actions, Cassandra, GDPR, CCPA, Informatica, Confluence, Avro, Parquet, ORC, JSON, XML, SVN, AWS Athena, Star Schema, Snowflake Schema, Fact tables, Dimension tables, Agile methodologies, Scrum, JIRA, GIT

Fiserv, Dallas, TX
Big Data Engineer Oct 2015 to Nov 2018

Responsibilities:
Led an end-to-end ETL (Extract, Transform, Load) solution using Apache NiFi scheduling jobs to seamlessly ingest streaming data into the Apache Kafka messaging system focusing on seamless data extraction, transformation, and loading from diverse sources.
Involved in creating Apache NiFi flows to make the process data reach downstream systems, ensuring Data Integration and Data Ingestion.
Created and managed Hadoop Hive tables, including managed, external, and partitioned tables, ensuring proper Data Governance.
Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy, and Beautiful Soup for Data Processing and analysis.
Conducted performance tuning and optimization of Spark jobs in Databricks, leveraging caching, partitioning, and efficient data formats like Parquet and ORC.
Converted Hadoop Hive/SQL queries into Spark transformations using Spark DataFrames, Scala, and Python.
Developed Spark scripts for Data Analysis in both Python and Scala, utilizing Spark Core and Spark SQL for various transformations.
Developed a reusable framework to automate ETL from RDBMS systems to the Data Lake using Spark Data Sources and Hadoop Hive data objects.
Utilized Azure Synapse Data Catalog for metadata management, data lineage tracking, and Data Discovery to enhance Data Governance and collaboration.
Integrated Azure Synapse Analytics with Azure Data Lake Storage Gen2, enabling seamless Data Storage, management, and analytics across structured and unstructured data.
Developed complex Data Transformation logic using T-SQL, Spark SQL, and Azure Synapse Data Flows to cleanse, enrich, and aggregate data for analytics and reporting.
Built on-premise Data Pipelines using Apache Kafka and Spark for real-time data analysis, leveraging Spark Streaming for low-latency processing.
Built and deployed Machine Learning models using Spark MLlib and Spark ML in Databricks for predictive analytics and insights generation.
Utilized advanced MongoDB features such as indexes, aggregation framework, and Data Validation to enforce schema rules and improve query performance.
Utilized Oracle s performance tuning tools and techniques to monitor and enhance Database Performance, addressing issues like slow queries and inefficient indexing.
Integrated Apache Spark for parallel processing, resulting in a 10% enhancement in ETL performance and reduction in resource utilization using batch processing jobs, leveraging Spark SQL and Spark DataFrames.
Processed web URL data using Scala and converted it to Spark DataFrames for further transformations including flattening, joins, and aggregations. Queried data using Spark SQL on top of the Spark engine for faster datasets processing.
Integrated external data sources with Azure Synapse Analytics, including on-premises data sources, cloud data platforms, and third-party APIs, for comprehensive Data Analysis.
Implemented and maintained continuous integration/continuous deployment (CI/CD) pipelines using GitLab, automating testing, and deployment processes to enhance development efficiency and product reliability.
Implemented real-time alerting systems using Apache Kafka and Spark to notify relevant stakeholders of critical incidents.
Developed interactive dashboards and reports using Tableau for business users and executives. Created custom visualizations to monitor inventory levels, sales trends, and customer feedback.
Integrated Azure Synapse Analytics with Azure Data Factory to create seamless data workflows for ETL processes, ensuring efficient data ingestion, transformation, and loading.
Utilized Azure Cosmos DB for efficient NoSQL data storage and real-time data retrieval, enhancing Data Processing capabilities.
Managed and deployed microservices using Docker and Kubernetes for scalable and efficient application management, ensuring high availability and performance.
Employed Terraform for infrastructure as code, automating the provisioning of resources in the cloud environment, ensuring consistency and scalability.
Ensured Data Security and compliance with GDPR and CCPA regulations, implementing robust Data Encryption methods and Data Governance practices.
Configured and managed version control using GIT and SVN for collaborative development, ensuring code integrity and facilitating teamwork.
Utilized Azure Stream Analytics for real-time data processing and analytics, integrating with various data sources for comprehensive insights.
Developed Java-based data processing applications for robust and scalable data workflows.
Implemented and managed data storage using Hadoop Distributed File System (HDFS) for reliable and efficient data access.
Created Hadoop MapReduce jobs for large-scale data processing and analysis tasks.
Utilized Hadoop YARN for efficient resource management and job scheduling in a distributed environment.
Processed and analyzed large datasets using Hadoop Pig for ETL operations. Integrated and managed Hadoop HBase for scalable and real-time data storage and retrieval.
Employed Hadoop Sqoop for seamless data transfer between Hadoop and relational databases. Created and managed workflows using Hadoop Oozie for automated data processing pipelines.
Implemented Spark Streaming for real-time data processing and analytics. Managed PostgreSQL databases for structured data storage and retrieval.
Leveraged Dagster for orchestrating complex data workflows and ensuring data quality. Utilized Apache Flink for real-time stream processing and data analytics.
Designed data models using Star Schema for efficient data warehousing and querying. Developed Snowflake Schema models to enhance data organization and query performance.
Employed OLAP techniques for multidimensional data analysis and reporting.Created and managed Fact Tables and Dimension Tables for effective data warehousing.
Utilized Avro for data serialization and deserialization in data processing workflows.Processed JSON data formats for efficient data exchange and storage.
Managed XML data formats for structured data representation and integration.
Employed Azure DevOps for continuous integration and continuous deployment of data engineering projects. Configured and managed JIRA for effective project management and issue tracking.
Implemented Jenkins for automated build and deployment pipelines in data workflows. Practiced Scrum and Agile methodologies for effective project management and delivery.

Environment: Data Engineer, Apache NiFi, Apache Kafka, Hadoop HDFS, Data Governance, Python LibrariesDatabricks, Parquet, ORC, Spark DataFrames, Scala, Python, Spark SQL, RDBMS, Data Lake, Azure Synapse Data Catalog, Azure Synapse Analytics, Azure Data Lake Storage Gen2, T-SQL, Azure Synapse Data Flows, Spark Streaming, Spark MLlib, Spark ML, MongoDB, Oracle, Spark Core, Spark DSL, APIs, Spark DataFrames, Spark RDDs, Apache Airflow, Tableau, Power BI, Terraform, Hadoop Oozie, Hadoop Pig, Jenkins, Kubernetes, Docker, GitLab CI/CD, GitHub Actions, Cassandra, GDPR, CCPA, Informatica, Confluence, Avro, Parquet, ORC, JSON, XML, SVN, Star Schema, Snowflake Schema, Fact tables, Dimension tables, Agile methodologies, Scrum, JIRA, GIT, Azure Cosmos DB, Azure Stream Analytics, Java, HDFS, Hadoop MapReduce, Hadoop YARN, Hadoop HBase, Hadoop Sqoop, Hadoop Oozie, PostgreSQL, Dagster, Apache Flink, Azure Functions, Azure DevOps, Jenkins
Infosys Bangalore, India
Hadoop Data Engineer Feb 2013 to Dec 2014

Responsibilities:
Worked on development of data ingestion pipelines using ETL tools like Talend & bash scripting with big data technologies including but not limited to Hadoop Hive, Impala, Spark, and Talend.
Utilized Talend for Big Data Integration, incorporating Spark and Hadoop technologies.
Set up and worked on Kerberos authentication principles to establish secure network communication on the cluster and tested Hadoop HDFS, Hadoop Pig, and Hadoop MapReduce to access the cluster for new users.
Performed Data Validation and Data Cleansing of staged input records before loading them into the Data Warehouse.
Supported Data Quality management by implementing proper data quality checks in Data Pipelines.
Involved in development using Cloudera distribution system.
Installed Hadoop Oozie workflow engine to run multiple Hadoop Hive jobs.
Implemented Proof of Concept on Hadoop stack and different big data analytic tools, including Data Migration from different databases.
Built machine learning models to showcase big data capabilities using PySpark and Spark MLlib.
Worked with multiple storage formats (Avro, Parquet) and databases (Impala, Kudu).
Designed and built Big Data ingestion and query platforms with Spark, Hadoop, Hadoop Oozie, Hadoop Sqoop, Presto, EMR, AWS S3, AWS EC2, AWS CloudFormation, AWS IAM, and Control-M.
Involved in working with messaging systems using message brokers such as Apache Kafka.
Involved in the development of Agile, iterative, and proven Data Modeling patterns that provide flexibility, including Star Schema and Snowflake Schema designs. Worked with utilities like TDCH to load data from Teradata into Hadoop.
Troubleshot user analysis bugs using JIRA and IRIS Ticket systems.
Worked with SCRUM team in delivering agreed user stories on time for every sprint. Used Jenkins for CI/CD, Docker as a container tool, and GIT as a version control tool.
Implemented UNIX scripts to define the use case workflow, process the data files, and automate the jobs.

Environment: Data Engineer, Talend, Bash scripting, Hadoop Hive, Impala, Spark, Kerberos, Hadoop HDFS, Hadoop Pig, Hadoop MapReduce, Data Validation, Data Cleansing, Data Warehouse, Data Quality, Cloudera, Hadoop Oozie, Data Migration, PySpark, Spark MLlib, Avro, Parquet, Kudu, Presto, EMR, AWS S3, AWS EC2, AWS CloudFormation, AWS IAM, Control-M, Apache Kafka, Star Schema, Snowflake Schema, TDCH, JIRA, IRIS Ticket systems, SCRUM, Jenkins, Docker, GIT, UNIX scripts
Keywords: continuous integration continuous deployment machine learning business intelligence sthree database active directory information technology golang Texas

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];3143
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: