Harika - Data engineer |
[email protected] |
Location: Indianapolis, Indiana, USA |
Relocation: Yes |
Visa: GC |
Name: Harika Pasupuleti
Email: [email protected] Mobile: +1 469-902-9038 PROFESSIONAL SUMMARY: Over 10 years of experience as a Sr. Data Engineer & Hadoop developer utilizing big data, Hadoop technologies, Spark, Scala, Python, Machine Learning Algorithms, Deployment, Data Pipeline Design, Development, and Implementation as a Data Engineer. Strong experience in Big Data Analytics using HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue. Good exposure working with Hadoop distributions such Cloudera, Hortonworks, and Data Bricks. Extensive knowledge in writing Hadoop jobs for data analysis as per the business requirements using Hive and worked on HiveQL queries for required data extraction, join operations, writing custom UDF's as required and having good experience in optimizing Hive Queries Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala. Contributed significantly to the architecture and implementation of multi-tier applications, leveraging an extensive use of AWS services including EC2, Route53, S3, Lambda, CloudWatch, RDS, DynamoDB, SNS, SQS, and IAM. Specialized in optimizing for high-availability, fault tolerance, and the seamless integration of auto-scaling through the strategic use of AWS CloudFormation. Designed and implemented data lake architectures on Amazon S3, leveraging partitioning and columnar formats such as Parquet to optimize query performance and minimize storage costs. Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory Good understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Schema Modeling, Fact and Dimension tables. Expertise in real-time data processing technologies such as Apache Storm, Kafka, Flink, Spark Streaming, and Flume to enable real-time analytics and integration with diverse data sources. Knowledgeable about testing data extraction and loading from different database systems, such as Oracle, SQL Server, PostgreSQL, Azure SQL, HiveQL and MySQL. Skilled in using ETL/ELT tools like AWS Glue, Ab Initio, Talend, Informatica PowerCenter, Apache Flink, Kafka, Apache NiFi, and Apache Airflow. Good knowledge in OLAP, OLTP, Business Intelligence and Data Warehousing concepts with emphasis on ETL and Business Reporting needs. Experience in writing the HTTP RESTful Web services and SOAP API's in Golang. Experience in Apache Iceberg, Avro, Impala, Parquet, and Griffin, leveraging these Apache technologies for efficient data storage, processing, and quality assurance, enabling seamless data management and analytics workflows. Hands-on expertise in using cloud-based data warehousing solutions such as Hive, Teradata, Azure Synapse Analytics, and Snowflake to manage and scale large datasets effectively. Deep knowledge in scripting languages like Python, Bash, JavaScript, PowerShell, Ruby, Perl, Go Lang, JSON, YAML, and Groovy. Proficient in designing and developing interactive dashboards and reports using Power BI Desktop, Power BI Service, and Power BI Embedded. Skilled in building efficient data models in Power BI using Power Query Editor to transform, clean, and aggregate data from various sources. Skilled in using Issue Tracking and Project Management tools like Azure DevOps, Jira and ServiceNow to track and manage issues related to the SDLC, ensuring that bugs are identified, reported, tracked, and resolved promptly and effectively. Strong expertise in CI/CD (Continuous Integration/Continuous Deployment) practices and tools, including Jenkins, Maven, and Kubernetes Strong knowledge of agile methodology, waterfall environment, and proficiency in GIT workflow. EDUCATION: Bachelor of Technology in Computer Science and Engineering from Bharath University, Chennai, india 2013 SKILLS: Big Data Ecosystem Hadoop, Apache Spark, Spark Core, Spark SQL MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Apache Kafka, Zookeeper, Yarn and Sparklib. Cloud Platforms AWS, Azure and Google Cloud. ETL ADF, Databricks, Tableau, Talend, Informatica PowerCenter, Glue, Apache Flink, Apache Kafka, Apache NiFi, and Apache Airflow. Data warehousing Hive, Teradata, Snowflake, Amazon Redshift, and Azure Synapse Analytics. Databases MS SQL Server, MySQL, PL/SQL, Oracle, PostgreSQL, SQLite, Teradata, IBM Db2, Hive, MongoDB, DynamoDB, CosmosDB, HBase, Redis, Neo4j and Cassandra. Programming / Scripting Languages Python, PySpark, Spark-SQL, Scala, Java, C, C#.NET, SQL, T-SQL, U-SQL, PL/SQL, Bash, YAML, JSON, PowerShell, Perl script, GoLang, Ruby, Pig Latin, and HiveQL. Methodologies RAD, JAD, System Development Life Cycle (SDLC), Agile. DevOps CICD, Jenkins, AWS Elk, Kubernetes, Splunk. Visualization/ Reporting Tableau, ggplot2, matplotlib, QuickSight, SSRS and Power BI. Version Control GIT Operating System Windows, Unix. PROFESSIONAL EXPERIENCE: AMD, Austin, TX Senior Data Engineer Jun 2021 to Present Responsibilities: Designed and implemented data pipelines on Azure using tools like Azure Data Factory for ETL processes. Leveraged Azure cloud services, such as Azure Data Lake Storage and Azure SQL Database, for scalable and cost-effective data solutions. Designed and implemented data lakes using Azure Data Lake Storage for scalable and cost-effective storage of structured and unstructured data. Handled streaming data using both Spark streaming and Spark structured streaming APIs. Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects. Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through Python. Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake. Worked on creating tabular models on Azure analysis services for meeting business reporting requirements. Have good experience working with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW). Extract, transform and load data from source systems to Azure Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL of Azure Data Lake Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL DB, Azure SQL DW), and processing the data in Azure Databricks. Applied expertise in NoSQL databases like MongoDB or Cassandra for handling unstructured and semi-structured data. Developed Spark Data Frame optimization techniques, such as predicate pushdown, column pruning, and vectorized execution, and their impact on query performance and resource utilization that helped in saving Cost to the project by 20%. Utilized Kafka REST API to facilitate seamless communication and data streaming within distributed systems Implemented data processing workflows using Apache Airflow, enhancing automation and scheduling capabilities. Implemented data transformations and manipulations within T-SQL scripts for ETL processes. Worked on Dimensional and Relational Data Modelling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modelling using Erwin. Worked with PowerShell and UNIX scripts for file transfer, emailing and other file-related tasks. concurrently utilized Ansible to configure the Azure Virtual Machines. Implemented Delta Lake for data versioning, ACID transactions, and schema enforcement, ensuring data quality and reliability. Implemented container orchestration using Kubernetes for managing and scaling containerized applications. Orchestrated continuous integration and deployment (CI/CD) pipelines with Jenkins for streamlined development workflows. Utilized Docker for containerization, ensuring consistent deployment across different environments. Used ETL testing tools and frameworks, such as QuerySurge, Talend Data Quality, or Informatica Data Validation Options. Developed Key Performance Indicator (KPI) dashboards using advanced visualization tools to provide actionable insights. Integrated KPI dashboards with data sources to enable monitoring and analysis of critical business metrics. Collaborated with cross-functional teams in an Agile environment, participating in sprint planning and daily stand-ups. Environment: Azure Cloud, Hadoop, HDFS, Hive, SQL Server, MongoDB, SSIS, Informatica, Python, PySpark, SQL, Spark-SQL, U-SQL, T-SQL, Apache Kafka, Airflow, Talend, Kubernetes, Jenkins, PowerBI, and Azure DevOps. CG Infinity, Plano, TX Data Engineer Dec 2018 to May 2021 Responsibilities: Implemented Spark streaming jobs to continuously retrieve data from Kafka, subsequently storing the streaming information into HDFS. Used Spark SQL for Scala and Python interface, optimizing the conversion of RDD case classes to schema RDD for improved performance and ease of data manipulation. Ingested data from diverse sources such as HDFS/HBase into Spark RDD, leveraging PySpark for computational tasks and generating output responses. Contributed to the development of ETL processes, utilizing Data Stage Open Studio, to load data from various sources into HDFS via FLUME and SQOOP. Executed structural modifications using MapReduce. Created data pipelines employing Sqoop, Pig, and Hive for the ingestion of customer data into HDFS, facilitating subsequent data analytics. Utilized Talend for Big Data Integration, incorporating Spark and Hadoop technologies. Developed Spark scripts and UDFs, employing both Spark DSL and Spark SQL queries for tasks like data aggregation, querying, and writing data back into RDBMS through Sqoop Constructed multiple MapReduce Jobs using the Java API, along with Pig, for data extraction, transformation, and aggregation from various file formats, including Parquet, Avro, XML, JSON, CSV, ORCFILE, and others. Worked with binary and textual data formats in Spark, such as CSV, JSON, and XML, and their serialization and deserialization using Spark Data Frames and RDDs. Designed and created dashboards and reports using tools like Tableau and Power BI, aiding patients in comprehending their migraine data and model-generated insights. Validated the performance of machine learning models using appropriate evaluation metrics. Used Terraform for provisioning Azure resources and ARM templates for defining and deploying infrastructure. Ensured the accuracy and reliability of the model through continuous testing and refinement. Created Oozie Workflow to automate data loading into the Hadoop Distributed File System and PIG to pre-process the data and to apply complex transformation. Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature. Used Jenkins for CI/CD, Docker as a container tool and Git as a version control tool. Develop dashboards and visualizations to help business users analyze data as well as provide data insight to upper management with a focus on Microsoft products like SQL Server Reporting Services (SSRS) and Power BI. Integrated Jenkins pipelines with Kubernetes and Docker to automate the deployment of data pipelines and data-driven applications to Kubernetes clusters. Used Git to manage and monitor how the source code has been modified and collaborate on codebases. CA Agile Rally was used to create features, use cases, track bugs, add test cases from Red Hat studio via a Jenkins tool, and keep track of the project. Environment: Python, PySpark , spark SQL , .Net, AWS EMR, S3, RDS, FLUME, SQOOP, ETL, Lambda, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Apache Pig, Scala, Shell Scripting, Spark, Docker, Kubernetes, SSRS, and Power BI. Fiserv, Dallas, TX Big Data Engineer Oct 2015 to Nov 2018 Responsibilities: Led an end-to-end ETL solutions using Apache NiFi scheduling jobs to seamlessly ingest streaming data into the Apache Kafka messaging system focusing on seamless data extraction, transformation, and loading from diverse sources. Involved creating Nifi flows to make the process data to reach downstream systems. Created and managed Hive tables, including managed, external, and partitioned tables. Created lambda functions in AWS to run EC2 containers. Managed end-to-end data workflows with AWS Glue and orchestrated ETL workflows using Apache Airflow. Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup. Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and Python. Experienced in developing Spark scripts for data analysis in both python and Scala. Wrote Scala scripts to make spark streaming work with Kafka as part of spark Kafka integration efforts. Built on premise data pipelines using Kafka and spark for real-time data analysis. Implemented AWS Step Functions to automate and orchestrate the Amazon Sage Maker related tasks such as publishing data to S3, training ML model and deploying it for prediction. Integrated Apache Spark for parallel processing, resulting in a 10 % enhancement in ETL performance and reduction in resource utilization using batch processing jobs, leveraging Spark SQL and Data Frames. Processed web URL data using Scala and converted it to data frames for further transformations including flattening, joins, and aggregations. Queried data using Spark SQL on top of Spark engine for faster datasets processing. Implemented machine learning algorithms, such as decision trees, clustering, and regression, to derive personalized insights from patient data and designed intuitive dashboards that provide actionable insights and facilitate data-driven decision-making. Assisted QA team during testing and defects fixes. Implement real-time alerting systems using Kafka and Spark to notify relevant stakeholders of critical incidents. Developed interactive dashboards and reports using PowerBI for business users and executives. Created custom visualizations to monitor inventory levels, sales trends, and customer feedback. Environment: Hadoop, Apache Spark,NiFi, Python (Scikit-learn, Numpy, Pandas, PySpark, Pytest, Pymongo), PySpark, Spark-SQL, Oracle, SQL Server, MongoDB, Kafka, HDFS, Hive, RESTful APIs, AWS, PowerBI Infosys Banglore, India Hadoop Data Engineer Feb 2013 to Dec 2014 Responsibilities: Worked on development of data ingestion pipelines using ETL tool, Talend& bash scripting with big data technologies including but not limited to Hive, Impala, Spark and Talend. Set up and worked on Kerberos authentication principles to establish secure network communication on cluster and testing of HDFS, Pig and MapReduce to access cluster for new users. Data validation and cleansing of staged input records was performed before loading into Data Warehouse Supported data quality management by implementing proper data quality checks in data pipelines. Involved in development using Cloudera distribution system. Installed Oozie workflow engine to run multiple Hive. Implemented Proof of Concept on Hadoop stack and different big data analytic tools, migration from different databases. Build machine learning models to showcase big data capabilities using Pyspark and MLlib. Worked with multiple storage formats (Avro, Parquet) and databases (Impala, Kudu). Designing and building Big Data ingestion and query platforms with Spark, Hadoop, Oozie, Sqoop, Presto, Amazon EMR, Amazon S3, EC2, AWS Cloud Formation, Amazon IAM, and Control-M. Involved in working with messaging systems using message brokers such as RabbitMQ. Involved in the development of Agile, iterative, and proven data modeling patterns that provide flexibility. Worked with utilities like TDCH to load data from Teradata into Hadoop. Involved in Scheduling jobs using Crontab. Troubleshooted user's analyses bugs (JIRA and IRIS Ticket). Worked with SCRUM team in delivering agreed user stories on time for every Sprint. Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs. Environment: Spark,AWS,Hadoop, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, UNIX, Jenkins, Eclipse, SVN, Oozie, Talend, Agile Methodology Keywords: cprogramm csharp continuous integration continuous deployment quality analyst machine learning business intelligence sthree database information technology golang microsoft procedural language California Texas |