Bhargavi - Data Engineer |
[email protected] |
Location: Portland, Oregon, USA |
Relocation: Yes |
Visa: H4EAD |
Bhargavi Kondamuri
Data Engineer Current Location: Portland, OR Contact: +1 571 751 1099 Visa status: H4EAD Relocation: Yes https://www.linkedin.com/in/kondamuri-bhargavi-0a2b6b139/ PROFESSIONAL SUMMARY: Highly qualified professional with over 8 + years of experience in interpreting, processing, and analyzing data in order to drive successful business solutions, building big data applications, data pipelines creating data lakes to manage structured and semi-structured data and workflow implementations. Extensive experience in big data technologies like HDFS, MapReduce, PySpark, Yarn, Kafka, Hive, Sqoop, Snowflake and Redshift. Expertise in debugging and tuning failed and long-running Spark applications using various optimization techniques for executor tuning, memory management, garbage collection, Serialization, broadcast variable, and persisting methods assuring the optimal performance of applications. Expertise in writing DDLs and DMLs scripts for analytics applications in MySQL, Hive, redshift and snowflake. Experienced in Python development for ETL various applications as well as working with python libraries like Pandas for data analysis. Expertise in working with AWS cloud services line EMR, S3, Redshift, EMR, Lambda, Data Pipeline, and Athena for big data development. Expertise in working with Azure cloud services like Blob, Databricks, Synapse, Data Factory, Data Pipeline, Event Hub, and HD Insights. Expertise in working with Hive optimization techniques like Partitioning, Bucketing, vectorizations and Map side-joins, Bucket-Map Join, skew joins, and creating Indexes. Expertise in ETL process using python, RDS, and data lake to extract, transform and load a large amount of data and implement it on AWS for data warehousing and data migration purposes. Experience working with batch processing and operational data sources and migration of data from traditional databases to Hadoop and NoSQL databases. Experienced with different file formats like Parquet, ORC, CSV, Text, Sequence, XML, JSON, and Avro files. Expertise in python scripting and bash scripting. In-depth understanding of Hadoop Architecture and its various components such as Resource Manager, Node Manager, Applications Master, Name Node, and Data Node concepts. Excellent technical and analytical skills with a clear understanding of design goals of ER modeling for OLTP and dimension modeling for OLAP. Experience in Orchestrating workflows using Airflow. Good Knowledge in making and keeping up profoundly versatile and fault tolerant Infrastructure in the AWS environment spanning over different availability zones. Passionate about gleaning insightful information from massive datasets and developing a culture of sound, data-driven decision-making. I am a good team player who likes to take initiative and seek out new challenges. Excellent communication skills, can work in a fast-paced multitasking environment both independently and in a collaborative team, a self-motivated enthusiastic learner. Involved in all the phases of the Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies. CERTIFICATION: Databricks Apache Spark Associate Developer Certification TECHNICAL SKILLS: Big Data Technologies HDFS, Map Reduce, Sqoop, Hive, Spark and Kafka. Languages Python, SQL, HiveQL, Shell Scripting, Unix, Java AWS Services EC2, EMR, Redshift, RDS, IAM, S3, AWS Lambda Azure Services Blob, Databricks, Synapse, Data Factory, Data Pipeline, Event Hub, HD Insights Databases MySQL, Hive, Redshift, Dynamo DB and Snowflake Other tools JIRA, GitHub, Jenkins PROFESSIONAL EXPERIENCE: Client: Capital One, Richmond, VA July 2021 Till Date Role: Data Engineer Project Description: In this project, I played a pivotal role in enhancing the capabilities of the existing data management and analytics platform by seamlessly integrating various technologies and implementing robust solutions. The project encompassed a wide range of tasks, including data profiling, ETL pipeline development, data cleansing, data governance, real-time analytics, and deployment automation. Responsibilities: Worked closely with MDM team to identify the data requirements for their landing tables and created Mappings, Trust and Validation rules, Match Path, Match Column, Match rules, Merge properties, and Batch Group creation as part of Informatica MDM. Designed and developed ETL Pipelines using PySpark and Hive to process large datasets in a distributed environment. Developed and maintained data processing and cleansing scripts using python and PySpark. Wrote complex SQL and PL/SQL testing scripts for Backend Testing of the data warehouse application. Expert in writing Complex SQL/PLSQL Scripts in querying Teradata and Oracle. Data extraction, data scaling, data transformations, data modeling, and visualizations using Python, snow SQL, HSQL based on requirements. Developed and maintained PySpark based data processing pipelines for batch and streaming data using kafka, spark- streaming and spark-SQL. Design and develop ETL integration pattern using python on spark. Imported and exported data from Teradata to HDFS and vice-versa. Data Migration from existing Teradata systems to HDFS and build datasets on top of it. Utilized Pyspark to develop and execute Bigdata Analytics and Machine learning applications, executed machine learning use cases under Spark and MLlib. Experienced using Bash shell scripts to automate system maintenance tasks. Implemented consolidation of CDW module using Pyspark, Databricks, and AWS. Developed Spark/Scala scripts, and UDFs using both Dataframes and RDD in Spark for Aggregation, queries, and writing data back into Delta lake tables. Data governance to enable integrity and data quality. Used Spark Streaming APIs to perform transformations and actions on the fly for building a common learner data model which gets the data from EventHub in near real-time and persists it to Delta lakes. Developed POC on real-time streaming data received by EventHub and processed the data using PySpark and this data was further stored in Delta lakes using Databricks. Created configuration schemas and tables in Azure Synapse for dynamically running pipelines. Created advanced SQL scripts in PL/SQL Developer to facilitate the data in/out flow in Oracle. Experience writing Advance SQL for joining multiple tables, sorting data, creating SQL views using snowflake. Used python sub-process module to perform UNIX shell commands. Automating and scheduling the Sqoop jobs in a timely manner using Unix Shell Scripts. Developed the UNIX shell scripts for creating the reports from Hive data. Deployment of Cloud service including Jenkins and Nexus on Docker using Terraform. Experience in developing scalable solutions using NoSQL databases including HBase and COSMOS DB. Worked on hive data warehouse modeling to interface with BI tools such as Tableau. Written python custom Regex functions to mask the non-public information and plastic card info data. Environment: Python, Hadoop, Spark, Spark SQL, Hive, MySQL, HDFS, Shell Scripting, AWS S3, EMR, Databricks, Redshift and snowflake. Client: BCBS, Jacksonville, FL Dec 2019 Jun 2021 Role: Data Engineer Project Description: In this project, I took a lead role in implementing robust pipelines and analytical workloads using cutting-edge big data technologies. The project spanned various facets, from building a comprehensive Data Catalogue Manager tool to deploying serverless ETL pipelines on AWS Lambda. Responsibilities: Worked on implementing pipelines and analytical workloads using big data technologies such as Hadoop, Spark, Hive, and HDFS. Played a key role in Building the Data Catalogue Manager tool to identify the personal information and non-personal information columns of data stored in Vertica, On-premise Databases (USAS), and Hive. Handled importing data from AWS S3 to HDFS, and performed transformation and action functions using Spark to get the desired output. Implemented serverless pipelines using AWS Lambda to export data from Vertica and Hive into Redshift. Designed and built data lake storage and ETL process for device insights streaming data using Kafka and Spark Streaming. Imported data from S3 Glacier to Hive using spark on EMR clusters. Handled loading clickstream data from AWS S3 to Hive using crontab and shell scripting. Closely worked with the Kafka Admin team to set up the Kafka cluster and implemented Kafka producer and consumer applications on the Kafka cluster setup with help of Zookeeper. Build a pipeline using Spark Streaming to receive real-time data from Kafka and store the stream data to HDFS. Tune long-running Spark applications using various optimization techniques for executor tuning, memory management, garbage collection, Serialization, broadcast variable, and persisting methods. Developed Tableau visualizations and dashboards using Tableau Desktop. Developed spark applications to stream data from Kafka topics to HDFS integrates with apache hive to make data immediately available for HQL querying. Developed DDL and DML scripts for creating tables and views, loading data, and developing analytics applications in Hive and Redshift. Implemented python scripts for backend database connectivity and data imports. Built a serverless ETL in AWS lambda to process the files that are new in the S3 bucket to be cataloged immediately. Built a python module to access Jira and create issues for all the DB owners and notify them every 7 days if the issue is not closed. Used AWS SQS to send the processed data further to the next working teams for further processing. Deployed AWS Lambda functions to sync data from MySQL to the Client Portal. Involved in all the phases of the Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies. Environment: Hadoop, Spark, Spark SQL, Hive, HQL, MySQL, HDFS, Shell Scripting, Apache Kafka, Python, AWS Lambda, AWS EC2, EMR and snowflake, JIRA Client: Bed Bath Beyond, Northern, NJ Mar 2018 Nov 2019 Role: Data Engineer Project Description: In this project, I demonstrated a comprehensive skill set in designing, deploying, and optimizing Hadoop clusters and various Big Data analytic tools. The project covered a spectrum of technologies, from traditional Hadoop components to advanced tools like Apache Spark, Kafka, and Delta Lake. Responsibilities: Experienced in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Sqoop, Kafka, Spark, Impala. Imported weblogs and unstructured data using the Apache Flume and store it on HDFS. Loaded the CDRs from relational DB using Sqoop and other sources to the Hadoop cluster by Flume. Developed business logic in Flume interceptor in Java. Implementing quality checks and transformations using Flume Interceptor. Worked on Ad hoc queries, Indexing, Replication, Load balancing, and Aggregation in MongoDB. Experience in managing MongoDB environment from availability, performance, and scalability perspectives. Developed multiple MapReduce jobs using Java for data processing. Developed workflows using custom MapReduce, Pig, Hive, and Sqoop. Implemented Data Ingestion in real-time processing using Kafka. Developed multiple Kafka Producers and Consumers as per the software requirement specifications. Documented the requirements including the available code which should be implemented using Spark, Hive, and HDFS. Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, and Scala. Optimized Hive QL/ pig scripts by using an execution engine like Spark. Worked and learned a great deal from Amazon Web Services (AWS) Cloud services like EC2, S3, EBS, RDS, and VPC. Worked in provisioning and managing multi-tenant Hadoop clusters on public cloud environment -Amazon Web Services (AWS) and on private cloud infrastructure OpenStack cloud platform Worked with cloud administrations like Amazon web services (AWS). Worked on POCs for using Apache Spark which provides a fast and general engine for large data processing integrated with the functional programming language Scala. Experience in performance tuning a Cassandra cluster to optimize it for writes and reads. Experience in Data modeling and connecting Cassandra from spark and saving summarized data frames to Cassandra. Experience designing and executing time-driven and data-driven Oozie workflows. Implemented consolidation of CDW module using Pyspark, Databricks, and AWS. Developed Spark/Scala scripts, and UDFs using both Dataframes and RDD in Spark for Aggregation, queries, and writing data back into Delta lake tables. Data governance to enable integrity and data quality. Used Spark Streaming APIs to perform transformations and actions on the fly for building a common learner data model which gets the data from EventHub in near real-time and persists it to Delta lakes. Developed POC on real-time streaming data received by EventHub and processed the data using PySpark and this data was further stored in Delta lakes using Databricks. Environment: Hadoop, HDFS, Hive, YARN, Sqoop, Oozie, Python, AWS, Shell Scripting, Spark, spark-SQL. Client: SYNECHRON, India Nov 2014 Jul 2017 Role: Data Engineer Project Description: In this project, I showcased a robust skill set encompassing Apache Airflow, Python, Spark, Snowflake, Jenkins, and GitHub. The project revolved around the design and implementation of continuous integration and continuous deployment (CI/CD) pipelines, leveraging Apache Airflow for data pipeline automation and employing various tools for efficient data processing. Responsibilities: Expertise in Apache Airflow, Python, Spark, Snowflake, Jenkins, GitHub. Design and Implement CI & CD Pipelines achieving the end-to-end automation supported deployment activities via Jenkins. Create data pipelines using Apache Airflow and automate the process. Create Autosys jobs to trigger Airflow DAGS as per the schedule and to meet upstream dependencies. Analyze the data by performing SQL queries on top of snowflake and Athena. Implement business logic by writing UDFs and scripts in Python and used various UDFs from other sources. Load and transform of large sets from different sources like SAP HANA, Teradata using Python and Spark and load the data to snowflake and s3. Create external tables on S3 by using snowflake and Athena to validate the data. Configure and Manage Jenkins in various Environments like DEV, QA, and PROD. Create various branches in GIT, merged from development branch to release branch and created tags for releases. Follow Agile software development methodologies. Participate in Program Increment (PI) Planning sessions with the team s Product Owner, Scrum Master, and other Product Analysts to understand the business vision and create plans and objectives for the upcoming PI. Work with the team during the PI on the backlog of work for each iteration delivering value in 2-week Agile Sprints. Stay closely aligned with the Product Owner and global business teams throughout the iteration. Explore, suggest, and implement new ideas that will inform the technical direction of the team. Actively participate in technical review meetings to understand business requirements, strategy, and delivery constraints. Determine test effort for test planning and execution. Develop test plans, test models and test scripts, including conditions and expected results. EDUCATION: Bachelor s in computer science from Jawaharlal Nehru Technological University in 2014. Keywords: continuous integration continuous deployment quality analyst access management business intelligence sthree database active directory information technology procedural language Florida New Jersey Virginia |