Lisha - Big Data Engineer |
[email protected] |
Location: Dallas, Texas, USA |
Relocation: Yes |
Visa: H4 |
Lisha
Big Data Engineer (609)-905-0745 [email protected] Dallas, TX Yes H4 Professional Summary: Having 8+ years as a Big Data Engineer with experience in all phases of Software Application requirement analysis, design, development, and maintenance of Big Data applications. Strong experience in end-to-end data engineering including data ingestion, data cleansing, data transformations, data validations/auditing and feature engineering. Strong experience in programming languages like Java, Scala, and Python. Strong experience working with Hadoop ecosystem components like HDFS, Map Reduce, Spark, HBase, Oozie, Hive, Sqoop, Pig, Flume and Kafka Good hands-on experience working with various Hadoop distributions mainly Cloudera (CDH), Hortonworks (HDP) and Amazon EMR. Good understanding of Distributed Systems architecture and design principles behind Parallel Computing. Expertise in developing production ready Spark applications utilizing Spark-Core, Data frames, Spark-SQL, Spark-ML and Spark-Streaming API's. Experience on migrating on Premises ETL process to Cloud. Experience in Data Warehousing applications, responsible for the Extraction, Transformation and Loading (ETL) of data from multiple sources into Data Warehouse Strong experience troubleshooting failures in spark applications and fine-tuning spark applications and hive queries for better performance. Worked extensively on building real time data pipelines using Kafka for streaming data ingestion and Spark Streaming for real time consumption and processing. Worked extensively on Hive for building complex data analytical applications. Strong experience writing complex map-reduce jobs including development of custom Input Formats and custom Record Readers. Experience in managing the Hadoop infrastructure with Cloudera Manager. Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and vice-versa. Great hands-on experience with PySpark for using Spark libraries by using python scripting for data analysis. Good experience working with AWS Cloud services like S3, EMR, Lambda functions, Redshift, Athena, Glue etc., Good experience working with Azure Cloud services like ADLS, Azure Databricks, Azure functions, Azure SQL warehouse, Azure Synapse Analytics, Azure Data factory etc., Solid experience in working with csv, text, Avro, parquet, orc, Json formats of data. Working experience on Core java technology, which includes efficient use of Collections framework, Multithreading, I/O & JDBC, Collections, localization, ability to develop new API for different projects. Experience in building, deploying, and integrating applications in Application Servers with ANT, Maven and Gradle. Proficient in design and development of various dashboards, reports utilizing Tableau Visualizations such as bar graphs, scatter plots, pie-charts, maps, funnel charts, lollypop charts, donuts, bubbles, etc. making use of actions and other local and global filters according to the end user requirement. Expertise in all phases of System Development Life Cycle Process (SDLC), Agile Software Development, Scrum Methodology and Test-Driven Development. Experience in using Version Control tools like Git, SVN. Experience in web application design using open source MVC, Spring and Spring Boot Frameworks. Adequate knowledge and working experience in Agile and Waterfall Methodologies. Defining user stories and driving the agile board in JIRA during project execution, participate in sprint demo and retrospective. Have good interpersonal, communication skills, strong problem-solving skills, explore/adopt to new technologies with ease and a good team member. Technical Skills: BIG Data: Spark, Hive, MapReduce, Yarn, Kafka, Sqoop, HDFS NOSQL : HBase, Dynamo DB Languages: Python, Scala, Java Cloud: Microsoft Azure, ADF, ADLS, Synapse, Databricks, AWS, EC2, Redshift, S3, RDS, EMR, Athena, Lambda Functions Scripting: Shell Scripting, SQL Tools & Utilities: Jenkins, Git, Jira Methodologies Agile, Waterfall Professional Experience: Client: AT&T, Dalla, TX Period : Aug 2021 to Till Date Role: BIG DATA ENGINEER Roles & Responsibilities: Responsible for ingesting large volumes of user profile and accounting data profiles Azure Analytics Data store. Developed Spark applications utilizing Spark JDBC reads for performing incremental ingestion of data from on prem databases on to Cloud storage. Developed Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume. Worked on troubleshooting spark application to make them more error tolerant. Worked on fine-tuning spark applications to improve the overall processing time for the pipelines. Wrote Kafka producers to stream the data from external rest apis to Kafka topics. Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to Synapse SQL. Provide guidance to development team working on PySpark as ETL platform. Optimize the PySpark jobs to run on Kubernetes Cluster for faster data processing. Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables. Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, effective & efficient joins, transformations, and other capabilities. Experience working for Azure Databricks runtimes and utilizing data bricks api for automating the process of launching and terminating runtimes. Analyzing the Data from different sourcing using Big Data Solution Hadoop by implementing Azure Data Factory, Azure Data Lake, Azure Data Lake Analytics, HDInsight, Hive, Sqoop. Migrating on-prem ETLs from MS SQL server to Azure Cloud using Azure Data Factory and Databricks Implemented ETL framework using Spark with Python and loaded standardize data into Hive and HBase tables. Build the infrastructure required for optimal extraction, transformation, and loading (ETL) of data from a wide variety of data sources like Salesforce, SQL Server, Oracle & SAP using Azure, Spark, Python, Hive, Kafka and other Bigdata technologies. Involved in creating Hive tables, loading, and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Bucketing in Hive. Automated and validated data pipelines using Apache Airflow. Good experience with continuous Integration of application using Jenkins. I have designed and built various reports & dashboards activities such as grouping, sorting, filtering, ranking, and ordering the data and using Tableau. Collaborated with the infrastructure, network, database, application, and BA teams to ensure data quality and availability. Designed and documented operational problems by following standards and procedures using JIRA. Environment: Azure Cloud, Databricks, Spark, Hive, Synapse SQL, Sqoop, Kafka, Azure Functions, ADF Client : COMCAST, Chicago, IL Period: Jan 2020 to July 2021 Role: BIG DATA ENGINEER Roles & Responsibilities: Ingested log data and on-prem data warehouse data on a daily basis using different ingestion processes. Developed various spark applications to perform various enrichments of user behavioral data (click stream data) merged with user profile data. Involved in data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for downstream machine learning and reporting teams. Troubleshooting Spark applications for improved error tolerance. Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines. Created Kafka producer API to send live-stream data into various Kafka topics. Developed Spark-Streaming applications to consume the data from Kafka topics and to insert the processed streams to HBase. Utilized Spark in Memory capabilities, to handle large datasets. Used Broadcast variables in Spark, effective & efficient Joins, transformations, and other capabilities for data processing. Great hands-on experience with Pyspark for using Spark libiries by using python scripting for data analysis. Designed PySpark scripts and Airflow DAGS to transform and load the data from HIVE tables from AWS S3. Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake. Experienced in working with EMR cluster and S3 in AWS cloud. Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket. Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data. Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production. Created scripts to read CSV, Json and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake. Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway. Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Modillion from data lake Confidential AWS S3 bucket. Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow. Creating Hive tables, loading and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in HIVE. Deploy the code to EMR via CI/CD using Jenkins. Interacted with the infrastructure, network, database, application, and BA teams to ensure data quality and availability. Working experience in Agile methodology Environment: AWS EMR, Spark, Hive, S3, Lambda, Snowflake, Scala HDFS, Sqoop, Kafka, Oozie, HBase, python, MapReduce Client: COGNIZANT, INDIA Period : June 2016 to Oct 2019 Role: BIG DATA DEVELOPER Roles & Responsibilities: Worked on building centralized Data Lake on AWS Cloud utilizing primary services like S3, EMR, Redshift and Athena. Worked on migrating datasets and ETL workloads from On-prem to AWS Cloud services. Built series of Spark Applications and Hive scripts to produce various analytical datasets needed for digital marketing teams. Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries. Used managed spark platform by Databricks on AWS to quickly create cluster on demand to process large amounts of data using PySpark. Worked extensively on building and automating data ingestion pipelines and moving terabytes of data from existing data warehouses to cloud. Optimize the PySpark jobs to run on Kubernetes Cluster for faster data processing. Worked extensively on fine tuning spark applications and providing production support to various pipelines running in production. Designed, developed, and implemented ETL pipelines using python API (PySpark) of Apache Spark on AWS EMR. Worked closely with business teams and data science teams and ensured all the requirements are translated accurately into our data pipelines. Worked on full spectrum of data engineering pipelines: data ingestion, data transformations and data analysis/consumption. Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier, Created Volumes and configured Snapshots for EC2 instances. Worked on automating the infrastructure setup, launching and termination EMR clusters etc., Created Hive external tables on top of datasets loaded in S3 buckets and created various hive scripts to produce series of aggregated datasets for downstream analysis. Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server using Python. Build real time streaming pipeline utilizing Kafka, Spark Streaming and Redshift. Worked on creating Kafka producers using Kafka Java Producer API for connecting to external Rest live stream application and producing messages to Kafka topic. Automated the complex workflows using the Airflow workflow handler. Environment: AWS S3, EMR, Redshift, Athena, Glue, Spark, Scala, Python, Java, Hive, Kafka Client : HDFC, INDIA Period : Jan 2015 to May 2016 Role: HADOOP DEVELOPER Roles & Responsibilities: Developed Hive ETL Logic for data cleansing and transformation of data coming through RBMS. Implemented complex data types in hive also used multiple data formats like ORC, Parquet. Worked in different parts of data lake implementations and maintenance for ETL processing. Developed Spark Streaming application using Scala and python for processing data from Kafka. Implemented various optimization techniques in spark streaming with python applications. Imported batch data using Sqoop to load data from MySQL to HDFS on regular intervals. Extracted data from various APIs, data cleansing and processing by using Java and Scala. Converted Hive queries into Spark-SQL that integrate Spark environment for optimized runs. Developed a migration data pipelines from HDFS on prem cluster to Azure HD Insights. Developed Complex queries and ETL process in Jupiter notebooks using data bricks spark. Developed different modules in microservices to collect stats of application for visualization. Worked on docker and Kubernetes for deploying application and make it containerize. Implemented NiFi pipelines to export data from HDFS to cloud locations like AWS and Azure. Worked on implementing various airflow automations for building integrations between clusters. Environment: Hive, Sqoop, Linux, Cloudera, Scala, Kafka, HBase, Avro, Spark, Zookeeper and MySQL, Databricks, Scala, Python, airflow. Keywords: continuous integration continuous deployment business analyst machine learning sthree database information technology microsoft Delaware Illinois Texas |