Arpit (Monu) Jain - Senior Azure Data Engineer |
[email protected] |
Location: NYC, New York, USA |
Relocation: NY, MI / Remote |
Visa: TN |
Arpit (Monu) Jain
Senior Azure Data Engineer/ Lead 2018446579 [email protected] NYC, NY NY, MI / Remote TN Skills Big Data Developer Data Engineer Hive Data Analyst Python Developer GCP Data Engineer Scala Developer Azure Data architect Azure Data Engineer AWS Data Engineer Data Engineer Summary(Spark/Hive/HDFS) Senior Big Data developer (Data Engineer) with over 12 years of experience and outstanding performer in Spark and open source technologies. Deep understanding of different phased of data projects like Source, Ingestion, profiling ,cleaning , storage , processing , warehouse , reporting , Visualization. Built messaging system using Kafka for sending ack message. Used Spark 1.6, spark 2.0, PYSPARK with SCALA/PYTHON and JAVA. Extensively used Hive and HDFS storage concepts for data storage and analytics. Impala / Hive / Spark sql used for data extraction and profiling. Wisely used different file format , compression based on nature of dataset Sqoop tool based on MapReduce used for data ingestion to HDFS and exporting to RDBMS. Cloud Summary Expertise with Microsoft Azure for building data pipeline, integration with on premises data, computation using azure data bricks Working closely with GCP on handling BIG QUERY reader with PySpark code and doing optimization on queries. DATA proc cluster creation for client and cluster mode execution. Microsoft certified Azure data Engineer. (Jan 2021) Delivered multiple projects where I was responsible for building enterprise data lake over HDP and Azure. AWS EMR , Glue catalog used for writing PySpark pipeline to process different file format data. AWS S3 used for data storage and EC2 instance for file landing zone. Azure data factory V2, Azure storage , Azure Data lake gen2 , Azure Databricks used for creating data pipeline , storage and computation. Experience with AWS lambda ,AWS EMR , AWS redshift , S3 for data storage , data computation and analytics. Snowflake SnowPro Core certified in July-2021. Currently exploring GCP(Google cloud platform), working on code and data migration from Netezza/SQL server to Big query and computation migrated from Datastage to dataProc cluster and Pyspark. Data reporting/interface Used Datameer for data profiling and viewing parquet data. Datastage tool used for data transformation and creating pipeline from source RDBMS to HDFS layer. Dremio + Tableau used for connecting to HDFS/Hive and Visualize report on Tableau. Dremio act as query optimizer interface. Programming language(SCALA/PYTHON/JAVA) Expertise in PYSAPRK with Pandas library for data transformation and analysis. Played data engineer role in multiple projects and using Spark + Scala for data pipelines. Also use Spark 1.6 + core Java for data transformation. Used XML parsing with JAVA libraries. Soft Skills Recognized for inspiring management team members to excel and encouraging creative work environments. Proven success in leadership, operational excellence and organizational development with keen understanding of Data driver business. Act as an converter to take business requirement and convert to technical design and then in code also. Tools Maven / GITHUB / Jenkins used for code build , repository and deployment. Hands on Eclipse , Jupyter Notebook , Pycharm , VS code IDE for code development. SQL assistant used for running adhoc Hive queries for data profiling and data viewing. Work History 2022-01 Current Azure Data Engineer/ Lead CTS/ Air CA, NYC, NY Toronto Analyzing existing Datastage pipeline and migrated to PySpark executed in Azure Databricks Writing all common routine in pyspark to build robust ETL tool in ADF/Databricks Migrated Netezza data to Azure Synapse. Azure Synapse used as data warehouse and data proc cluster as computation engine. Developing stored proc in SQL server and calling from PySpark code for complex data transformation. Writing custom functions to replicate datasatge functionality in pyspark . Developed complex SQL query to execute in Synpase. 2021-03 2021-12 Data Engineer/ Lead Cognizant/ Huntington Bank, Walgreens USA NYC, NY Working over Code migration from SSIS to PySpark in Hortonworks. Technology conversion from T-sql to snowflake. Actively participating in PySpark code design to cater client requirement. Performance tuning spark and snowflake code to meet performance provided by SSIS and sql. Deep analysis of data , understand the data and implement in code. Worked in code conversion from DataStage to PySpark Job in Azure databricks. Azure data lake used for storage and Data bricks used for computation. 2020-12 2021-03 Data Engineer Tech Mahindra, ArcelorMittal, Montreal, CA Building Data Ingestion tool using Azure event hub and Databricks. ADLS gen2 and Azure syanpsis used for storage and data warehouse. Loaded streaming data to warehouse in fact/dimensions schema using Pyspark code. Azure storage used for control table and logs data to store. PySpark used for data cleaning and transformation. 2020-01 2020-12 Data Engineer Tech Mahindra, Rogers, Toronto, CA Building Data Ingestion tool using Azure data factory, Azure databricks notebook used for data computation and transformation. Azure storage used for control table and logs data to store. Migrating large volume of data from On premised HDP to Azure DL Implemented Data migration from HDP 2.6 to Azure platform. Built optimal storage design to achieve best storage cost and computing performance Used extensively Spark dataset , Scala , shell scripting , Hive for this project Build Data factory for migration from oracle / HDP to Azure. GITHUB . MAVEN , Jenkins , Eclipse , PyCharm tools used for build , code management , deployment. 2018-01 - 2019-12 Data Engineer Tech Mahindra ,RBC (Finance/IT), Toronto, CA Delivered spark sql based solution for operations over swift messages Impala used in Cloudera platform for data cleaning , data profiling and extraction of data for business. Delivered python + pyspark code for client to automated excel work which earlier team was doing using excel tools Build self-serve data layer for Data analytics and Data Scientist team using Dremio / Presto. Delivered presentation on different file format and performance comparison in big data Providing AWS cloud solution for clients big data migration from on premises to S3. All computation moved to AWS EMR and data warehouse created in AWS Redshift Datameer tool used for profiling, querying and viewing data. 2016-07 - 2017-12 Big Data Developer Tech Mahindra , RBC Insurance, Toronto, CA Completed data migration project from scratch to delivery and served as a Technical lead for Data lake project Used shell Scripting for schedule, Spark - scala for data transformation and migration , Hive used for viewing and analysis Migrated approx 60+ different system in to 3 downstream system and stored all data in data lake for future reporting and analysis requirement Managed approx 15+ spark developers for this project and also responsible for Integration SQOOP job used for importing/exporting data to/from Oracle to/from HDFS. It used MapReduce technology for processing. Built messaging system using Kafka for sending ACK message to downstream application. 2014-10 - 2016-06 Big Data Developer Tech Mahindra, Client:-NCB Jeddah ,Mumbai ,India Served as Data lake data engineer for National commercial bank I was responsible for choosing storage format , writing adhok and writing Hive routines for analysis purpose Data storage in Data lake used for generating daily/monthly/yearly extract for vendors. Writing Sqoop jobs, using HDFS as storage layer. Using Hive query for data analysis. Hortonworks distribution used for Hadoop. . 2010-03 - 2014-10 Developer Tech Mahindra, Mumbai, India I served myself as Developer in Banking product from start of my career in IT industry. Part of SBI production troubleshooter team for 3 years Used Cobol , PLSQL , Shell scripting for achieving different banking clients requirement Implemented Swift payment system multiple banking projects. Keywords: sthree information technology trade national California Michigan New York Tennessee |