Home

Mirza Shahzain Umer Baig - Senior Data Engineer/Big Data Engineer
[email protected]
Location: Dallas, Texas, USA
Relocation: Yes
Visa: GC
Mirza Shahzain Umer Baig
Senior Data Engineer/Big Data Engineer
+1 469 638 9275 Ext 284
[email protected]
Dallas, TX
Yes
GC

PROFESSIONAL SUMMARY

10+ years of professional IT expertise in BIGDATA utilizing HADOOP framework and Analysis, Design, Development, Documentation, Deployment, and Integration using SQL and Big Data technologies as well as Java / J2EE technologies using AWS, AZURE.
Knowledge of Hadoop ecosystem components such as Sqoop, Spark, Hive, HDFS, Pig and Kafka.
Experience architecting, designing, installing, configuring and managing Apache Hadoop Clusters, MapR, Hortonworks and Cloudera Hadoop Distribution.
Experience building Scala and python applications with Spark on Hadoop.
Solid grasp of Hadoop architecture and hands-on expertise with Hadoop components such as Resource Manager, Node Manager, Name Node, Data Node and Map Reduce concepts and HDFS Framework.
Knowledge of data migration, data profiling, data ingestion, data cleaning, transformation, data import, and data export using numerous ETL technologies such as Informatica Power Centre, SSIS.
Working knowledge of Spark RDD, Data frame API, Dataset API, Data Source API.
Worked on Spark SQL and Spark Streaming.
Experience of Python and JavaScript with the addition of higher-level programs like Pandas/NumPy/SciPy/IPython, R, MATLAB and SPSS being even better.
Experience with exporting and importing data using Sqoop from HDFS to Relational Database systems and vice versa, including loading into partitioned Hive tables.
Knowledge of monitoring log files, databases, online services, and other monitoring endpoints using Splunk.
Worked on Data Migration from Teradata to AWS Snowflake Environment using Python and Bi tools like Alteryx.
Hands on Experience with AWS Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system, which include loading nested JSON, formatted data into Snowflake table.
Partitioning and bucketing ideas in Hive experience, as well as designing Managed and External tables in Hive to maximize efficiency.
To speed up data processing, Spark code was created utilizing Python and Spark-SQL/Streaming.
Created RDDs (Resilient Distributed Datasets) and used PySpark and spark-shell to implement Spark Streaming tasks in Scala.
Extensive experience utilizing Kafka, Flume, and Apache Spark to build real-time data streaming systems.
Solid understanding of using Apache NiFi to automate data transfer between various Hadoop Systems.
Solid knowledge of how to use Apache Kafka to manage messaging systems.
Expertise creating reports and dashboards in Tableau, as well as knowledge of data mining and data warehousing using ETL tools (BI Tool).
Used the data frame functionality of many querying engines, including Spark and Flink, to read and write data in Apache Iceberg.
Solid knowledge and comprehension of NoSQL databases like HBase and Cassandra.


Strong knowledge of Amazon Web Services (AWS), including EC2 for computation, S3 for storage, EMR, Lambda, RedShift, and DynamoDB.
Strong knowledge and comprehension of Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Azure Data Factory, Google Cloud Platform (GCP) and Logic Apps.
Proficient with Tableau and Microsoft Power BI reports, dashboards and end-user publishing for executive-level business decisions.
Knowledge of SOA, Graph Database, CI/CD Pipeline, monitoring, and alerting would be advantageous.
Knowledge of several operating systems, such as UNIX, Linux, Solaris, or Microsoft Windows
Practical knowledge in the development of enterprise applications using Java, J2EE, Spring, Hibernate, JSF, JMS, XML, EJB, JSP, Servlets, JSON, JNDI, HTML, CSS, and JavaScript, as well as SQL and PL/SQL.
Experience with SCRUM and Agile techniques in the Software Development Lifecycle (SDLC).


EDUCATION: Bachelor s in Computer Science Engineering from CBIT University, INDIA - 2012

TECHNICAL SKILLS:
Hadoop Components/ Big Data: Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, c park, Airflow, Kafka, HDFS, Hue, MapReduce, PIG, Hive, HCatalog, HBase, Snowflake IDE Tools Eclipse, IntelliJ, and PyCharm.
Programming Languages Python, SQL, Scala, java, R, C#
ETL Tools Informatica PowerCenter, IBM DataStage, SAP Data Service, Talend, SSIS (Microsoft SQL Server Integration Services), Oracle Data Integrator (ODI)
Cloud Based ETL Azure Data Factory, AWS Glue, Google Cloud Platform (GCP), Hevo, SAP Data Services Cloud, Informatica Cloud, Talend Cloud
Devops CICD, Jenkins, Maven, Docker, Kubernetes, Splunk, Git stash

Data Warehouse Apache Iceberg, Apache Avro, Apache impala, Apache Griffin
Logging Tools Apache Solr, Splunk, Elasticsearch
RDBMS Azure SQL, Oracle SQL, PostgreSQL, Teradata, MPP Database, SQL Server
Messaging Technologies ActiveMQ, Kafka Cluster
Data Vizualization_Analytics_BI Power BI, Tableau, Sigma, Azure Data Analytics
Data Lake Technologies Azure Data Lake, Snowflake, DataBricks, Data Lake House
Orchestration Apache Airflow, Dataflow, Control-M


PROFESSIONAL EXPERIENCE

Client: HSF Affiliates - Irvine, CA May 2021 to Present Role: Sr. Data Engineer

Responsibilities:

On a Python development platform built on top of AWS services, create new data analytic applications and enhance those that already exist.
Developed reliable and scalable ETL solutions that combine complex and large amounts of data from various platforms.
Amazon Web Services (AWS), which includes EC2, S3, Cloud Front, Dynamo DB, Lambda, Elastic File System, RDS, VPC, Direct Connect, Route53, Cloud Watch, Cloud Trail, Cloud Formation, and IAM, made automated operations possible.
With a focus on high availability, fault tolerance, and auto-scaling, multi-tier applications were designed and deployed on AWS Cloud Formation using all AWS services (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM).
Supporting continuous storage using AWS Elastic Block Storage, S3, and Glacier. I established Snapshots and generated Volumes for EC2 instances.
Consume Data from Restful APIs, databases, and csv files were ingested.
A migration of Data Pipeline from Cloudera Hadoop clusters to AWS EMR clusters.
Experience in designing and deploying Hadoop clusters as well as additional Big Data analytical tools like Spark, Kafka, Sqoop, Pig, and Hive on the Cloudera distribution.
Created an integration between Apache Kafka and Spark Streaming to consume data from outside REST APIs and carry out customized operations.
Using Kafka and Spark Streaming, extract real-time feeds, transform them to RDDs, analyze data in Data Frames, and save the data in Parquet format in HDFS.
Several Kafka Producers and Consumers were built entirely from scratch to meet the requirements.
Prior experience with tweaking spark job performance.
Worked on making the most of the cluster environment and using Cache to optimize the speed of Spark operations.
Using Scala Shell commands, to develop spark scripts based on requirements
Created an Oozie workflow engine for scheduling numerous Hive and Pig operations.
executed Hadoop streaming processes that processed gigabytes of text data. There were several file types used, including Text, Sequence files, Avro, ORC, and Parquet.
Used Amazon EMR for Big Data processing in a Hadoop Cluster of virtual machines on Amazon's EC2 and S3 services.
Utilized AWS Sage Maker to build, train, and deploy machine learning models.
Helped a Kubernetes cluster's task execution and the generation of Docker images.
Generalized solution model implementations in AWS Sage Maker.
Experience with Spark's core APIs and data processing on an EMR cluster.
Developed and implemented AWS Lambda functions to build a serverless data pipeline that can be fed into the Glue Catalog and queried by Athena for ETL Migration services.
Setting up rules for S3 buckets and using Glacier and S3 buckets for backup and storage on AWS.
Knowledge on managing IAM users by creating new ones, offering them restricted access according to their requirements, and giving them roles and policies.
Participated in Snowflake testing to find the best way to use cloud resources and used Snowflake's time travel feature to retrieve old data.
Assisted the execution of merging scripts to handle upserts and the construction of Delta Lake tables.
Snowflake scripts for automatically refreshing external tables were put into place.
Used Python, Spark, and Spark Stream to create an analytical component.
Act as the team's technical point of contact for all customer inquiries regarding AWS.
Transformed Hive/SQL queries into Spark transformations by using Scala and Spark RDDs.
Worked on handling gigabytes of xml and json-formatted data using Hadoop streaming.
Instead of Hadoop YARN, Spark API was employed as the execution engine for Hive data analysis.
Installed YARN Capacity Scheduler in several configurations and fine-tuned settings according to workloads relevant to each application.
Set up a Continuous Integration system with Jenkins, Maven, and GIT to execute automated test suites on a regular basis.

Tools & Environment: Hadoop, HDFS, Hive, Spark, Cloudera, Snowflake, AWS EC2, AWS S3, AWS ERM, Sqoop, Kafka, Yarn, Shell Scripting, Scala, Pig, Cassandra, Oozie, Agile methods, MySQL

Client: Fifth Third - Cincinnati, Ohio Mar 2020 to May 2021
Role: Data Engineer

Responsibilities:
Hands-on expertise with Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, and Azure Data Lake.
Have good experience working with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW).
Used Azure Data Factory, T-SQL, Spark SQL, and U-SQL from Azure Data Lake Analytics to extract, transform, and load data from source systems to Azure Storage services.
Processing the data in Azure Databricks once it has been ingested into one or more Azure services (such as Azure Data Lake, Azure Storage, Azure SQL DB, and Azure SQL DW).
Have experience working with the Snowflake data warehouse.
Moved the data from Azure Blob storage to snowflake database.
To speed up data processing, Spark code was created using Scala and Spark-SQL/Streaming.
Created a new Spark REPL application to manage comparable datasets.
Used Hadoop scripts to manipulate and load data from the Hadoop File System (HDFS).
Ran Hive test queries on HDFS and local sample files.
Split streaming data into batches and feed them into the Spark engine for batch processing using Spark Streaming.
Used PySpark and Spark to implement data quality checks, data transformation, and data validation
processes
Developed and implemented Data Fusion and Dataflow pipelines using Beam programming in Java, ensuring efficient data processing and transformation.
Created custom templates and streaming pipelines to meet specific project requirements, resulting in improved data integration and reduced development time.
Identified and resolved issues in data pipelines, performed root cause analysis, and implemented effective solutions, ensuring data quality and accuracy.
Utilized Cloud Dataproc to process and analyze large-scale data sets, optimizing performance and scalability.
Leveraged GCP services such as Cloud Composer, Cloud Storage, BigQuery, pub/sub, and IAM to design and implement end-to-end data engineering solutions.
Analyzed the Hadoop cluster and other Big Data analytical tools, such as Pig, Hive, HBase, Spark, and Sqoop.
Used Sqoop to export data from HDFS to an RDBMS for business intelligence, visualization, and the generation of user reports.
Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources and developed Spark Applications by using Scala, python.
Used Scala and Java APIs, to build Spark programs and to perform transformations and operations on RDDs.
Using Spark RDD, Scala, and Python, I was involved in transforming Hive/SQL queries into Spark transformations.
Create an ETL process using HBASE, HIVE, SCALA, and SPARK.
Implemented several performance improvement methods, such as building partitions and buckets in HiveQL, after analyzing user request patterns.
Used Scala's case class option to give each column a name.
Helped with the loading of big datasets (Structured, Semi-Structured, and Unstructured) into HDFS
Developed Spark SQL to load tables into HDFS for select queries to be executed.
Quickly built, trained, and deployed the machine learning models using Azure ML.
Scala, Spark, and Spark Stream were used to develop an analytical component.
Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.
Perform validation and verify software at all testing phases which includes Functional Testing, System Integration Testing, End to End Testing, Regression Testing, Sanity Testing, User Acceptance Testing, Smoke Testing, Disaster Recovery Testing, Production Acceptance Testing and Pre-prod Testing phases.
Have good experience in logging defects in Jira and Azure Devops tools.
Analyzed Data Profiling Results and Performed Various Transformations.
Hands on Creating Reference Table using Informatica Analyst tool as well as Informatica Developer tool.
Written Python scripts to parse JSON documents and load the data in database.
Generating various capacity planning reports (graphical) using Python packages like Numpy, matplotlib.
Hands-on experience with Snowflake utilities, SnowSQL, SnowPipe, Big Data model techniques using Python.
ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
Used python APIs for extracting daily data from multiple vendors.
Tools & Environment: Azure, ADF, Azure Databricks, Snowflake, Scala, PySpark, Azure DevOps, Hadoop, Hive, Oozie, Java, Linux, Oracle 11g, MySQL, IDQ Informatica Tool 10.0, IDQ Informatica Developer Tool 9.6.1 HF3.

Client: Primero Edge Houston, TX Apr 2018 to Mar 2020
Role: Big data engineer

Responsibilities:
Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.
Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
Using Sqoop to import and export data from Oracle and PostgreSQL into HDFS to use it for the analysis.
Migrated Existing MapReduce programs to Spark Models using Python.
Migrating the data from Data Lake (hive) into S3 Bucket.
Done data validation between data present in Data Lake and S3 bucket.
Used Spark Data Frame API over Cloudera platform to perform analytics on hive data.
Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.
Implemented AWS EMR Spark for faster processing of data using PySpark, Data Frames.
Optimize existing data pipelines and maintain all domain-related data pipelines.
Efficiently managed project workflows and issue tracking using Azure DevOps, ensuring seamless collaboration and project management.
Used Kafka for real time data ingestion.
Created different topic for reading the data in Kafka.
Read data from different topics in Kafka and run spark structured streaming jobs on AWS EMR cluster.
Moved data from s3 bucket to Snowflake data warehouse for generating the reports.
Created database objects like Stored Procedures, UDFs, Triggers, Indexes and Views using TSQL in both OLTP and Relational data warehouse in support of ETL.
Developed complex ETL Packages using SQL Server 2008 Integration Services to load data from various sources like Oracle/SQL Server/DB2 to Staging Database and then to Data Warehouse.
Created report models from cubes as well as relational data warehouse to create ad-hoc reports and chart reports
Migrated an existing on-premises application to AWS.
Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS.
Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
Created many Spark UDF and UDAFs in Hive for functions that were not existing in Hive and Spark Sql.
Load data into Amazon Redshift and use AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential.
Used Python API by developing Kafka producer, consumer for writing Avro Schemes.
Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift.
Developed ETL workflows using Azure Databricks to extract, transform, and load data from various sources into Snowflake.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.
Good knowledge on Spark platform parameters like memory, cores and executors
By using Zookeeper implementation in the cluster, provided concurrent access for hive tables with shared and exclusive locking.

Tools & Environment: Linux, Apache Hadoop Framework, HDFS, YARN, HIVE, HBASE, AWS (S3, EMR), Spark, SQOOP, MS SQL Server 2014, Teradata, ETL, Tableau (Desktop 9.x/Server 9.x), Python 3.x(Scikit-Learn/Scipy/Numpy/Pandas), AWS Redshift, Spark (Pyspark, MLlib, Spark SQL).


Client: Lucky Truck Houston, TX Sep 2015 to Apr 2018
Role: Hadoop Big Data Developer

Responsibilities:
Worked on the creation of the transformation of data from the customer's relational database to the data warehouse and assisting the client, a utility business, in gathering reporting requirements.
Implemented data ingestion and processing solutions utilizing Hadoop, Map Reduce Frameworks, HBase, and Hive for Data-at-Rest processing.
Efficiently transferred data between databases and HDFS using Sqoop and streamed log data using Flume.
Managed full SDLC of AWS Hadoop cluster based on client's business needs.
Loaded and transformed structured, semi-structured, and unstructured data into HDFS from relational databases using Sqoop imports.
Responsible for importing log files into HDFS using Flume from various sources.
Utilized HiveQL for data analysis, generating payer reports, and transmitting payment summaries.
Imported large sets of structured data using Sqoop import, processed with Spark, and stored in HDFS in CSV format.
Designed and developed Kafka and Storm-based data pipelines, accommodating high throughput.
Led the architecture, development, and engineering of Big Data solutions.
Proficient in installing, deploying, configuring, and monitoring Hadoop clusters.
Employed Data Frame API in Python for distributed data analysis with named columns.
Conducted data profiling and transformation using Hive, Python, and Java.
Ensured smooth operation of Hadoop clusters in production.
Developed predictive analytics using Apache Spark and Python APIs.
Utilized Hive and User Defined Functions (UDF) for big data analysis.
Created Hive External tables, loaded data, and performed queries using HQL.
Implemented Spark Graph application for guest behavior analysis.
Enhanced traditional data warehouse, updated data models, and performed data analytics using Tableau.
Migrated data from RDBMS to Hadoop using Sqoop.
Automated hive scripts using Shell, Perl, and Python scripts for control flow.
Prototyped Big Data analysis using Spark, RDD, Data Frames, and various file formats.
Developed Hive SQL scripts for transformation and loading data across zones.
Created workflow and Coordinator jobs for Hive jobs using Oozie scheduler.
Orchestrated Sqoop and Hive jobs with Oozie for timely data extraction.
Exported results to Tableau for testing, connecting to Hive tables using Hive ODBC connector.
Utilized Sqoop to export data from HDFS to RDBMS for BI, visualization, and reporting.
Managed and led development efforts across diverse internal and overseas teams.
Created data pipelines for extracting and converting data into goods that assist enterprises in achieving their objectives.
Develop in Hadoop and produce fast web services for data tracking.

Tools & Environment: Hadoop, MapReduce, HDFS, Hive, Java, Hadoop distribution of Cloudera, Pig, HBase, Linux, XML, Java 8, Eclipse, Oracle 10g, PL/SQL, MongoDB, Toad

Client: West Agile Labs, Hyderabad Nov 2012 to Jan 2015
Role: Hadoop Admin Developer

Responsibilities:

Led end-to-end setup of Hadoop cluster, encompassing installation, configuration, and monitoring.
Automated Hadoop cluster setup and implemented Kerberos security with Horton Works.
Managed cluster maintenance, including data node commissioning and decommissioning, monitoring, troubleshooting, backups, and log file reviews.
Oversaw system and service monitoring, Hadoop deployment architecture, configuration management, backup, and disaster recovery.
Installed and configured various Hadoop Ecosystem components and daemons.
Set up and configured Hive, HBase, and Sqoop on the Hadoop cluster.
Managed property files like core-site.xml, hdfs-site.xml, mapred-site.xml based on job requirements.
Orchestrated data loading from UNIX file systems to HDFS, and data import/export using Sqoop.
Implemented ETL pipelines with Apache Hive for data extraction and ingestion into Hadoop Data Lake.
Utilized Hadoop log files for administration and troubleshooting.
Extracted meaningful data from various sources and generated Python Panda's reports for analysis.
Developed Python code with version control tools like GitHub and SVN on vagrant machines.
Conducted data analysis, feature selection, and feature extraction using Apache Spark's streaming libraries in Python.
Analyzed system failures, identified root causes, and documented processes and procedures for future reference.
Collaborated with systems engineering to plan and deploy new Hadoop environments and expand existing clusters.
Installed and configured Kerberos for user authentication and Hadoop daemons.

Tools & Environment: Bi Horton work, Hadoop, HDFS, Hive, Sqoop, Flume, Storm, Unix, Cloudera manager, Zookeeper and HBase, Java 6, Apache, SQL, ETL
Keywords: cprogramm csharp continuous integration continuous deployment machine learning business intelligence sthree database active directory rlang information technology green card microsoft procedural language California Texas

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];1453
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: