Uzair Mohammed - Sr. Data Engineer |
[email protected] |
Location: Austin, Texas, USA |
Relocation: yes |
Visa: GC |
Uzair Mohammed
Sr. Azure Data Engineer 8+ Years Professional Summary Overall, 9+ years of experience with emphasis on Analytics, Design, Development, Implementation, Testing and Deployment of Software Application. Good Experience in Big Data and Hadoop Ecosystem Experience in processing large sets of Structured, Semi-structured and Unstructured datasets and supporting Big Data applications. Hands on experience with Hadoop Ecosystem components like HDFS (Storage), Sqoop, Pig, Hive, HBase, Oozie, Zookeeper and Spark for data storage and analysis Experience in transferring data in a RDBMS such as MY SQL, Oracle, Teradata and DB2 using Sqoop. Worked on NiFi to automate the data movement between different Hadoop systems. Experience in writing Hive Scripts and extended their functionality using User Defined Functions (UDF's). Worked directly with the stakeholders to understand their various business needs. Experience in handling arrangement of data within certain limits (Data Layout's) using Partitions and Bucketing in Hive. Experience and Knowledge in NoSQL databases like Mongo DB, HBase and Cassandra. Performance tuning in Hive & Impala using multiple methods but not limited to dynamic partitioning, bucketing, indexing, file compressions, and cost-based optimization etc. Hands on experience handling different file formats like Json, AVRO, ORC and Parquet. Hands on experience with Spark using SQL, Python and Scala. Knowledge on DevOps tools and techniques like Jenkins and Docker. Evaluated Hortonworks NiFi (HDF 2.0) and recommended solution to inject data from multiple data sources to HDFS & Hive using NiFi. Hands on experience in Spark architecture and its integrations like Spark SQL, Data Frames and Datasets API. Hands on experience with Amazon Web Services (AWS) cloud services like EC2, EMR, Redshift, S3 and RDS. Good Experience in Data Visualization tools like Tableau. Experience in the Tableau Administration tasks and Tableau Server Maintenance, Backups, Cleanups, License etc., Professional Experience Azure Data Engineer Cisco Austin, Tx November 2021 to Present Responsibilities: Worked directly with stakeholders to gather requirements and automated monthly and daily reports and pipelines using PySpark on Azure Databricks. Created various workflows in Informatica to extract data from various sources and load it into the target for Tableau dashboards. Successfully developed and maintained ETL solutions using Informatica PowerCenter, ensuring smooth data extraction, transformation, and loading processes for various projects. Performed data cleaning, feature scaling, and feature engineering using pandas and NumPy packages in Python. Expertise in data manipulation and analysis using Pandas. Implemented Delta Lake on Databricks to ensure data reliability, ACID compliance, and easy version control for large-scale data transformations. Developed end-to-end data integration solutions using Azure Data Factory, enabling seamless data movement and transformation across various data sources and destinations. Developed custom DBT macros and Jinja templates to address complex data transformation requirements and improve the efficiency of data modeling. Employed Apache Airflow to manage the scheduling and execution of data processing tasks across distributed clusters, optimizing resource utilization. Demonstrated proficiency in utilizing Jenkins for CI/CD and configured Jenkins webhooks to automatically trigger jobs for new merges. Experience in implementing end-to-end data solutions on Azure, integrating with Red Hat technologies such as OpenShift and Ansible for containerization and orchestration. Migrated data from on-premises SQL Server to cloud databases such as Azure Synapse Analytics (DW) and Azure SQL DB. Created tabular models on Azure Analysis Services to meet business reporting requirements. Experienced in working with Azure Blob and Data Lake Storage, loading data into Azure Synapse Analytics (DW). Integrated Secrets Manager with applications to securely retrieve secrets without hardcoding them in the code. Experienced in integrating ADLS as a data storage solution within Azure, implementing data lakes and leveraging its capabilities for big data processing and analytics. Strong scripting skills in Unix shell languages (bash) for automation of routine tasks and system processes. Implemented Disaster Recovery Strategy for Kubernetes Cluster using Velero and Restic. Performed data quality analyses and applied business rules throughout the data extraction and loading process. Experience in container runtime application like Docker. Utilized data quality tools and framework Apache Nifi, for data validation and Quality Assurance (QA). Logged defects in Jira for issue tracking and resolution. Developed a tool to audit Secrets Manager usage and identify potential security risks. Proactively identified opportunities for process improvement and implemented data-driven solutions. Utilized AWS Server Migration Service (SMS) to automate the migration process, minimizing downtime and ensuring data integrity. Developed and implemented migration plans, including detailed documentation and risk mitigation strategies, to ensure a smooth and successful migration. Experience with containerization platforms such as Docker and ECR for repos. Set up monitoring systems to track Data Quality metrics and generate alerts when data discrepancies exceeded predefined thresholds. Successfully migrated Tableau visuals and reports to Power BI, ensuring seamless transition and maintaining data integrity. Utilized Power BI capabilities to enhance visualization and reporting functionalities. Enhanced application security by implementing AWS security groups, IAM roles, and network access control lists (ACLs). Conducted In-depth data analysis and data discovery to identify patterns, trends, and insights. Part of the ETL development team where the main goal is to supply good data to business users. Environment: Azure, Databricks, Data Lake, MySQL, Azure SQL, MongoDB, Teradata, Azure AD, Git, Blob Storage, DBX, Data Factory, Python, Scala, Hadoop (HDFS, MapReduce, Yarn), Spark, PySpark, Kubernetes, Ansible, Airflow, Hive, Docker, Sqoop, HBase, Oozie, Tableau, Power BI. Sr. Data Engineer PPL (Public Partnership Limited), Boston, MA December 2019 to November 2021 Responsibilities: Part of Data Engineering Team where we provide the data to different Data Science Teams. Worked with stakeholders directly for requirements gathering and automated the execution of monthly and daily reports and pipelines using AWS Glue and Spark. Created ETL pipelines using AWS Glue and Apache Spark. Utilized AWS Data Pipeline and AWS Glue with Jenkins to automate jobs and pipelines. Managed CI/CD pipelines via Jenkins, enabling automation, speeding up development and testing, and boosting quality. Conducted performance tuning of DBT models and SQL queries, optimizing the data pipeline's efficiency and reducing query execution times. Developed PySpark and Scala code for Athena jobs to perform complex data transformations and analysis. Developed SQL scripts in Amazon Redshift data warehouse for business analysis and reporting. Proficiently designed and managed AWS Cloud Formation templates to automate infrastructure provisioning and orchestration. Developed a centralized secret management system using AWS Secrets Manager, replacing manual processes and reducing the risk of credential exposure. Built Athena views and procedures to provide users with easy access to data. Integrated Secrets Manager with a data pipeline to automate secret rotation and improve security compliance. Experience in using NumPy to optimize code performance. Worked on DBT to connect Amazon Redshift and create models for data transformation. Experience in using Pandas to work with large and complex datasets. Converted SQL Server Stored Procedures to Amazon Redshift PostgreSQL and integrated them into Python Pandas framework. Leveraged AWS CloudWatch for monitoring and maintaining the health and performance of AWS resources, ensuring system reliability. Optimized query performance and reduced query run time by implementing Amazon Redshift's query tuning features and optimizing SQL queries. Implemented security and access control measures using Amazon Redshift's role-based access control features to ensure data privacy and compliance. Designed and maintained data storage solutions using AWS DynamoDB, optimizing data access and retrieval for efficient data processing. Developed mappings, sessions, and workflows using Informatica to extract, validate, and transform data according to business rules. Created serverless functions using AWS Lambda to execute data processing tasks, improving system scalability and reducing operational overhead. Developed error-handling mechanisms within Airflow DAGs to identify and recover from failures, minimizing data pipeline downtime. Created and managed secrets for database credentials, API keys, and other sensitive information Implemented test automation frameworks using Python to streamline the testing process, reducing manual efforts and enhancing overall system stability. Worked in hybrid cloud environments, integrated on-premises data platforms with AWS, ensured seamless data flow and consistency across environments. Utilized AWS Data Pipeline to configure data loads from S3 into Amazon Redshift. Communicated data analytic findings for Digital Manufacturing, Audit Data, Distribution Centers analysis, etc. Utilized Unix and Linux environments for data processing and system integration tasks. Had a good understanding of data ingestion, used Airflow Operators for Data Orchestration, and utilized other related Python libraries. Worked on Python APIs calls and landed data to S3 from external sources. Analyzed Machine Data and created visualizations in Excel to present findings as a narrative for business users and product owners. Used Tableau and Excel for visualization charts and regularly communicated findings with product owners. Proficient in Tivoli Workload Scheduler (TWS) to manage job scheduling and automate ETL workflows, minimizing manual intervention and improving operational efficiency. Worked on Tableau Visualization Charts and daily status Dashboards. Worked in an Agile environment, participated in Design Reviews, End-to-End UATs, and assisted QA in automating test cases. Environment: AWS, Glue, Amazon S3, Redshift, DynamoDB, Spark SQL, Spark Dataframe API, Airflow, S3, Athena, EMR, EC2, Tableau, Jenkins, Git. Sr. Data Engineer Bank Of America, Maryland January 2018 to December 2019 Responsibilities: Suitability of Hadoop and its ecosystem to the project and implementing, validating with various proof of concept (POC) applications to eventually adopt them to benefit from the Big Data Hadoop Evaluated initiative. Estimated the Software & Hardware requirements for the Name node and Data nodes in the cluster. Experience in migrating existing databases from on premises to AWS Redshift using various AWS services. Developed the PySpark code for AWS Glue jobs and for EMR. Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and preprocessing. Hands-on experience in Scala multi-threaded applications. Implemented design patterns in Scala for the applications and used Akka concurrency approach for processing PDL files. Developed Java Map Reduce programs for the analysis of sample log files stored in cluster. Implemented Spark using Python and Spark SQL for faster testing and processing of data. Perform Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python. Imported data using Sqoop to load data from MySQL to HDFS on regular basis. Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala. Developed Scripts for AWS Orchestration. Created monitors, alarms, notifications, and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch. Used IAM to create new accounts, roles and groups and polices and developed critical modules like generating amazon resource numbers and integration points with S3, Dynamo DB, RDS, Lambda and SQS Queue. Created a microservice environment on the cloud by deploying services as docker containers. Implemented Amazon API and Gateway to manage, as an entry point for all the API's. Worked with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table. Knowledge about setting up Python REST API Framework using Django and experience in working with Python ORM Libraries including Django ORM, SQL Alchemy. Developed ETL parsing and analytics using Python/Spark to build a structured data model in Elastic search for consumption by the API and UI. Developed ETL jobs using Spark -Scala to migrate data from Oracle to new Cassandra tables. Used Spark -Scala (RDD s, Data frames, Spark SQL) and Spark - Cassandra -Connector API's for few tasks (Data migration, Business report generation etc.) Created Partitions, Buckets based on State to further process using Bucket based Hive joins. Created and configured a Spark cluster and several Big Data analytical tools, such as Spark, Kafka streaming, AWS, and HBase using Cloudera distribution. Responsible for implementing monitoring solutions in Ansible, Terraform, Docker, and Jenkins. Building Docker Images using Dockerfile. Using Multi-Stage Docker Images to reduce the image size. Created an e-mail notification service upon completion of job for the team which requested for the data. Implemented security to meet PCI requirements, using VPC Public/Private subnets, Security Groups, NACLs, IAM roles, policies, VPN, WAF, Trust Advisor, Cloud Trail etc. to pass penetration testing against infrastructure. Provision a Kubernetes Cluster using Rancher (Rancher Kubernetes Engine). Defined job work flows as per their dependencies in Oozie. Played a key role in productionizing the application after testing by BI analysts. Designed, developed, and managed Power BI, Tableau, QlikView including Dashboard, Reports, Storytelling. Environment: AWS, Hadoop, MapReduce, Hive, HDFS, Sqoop 1.4.4, Oozie 4.2, Scala, Java, Spark 2.3, Spark RDD, NumPy, Pandas, Linux, Zookeeper, Glue, Docker, Dataframe API, Amazon S3, Redshift, Oracle, DynamoDB, Airflow, EC2, Kubernetes, Athena, EMR, PySpark, EMR, EC2, Jenkins, Git. Big Data Engineer PrimeWing Technologies, Hyderabad, TS Nov 2016 to Dec 2017 Responsibilities: Responsible Worked on building Hadoop cluster in AWS Cloud on multiple EC2 instances. Used Amazon Simple Storage Service (S3) for storing and accessing data to Hadoop cluster. Migrating the needed data from Oracle, MySQL in to HDFS using Sqoop and importing various formats of flat files into HDFS. Worked on YAML for configuration of different Rules according to Tista Methodology. Writing new rules and modifying existing rules by using SQL and Python. Working with Spark using PySpark and Spark SQL to analyze different issues in data and cleanse the data. Using Git as a repository for the application project folders and JIRA for trouble tickets and Confluence for our Knowledge base. Ingested transactional data from Oracle into HDFS using Sqoop. Developed PL/SQL programs to create and optimize complex stored procedures, triggers, and functions. Implemented database solutions using PL/SQL programming that adhere to performance and security standards, contributing to efficient data processing. Performed data profiling and quality validation using transient and staging tables in Hive. After all the actions are done, data is loaded into the staging tables. Developed custom Apache spark programs for data validation to filter unwanted data and cleanse the data. Ingest traditional RDBMS data into the HDFS from the existing SQL server using Sqoop. Started using Apache NiFi to copy the data from local file system to HDFS. Integrated Apache Kafka for data ingestion. Involved in exploration of new technologies like AWS, Azure, Apache Flink, and Apache NIFI etc. to increase business value. Environment: Hive, Sqoop, Oracle, Cloudera, YAML, Pig, Spark, Zookeeper, SQL Server, PL/SQL, AWS. Hadoop Developer FlixBox Technologies , Hyderabad, TS July 2014 to Oct 2016 Responsibilities: Involved in complete Big Data flow of the application starting from data ingestion upstream to HDFS, processing the data in HDFS and analyzing the data. Creating Hive tables to import large data sets from various relational databases using Sqoop and export the analyzed data back for visualization and report generation by the BI team. Developed a process for the Batch ingestion of CSV Files, Sqoop from different sources and generating views on the data source using Hive and Impala. Implemented partitioning, dynamic partitions and buckets in HIVE. Used Amazon Simple Storage Service (S3) for storing and accessing data to Hadoop cluster. Migrating the needed data from Oracle, MySQL in to HDFS using Sqoop and importing various formats of flat files into HDFS. Performance optimizations on Hive. Diagnose and resolve performance issues. Involved in complete software development life cycle management. Coding interfaces for Web Services. Application was developed using Spring MVC Web flow modules. Worked on UAT testing and developed test strategies, test plans, reviewed QA test plans for appropriate test coverage. Performed functional, integration, system, and validation testing. Environment: Hadoop, HDFS, Map Reduce, Hive, HBase, Zookeeper, Impala, Cloudera, Oracle, SQL Server, Flume, Oozie, Scala, Spark, Sqoop, PySpark. EDUCATION Osmania University, India May 2014 Bachelor of Engineering. Keywords: continuous integration continuous deployment quality analyst user interface business intelligence sthree database active directory information technology procedural language Massachusetts Texas |