Vishal - Data Engineer |
[email protected] |
Location: Newington, Virginia, USA |
Relocation: Open |
Visa: OPT EAD |
Name: Vishal Contact no: +1 703-745-8917
PROFESSIONAL SUMMARY Data Engineer having around 6 years experience in all phases of Analysis, Design, Development, Implementation, and support of Data Warehousing applications using Big Data technologies like HDFS, Spark, MapReduce, and Hive with hands-on experience on cloud platforms Azure and AWS. Strong experience working with Azure Cloud and its components like Azure Data Factory, Azure Databricks, Azure Synapse Analytics, Azure Stream Analytics, Logic Apps, HDInsight, Function Apps, and Azure DevOps services. Microsoft certified: Azure Data Engineer Associate (DP-203) and Azure Fundamentals (AZ-900). Hands-on experience with Amazon EC2, S3, RDS, Redshift, IAM, Step Functions, CloudWatch, SNS, SQA, Athena, Glue, Kinesis, Lambda, EMR, Kinesis, DynamoDB and other AWS services. Possesses a solid understanding of Spark Architecture, covering Spark Core, Spark SQL, Dataframes, Spark Streaming, Driver Node, Worker Node, Stages, Executors, and Tasks. Strong experience working with PySpark for performing transformations in the data files and for Data quality checks. Proficient in Snowflake cloud technology, including in-depth knowledge of multi-cluster sizing and credit usage. Firm understanding of Hadoop architecture and various components including HDFS, Yarn, MapReduce, Hive, Pig, HBase, Kafka, Oozie, etc. Skilled in ETL/ELT development, data modeling, data architecture design, data governance, and data security. Hands-on experience with SQL, Python, and other data-processing languages and tools. Good experience working with real-time streaming pipelines using Kafka, Azure Event Hubs, and Spark-Streaming. Worked on many on-premises data sources like Oracle, SQL Server, MongoDB, SFTP, SAP HANA, etc. Utilized Databricks notebooks and Spark SQL for exploratory data analysis, feature engineering, and data visualization, enabling data-driven insights and decision-making. Strong experience in Talend, Informatica, Alteryx, SSIS, Apache Airflow, and DataStage ETL tools. Excellent knowledge of concepts like Slowly Changing Dimensions (SCD Type 1, Type 2, Type 3), Change Data Capture (CDC), Dimensional Data Modeling, Star/Snowflake Modeling, Data Marts, OLTP, OLAP, FACT and Dimensions tables, and both Physical and Logical data modeling. Experienced in writing SQL, PL/SQL programming, Stored Procedures, Packages, Functions, Triggers, Views, Materialized Views, and Indexing strategies. Good knowledge of building RESTful APIs using Spring MVC, Spring JPA utilizing both SQL and No-SQL databases, and Spring Boot, facilitating seamless communication between frontend and backend systems. Experience in optimizing query performance in Hive using bucketing and partitioning techniques. Proficient knowledge and hands-on experience in writing shell scripts in Linux. Excellent understanding of NoSQL databases like HBASE, Cassandra, and MongoDB. Worked with version control tools like Bit-Bucket, GIT, and SVN. Designed and Developed ETL pipelines in and out of Snowflake using SnowSQL and Snow pipe. Skilled in administering and optimizing Hadoop clusters, ensuring efficient data ingestion and processing workflows. Developed, proposed, and contributed to implementing data quality improvement processes. Good knowledge of Generative AI, including ChatGPT and Gemini. Used RDD transformations and Spark SQL extensively. Leveraged Machine Learning algorithms (linear regression, decision trees) for data analysis, model development, and predictive tasks. Experienced in requirement analysis, application development, application migration, and maintenance using Software Development Lifecycle (SDLC) and Python/Java technologies. Experience in implementing, and supporting data lakes, data warehouses, and data applications on AWS for large enterprises. Had good knowledge of Microsoft Fabrics and implemented data engineering solutions. Experience working with Data Visualization tools Power BI and Tableau. Hands-on experience in resolving production issues in a 24/7 environment. Outstanding communication and interpersonal skills, ability to learn quickly, good analytical reasoning, and high compliance with new technologies and tools. CERTIFICATIONS Microsoft Certified: Azure Fundamentals (AZ-900). Microsoft Certified: Data Engineering Associate (DP-203). Machine Learning: Standford University. TECHNICAL SKILLS Programming Languages Python(Pandas, NumPy, PySpark), SQL, Scala, Java, HiveQL Azure Services Azure Data Factory, Azure Databricks, Logic Apps, Functional App, Azure Synapse Analytics, Azure Stream Analytics, Azure DevOps, Azure Event Hub, Cosmos DB AWS Services EC2, S3, EMR, Redshift, CloudFormation, Aurora, VPS, Glue, Kinesis, Lambda, Quick Sight, Glacier, Route53, EKS, CloudWatch, CloudFront GCP Services Compute Engine (GCE), Cloud Storage, BigQuery, Cloud SQL, Cloud Functions, Cloud Dataflow, Cloud Bigtable, Google Kubernetes Engine, BigQuery BI Engine. Hadoop Ecosystem HDFS, SQL, YARN, PIG Latin, MapReduce, Hive, Sqoop, Spark, Yarn, Strom, Zookeeper, Oozie, Kafka, Storm, Flume SQL Databases Oracle DB, Microsoft SQL Server, IBM DB2, PostgreSQL, Teradata, Azure SQL Database, Amazon RDS, GCP CloudSQL, GCP Cloud Spanner, Snowflake NoSQL Databases MongoDB, Cassandra, Amazon DynamoDB, HBase Development Methods Agile/Scrum, Waterfall IDE s PyCharm, IntelliJ, Visual Studio, Eclipse Data Visualization Power BI, BO Reports, Splunk, Tableau Operating Systems Linux, Windows, UNIX Build Tools Jenkins, Maven, SQL Loader, Oozie Containerization Docker & Docker Hub, Kubernetes, OpenShift ETL Tools Airflow, Talend, Informatica, SSIS, Airbyte, Alteryx, StreamSets PROFESSIONAL EXPERIENCE Freddie Mac, McLean, VA Feb 2023 Present Role: Sr Azure Data Engineer Description: Freddie Mac is a financial service company that provides liquidity, stability, and affordability to the U.S. housing market by giving more funds to lenders to issue new mortgage loans. This project built a comprehensive customer intelligence platform for the client to unlock deep customer understanding. By leveraging big data tools and machine learning techniques, we gained valuable insights into customer behavior, preferences, and lifetime value. Responsibilities: Implemented large Lambda architectures using Azure services capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure Monitoring, Key Vault, Event Hubs, and Azure SQL Server. Involved in various SDLC Life cycle phases like development, deployment, testing, documentation, implementation & maintenance of application software. Developed and managed end-to-end operations of ETL data pipeline in Azure Data Factory, Alteryx handling large datasets. Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing & transforming the data. Developed Machine Learning models using Python and sci-kit-learn for predictions. Developed and deployed a real-time anomaly detection system using Azure Machine Learning and Azure Stream Analytics. Configured Spark Streaming to receive real-time data from Apache Kafka and store the stream data to HDFS using Scala and Python. Utilized Azure Functions or other serverless solutions for real-time data processing tasks. Designed and implemented a solution for processing Near Real-Time (NRT) Data using Azure Stream Analytics, Azure Event Hub, and Service Bus queue. Been involved in performance tuning on existing processes. Utilized Terraform for infrastructure provisioning and automation, enabling seamless deployment and management of cloud resources. Designed and Developed SSIS packages to pull data from various sources like Excel, Flat files, SQL Server, and DB2 into destinations like Azure Synapse Analytics and Azure SQL Database. Experience in transporting and processing real-time stream data using Kafka. Built various interactive dashboards in Tableau and other BI tools. Developed PL/SQL stored procedures, functions, triggers, and packages to streamline application logic and optimize query performance through indexing, aggregation, and materialized views. Actively participated in migrating legacy servers to Snowflake. Leveraged Azure DevOps for continuous integration and deployment (CI/CD) of data pipelines and applications, streamlining the development and deployment processes. Collaborated with other developers and testers in an Agile Scrum framework to communicate, refine, and validate requirements and solutions. Environment: Azure Databricks, Data Factory, Spark Streaming, Terraform, Azure DevOps, yaml, Spark, Hive, SQL, Python, Scala, PySpark, GIT, Kafka, Snowflake, SSIS, Alteryx, Power Bi, Azure Synapse Analytics, SQL Server, BLOB, Data Lake, Azure Monitoring, Azure Functions, Tableau. Tredence Analytics (Kimberly-Clark), Irving, TX Nov 2020 Jul 2022 Role: Azure Data Engineer Project: Revenue Growth Management (RGM) Description: Kimberly-Clark, a leading global manufacturer of essential consumer products faces a challenging economic landscape from the COVID-19 pandemic. To survive this, CPG companies need to optimize their revenue growth management (RGM) strategies. It is the discipline of driving sustainable, profitable growth from consumer base through a range of strategies around assortment, promotions, trade management, and pricing. RGM uses real-time data and analytics to improve decision-making and unlock revenue potential. Kimberly-Clark leveraged Snowflake to execute an end-to-end RGM strategy from setting up a unified data model to driving optimized action alongside retail partners. Responsibilities: Led cross-functional development and deployment of data analytics strategy for a client, integrating cloud technologies reducing data ingestion time by 70%. Designed end-to-end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Azure DevOps, Databricks, Logic Apps, and Event Hubs. Aggregated data from diverse sources, including SFTP, SAP HANA, Blob, and SQL Server, to construct streamlined ETL pipelines in Azure Data Factory. Automated the entire data pipeline with triggers, enabling scheduled data updates and monitoring for any automation failures by sending alerts, resulting in an 85% reduction in operational workload. Employed Azure Data Bricks to perform advanced data analytics, including complex data transformations, aggregations, and statistical analysis using PySpark. Implemented data quality checks and monitoring processes to ensure data integrity throughout the pipeline. Spearheaded a proof of concept (POC) in Databricks using PySpark to design a solution for ingesting more than 100 GB files in less than 15 minutes using parallel processing. Streamlined Snowflake schemas by employing stored procedures and queries to develop and maintain a Unified Data Model, resulting in a 50% reduction in execution time. Designed a stored procedure using SQL & JavaScript in Snowflake to perform Change Data Capture(CDC) for automating daily jobs which saved the manual effort of the team by 30 hours. Developed an automated data pipeline using Azure Logic Apps that triggers upon arrival of new files in an SFTP server. This solution also integrates error notification via email in case of pipeline execution failures. Understanding the functional mappings and candlestick chart of the goods for different markets. Developed Power BI dashboards encompassing data from 3 market regions, APAC, EMEA & LATAM. Collaborated with senior management to deliver optimal solutions to stakeholders, and implemented enhanced data quality frameworks. Performed Unit testing and prepared Technical Design documents for the pipelines. Worked on deploying the pipelines and some other files from the lower environment to the production environment using Azure DevOps. Part of the Harmonization team where we created a lot of stored procedures for pushing the staging tables into the final UDM table for reporting after doing mapping for each attribute. Assisted in the column mappings in Collibra for Data Governance. Worked in an internal Snowflake project to analyze metrics such as account usage, query execution time, average elapsed time, and query credit usage. Worked on RDDs & Data frames (SparkSQL) using PySpark for analyzing and processing the data. Implemented Auditing and Error logging in all the stored procedures. Involved in End-to-End testing and able to help in fixing the SIT and UAT defects. Received Pat on the back recognition for my problem-solving skills and for timely project delivery. Environment: Azure Data Factory, Azure Data Bricks, Logic Apps, ADLS, Snowflake, PowerBI, SQL Server, Blob Storage, SAP HANA, SFTP Server, Nielsen s data, Azure DevOps, Python, SQL, PySpark, EXCEL, HDInsight, Key Vault, Event Hubs, Collibra, Data Pipeline, Terraform. Smart IMS, Hyderabad, India Mar 2019 Nov 2020 Role: AWS Data Engineer Project: Relevance 360 Description: This project aims to modernize Elevance Health's data platform on Amazon Web Services (AWS) to enhance scalability, efficiency, and real-time data processing capabilities. By leveraging advanced AWS services, the project will enable improved data-driven decision making for Elevance Health. Responsibilities: Crafted highly scalable and resilient cloud architectures that address customer business problems and accelerate the adoption of AWS services for clients. Built applications and database servers using AWS EC2, created AMIs as well as used RDS for PostgreSQL. Designed the project workflows/pipelines using Jenkins as a CI tool. Used Terraform to allow infrastructure to be expressed as code in building EC2, LAMBDA, RDS, EMR Implemented a PySpark data quality framework, ensuring data integrity through schema validation and profiling for downstream analytics. Implemented Snowpipe for continuous data ingestion of structured and semi-structured data from files and web interfaces into Snowflake. Performed ETL processing tasks, including data extraction, transformation, mapping, and loading, to ensure clean and ready-for-analysis data. Leveraged Pandas for time series data manipulation and retrieval. Converted timestamp data into both time series and tabular formats, facilitating efficient data exploration and analysis. Leveraged Spark on EMR to scale up enterprise data processing across our AWS data lake. Managed EC2 resources for long-running Spark jobs, configuring optimal parallelism and executor memory to enhance data caching. Established analytical warehouses in Snowflake, enabling efficient data exploration through querying staged files with reference to metadata columns. Used Spark Structured streaming to consume real-time data and build feature calculations from various sources like Data Lake and Snowflake and produce them back to Kafka. Extensively worked with Avro and Parquet files and converted the data from either format Parsed Semi-Structured JSON data or converted to Parquet using Data Frames in Pyspark. Automated CSV data loading to S3 buckets using a Python script. Managed bucket creation, folder structure, logs, and object lifecycle within S3. Defined Hive DDL on Parquet and Avro data files residing in HDFS and S3 buckets. Extensive experience with AWS S3 bucket management and data transfer to/from HDFS. Loaded data from different sources like AWS S3 and local file systems into Spark RDD. Environment: AWS, Python, Scala, EMR, Spark Streaming, Databricks, data lake, Kafka, EMR, RDS, Spark, Linux, Shell Scripting, GitHub, Jira, Jenkins Eidiko Systems Integrators, Hyderabad, India Jun 2018 Feb 2019 Role: Data Scientist Project: Hand Reco Description: This project aims to develop a computer vision system capable of detecting and recognizing American Sign Language (ASL) gestures from video input. The system will be trained on a dataset of labeled ASL videos and by using deep learning models (TensorFlow/Keras) we allow it to identify specific signs and potentially translate them into text and spoken language. Responsibilities: Developed a Python application for real-time ASL detection using libraries like OpenCV and MediaPipe. Leverage publicly available image datasets to gather a dataset of labeled images for the chosen classification task. Implemented video pre-processing techniques using libraries like OpenCV to ensure consistency in the data, including frame resizing, normalization, and background subtraction for illumination invariance. Implemented a Python script to capture ASL gestures from a webcam and translate them into text language. Utilized Computer Vision techniques for hand detection and pose estimation, extracting key point features like fingertips and palm center to capture the spatial configuration of the hand for sign recognition. Investigated various deep learning architectures, such as Convolutional Neural Networks (CNNs) implemented using TensorFlow and PyTorch in Python, to identify the most suitable model for ASL gesture classification. Employed various evaluation metrics like accuracy, precision, recall, and F1-score to assess the performance of the trained model on a held-out test set. Analyzing confusion matrices using Python libraries like Seaborn to identify specific signs with higher misclassification rates. Environment: Visual Studio Code, Python (Numpy, Tensorflow, OpenCV, Keras, Mediapipe, math). EDUCATION DETAILS Master s degree in data science. Bachelor s degree in computer science & engineering. Keywords: continuous integration continuous deployment artificial intelligence business intelligence sthree database information technology procedural language Arizona Texas Virginia |