prashanth Kema - Sr Data Engineer |
[email protected] |
Location: Dallas, Texas, USA |
Relocation: yes |
Visa: H1B |
Name: Prashanth K
Phone: +1-469-319-0089 Email: [email protected] SUMMARY Around 8 years of IT experience on Azure Cloud. Experience with migrating SQL databases to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data Warehouse, as well as controlling and providing database access and migrating on-premises databases to Azure Data Lake stores using Azure Data Factory. Expertise in MLOps methodologies, including model deployment, infrastructure management, and automation, ensuring seamless integration of data science solutions into business operations. Proven ability to collaborate with cross-functional teams to optimize model performance, enhance scalability, and drive business outcomes. Skilled in leveraging cloud technologies and CI/CD pipelines to streamline development processes and ensure continuous improvement. Strong background in data analysis, statistical modeling, and deep learning algorithms. Seeking to leverage technical expertise and leadership skills to contribute to innovative projects in a dynamic organization. Develop Spark applications utilizing Spark - SQL in Databricks to extract, convert, and aggregate data from various file formats. Analyze and manipulate data to get insights into consumer usage trends. Strong knowledge of Spark architecture, including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors, and Tasks. Experience with MS SQL Server Integration Services (SSIS), T-SQL, stored procedures, and triggers. Develop Spark applications with Pyspark and Spark-SQL to extract, process, and aggregate data from various file formats. Analyze and manipulate data to get insights into consumer usage trends. Strong knowledge of Big Data Hadoop and Yarn architecture, including Job Tracker, Task Tracker, Name Node, Data Node, Resource/Cluster Manager, and Kafka (distributed stream-processing). Created, monitored, and restored Azure SQL databases. Migrated Microsoft SQL server to Azure SQL database. Implemented data replication and synchronization strategies using tools like AWS DMS (Database Migration Service), Google Cloud Dataflow, or Azure Data Factory to ensure data consistency across distributed environments. Designed and implemented scalable data pipelines on cloud-based SaaS platforms such as AWS Glue, Google Dataflow, or Azure Data Factory to ingest, process, and transform large volumes of data. Proficient in implementing and configuring cloud-based data management solutions, including DQLabs, to streamline data ingestion, transformation, and analysis processes. Proficient in data manipulation and analysis using Python Pandas library Proficient in database design and development for business intelligence utilizing SQL Server 2014/2016, Integration Services (SSIS), DTS Packages, SQL Server Analysis Services (SSAS), DAX, OLAP Cubes, Star Schema, and Snowflake Schema. Strong experience moving from different databases to Snowflake. Collaborate with domain experts, engineers, and other data scientists to design, deploy, and upgrade current systems. Participate in design sessions to create the Data Model and give advice on optimal data architecture techniques. Experience with Snowflake Multi-Cluster Warehouses and knowledge of Snowflake cloud technologies. Implemented CI/CD pipelines using tools such as Jenkins, GitLab CI/CD, or Azure DevOps to automate the testing, building, and deployment of data pipelines and workflows. Experience using Snowflake cloud data warehouse and AWS S3 bucket for data integration from numerous source systems, including importing nested JSON structured data into a Snowflake table. Proficient in implementing Apache Kafka clusters and configuring Kafka brokers, topics, partitions, and replication to ensure high availability, fault tolerance, and scalability of real-time data processing pipelines. Professional understanding of AWS Redshift. Experience constructing Snow pipes. Experience using Snowflake Clone and Time Travel. Experience with different data intake strategies for Hadoop. Contributes to the creation, refinement, and maintenance of Snowflake database applications. Proficient in building, implementing, and improving ETL/ELT procedures for extracting data from multiple sources. Experienced in performing complex data transformations with SQL queries, Python, and ETL/ELT tools. Experience working with Informatica IICS tool effectively using it for Data integration and Data Migration from multiple source systems in Azure Sqi vata warehouse. Experienced in Informatica ETL Developer role in Data Warehouse projects, Enterprise Data Warehouse, OLAP and dimensional data modeling. Developed automated solutions for SRE for capacity planning, performance optimization, and resource scaling to accommodate growing data volumes and user demands. Azure Data Factory (ADF), Integration Runtime (IR), File System Data Ingestion, and Relational Data Ingestion. Worked in a combination of DevOps roles, including Azure Architect/System Engineering, network operations, and data engineering. Experienced in developing complex PL/SQL queries to perform data retrieval, transformation, aggregation, and analysis, ensuring efficient and scalable database operations. Experienced in database design and modeling using ANSI SQL standards, including entity-relationship. In-depth understanding of the Snowflake Database, Schema, and Table structures. Define virtual warehouse size for Snowflake for various workload types. I worked with a cloud architect to build up the environment. Create SQL queries in SnowSQL. Create transformation logic using Snow pipeline. Hands-on experience with Snowflake tools, Snow SQL, Snow Pipe, and Big Data modeling approaches in Python and Java. ETL pipelines in and out of the data warehouse using a combination of Python and Snowflakes. SnowSQL Writing SQL queries against Snowflake. TECHNICAL SKILLS Hadoop/Big Data Technologies Hadoop 26.5, HDFS, MapReduce, HBase 1.4, Apache Pig, Hive2.3, Sqoop 1.4, Apache Impala 2.1, Oozie 4.3, Yarn, NIFI, Apache Flume 1.8, Kafka 1.1, Zookeeper, Flink,GCP cloud storage, GCP MySQL, Cloud Spanner, Creating Hadoop, and Spark Clusters in GCP, BIGQUERY, Open Refine. Cloud Platform Amazon AWS, EC2, EC3, SaaS, Aurora, MS Azure, Azure SQL Database, Azure SQL Data Warehouse, Azure Analysis Services, HDInsight, Azure Data Lake, Data Factory, Snowflake, MS Azure, Azure SQL Database, Azure SQL Data Warehouse, Azure Analysis Services, DQLabs, HDInsight, Azure Data Lake, Data Factory, Azure Databricks, Azure Blob Storage, Azure IAAS, Azure SQL, Log Analytics Workspace, RBAC, Azure VM Scale sets, Azure Backup, Azure Functions, Azure Synapse Analytics, Azure Stream Analytics and Event Hubs, Key vault, Azure AD, Azure Disaster Recovery, Azure Load Balancer, Azure Network Interface, Azure Virtual Network, and Azure Virtual Peering Hadoop Distributions Cloudera, Hortonworks, MapR Deployment Model Deployment , Monitoring in cloud (MLOPS) Programming Language Python, SQL, PL/SQL, ANSI SQL, Shell Scripting, Terraform, Spark Databases Oracle 12c/11g, SQL, PostgreSQL, DBT Operating Systems Linux, Unix, Windows 10/8/7 NoSQL Databases HBase 1.4, Cassandra 3.11, MongoDB Web/Application Server Apache Tomcat 9.0.7, WSDL SDLC Methodologies Agile, Waterfall Operating system Windows, Linux, Unix Version Control GIT, SVN, CVS IDE and Tools Eclipse 4.7, NetBeans 8.2, IntelliJ, Maven Testing Writing Manual Test Cases for User Acceptance Testing (UAT) Data Visualization Tools Tableau, PowerBI, Informatica PROFESSIONAL EXPERIENCE Data Engineer IBM Dallas, TX Jan 2022-Present Responsibilities: Analyze, create, and build modern data solutions with the Azure PaaS service for data visualization. Understand the present production state of the application and assess the impact of new installation on existing business processes. Utilized DQLabs to automate data quality checks, ensuring accuracy, consistency, and completeness of datasets. Integrated version control systems like Git with CI/CD pipelines to track changes to code, configurations, and infrastructure as part of the software development lifecycle. Designed and developed custom data connectors and integrations to ingest data from various SaaS applications, CRM systems, and third-party APIs into centralized data repositories. Experience in handling large datasets efficiently with Python Pandas Data Frames. Extract, transform, and load data from source systems to Azure Data Storage services utilizing Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data is ingested into one or more Azure Services (Azure Data Lake, Azure Storage, Azure SQL, Azure Data Warehouse) and processed in Azure Databricks. Developed custom workflows and pipelines using DQLabs automation features to accelerate ETL (Extract, Transform, Load) processes. Orchestrated the deployment of data pipeline changes across development, staging, and production environments using CI/CD automation, reducing manual errors and improving deployment consistency. Created pipelines in ADF utilizing Linked Services/Datasets/Pipeline/ to extract, transform, and load data from many sources, including Azure SQL, Blob storage, Azure SQL Data Warehouse, write-back tool, and backwards. Mentored and trained team members on DevOps principles, tools, and practices, fostering a culture of collaboration and automation. Integrated DQLabs with various data sources and destinations, such as databases, data lakes, and cloud storage solutions, for seamless data movement. Experienced in implementing data transformation pipelines using Python scripts, ensuring data quality and integrity throughout the process. Experienced in AWS, dealt with CI/CD pipelines triggering using integration tool as Jenkins. Skilled in designing and developing ETL workflows with Python and related frameworks like Apache Airflow, facilitating seamless data integration across heterogeneous sources and destinations. Worked on SnowSQL and Snow Pipe. Skilled in defining data models and relationships in DBT to create a structured data warehouse. Redesigned the Views in snowflake to increase the performance. Unit tested the data between Redshift and Snowflake. Developed data warehouse model in snowflake for over 100 datasets using whereScape.Creating Reports in Looker based on Snowflake Connections. Experience in working with AWS, Azure, and Google data services. Used COPY to load the data in bulk. Created data sharing across two snowflake accounts. Proficient in working with relational databases (e.g., MySQL, PostgreSQL) and NoSQL databases (e.g., MongoDB, Cassandra) using Python libraries like SQL Alchemy and PyMongo for data retrieval, manipulation, and storage. Built and optimized data models and datasets within SaaS analytics platforms to support self-service reporting and dashboarding for business users across the organization. Experienced in using version control systems such as Git for managing code repositories and collaborating with team members on data engineering projects. Spark apps were developed using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from numerous file formats. Responsible for calculating cluster size, monitoring, and debugging the Spark data bricks cluster. Developed JSON scripts for installing the Pipeline in Azure Data Factory (ADF), which processes data using the SQL Activity. Familiarity with version control systems such as Git for managing DBT projects and codebase. Proficient in monitoring Kafka clusters using tools like Kafka Manager, Confluent Control Center, and Prometheus, and optimizing cluster performance by fine-tuning configurations, adjusting resource allocations, and implementing partitioning strategies. Experienced in ensuring data reliability and fault tolerance in Kafka clusters through techniques such as data replication, partitioning, message retention policies, and setting up proper backup and disaster recovery mechanisms. Proficient in performing data manipulation and analysis tasks using ANSI SQL, including data insertion, deletion, update, and retrieval operations, as well as data aggregation, filtering, and sorting to derive meaningful insights from large datasets. Hands-on experience writing SQL scripts for automation. Consulting on Snowflake Data Platform Solution Architecture, Design, Development, and Deployment aimed at promoting a data-driven culture throughout companies. Experience in integrating Python Pandas with other libraries such as NumPy, Matplotlib, and Scikit-learn for comprehensive data analysis and visualization. Create stored procedures/views in Snowflake and load Dimensions and Facts using Talend. Design, develop, test, implement, and support Data Warehousing ETL with Talend. Excellent grasp of RDBMS issues; ability to write complicated SQL and PL/SQL. Environment: Snowflake, Redshift, DBT, DevOps, DQLabs, Python Pandas, SaaS, SQL server, AWS, Flink,AZURE, TALEND, Snow Pipe, Azure PaaS Service, CI/CD Azure Data Storage services utilizing Azure Data Factory, T-SQL, Spark SQL, U-SQL, Redshift, and SnowSQL. KROGER Cincinnati, Ohio Data Engineer May 2020 Jan 2022 Responsibilities: Design and implement database solutions using Azure SQL Data Warehouse and Azure SQL. Involved in requirements gathering, analysis, design, development, change management, deployment. Involved in the development of real time streaming applications using PySpark, Apache Flink, Kafka, Hive on distributed Hadoop Cluster. Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Milib. Extracted data from heterogeneous sources and performed complex business logic on network data to normalize raw data which can be utilized by BI teams to detect anomalies. Designed and developed Flink pipelines to consume streaming data from Kafka and applied business logic to massage and transform and serialize raw data. Developed common Flink module for serializing and deserializing AVRO data by applying schema. Developed Spark streaming pipeline to batch real time data, detect anomalies by applying business logic and write the anomalies to Hbase table. Implemented layered architecture for Hadoop to modularize design. Developed framework scripts to enable quick development. Designed reusable shell scripts for Hive, Sqoop, Flink and PIG jobs. Standardize error handling, logging and metadata management processes. Design and deploy medium to large-scale BI solutions on Azure leveraging Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, and NoSQL database). Collaborated with data scientists and analysts to integrate model training, evaluation, and deployment processes into CI/CD pipelines, enabling automated model deployment and inference. Developed and maintained documentation and training materials to educate users and stakeholders on best practices for data management, analysis, and visualization within SaaS solutions. Leveraged DQLabs' machine learning capabilities to automate anomaly detection and data classification tasks, improving data governance and compliance. Design and implement migration methods for legacy systems to Azure (Lift and Shift, Azure Migrate, and other third-party solutions). Engage with business customers to gather requirements, build visualizations, and teach them on how to utilize self-service BI technologies. Implemented data masking and anonymization techniques within DQLabs to protect sensitive information and ensure regulatory compliance. Contributed to the evaluation and selection of new automation tools and technologies to enhance the efficiency and scalability of data engineering workflows. Conducted regular performance tuning and optimization of CI/CD pipelines and infrastructure to improve build and deployment times Implemented data governance and access controls within SaaS platforms to ensure data security, privacy, and compliance with regulatory requirements such as GDPR and HIPAA. Skilled in integrating MongoDB with ETL pipelines using tools like Apache Spark, Apache Kafka, and Talend to ingest, transform, and load data from various sources into MongoDB collections for analysis and reporting. Experience working with multiple Hadoop distributions, including CloudEra, HortonWorks, and MapR. Collaborated with the cloud architect to build up the environment. Worked with Oracle databases, RedShift, and Snowflakes. Design, set up, and administer the Azure SQL Database, Azure Analysis Service, Azure SQL Data Warehouse, Azure Data Factory, and Azure SQL Data Warehouse. Construct complex distributed systems that handle large amounts of data, collect metrics, construct data pipelines, and do analytics. Environment: Snowflake, Bigdata, AZURE, DQLabs, CI/CD, AWS, SQL, SaaS, PowerBI, NoSQL, Azure SQL, Data Warehouse, Azure Data Lake, Data Factory, Data lake Analytics, Stream Analytics, Azure SQL DW, HDInsight, Databricks, and Azure Analysis Service. Kaiser Permanente Oakland, CA May 2018-May2020 Data Engineer Responsibilities: Worked on the AWS Data Pipeline to setup data loads from S3 into Redshift. Using AWS Redshift, Extracted, converted, and loaded data from a variety of diverse data sources and destinations. Implemented Apache Airflow for creating, scheduling, and monitoring data pipelines Developed many Directed Acyclic Graphs (DAGs) to automate ETL workflows. Extracted, transformed, loaded, and integrated data in data warehouses, operational data stores, and master data management. A solid grasp of AWS components including EC2 and S3. Performed data migration to AZURE. Experience migrating data between AZURE and Azure using Azure Data Factory. Conducted performance tuning and optimization of data queries and reports within SaaS analytics platforms to improve query response times and enhance user experience. Conducted training sessions and provided documentation to educate team members on best practices for using DQLabs and other automation tools effectively. Actively participated in user forums and communities to stay updated on the latest features and best practices related to DQLabs and other automation tools. Responsible for data services and data transport infrastructures. Experienced in ETL ideas, building ETL systems, and data modeling. Worked on architecting the ETL transformation layers and building Spark tasks to do the processing. Experience working with Informatica IICS tool effectively using it for Data integration and Data Migration from multiple source systems in Azure Sqi vata warehouse. Experienced in Informatica ETL Developer role in Data Warehouse projects, Enterprise Data Warehouse, OLAP and dimensional data modeling. Developed ETL programs using Info atica to implement the business requirements. Create shell scripts to fine tune the ETL flow of the Informatica workflows. Developed ETL programs using Informatica to implement the business requirements. Cloud and GPU computing technologies such as AWS and AZURE were used to automate machine learning and analytics processes. Experienced in data modeling and schema design in MongoDB, leveraging document-oriented data structures to design efficient and scalable database schemas that meet application requirements. Experienced in ETL ideas, building ETL solutions, and data modeling Developed many Directed Acyclic Graphs (DAGs) to automate ETL workflows. Experience with fact dimensional modeling (Star schema, Snowflake schema), transactional modeling, and SCD (slowly changing dimension). Created PL/SQL stored procedures, functions, triggers, views, and packages. Designed workflows with many sessions with decision, assignment task, event wait, and event raise tasks, used Informatica scheduler to schedule Jobs Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks. Environment: AZURE, AWS, Azure Data Lake, CI/CD, Azure Storage, Azure SQL, Azure DW, ETL/ELT, JIRA, AWS EC2, AWS S3, RedShift, Snowflake, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Informatica Rackspace Technology, Hyderabad, India May 2016 April 2018 Data Engineer Responsibilities: Migrating an entire oracle database to Big Query and using of power bi for reporting. Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators. Experience in GCP DataProc, GCS, Cloud functions, Big Query. Experience in moving data between GCP and Azure using Azure Data Factory. Experience in building power bi reports on Azure Analysis services for better performance. Used cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query Wrote scripts in Hive SQL for creating complex tables with high-performance metrics like partitioning, clustering, and skewing. Work related to downloading Big Query data into Python pandas or Spark data frames for advanced ETL capabilities. Worked with google data catalog and other google cloud APIs for monitoring, query, and billing related analysis for Big Query usage. Worked on creating POC for utilizing the ML models and Cloud ML for table Quality Analysis for the batch process. Knowledge about cloud dataflow and Apache beam. Created Big Query authorized views for row level security or exposing the data to other teams. Hands-on experience with MongoDB Connector for Apache Spark for seamless integration of MongoDB with Spark-based data processing workflows. Experienced in optimizing MongoDB queries and operations to improve performance and reduce. Proficient in using MongoDB's Explain () method and performance monitoring tools like MongoDB Profiler to identify and address performance bottlenecks. Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, SQOOP, Apache Spark, with Cloudera Distribution. Hands on experience with building data pipelines in python/Pyspark/HiveQL/Presto. Carried out data transformation and cleansing using SQL queries, Python and Pyspark. Expertise knowledge in Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology to get the job done. Implementing and Managing ETL solutions and automating operational processes. Developed stored procedures in MS SQL to fetch the data from different servers using FTP and processed these files to update the tables. Environment: Big Query, Hive, Ad Hoc, ETL/EL, GCP DataProc, GCS, Cloud functions, Azure Analysis, ODBC, SQOOP, Apache Spark, Cloudera, SQL, Tableau, FTP, HDFS, Azure Data Factory. Education: Bachelor s degree in computer science and engineering from JNTU in 2016. Keywords: continuous integration continuous deployment machine learning business intelligence sthree active directory information technology microsoft procedural language California Texas |