keerthi pothagani - Sr.Data Engineer |
[email protected] |
Location: Milwaukee, Wisconsin, USA |
Relocation: Yes |
Visa: H1B |
Keerthi Pothagani
Sr. Data Engineer Email: [email protected] Phone:(707)-886-6158 Professional Summary: Over 9+ years of extensive IT experience as a Data Engineer with expertise in designing data-intensive applications using the Hadoop Ecosystem and Big Data Analytical, Cloud Data engineering (AWS, Azure), Data Visualization, Data Warehouse, Reporting, and Data Quality solutions. Hands - on experience on Google Cloud Platform (GCP) in all the big data products bigquery, Cloud DataProc, Google Cloud Storage, Composer (Air Flow as a service). Converted PL/SQL type of code to both bigquery-python architecture as well as azure databricks and pyspark in dataproc. Wrote AWS Lambda functions in python for AWS& Lambda which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters. Implemented a serverless architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets. Created a Lambda Deployment function, and configured it to receive events from your S3 bucket Knowledge of Guidewire integration framework and various integration methods like Web service APIs, Plugins, Messaging code, Guidewire XML models, templates etc. Hands-on expertise with data engineering stack, including Python, SQL, R, and worked with databases like Oracle, MySQL, SQL Server, Snowflake, Mongo DB, Cassandra, and writing ETLs. Working knowledge of Data Build Tool (Dbt) with Snowflake. Experience with Data Build Tool (Dbt)to performSchema Tests, Referential Integrity Tests, and Custom Tests the data and ensured Data Quality. Extensive experience with the Hadoop ecosystem, including solid knowledge of Big Data technologies such as HDFS, Spark, YARN, Kafka, MapReduce, Apache Cassandra, HBase, Zookeeper, Hive, Oozie, Impala, Pig, and Flume. Hands on experience inGCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Data Proc, Stack driver. Architect and implement ETL and data movement solutions using Azure Data Factory, SSIS Good Knowledge of web services using GRPC and GRAPHQL protocols. Experience in Hadoop ecosystem experience in ingestion, storage, querying, processing and analysis of big data. Experienced in developing MapReduce programs using Apache Hadoop for working with Big Data Experience working with Azure Logic APP Integration tool. Extract Transform and Load data from Sources Systems to Azure Data Storage services using Azure Data Factory and HDInsight. Built the trigger-based Mechanism to reduce the cost of different resources like Web Job and Data Factories using Azure Logic Apps and Functions. Experienced in implementation of Lakehouse architecture on Azure using Azure Data Lake, Delta Lake, Delta Tables, and Databricks. Expertise in developing applications using PowerApps, canvas, model driven app, Common Data Service CDS, SQL, Forms, SharePoint online, Dynamics 365 CRM, Azure, C#, ASP.Net, Web Services. Experience in developing application in PowerApps using Common Data Service CDS, SQL, Flow, Excel and SharePoint. Expertise in using DAX and generated dashboards & reports and integrating with PowerApps. Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake . Very Good experience working in Azure Cloud, Azure DevOps, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NO SQL DB, Azure HD Insight BigData Technologies (Hadoop and Apache Spark) and Data bricks. Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer. Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds. Worked on building ETL processes to load data from multiple data sources to HDFS, perform structural modifications using MapReduce and Hive and analyse data using visualization/reporting tools. Use of AWS Appsync (GraphQL) for web API creation and Data synchronization into Aurora Postgres or Dynamo DB engine. Exposure to AI and Deep learning platforms such as TensorFlow, Keras, AWS ML, Azure ML studio Designed and developed a decision tree application using Neo4j graph database to model the nodes andrelationship for each decision. Databricks job configuration, Refactoring of ETL Databricks notebooks Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing. Knowledge of Spark Context, Spark-SQL, Data frame API, Spark Streaming, and Pair RDDs, worked extensively on PySpark to increase the efficiency and optimization of existing Hadoop approaches. Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server. Developed different pipelines in the Streamsets according the requirements of the business owner. Architect and build pipeline solutions to integrate data from multiple heterogeneous systems using Streamsets data collectors and Azure Worked on NoSQL databases like MongoDB, Document DB and Graph Databases like neo4j. Responsible to develop EEIM Application as Apache Maven project and commit to code to GIT. Worked with Jira, Bit Bucket and source control systems like GiT and SVN and development tools like Jenkins, Artifactory. Designed one-time load strategy for moving large databases to Azure SQL DWH. Designed and developed logical and physical data models that utilize concepts such as Star Schema, Snowflake Schema, and Slowly Changing Dimensions. Expertise in using Airflow and Oozie to create, debug, schedule, and monitor ETL jobs. Experience with Partitions and bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance. Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds. Expert knowledge of analytics with Big-Data and Deployment tools like MLlib, Databricks, AWS Sage maker, Docker, and TensorFlow Serving. Architect and design serverless application CI/CD by using AWS Serverless (Lambda) application model. Provide a streamlined developer experience for delivering small serverless applications to solve business problems The Platform is a Lambda-based platform. It is composed of a pipeline and a runtime. Developed Python scripts to take backup of EBS volumes using AWS Lambda and Cloud Watch. Experienced in building Snow pipe and In-depth knowledge of Data Sharing in Snowflake and Snowflake Database, Schema, and Table structures. Hands-on experience with Snowflake utilities, Snow SQL, Snow Pipe, and Big Data model techniques using Python. Storage integrations and external stages for data moments to/from - S3 / Blob to Snowflake. Extensively used Data loading mechanisms for semi-structured (JSON, Parquet) and structured formats (CSV, flat files) using copy and Snow pipe (SQS). Developed stream & tasks to implement the CDC in Snowflake. Good understanding of the Fact/Dimension data warehouse design model, including star, snowflake design methods. Worked on the Guidewire Policy Center 8.0, Billing Center 7.0, Claims Center 8.x/7. Strong Experience in working with ETL Informatica, which includes components InformaticaPowerCenter Designer, Workflow manager, Workflow monitor, Informatica server and Repository Manager. Expertise in Data masking, Data subsetting, Synthetic test data generation, and Data archive using Informatica TDM/ILM Suite. Experience in Data Masking and Data Quality. Data De-identification, Data Profiling, Data Masking, Data Domain. Good understanding of Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters in Databricks, Managing the Machine Learning Lifecycle. Can work parallelly in both GCP and Azure Clouds coherently. Created Redshift clusters on AWS for quick accessibility for reporting needs. Designed and deployed a Spark cluster and different Big Data analytic tools, including Spark, Kafka streaming, AWS, and HBase with Cloudera Distribution. Experience in writing SAM template to deploy serverless applications on AWS cloud. Proven experience deploying software development solutions for a wide range of high-end clients, including Big Data Processing, Ingestion, Analytics, and Cloud Migration from On-Premises to AWS Cloud. Experience in domains like IT, Supply-chain, Retail, Healthcare, etc., with in-depth knowledge of industry workings and challenges. Proven track record of ability, adaptability, creativity, and innovation, along with the demonstration of very strong technical and managerial skills and while successfully leading teams to strict project deadlines. Set up dynamic inputs using Alteryx for ETL processes to bring data from multiple sources. Technical Skills: Languages: Python (SpaCy, Pandas, NumPy, Sci-kit learn, etc.), R (dplyr, car, zoo, etc.), SQL, JavaScript, C#, Scala, Scala HQL, XPath, PySpark, PL/SQL, Shell script, Perl script. Database: Oracle, My SQL, SQL Server, MongoDB, DBT, Snowflake, Cassandra, NoSQL, Alteryx, AWS Redshift, AWS Athena, DynamoDB, PostgreSQL, Teradata, Cosmos. Database Modelling: Dimension Modelling, ER Modelling, Star Schema Modelling, Snowflake Modelling. Operating Systems: Windows, Ubuntu Linux, MacOS. Cloud: AWS, Azure,GCP, Dataproc, Amazon AWS, EC2, EC3, MS Azure, Azure SQL Database, Azure SQL Data Warehouse, Azure Analysis Services, Azure Data Lake, Data Factory Data Warehousing/BI:Informatica Power Centre, Power Exchange, IDQ, Ambari view, consumption framework Big Data Technologies: Apache Hadoop, Streamsets, Spark, Spark SQL, MLlib, Databricks, Synapse, Powerapps, AWS Sage maker, TensorFlow Serving, HDFS, MapReduce, MR Unit, YARN, Hive, PIG, HBase, Impala, Zookeeper,Sqoop, Oozie, Apache Cassandra, Scala, Flume, NiFi, Kafka, Zookeeper, Yarn, Apache Spark, Mahout, Spark MLIib. Visualization Tools: Tableau, Power BI, Excel, VBA, etc. Development Tools:PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, QueryAnalyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman,Jupiter Notebook, ND4J, Scikit-learn, Shogun, MLlib, H2O, ClouderaOryx, Go Learn,ApacheSinga. ETL Tools: MS SQL Server Analysis Services (MSOLAP, SSAS), Integration Services (SSIS),Reporting Services (SSRS), Performance Point Server (PPS), Oracle 9i OLAP, MSOfficeWeb Components (OWC11), Informatica, Sqoop, TDCH, Manual, etc. Version Controller: Tortoise HG, Microsoft TFS, SVN, GIT, GitHubMload, Fast Export. Professional Experience: Amazon, Seattle, WA March 2023 - Present. Role: Sr Data Engineer Responsibilities: Implementing solutions utilizing Advanced AWS Components: EMR, EC2, etc. integrated with Big Data/Hadoop Distribution Frameworks: Hadoop YARN, MapReduce, Spark, Hive, etc. Automated and built Machine learning model pipelines using frameworks such as sci-kit learn on user identity security data to detect fraud and deployed them using kubeflow . Using AWS Athena extensively to ingest structured data from S3 into multiple systems, including RedShift, and to generate reports. Develop an exploratory data analysis approach with the team lead to verify the initial hypothesis associated with potential AI/ML use cases. Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements from Aurora. Creating AWS Lambda functions using python for deployment management in AWS and designed, investigated and implemented public facing websites on Amazon Web Services and integrated it with other applications infrastructure. Creating different AWS Lambda functions and API Gateways, to submit data via API Gateway that is accessible via Lambda function. Career Interest and future aspirations include but not limited to: ML, AI, RPA & Automation everywhere motives. Developed Python AWS serverless lambda with concurrent and multi-threading to make the process faster and asynchronously executing the callable. Creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark. Configured EMR cluster for data ingestion and used dbt (data build tool) to transform the data in Redshift. Worked on writing, testing, debugging SQL code for transformations using data build tool (dbt). Facilitated training sessions to demo the dbt tool for various teams and sent weekly communications on different topics related to Data Engineering. Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators. Experience in GCP Dataproc, Dataflow, PubSub, GCS, Cloud functions, BigQuery, Stackdriver, Cloud logging, IAM, Data studio for reporting etc. Developed ELT processes from the files from abinitio, google sheets in GCP with compute being dataprep, dataproc (pyspark) and Bigquery. Migrated an Oracle SQL ETL to run on google cloud platform using cloud dataproc & bigquery, cloud pub/sub for triggering the airflow jobs. Converted SAS code to python/spark-based jobs in cloud dataproc/big query in GCP. Experience in GCP Dataproc, GCS, Cloud functions, BigQuery. Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators. Experience in building isomorphic applications using React.js and Redux with GraphQL on server side. Experience in developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Informatica. Responsible for developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Informatica Power Center and data subset solutions using Informatica persistent data masking. Experience in developing Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Developed application in PowerApps using Common Data Service CDS, Flow, Excel, Forms and SharePoint online. Analyzing the Data from different sourcing using Big Data Solution Hadoop by implementing Azure Data Factory, Azure Data Lake, Azure Synapse, Azure Data Lake Analytics, HDInsights, Hive, Sqoop. Data Stack typically includes - AWS, Snowflake, DynamoDB, S3, RDSs, AI/ML Data exploration, RPA-co-relations causations, Spark SQL, SQLs, Data Modeling, Tableau, Excel. Working with Azure BLOB andData lakestorage and loading data intoAzure SQL Synapse analytics (DW). Created custom forms using PowerApps forms / Standalone apps. Developed ticketing management system based on my organization requirement using PowerApps. Extensively used Azure Big data technologies like Azure Data Lake Analytics, Azure Data Lake Store, Azure Data Factory and created POC in moving the data from flat files and SQL Server using U-SQL jobs. Experience in developing very complex mappings, reusable transformations, sessions, and workflows using Informatica ETL tool to extract data from various sources and load into targets. Implement the Data Profiling and Data masking methodologies using PL/SQL Scripting, CA TDM (Computer Associates Test Data Management) process, Grid tools usage (Fast Data Masker, GT data maker, Javelin). Designed and developed the ELT jobs using DBT to achieve the best performance. Implemented SCD Type 2 using python & DBT, on the claims system and incremental load on claims subject area. Design, development, debugging and Unit Testing of theDBT jobs Expertise on all basic IICS transformations, tasks including Hierarchy Builder, Hierarchy Parser, data masking, Replication task, Rest V2 Connector & Web services transformation. Used GRPC and GRAPHQL as a data Gateway. Performing end-to-end Architecture and implementation assessment of various AWS services like Amazon EMR,Redshift, S3, Athena, Glue, and Kinesis. Creating AWS RDS (Relational database services) to work as Hive meta store and could combine EMR cluster s metadata into a single RDS, which avoids the data loss even by terminating the EMR. Involving in code migration of quality monitoring tool from AWS EC2 to AWS Lambda and built logical datasets to administer quality monitoring on snowflake warehouses. Experienced with event-driven and scheduled AWS Lambda functions to trigger various AWS resources. Installed and configured Apache Airflow for S3 bucket and Snowflake data warehouse and created to run theAirflow. Worked creating, developing & production support for AAS models on retail sales, POS, Corporate Forecast data by developing & deploying spark pipelines for ensuring continuous delivery of data from various cross-functional teams like Omni-channel, Demand-Supply, Global Logistics and deliver enterprise cleansed datasets to BI Engineering & Data Science teams. Good Experience with Guidewire PolicyCenter API. Load the data into Spark RDD and performed in-memory data computation to generate the output response. Creating ETL jobs on AWS glue to load vendor data from different sources, transformations involving data cleaning, data imputation and data mapping and storing the results into S3 buckets. The stored data was later queried using AWS Athena. Designing and developing ETL process using Informatica 10.4 tool to load data from wide range of sources such asOracle, flat files, salesforce, and AWS cloud. Worked on all IICS transformations, tasks including Hierarchy Builder, Hierarchy Parser, data masking, Replication task, Rest V2 Connector & Web services transformation. Experience in working with GraphQL queries and use Apollo GraphQL library. Intensively used Python, JSON & Groovy scripts coding to deploy the Streamsets pipelines into the server. Responsible for sending quality data thru secure channel to downstream systems using role base access control and Streamsets. Created and modified several database objects such as Tables, Views, Indexes, Constraints, Stored procedures, Packages, Functions and Triggers using SQL and PL/SQL. Hands-on experience wif building Azure notebooks, dbutils functions using Visual Studio Code and creating deployment using GIT. Experience working on Version control tools like SVN and GIT revision control systems such as GitHub and JIRA to track issues. Wrote Python scripts to parse XML documents and load the data in database. Extracting and uploading data into AWS S3 buckets using InformaticaAWSplugin. Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production. Queried both Managed and External tables created by Hive using Impala. Monitored and controlled Local disk storage and Log files using Amazon CloudWatch. Played a key role in dynamic partitioning and Bucketing of the data stored in Hive Metadata. Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery Created azure data factory (ADF pipelines) using Azure blob. Moved ETL jobs written previously in MySQL andOracleto on-prem Hadoop initially and then performed the lift and shift ETL jobs from Hadoop on-prem to Cloud Dataproc. Scheduled, automated business processes and workflows using Azure Logic Apps. Involved in optimized Delta tables by using Optimize command and applying Z-Order Clustering. Implemented SCD type1, SCDType2 logic using Azure data bricks for Delta tables. Performed ETL on data from different source systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Implemented Neo4j to integrate graph database with relational database and to efficiently store, handle and query highly connected elements in your data model. Worked on Azure PaaS Components like Azure data factory, Data Bricks, Azure logic apps, Application insights, Azure Data Lake, Azure data lake analytics, virtual machines, Zeo-Replication, and app services.Built the trigger based Mechanism to reduce the cost of different resources like Web Job and Data Factories using Azure Logic Apps and Functions. Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery Involving with extraction of large volumes of data and analysis of complex business logics to drive business -oriented insights and recommending/proposing new solutions to the business in Excel Report. Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level ofParallelism and memory tuning. Encoded and decoded JSON objects using PySpark to create and modify the data frames in Apache Spark. Expertise in Microsoft SQL Server, Azure PaaS components (Azure Data Factories, Data Bricks, Azure Logic Apps, Azure Data lake Analytics using USQL, Azure App Service" Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS). Developed ETL jobs to automate the real time data retrieval from Salesforce.com, suggest best methods for data replication from Salesforce.com. Used AWS data pipeline for Data Extraction, Transformation and Loading from homogeneous or heterogeneousdata sources and built various graphs for business decision-making using Python mat plot library. Perform Data Cleaning, features scaling, features engineering using pandas and NumPy packages in Python. Developed PySpark and Spark SQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMsdeveloped. Designed, developed, and managed Power BI, Tableau, QlikView, Qlik Sense Apps including Dashboard, Reports, Storytelling. Created a new Power BI reports dashboard with 13 pages according to the design spec in two weeks beating the tight timeline. Deployed automation to production for updates the company holiday schedule based on company s holiday policy which need to be updated yearly. Used Informatica Power Centre for extraction, transformation, and loading (ETL) of data in the data warehouse. The source system is tables from Claim Center of Guidewire, and Policy/Insurance management system. Loading data into Snowflake tables from internal stage using Snow SQL. Prepared data warehouse usingStar/Snowflake schema concepts in Snowflake using Snow SQL. Prepared Tableau reports and dashboards with calculated fields, parameters, sets, groups, or bins and published them on the server. Designed and implemented ETL pipelines on S3 parquet files on Data Lake using AWS Glue. Designed AWSarchitecture, Cloud migration, Dynamo DB and event processing using Lambda function. Experience in managing and securing the Custom AMI's, AWS account access using IAM. Managed storage in AWS using Elastic Block Storage, S3, created Volumes and configured Snapshots. Experience configuring AWS S3 and their lifecycle policies and backup files and archive files in Amazon Glacier. Experience in creating and maintaining the databases in AWS using RDS. Created monitors, alarms, notifications, and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatchand used for the data transformation, validate and data cleansing. Experience in Building and Managing Hadoop EMR clusters on AWS. Used AWS Beanstalk for deploying and scaling web applications and services developed with Java. Developed Scripts for AWS Orchestration. Designed tool API and MapReduce job workflow using AWS EMR and S3. Used Spark -Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kinesis in near real-time. Worked with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table. Used AWS glue catalogue with crawler to get the data from S3 and perform SQL query operations and JSON schema to define table and column mapping from S3 data to Redshift. Worked on EMR Security Configurations, to store the self-signed certificates as well as KMS keys created into it. This makes it spin up a cluster in an easy manner without modifying permissions after the call. Worked with Cloudera 5.12.x and its different components. Installation and setup of multi node Cloudera cluster on AWS cloud. Created Redshift clusters on AWS for quick accessibility for reporting needs. Designed and deployed a Spark clusterand different Big Data analytic tools including Spark, Kafka streaming, AWS and HBase with Cloudera Distribution. Involved in importing real-time data using Kafka and implemented Oozie jobs for daily imports. In Tableau development environment, supported customer service designing ETL jobs, dash boards utilizing data from Redshift. Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau andSAS Visual Analytics. Performed partitioning and Bucketing concepts in Apache Hive database, which improves the retrieval speed when someone performs a query. Environment: Spark RDD. AWS Glue, Apache Kafka, Amazon S3, SQL, Streamsets, Spark , GCP, Dataproc, AWS cloud,ETL, GIT, GRAPHQL, NumPy, SciPy, pandas, Scikit-learn, Seaborn, NLTK) and Spark 1.6 / 2.0 (PySpark , MLlib, EMR, EC2, and Amazon RDS. data lake, Python, ClouderaStack,HBase, Hive, Impala, Pig, NiFi, Spark , Spark Streaming,Elasticsearch, Log stash,Kibana, JAX-RS, Spring, Hibernate, Apache Airflow,Guidewire, Oozie, RESTful API, JSON,JAXB, XML, WSDL, MySQL, Cassandra, MongoDB, HDFS, ELK/Splunk, Athena, Azure, tableau, Redshift, Scala, snowflake, java, Jenkins, Snow SQL. Papa John s Intl, Louisville, KY July 2022 - February 2023 Role: Data Engineer Responsibilities: Designed and set up Enterprise Data Lake to provide support for various use cases, including Storing, processing, Analytics, and Reporting of voluminous, rapidly changing data by using various AWS Services. Used various AWS services, including S3, EC2, AWS Glue, Athena, RedShift, EMR, SNS, SQS, DMS, and Kinesis. Extracted data from multiple source systems S3, Redshift, and RDS and Created multiple tables/databases by creating Glue Crawlers. Used AWS data pipeline for Data Extraction, Transformation, and Loading from homogeneous or heterogeneous data sources and built various graphs for business decision-making using the Python mat plot library. Implemented solutions utilizing Advanced AWS Components: EMR, EC2, etc., integrated with Big Data/HadoopDistribution Frameworks: Hadoop YARN, MapReduce, Spark, Hive, etc. Good Knowledge in Amazon, AWS concepts like EMR and EC2 web services which provides fast and efficient processing of Big Data. Evaluated suitability of Hadoop and its ecosystem to the above project and implementing / validating with various proof of concept (POC) applications to eventually adopt them to benefit from the Big Data Hadoop initiative. Used AWS Athena extensively to ingest structured data from S3 into multiple systems, including RedShift, and to generate reports. Exposure on IAM roles in GCP. Used Sqoop import/export to ingest raw data into Cloud Storage by spinning up Cloud Dataproc cluster. Experience in GCP Dataproc, GCS, Cloud functions, Cloud SQL & BigQuery. Monitoring Big query, Dataproc and cloud Data flow jobs via Stack driver for all the different environments. Created on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks. Analyzing the Data from different sourcing using Big Data Solution Hadoop by implementing Azure Data Factory, Azure Data Lake, Azure Synapse, Azure Data Lake Analytics, HDInsights, Hive, Sqoop. Working with Azure BLOB andData lakestorage and loading data intoAzure SQL Synapse analytics (DW). Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW)& Azure SQL DB).Have good experience working with Azure BLOB and Data lake storage and loading data into Azure SQL Synapse analytics (DW). Created Pipelines inADFusingLinked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources likeAzure SQL, Blob storage, Azure Synapse, Azure SQL Data warehouse, write-back tool and backwards. Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Synapse, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks Designed and developed a decision tree application using Neo4j graph database to model the nodes and relationship for each decision. Performed end-to-end Architecture and implementation assessment of various AWS services like Amazon EMR,Redshift, S3, Athena, Glue, and Kinesis. Created AWS RDS (Relational database services) to work as Hive meta store and could combine EMR cluster'smetadata into a single RDS, which avoids data loss even by terminating the EMR. Involved in code migration of quality monitoring tool from AWS EC2 to AWS Lambda and built logical datasets to administer quality monitoring on snowflake warehouses. Install and configure Apache Airflow for the S3 bucket and Snowflake data warehouse and create dags to run the Airflow. Experience in Workers Compensation and Personal Insurance modules in Guidewire Policy Center. Experience working on Guidewire Claim Center, Policy Center and Billing Center. Experience working on Guidewire Policy Center Integrations and Configuration User Stories. Loaded the data into Spark RDD and performed in-memory data computation to generate the output response. Extracting and uploading data into AWS S3 buckets using the Informatica AWS plugin. Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production. Queried both Managed and External tables created by Hive using Impala. Monitored and controlled Local disk storage and Log files using Amazon CloudWatch. Played a key role in dynamic partitioning and bucketing the data stored in Hive Metadata. Involved with extracting large volumes of data and analysing complex business logic to derive business-oriented insights and recommend/propose new solutions to the business in Excel Reports. Experienced in performance tuning of Spark Applications for setting the right Batch Interval time, the correct level of Parallelism, and memory tuning. Encoded and decoded JSON objects using PySpark to create and modify the data frames in Apache Spark. Created Build and Release for multiple projects (modules) in the production environment using Visual Studio Team Services (VSTS). Used AWS data pipeline for Data Extraction, Transformation, and Loading from homogeneous or heterogeneous data sources and built various graphs for business decision-making using the Python mat plot library. Developed PySpark and Spark SQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed. Used Informatica Power Centre for data extraction, transformation, and loading (ETL) in the data warehouse. Built extract / load / transform (ETL) processes in the Snowflake Data Factory using dbt to manage and store data from internal and external sources. Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW)& Azure SQL DB).Have good experience working with Azure BLOB and Data lake storage and loading data into Azure SQL Synapse analytics (DW). Created Pipelines inADFusingLinked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources likeAzure SQL, Blob storage, Azure Synapse, Azure SQL Data warehouse, write-back tool and backwards. Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Synapse, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks Develop ELT data pipeline to migrate applications usingDBTand Snowflake framework. Created GIT repositories and specified branching strategies dat best fitted the needs of the client. Prepared data warehouse using Star/Snowflake schema concepts in Snowflake using Snow SQL. Build and test the ETL process using Informatica and Python loading data into Oracle Exadata. Developed Spark API to import data into HDFS from MySQL, SQL Server, Oracle and created Hive tables. Developed Sqoop jobs to import data in Avro file format from Oracle database and created Hive tables on top of it. Designed and implemented ETL pipelines on S3 parquet files on data lake using AWS Glue. Designed AWS architecture, Cloud migration, Dynamo DB, and event processing using the Lambda function. Experience managing and securing Custom AMI's AWS account access using IAM. Managed storage in AWS using Elastic Block Storage, S3, created Volumes, and configured Snapshots. Experience configuring AWS S3 and its lifecycle policies and backup and archive files in Amazon Glacier. Experience in creating and maintaining the databases in AWS using RDS. Experience in Building and Managing Hadoop EMR clusters on AWS. Used AWS Beanstalk for deploying and scaling web applications and services developed with Java. Designed tool API and MapReduce job workflow using AWS EMR and S3. Used Spark -Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model, which gets the data from Kinesis in near real-time. Worked with Snowflake cloud data warehouse and AWS S3 bucket to integrate data from multiple source systems, including loading nested JSON formatted data into snowflake table. Used AWS glue catalogue with crawler to get the data from S3 and perform SQL query operations and JSON schema to define table and column mapping from S3 data to Redshift. Involved in importing real-time data using Kafka and implemented Oozie jobs for daily imports. In Tableau development environment, supported customer service designing ETL jobs, dash boards utilizing data fromRedshift. Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SASVisual Analytics. Performed partitioning and Bucketing concepts in Apache Hive database, which improves the retrieval speed when someone performs a query. Environment: Spark RDD, AWS Glue, Apache Kafka, Amazon S3, SQL, Spark, AWS cloud, ETL, NumPy, SciPy, pandas, Scikit-learn, Seaborn, NLTK) and Spark 1.6 / 2.0 (PySpark, MLlib, EMR, EC2, and Amazon RDS. Datalake, Python, Cloudera Stack, HBase,GCP, Hive, Impala, Pig, NiFi, Spark, Spark Streaming, Elasticsearch, Logstash, Kibana, JAX-RS, Spring, Hibernate, Apache Airflow, Guidewire, Oozie, Restful API, JSON, JAXB, XML, WSDL, MySQL, Cassandra, MangoDB, HDFS, ELK/Splunk, Athena, Azure, Tableau, Redshift, Scala, Snowflake,GIT, Jenkins, SnowSQL, JIRA, Alteryx. Global Atlantic financial group, Indianapolis, IN January 2021 - June 2022 Role: AWSData Engineer Responsibilities: Exploring with Spark improving the performance and optimization of the existing algorithms in Hadoop using SparkContext, Spark SQL, Data Frame, and Spark Yarn. Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS and converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size. Wrote Spark applications for Data validation, cleansing, transformations, and custom aggregations and imported data from different sources into Spark RDD for processing and developed custom aggregate functions using Spark SQL andperformed interactive querying. Worked on data pipeline creation to convert incoming data to a common format, prepare data for analysis and visualization, migrate between databases, share data processing logic across web apps, batch jobs, and APIs, consume largeXML, CSV, and fixed-width files and created data pipelines in Kafka to replace batch jobs with real-time data. Involved in converting Hive/SQL queries into Spark Transformations using Spark. RDDs and Scala and involved in using SQOOP for importing and exporting data between RDBMS and HDFS. Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and aggregation on the fly to build the common learner data model and persistence the data in HDFS. Created AWS Glue job for archiving data from Redshift tables to S3 (online to cold storage) as per data retention requirements and involved in managing S3 data layers and databases including Redshift and Postgres. Processed the web server logs by developing multi-hop flume agents by using Avro. Sink and loaded into MongoDB for further analysis and worked on MongoDBNoSQL data modelling, tuning,disaster recovery and backup. UsedAirflow users to schedule and run Data Pipelines using the flexible Python Operators and framework and to implement the pipelines allows users to streamline various business processes. Developed a Python Script to load the CSV files into the S3 buckets and createdAWS S3 buckets, performed folder management in each bucket, managed logs, and objects within each bucket. Worked with different file formats like JSON, AVRO and parquet and compression techniques like snappy and developed Python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool. Developed shell scripts for dynamic partitions adding to Hive stage table, verifying JSON schema change of source files, and verifying duplicate files in source location. Worked with importing metadata into Hive using Python and migrated existing tables and applications to work onAWS cloud (S3). Integrated Hadoop into traditional ETL, accelerating the extraction, transformation, and loading of massive structuredand unstructured data. Involved with writing scripts in Oracle, SQL Server and Netezza databases to extract data for reporting and analysis and worked in importing and cleansing of data from various sources like DB2, Oracle, flat files onto SQL Server with high volume data. Container management using Docker by writing Docker files and set up the automated build on Docker HUB and installed and configured Kubernetes. Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hiveand AWS cloud and making the data available in Athena and Snowflake. Extensively used Stash Git-Bucket for Code Control and Worked on AWS Components such as Airflow, Elastic MapReduce (EMR), Athena and Snowflake. Environment: Spark, AWS, EC2, EMR, Hive, SQL Workbench, Tableau, Kibana, Sqoop, Spark SQL, Spark Streaming, Scala, Python, Hadoop (Cloudera Stack), Informatica, Jenkins, Docker, Hue, Spark, Netezza, Kafka,HBase, HDFS, Hive, Pig, Sqoop, Oracle, ETL, AWS S3, AWS Glue, GIT, Grafana. Highmark Health, Pittsburgh, PA April 2020 - December 2020 Role: Data Analyst Responsibilities: Involved in writing complex data queries using advanced SQL and Database concepts. Generated surveys and different reports. Developed SQL programs for quality checks and macros for standard reports and validations using different KPI analytical skills. Experienced in analysing enrolment data (Electronic Medical System and Electronic data interfaces). Created and executed claims processing procedures that were organized and resource efficient. Efficiently and independently performed HealthCare Claim (MEDICAID AND MEDICARE) analysis and created reports included as federal reports, financial arrangements, inventive agreements, therapeutic value, advantage structure, benefit design, healthcare reform, and health systems. Prepared summary statistics (mean, median, mode, standard deviation, minimum, maximum, sum, etc.) of quantitative variables within each data set. Hands-on experience on AWS components EMR, EC2, S3, RDS, IAM, Auto Scaling, Cloud Watch, SNS, Athena, Glue, Kinesis, Lambda, Red shift, Dynamo DB to ensure a secure zone for an organization in AWS public cloud. Developed Python AWS server less lambda with concurrent and multi-threading to make the process faster and asynchronously executing the callable. By using HL7 integrate, store, and share data electronically daily. Conducted hands-on data transfer, such as coding of patient identification algorithms and cost and utilization outcomes. Produced code that is logically organized and well-documented. Created data analysis and business analysis reports in various formats (RTF, PDF, HTML, etc.) Performed Data Collection, Data Cleaning, Data Visualization, and Feature Engineering using Python libraries such as Pandas and Numpy, matplotlib, and seaborn. Applied Elbow method to select optimum clusters for K-means algorithm. Optimized SQL queries for transforming raw data into MySQL with Informatica to prepare structured data for machine learning. Used Tableau for data visualization and interactive statistical analysis. Worked with Business Analysts to understand the user requirements, layout, and look of the interactive dashboard. Environment: AWS, Python, Pandas, Numpy, SciPy, Apache, ggplot, Plotly, Python matplotlib, Tableau, ggplot, Bash, Shell, SQL server, MangoDB. ADP, Hyderabad, Telangana, India January 2017 March 2020 Role: Data Engineer Responsibilities: Followed the Agile Methodology (Scrum) to fulfil client expectations, timelines with quality deliverables. Created Spark jobs by writing RDDs in Python and created data frames in Spark SQL to perform data analysis and stored in Azure Data Lake. Configured Spark Streaming to receive real-time data from the Apache Kafka and store the stream data to HDFSusingScala. Developed Spark Applications by using Kafka and Implemented Apache Spark data processing project to handle datafrom various RDBMS and Streaming sources. Designed the Airflow scheduler for persistent service in an Airflow production environment. Created various data pipelines using Spark, Scala, and Spark SQL for faster processing of data. Designed batch processing jobs using Apache Spark to increase speed compared to that of MapReduce jobs. Written Spark-SQL and embedded the SQL in SCALA files to generate jar files for submission onto the Hadoop cluster. Developed data pipeline using Flume to ingest data and customer histories into HDFS for analysis. Executing Spark SQL operations on JSON, transforming the data into a tabular structure using data frames, and storing and writing the data to Hive and HDFS. Worked with HIVE data warehouse infrastructure-creating tables, datadistribution by implementing partitioning and bucketing, writing, and optimizing the HQL queries. Experience working on Guidewire Claim Center, Policy Center and Billing Center. Created Hive tables as per requirement were Internal or External tables defined with appropriate static, dynamic partitions, and bucketing, intended for efficiency. Used Hive as an ETL tool for event joins, filters, transformations, and pre-aggregations. Involved in moving all log files generated from various sources to HDFS for further processing through Kafka. Extracting real-time data using Kafka and Spark streaming by Creating DStreams and converting them into RDD,processingkit, and stored it into. Used Spark SQL for Scala interface that automatically converts RDD case classes to schema RDD. Extracted source data from Sequential files, XML files, CSV files, transformed and loaded it into the target Datawarehouse. Solid understanding of NoSQL Database (MongoDB and Cassandra). Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL and Scalaextracted large datasets from Cassandra and Oracle servers into HDFS and vice versa using Sqoop. Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala. Involved in Migrating the platform from Cloudera to EMR platform. Developed analytical component using Scala, Spark, and Spark Streaming. Used Hook for writing the low-level code that hits their API or uses special libraries and for the building blocks that Operators are built out of. Worked on developing ETL processes to load data from multiple data sources to HDFS using FLUME and performed structural modifications using HIVE. Provided technical solutions on MS Azure HDInsight, Hive, HBase, MongoDB, Telerik, Power BI, Spot Fire, Tableau and Azure SQL Data Warehouse Data Migration Techniques using BCP, Azure Data Factory, and Fraud prediction using Azure Machine Learning. Environment:Hadoop, Hive, Kafka, Snowflake, Spark, Scala, HBase, Cassandra, JSON, XML, UNIX Shell Scripting, Cloudera, MapReduce, Power BI, ETL, MySQL, No SQL. Intuit, Hyderabad, Telangana, India. April 2015 - December 2016 Role: Data Engineer Responsibilities: Collaborated with business users/product owners/developers to contribute to analysing functional requirements. Implemented Spark SQL queries that combine Hive queries with Python programmatic data manipulations supported by RDDs and data frames. Used Kafka Streams to Configure Spark streaming to get information and store it in HDFS. Extract Real-time feed using Spark Streaming and convert it to RDD, process data in the form of a Data Frame, and save the data in HDFS. Developing Spark scripts and UDFS using Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop. Installed and configured Hadoop Map Reduce HDFS. Developed multiple MapReduce jobs in java for data cleaning and pre-processing. Installed and configured Pig and wrote Pig Latin scripts. Wrote MapReduce job using Pig Latin. In Airflow, a (DAG) or a Directed Acyclic Graph, I used a collection of all the tasks you want to run and organized in a way that reflects their relationships and dependencies. Worked on analysing Hadoop clusters using different big data analytic tools, including HBase database and Sqoop. Worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization, and generating reports. Creating and inserting data into Hive tables for dynamically inserting data into data tables using partitioning and bucketing for EDW tables and historical metrics. Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations, and others during the ingestion process. Created ETL packages with different data sources (SQL Server, Oracle, Flat files, Excel, DB2, and Teradata) and loaded the data into target tables by performing various SSIS transformations. Managed storage in AWS using Elastic Block Storage, S3, created Volumes, and configured Snapshots. Experience configuring AWS S3 and its lifecycle policies and backup and archive files in Amazon Glacier. Designed and developed data integration programs in a Hadoop environment with NoSQL data store Cassandra for data access and analysis. Created partitions, bucketing across the state in Hive to handle structured data using Elastic search. Performed Sqooping for various file transfers through the HBase tables for data processing to several NoSQL DBs-Cassandra, Mongo DB. Environment: Hadoop, MapReduce, HDFS, Hive, Python, Kafka, HBase, Sqoop, No SQL, Spark 1.9, PL/SQL, Oracle, Cassandra, Mongo DB, ETL, MySQL, Python, SQL, Tableau, AWS, Data Visualization, ETL. CSC, Hyderabad, Telangana, India August 2014 - March 2015 Designation: Data Analyst Responsibilities: Creating and inserting data into Hive tables for dynamically inserting data into data tables using partitioning and bucketing for EDW tables and historical metrics. Gathered requirements, analysed, and wrote the layout documents. Involved in complete Agile Requirement Analysis, Development, System, and Integration Testing. Built various graphs for business decision making using Packages like NumPy, Pandas, Matplotlib, SciPy ggplot2 for Numerical analysis. Involve in records mining, transformation and loading from the supply structures to the goal system. Working with various Integrated Development Environments (IDEs) like Visual Studio Code and PyCharm. Create information fashions for AWS Redshift and Hive from dimensional information fashions. Documental statistics mapping and transformation techniques with inside the Functional Design files. Enterprise requirements Performed Data Profiling and Data Quality checks. Worked on exporting reviews in a couple of codecs which include MS Word, Excel, CSV, and PDF. Environment: Agile, NumPy, Pandas, Matplotlib, SciPy, ggplot2, AWS Redshift and Hive. Keywords: csharp continuous integration continuous deployment artificial intelligence machine learning javascript business intelligence sthree database rlang information technology golang microsoft procedural language California Colorado Delaware Kentucky Pennsylvania Washington |