Srihari - GCP Data Engineer |
[email protected] |
Location: Philadelphia, Pennsylvania, USA |
Relocation: |
Visa: H1B |
SRIHARI P
SENIOR DATA ENGINEER [email protected] Mobile: +1 205-510-7438 PROFESSIONAL SUMMARY: Around 14+ years of experience in IT with Application development, Big data analytics, Functional, Automation testing, Deployment and software development life cycle. 7+ years of experience in Data Engineering, Architecting & Developing data pipelines, Data analysis, ETL Development, and Project Management. Programming languages: Experience working in Python, Scala, Java, C++, shell scripting. AWS: Experienced in deploying applications with AWS EMR, Glue, SNS, SQS, API Gateway, S3, Cloud watch, EC2, Lambda, Step functions, Redshift, Athena, DynamoDB. GCP: Experienced in deploying applications on GCP Dataproc, Dataflow, Data Fusion, Cloud function, Composer, Bigquery and Data flow. Databricks: Experienced in developing ETL solutions using PySpark, Spark-sql, developing Delta Lake, Delta live tables, Managing Databricks Notebooks, Delta Lake with Python & Spark SQL. Databases: Experience in Oracle, MySQL, Sql server, Teradata, Hive, Impala, Postgres, DynamoDb & MongoDb. Development of data pipeline with AWS Glue, EMR, Databricks, GCP Dataproc and Palantir foundry. Experienced with efficient cloud data warehousing solutions on Big query, Redshift & Snowflake. Good experience with Python functional and object-oriented programming, hands on with Regex, pandas, NumPy, SciPy, PyArrow, Pytorch, matplotlib, Plotly & Geospatial libraries. Designed Python APIs to retrieve, analyze and structure data from NoSQL platforms like HBASE, MongoDb and DynamoDb. Experience in Dimensional data modeling in Star & Snowflake schemas and slowly changing dimensions(SCD). Experienced in application development for Telecom, Medical(EMR), Healthcare(EHR), Insurance & Financial domains. Experience with distributed systems, large-scale non-relational data stores, RDBMS, Hadoop MapReduce systems, data modeling, data management and multi-terabyte warehouses and data marts. Designed and implemented data-migration plans to transfer data to HDFS, S3, DynamoDB, BigQuery using PySpark, Python and Sqoop. Development of QA framework to test ETL data pipelines, Functional, regression and Dashboards. Deployment: Experience in CI/CD tools like Jenkins, Bamboo, Bitbucket, Maven, Ant, Git, CVS. Experience in machine learning models and Artificial Intelligence: regression, random forest, KNN, levenshtein distance algorithm. TECHNICAL SKILLS Big Data Technologies Spark, PySpark, Pandas, Hive, Impala, Sqoop, Flume, Oozie, Hbase, HDFS, MongoDb, Snowflake, Databricks, Apache beam, Spark streaming, Kafka, Airflow AWS EMR, Glue, S3, EC2, Lambda, Athena, Step function, API gateway, SNS, SQS, Redshift, DynamoDb, CloudWatch GCP Dataproc, Dataflow, Data Fusion, BigQuery, Composer, Cloud function, Cloud SQL, Looker, Storage, VM, image Databricks Delta live tables, Notebooks, Spark-Sql, Pyspark, Pytest DevOps Terraform, Cloudformation, Chef, Jenkins Languages Python, Scala, Java, C++, SQL, Shell scripting Databases MySql, Oracle PL/SQL, Teradata, DynamoDb, Postgres, Sql server, MongoDb, Hbase Palantir Foundry Code workbook, Slate, Contour, Data pipeline, Ontology, Monocle Machine learning AVM, NLP, KNN, levenshtein, Power iteration clustering Web Technologies REST, SOAP/XML, JSON, HTML, CSS, Javascript, WSDL Operating Systems Linux, Unix and Windows IDEs Conda, JuPyter, IntelliJ, PyCharm, Eclipse, Source insight DB Tools SQL Developer, Squirrel Tracking tools Redmine, JIRA CI/CD tools Git, Jenkins, Confluence, Bitbucket, Bamboo EDUCATION Bachelor of Technology JNTU College of Engineering (Autonomous), Kakinada, INDIA PROFESSIONAL EXPERIENCE 1. Senior Data Engineer Client: Pacific Gas & Energy July 2022 Till date Location: Sunnyvale CA(Remote) Project Description: Application design, Data analysis, migration, develop, deployment for public safety implementation and power shutoff and Message broadcast Roles & Responsibilities: Development of ETL data pipelines with GCP Dataproc, Bigquery & Dataflow. Development of PySpark application to ingest, transform the data and load with Python object-oriented programming & Shell scripting. Design & development of data pipeline with GCP Dataproc, dataflow, cloud function, API gateway & pub/sub. Development of PSPS situational Intelligence Platform(PSIP) for secure data-driven operations and decision making for PSPS activation and events. Development of data pipeline for message broadcasting, device failure notifications, rotating outages. Automating ALUT(Address lookup file upload), Data sharing, Download backup processes. Application development with Code workbook, Slate and Ontology in Palantir Foundry. Development of Dataflow pipelines using Apache beam to migrate the data from various sources to Bigquery with data fusion. Development of Java application to load incremental and snapshot data. Creating Jupyter notebooks using Pyspark, Python, SQL and automated notebooks using jobs. Integration of UI applications with APIs by usage type & category with API Gateway & GCP function. Development of Python APIs to dump the array structures in the Processor at the failure point for debugging. Development of ETL pipeline in data warehouse, developing customer reports using advanced sql in snowflake. Design and Development of ETL jobs with GCP Dataflow to migrate data from external sources to Bigquery. Automating the backup jobs on a monthly/daily basis with GCP scheduler & Cloud function. Design and development of Palantir Contour Dashboards and Data analysis. Development of Kafka consumer to fetch data from Kafka topics. Deployment of application for all Dev, Stage & Prod environments. Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark Dataproc & Bigquery clusters. QA analysis of Ontology objects, datasets and Contour dashboards. Developed the scripts to create Delta table DDL and Analyze table from Pyspark jobs. Automation of data validation through great expectations and stub functions. Environment: Pyspark, Python, REST, Pandas, JQ, GCP, BigQuery, Dataproc, Dataflow, Apache Beam, Data fusion, Cloud function, Cloud composer, Looker, Kafka, Palantir Foundry, Databricks, Airflow, Terraform, Kubernetes, Shell-scripting, Linux. 2. Senior Data Engineer Client: Walmart May 2021 June 2022 Location: Bentonville AR(Remote) Project Description: Assortment & Optimization: Store optimization, Data analysis, Data migration, Application design, architecture, development, deployment of optimization engine. Roles & Responsibilities: Design & Development of Pyspark applications for data extraction, transformation, and aggregation from multiple sources. Application design for integration with REST API s, Merchant UI and custom python libraries. Development of Data ingestion pipeline and integration to Pyspark, Bigquery, Apache beam. Process and load the data from Google pub/sub topic to Bigquery using cloud data flow with Python. Machine learning algorithm development with Pandas, NumPy, Matplotlib, Scikit-learn in Python. Data loading to Bigquery for incoming csv files in GCS bucket using Cloud function with Python. Translate business and data requirements into Logical data models in support of Enterprise Data Models, OLAP, OLTP and Analytical systems. Imported real time weblogs using Kafka as a messaging system and integrated with Spark streaming. Design & development of Kafka producer & consumer applications, cluster setup with Zookeeper. Data integration and unifying the batch data and streaming data to data lake. Development of data warehouse model in snowflake and built logical & physical data model for Snowflake. Connection establishment & Data ingestion from Snowflake to Databricks daily basis. Design & Development of high performance ETL pipelines with Databricks Delta live tables. Creating reports in Looker based on Snowflake connections. Visualization of store layout with adjacency, left/right, opposite & perpendicular mapping and mapping to UI dashboards. Infrastructure design, deployment and configuration management on GCP cloud. Automation of ETL processes to wrangle the data, posting recommendation data store by store. QA Automation of store data analysis by department and validation. Performance tuning of Spark application for setting right batch interval time, correct level of parallelism and memory tuning. Automation framework for QA analysis of visualizations and store layout. Built Horizontally, vertically scalable distributed data solutions with Python multiprocessing, REST API, and GCP VM s. Environment: Python, PySpark, BigQuery, REST, Pandas, JQ, Snowflake, Databricks, GCP, Dataproc, Apache beam, Cloud function, Data fusion, Cloud SQL, Looker, VM, Kafka, Spark streaming, Airflow, terraform, TrainMe, SQL, Shell-scripting, Linux. 3. Senior Data Engineer Client: Dun & Bradstreet July 2020 Apr 2021 Location: Short hills NJ(Remote) Project Description: Data analysis, Application design, develop, test, deployment for Healthcare Insurance claims for Workers Compensation. Development of data pipeline for healthcare Insurance claims data.. Roles & Responsibilities: Development of Spark application to extract data from multiple sources with Python, Scala & Shell scripting. Development of spark jobs to classify the data on industry type & category. Development of ETL spark jobs to detect Healthcare insurance fraud & Healthcare provider risk. Automating the backup jobs daily with GCP stackdriver & Cloud function services. Data analysis of healthcare data, enhancement of data accuracy, operational efficiency. Designed the ETL data pipeline to transform the data from multiple sources to Master tables. Development of Python APIs to dump the array structures in the Processor at the failure point for debugging. Build deployment on GCP ecosystem with Dataproc, Cloud storage, Stackdriver, cloud functions and Cloud composer. Transformation of Hive sql queries to spark jobs with Pyspark & spark Sql. Development of Oozie workflows for scheduling and orchestrating the workflows. Data migration from Cloudera to GCP cloud and SAS application migration to GCP. Design & building large scale data warehouse on Bigquery, performance tuning & normalization jobs. Data ingestion with Kafka, data integration to data lake and warehouse. Deployment of application for all Dev, Stage & Prod environments. Creating Databricks workflows using Pyspark, Python, SQL and automated notebooks using jobs. Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster. Developed the scripts to create Hive table DDL and Analyze table from Pyspark jobs. QA verification of ETL data pipeline and Tableau dashboards. Development of Tableau dashboard with the views for weekly data category wise. Environment: Spark, Python, Scala, Hive, Hbase, Teradata, Tableau, Databricks, GCP Dataproc, Cloud function, Bigquery, Cloud composer, Kafka, Cloud SQL, API Gateway, Oozie, Shell-scripting, Linux. 4. Senior Data Engineer Client: Vanguard, Malvern, PA July 2019 June 2020 Project Description: ETL data pipeline design, development, deployment for Electronic traded funds (ETFs). Data analysis on ETFs data includes fund flows, Asset Under Management by different categories. Roles & Responsibilities: Designed the PySpark application to extract data from multiple vendors through REST APIs. Spark application development with Python, Java & Shell scripting. Designed the basic ETL to transform the data from source to Master tables. Implemented the business rules for currency conversions, Fund coverage and Daily price. Design and development of performant ETL pipelines using python, PySpark, AWS EMR & lambda. Data structuring and building data pipeline for data ingestion and transformation to Hive tables. Data modeling with defining Schemas, removing duplicates and filling missing Observations. Build deployment on AWS ecosystem with EMR, S3, EC2, and Service Catalog & Glue Catalog. Automation & orchestration of spark jobs with Oozie and scheduling. Development of application repositories for AWS IAM roles to AWS services for all Dev, Test & Prod regions. Developed the scripts to create Hive table DDL and Analyze table from Pyspark jobs. Wrote various data normalization jobs for new data ingested into Redshift. Data analysis by Fund category, Type, Region and fund Holders after currency conversion. Development of Tableau views for Fund flow, AUM by Monthly, 12 Month Avg, YTD & Previous year YTD and dashboard creation. Development of Tableau views for ETFs by Client view, Product view & Month over Month differences. Data migration from Hive warehouse to Glue catalog and maintaining DDL required. Environment: Spark, Scala, Python, Pandas, Hive, Tableau, Postgres, Hbase, Oozie, AWS Glue, EMR, Lambda, Redshift, Control-M, Shell-scripting, Linux. 5. Bigdata Engineer Client: IQVIA, Plymouth Meeting, PA July 2018 June 2019 Project Description: Application design, develop, test, deploy the big data solution for customers experience. The customer transactional medical data includes Patient visits, Medications, Prescriptions, Problems, Orders and Observations. Roles & Responsibilities: Designed the Spark application to extract data from XML sources and transform OMOP tables to CDD tables. Implemented the business rules to read the Patient data, inferred new business rules to derive specific data. Designed business logic for Cleaning & Structuring incoming customer data in HDFS Parquet files, and transformation to Impala tables. Developed the scripts to create impala table DDL and Analyze table from spark jobs. Developed Spark Scala jobs and Java API's and performed transformations and actions on RDD. Designed Patient EMR/EHR object module with Java object-oriented programming. Analysis of healthcare (EHR) data , identifying the patterns to identify early warning signs, avoid care plan errors, to prevent mass diseases. Developing a pipeline to ingest structured & unstructured data, transform to EMR data as per HIPAA, HL7, FHIR standards. Development of Oozie workflows for daily incremental loads from Teradata to Hive tables. Data ingestion from RDBMS to HDFS with Sqoop, Oraoop & Spark JDBC applications. Writing Hive queries and Hive UDF to aggregate the Patient Orders data. Transformed Hive queries to Spark transformations using Spark-scala. Developed the Scala/Java application to fetch patient data from XML semi structured data. Automated the process of Archiving data tar files from CLEO server to PHI server & then to local HDFS daily. Responsible for Configuring, deploying and monitoring Spark jobs on YARN cluster. Manual testing of CDD tables against the Legacy data & Rules given, and with Xml data. Data modeling, Table creation in Impala and writing Hive queries using Spark SQL Context. Created the BDF with PySpark for parsing German Hospital data and machine learning algorithm/NLP (levenshtein distance, Power iteration clustering) to derive MFG Code for Drugs & Manufacturer name. Automated the Spark jobs for Data ingestion, Data cleansing and creating stage tables. Deployed the application on Jenkins with Maven and shell scripts. Manual testing of Impala/Hive tables against the Weekly & Monthly Reference tables and against CSV data. Testing the Spark jobs on Cloudera Manager & Kibana, debugging the root cause. Environment: Spark, Scala, Java, Python, Medical/Healthcare, Cloudera, Hive, Hbase, Impala, Oozie, Oracle, Teradata, AWS, Lambda, Redshift, Shell-scripting 6. Data Engineer Client: State Street, Boston, MA June 2017 July 2018 Project Description: Application design, development and optimization to track customer s experience. The customer transactional data includes customer demographics and key characteristics, products held, credit-card statements, transaction and point-of-sale data, online and mobile transfers and payments, and credit-bureau data. Roles & Responsibilities: Design and development of Spark application for Customer tracking experience. Performed clustering of customers for customer segmentation. Developing business logic to design the format for structuring incoming customer data initially stored in DynamoDB, and data transfer to HDFS. Responsible for setting up and monitoring a distributed analytics SPARK cluster on AWS. Developed and designed a system to collect data from multiple sources and then process it using spark streaming. Text mining the large customer data to identify the issues that customers have with the products. Developed scripts to perform analysis of consumer experience data (provided as ratings). This helps in ranking the vendors by the quality of services offered, grouped by the service field. Developed scripts to perform sentiment analysis of consumer experience data. Developed Spark-Scala jobs to create business tables in Hive and querying Hive tables using Spark HiveContext, Testing them. Designed custom PySpark APIs and Python object-oriented methods to analyze data and understand the correlation between customer and claims data and in understanding risk associated with various profiles. Developed clickstream and touch analysis routine to provide actionable feedback to improve the website/app design. It helped in improving the product visibility and customer service. Proficient in Managing and reviewing Hadoop log files and finding the root cause of errors. Owing to the success of SPARK, replicated and optimized algorithms in Python for a faster performance. Environment: Spark, Scala, Python, Pandas, OOPS, AWS EMR, Lambda, DynamoDB, MongoDb, Statistical analysis 7. NETWORK DATA ANALYSIS & OPTIMIZATION, BIG DATA DEVELOPER IPACCESS LTD, PUNE, INDIA November 2015 - February 2017 Client: British Telecom, UK Project Description: Network analysis is to track customer experiences and analyze customer clickstreams to understand their preferences and propensity to buy. And Communications service providers can optimize quality of service and routing by being able to analyze network traffic in real time. This enables them to respond to fluctuations in traffic and reallocate bandwidth as needed. Roles & Responsibilities: User data convergence based on user log data and backhaul network data. Design, implementation and deployment of Hadoop cluster. Analysis of user clickstream data for specific products to serve the targeted promotional offers. Providing solutions based on customer preferences and user behavior by analyzing customer data usage, call logs and customer social media data, and customer relevant recommendations. Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs. Developed and Configured Kafka brokers to pipeline server logs data into spark streaming for real time processing. Used Zookeeper to co-ordinate and run different cluster services. User segmentation based on region and usage patterns. Worked on Sqoop to load data into HDFS from Relational Database Management Systems. Loading data into the Hadoop distributed file system (HDFS) with the help of Kafka and REST API Network traffic analysis and routing to control traffic fluctuations, and bandwidth allocation. Writing spark applications to transform and load the data. Written stub functions to test application functionality. Data visualization and exporting to the infrastructure and planning team. Experienced in Machine Learning and Statistical Analysis with Python Scikit-Learn. Analyzing network data with Apache Spark to understand network congestion. Loading user data from RDBMS to SparkSql and analysis. Configured Jenkins nodes, upstream/downstream jobs, and deployments with Ant and Shell-scripts. Automation and unit testing of user clickstreams & traffic. Environment: Spark, Pyspark, Python, Scala, Java, C++, Kafka, Pandas, GIS(Geospacial), HDFS, Hive, Linux 8. PRESENCE COLLECTOR, APPLICATION DEVELOPER/DATA ANALYST IPACCESS LTD, PUNE, INDIA December 2014 - October 2015 Client: AT&T, TEXAS Project Description: Presence collector is the system, which collects CPE data based on their location update. It provides the customer data to the service provider, which has been used for commercial purposes. Project involved data migration and analyzing it to extract pertinent insights. Analyzation of customer level information to determine the effectiveness of CPE location and recommend consumer policies to provide better customer experience. Roles & Responsibilities: Extracted data from various servers and developed APIs to extract new customer data stored directly on to databases. Developing data analytical databases from complex UE location data. Designed a standalone application with python object-oriented concepts and sockets to communicate multi-UE s to femtocell, and to commission, configure the femtocell. Automation and testing of user data Call flows. Developed modules to read cellular data from backhaul networks and analyze them to understand network loads and service consumption per zone. Deriving qualitative and quantitative data and storing in mySql tables. Carrying out specific data processing and statistical techniques. Collecting, collating and carrying out complex data analysis in support of management & customer requests. Wrote REST APIs in java 7 to support internalization, and apps to visualize and set portfolio performance targets. Loading and extracting data from the database, wrote queries and scripts. Monitoring the automated loading processes. Analyzing CDRs for troubled patterns, service margins and improving call quality. Extracting, loading data from traditional databases, and processing & analyzation of CPE data. Analyzing the data from different formats, drawing conclusions & developing recommendations. Statistical reporting of findings and Data visualization. Environment: Python, C++, Java, OOPS, Pandas, NumPy, MySQL, Linux, Matplotlib, Plotly 9. LTE/4G FEMTOCELL STACK DEVELOPER IPACCESS LTD, PUNE, INDIA October 2013 - November 2014 Client: AT&T, TEXAS Project Description: Long-Term Evolution (LTE) is a standard for high-speed wireless communication for mobile devices and data terminals, based on the GSM/EDGE and UMTS/HSPA technologies. It increases the capacity and speed using a different radio interface together with core network improvements. LTE is the upgrade path for carriers with both GSM/UMTS networks and CDMA2000 networks. The different LTE frequencies and bands used in different countries mean that only multi-band phones are able to use LTE in all countries where it is supported. Roles & Responsibilities: Involved in the design, development of LTE 4G-femtocell Access point stack. Development of L3 stack to support UE attach, detach, Multi rab, Radio bearer, X2 Handover, simultaneous PS calls and SMS services. Development of Idle mode procedures, PS data packet classification, header suppression modules on L3 stack. Developed the interfaces between RRC, PDCP & RLC layers. Supported the PHY team to increase the system performance and call rates. Integration of new deployment branches to main stack and testing. Developed Fault management to monitor UE CS & PS calls. Wireshark log analysis, development of LUA scripts to analyze wireshark packets, Bug Fixing and logging defects. Development of TTCN scripts to test stack functionality interactively. Integration and Unit testing. Environment: C, C++, Linux, mySql, TTCN, LUA, Wireshark, Gdb 10. 3G/UMTS STACK DEVELOPER IPACCESS LTD, PUNE, India May 2011 - September 2013 Client: British Telecom, UK Project Description: UMTS, the Universal Mobile Telecommunications System is 3GPP protocol which uses WCDMA to carry radio transmissions. Femto cell/Pico cell is a small cellular base station, typically designed for use in a home or small business. It connects to the service provider's network via broadband. IPAccess's femto cell solution is called as Femto Access Point and Pico cell solution is called as Nano Access Point. The project involved developing L2/L3 stack, porting & systems support (analysis, problem determination and troubleshooting of applications, integration of enhancements). Roles & Responsibilities: End-to-end design of application components using Java Collections and providing concurrent database access using multithreading. Development of HSDPA/HSUPA user plane and measurement control features. Development of L2 interfaces with L3 control plane. Development of information elements (I.E's) for RRC protocol. Gathered specifications and requirements to enhance the stack according to 3GPP specifications. Maintaining code modularity & refactoring. Developed the functional test framework for various features based on the business requirements. Performance tuned the application to prevent memory leaks and to boost its performance and reliability. Prepared status summary reports with details of executed, passed and failed test cases. Developed statistical analysis of code by notifying code coverage. Developed and maintained various unit tests on the applications. Environment: C, C++, Linux, SQL, Statistical analysis 11. IPAM developer Tata Elxsi Ltd, Bangalore (INDIA) December 2009 April 2011 Client: Infoblox, CA Project Description: Infoblox appliance provides core network services including IPAM, DNS, DHCP, IPAM and vDiscovery (VMware discovery). This appliance consists of NIOS and vNIOS software. VSphere Discovery is a feature that builds on the existing Network Discovery feature. During this project we developed test cases which tests Infoblox appliance, includes DNS, DHCP and vDiscovery features. Roles & Responsibilities: Developed interface to IPAM (IP address manager) with application s current detail functionality and interfaces; Access and build the business knowledge of various functions of the application that requires functional testing. Prepared Implemented the IP address manager dashboard to display current network status, IP addresses, Network discovery status & services. Design different scenarios in DNS & DHCP controller and also setup performance monitors to help identify application issues such as database locks, bottlenecks, etc. Preparation/maintenance of test environments. Tested different versions of the application before going live to production. Executed functional test cases manually and performed bug tracking using Test Director. Confirmed installation standards and methodologies; Documented and communicated test results on daily bases. Report regular status and escalating issues to the manager. Attended weekly team meeting with team leads and manager to discuss the issues and status of the projects. Environment: Python, C, XML, Shell script, Ubuntu linux 12. WiMax Stack DEVELOPER Tata Elxsi Ltd, Chennai, (INDIA) August 2006 November 2009 Client: VIASAT, CA Project Description: The Mobile Wimax system profiles for wave-2 & 1 of Wimax protocol standard IEEE 802.16e/d supports data & voice services, and WiMax supports standard 4G features MIMO, MTC, and IPv6 & HARQ. Roles & Responsibilities: Involved in the complete SDLC software development life cycle of the application from requirement analysis to testing. Development of wave II features like IPv6, MTC and HARQ to Base station MAC layer. Requirement gathering as per IEEE specification. Development of Service flow, Payload header suppression modules. Porting application to different platforms like Cavium & fedora core5. BsMac stack integration with physical layer. Functional Test case development & Test Framework enhancement. Bug reporting and fixing. Environment: C, Xml, Linux, Fedora core 5, Cavium, SQL, Shell script Keywords: cprogramm cplusplus continuous integration continuous deployment quality analyst user interface sthree database information technology fourg procedural language Arkansas California Colorado Massachusetts New Jersey Pennsylvania |