Sri Varshini K - Data Engineer |
[email protected] |
Location: , , USA |
Relocation: Yes |
Visa: GC |
Sri Varshini
Data Engineer [email protected] +1 469-830-8717 Professional Summary: 9+ years of experience in systems analysis, developing, deploying, and managing in the fields of bigdata applications, Java, Data Warehousing. Hadoop Ecosystem, AWS Cloud Data Engineering, Data Visualization, Reporting, and Data Quality Solutions. Good experience in Amazon Web Services like S3, IAM, EC2, EMR, Kinesis, VPC, Dynamo DB, RedShift, Amazon RDS, Lambda, Athena, Glue, DMS, Quick Sight, Amazon Elastic Load Balancing, Auto Scaling, Cloud Watch, SNS, SQS and other services of the AWS family. Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks, Azure Blob, and Azure Data Lake Storage Gen2, Azure SQL Data warehouse and controlling and granting database. Hands on experience with S3, EC2, RDS, EMR, Redshift, SQS, Glue, and other services of AWS. Developed an end-to-end scalable architecture to solve business problems with Azure Components such as Data Lake, Key Vault, HDInsight, Azure Monitoring, Azure Synapse, Function app, Data Factory, and Event Hubs. Experience in Google Cloud Platform (GCP) Big Query, Cloud Composer, Airflow, Cloud SQL, Cloud Storage, Cloud Functions and Dataflow. Experience in writing queries, creating tables/ views/ partitions in GCP - Big Query. Strong experience on migration project from ETL on prem to Google cloud platform (GCP) - Big Query. Experienced in understanding distributed file systems. Development of ETL process using PostgreSQL, Informatica & Unix scripts/jobs. In-Depth understanding of Snowflake Multi-cluster Size and Credit Usages. Played key role in Migrating Teradata objects into Snowflake environment. Experience with Snowflake Multi-Cluster Warehouses. Experience with Snowflake Virtual Warehouses. Experience with PySpark and Azure Data Factory in creating, developing, and deploying high-performance ETL pipelines. Experience in managing, architecture/design, modeling, development and testing of database/data warehouse applications. Expertise in writing spark RDD transformations, actions, Data Frames for the given input. Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/ data marts from heterogeneous sources. Good understanding on Spark core, Spark SQL, Kinesis and Kafka. Improve the performance of SSIS packages by implementing parallel execution, removing unnecessary sorting and using optimized queries and stored procedures. Experience in of design or development experience with Tableau Hands on experience with Spark streaming to receive real time data using Kafka. Strong knowledge of Data Warehousing implementation concept in Redshift. Has done a POC with Matillion and Redshift for DW implementation. Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks. Experience in developing custom UDFs for Pig and Hive to incorporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL) and Used UDFs from Piggybank UDF Repository. Setup full CI/CD pipelines so that each commit a developer makes will go through standard process of software lifecycle and gets tested well enough before it can make it to the production. Experience in cloud provisioning tools such as Terraform and CloudFormation. Knowledge on NoSQL databases such as HBase, MongoDB, Cassandra. Imported the data from different sources like HDFS/HBase into Spark RDD. Experienced in serverless services in AWS like Lambda, Step functions, Glue. Experienced in other AWS services like CloudWatch, Athena. Skilled on streaming data using Apache Spark, migrating the data from Oracle to Hadoop HDFS using Sqoop. Good understanding of building consumption frameworks on Hadoop, AWS, Azure or GCP (Restful services, Self-service BI and Analytic. Experienced in processing large datasets of different forms including structured, semi-structured and unstructured data. As an Azure data engineer, I oversaw implementing the ETL process for loading data from various sources into Databricks tables and Azure Synapse Tables. Knowledge of Azure Cloud Services (PaaS & IaaS), Storage, Data-Factory, Data Lake (ADLA & ADLS), Logic Apps, Azure Monitoring, Active Directory, Synapse, Key Vault, and PostgreSQL Azure. Expertise in usage of Hadoop and its ecosystem commands. Expertise at designing tables in Hive, PostgreSQL, MYSQL using SQOOP and processing data like importing and exporting of databases to the HDFS. Expertise in designing and developing data marts. Hands on experience in setting up workflow using Airflow and Oozie workflow for managing and scheduling Hadoop jobs. Experienced in handling various file formats like AVRO, Parquet, ASCII, XML, JSON. Worked on different data formats such as CSV, JSON and Parquet, ORC, Text, Avro files. Education: Bachelor s Computer Science, Malla Reddy Engineering College, Hyderabad, India Technical Skills: Programming Languages Python, SQL, PL/SQL, PostgreSQL, Shell scripts, Java, Scala, Unix Scripting Languages: Java Script, Python, Shell Script. Web Servers: Apache Tomcat4.1/5.0 Big Data Tools Hadoop, Apache Spark, MapReduce, Flink, PySpark, Hive, YARN, Kafka, Flume, Oozie, Airflow, Zookeeper, Sqoop, HBase Cloud Services Amazon Web Services (AWS) - AWS Glue, S3, RedShift, EC2, S3, EMR, Dynamo DB, Data Lake, AWS Lambda, Cloud Watch HDInsight, Azure SQL Datawarehouse GCP Big Query, Cloud Composer, Airflow, Cloud SQL, Cloud Storage, Cloud Functions and Dataflow. ETL/Data warehouse Tools Informatica, Talend, DataStage, Power BI, and Tableau Version Control & Containerization tools SVN, GIT, Bitbucket, Docker, and Jenkins CVS, Code Commit, GIT hub, ApacheLog4j, TOAD, ANT, Maven, JUnit, JMock, Mockito, REST HTTP Client, JMeter, Cucumber, Aginity. Databases Oracle, MySQL, MongoDB, and DB2 Operating Systems Ubuntu, Windows, and Mac OS Methodologies Agile/ Scrum and Traditional Waterfall Professional Experience Humana, Tx Mar 2023- Present Role: GCP Data Engineer Responsibilities: Working in Agile/ Scrum environment, providing support on implemented projects. Using Python to transfer the data from on-premises clusters to Google Cloud Platform (GCP). Explores and identifies potential solutions for business problems within the domain, leveraging analytics, big data analytics, and automation techniques. Experience in developing enterprise level solution using batch processing (using Apache Pig) and streaming framework (using Spark Streaming, Apache Kafka & Apache Flink). Worked on POC to setup Cloud Datawarehouse on Big Query (GCP Google Cloud Platform). Loaded data into Spark Data Frames and used Spark-SQL and native core Scala to explore data insights. Design and implement ETL and data movement solutions using Azure Data Factory, and SSIS On-premises data migration (Oracle/ SQL Server/ DB2) to Azure Data Lake Store via Azure Data Factory Worked on GCS (Google Cloud Storage), Big Query, Data flow, Data Proc and other key frameworks. Worked on NoSQL databases like MongoDB, Document DB, and Graph DB like neo4j. Worked on implementing data quality checks using Spark Streaming and arranged passable and bad flags on the data. Using SonarQube for continuous inspection of code quality and to perform automatic reviews of code to detect bugs. Managing AWS infrastructure and automation with CLI and API. Creating AWS Lambda functions using python for deployment management in AWS and designed, investigated and implemented public facing websites on Amazon Web Services and integrated it with other applications infrastructure. Creating different AWS Lambda functions and API Gateways, to submit data via API Gateway that is accessible via Lambda function. Assists in the creation of business cases and recommendations while taking ownership of project activities and tasks assigned by others. Actively supports and implements process updates and changes to resolve business issues. Lead the team in developing real time streaming applications using PySpark, Apache Flink, Kafka, Hive on distributed Hadoop Cluster. Designed and created logical Data warehouse with Matillion and Redshift models to support strategic decisions. Experience in working with databases like MongoDB, PostgreSQL and Cassandra. Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds. Strong MySQL and MongoDB administration skills in Unix, Linux and Windows. Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift. Automate Datadog Dashboards with the stack through Terraform Scripts. Developed file cleaners using Python libraries and made it clean. Utilized Python Libraries like Boto3, NumPy for AWS. Used Amazon EMR for map reduction jobs and test locally using Jenkins. Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark. Create external tables with partitions using Hive, AWS Athena and Redshift. Developed the PySprak code for AWS Glue jobs and for EMR. Good Understanding of other AWS services like S3, EC2 IAM, RDS Experience with Orchestration and Data Pipeline like AWS Step functions/Data Pipeline/Glue. Experience in upgrading different databases and also migration of data among multiple databases. Experience in analyzing and visualizing the data along with data modeling. Experience in managing large shared MongoDB cluster. Developing data processing tasks using PySpark such as reading data from external sources, merging data, performing data enrichment, and loading into target data destinations. Built pipelines in ADF using Datasets/Linked Services/Pipeline to extract, load and transform data from various sources such as Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards. Involved in Teradata application to Google Cloud Platform (Big Query) migration POC. Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster Efficiently managed and processed massive volumes of data, utilizing Spark for processing and calculating key metrics. Actively engaged in all phases of the pipeline development, from conceptual design to implementation, and successfully deploying it into production. Developed stored procedure, lookup, execute pipeline, data flow, copy data, and azure function features in ADF. Designed and implemented multiple dashboards for internal metrics using Azure Synapse - PowerPivot & Power Query tools. Configured Spark Streaming to receive real time data from Kafka. Used Backpressure to control message queuing in the topic. Used Data Factory to develop pipelines and performed batch processing using Azure Batch processing. Designed and developed pipelines to move the data from Azure blob storage/file share to PostgreSQL, SQL Data Warehouse. Developed Spark applications in Databricks using PySpark and Spark SQL to perform transformations and aggregations on source data before loading it into Azure Synapse Analytics for reporting. Design and support ETL programs, create integration testing strategy for ETL jobs/ participate in integration testing in development/ test environments. Involved in building the ETL architecture & Source to Target mapping to load data into the Data warehouse. Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic to massage and transform and serialize raw data. Used Cloud shell SDK in GCP to configure the services Data Proc, Cloud Storage, and Big Query. Worked on GCP for the purpose of data migration from Oracle database to GCP. Creating Spark clusters and configuring high-concurrency clusters using GCP to speed up the preparation of high-quality data. Worked on Databricks to write Scripts in PySpark, Python, and PostgreSQL & integrate Databricks with GCP. Developed Spark code using Scala and Spark-SQL/ Streaming for faster data processing. Product Key Life Cycle analysis Project and published to Power BI Service. Designed both 3NF data models for OLTP systems and dimensional data models using Snowflake Schemas. Contributes to the data strategy by understanding, articulating, and applying its principles to routine business problems, which are typically confined to a single function. Implemented Row Level Security in Power BI to manage data privacy across user levels Works with data transformation and integration, extracting data from specified databases. Designed and developed pipelines to move the data from Azure blob storage/file share to SQL Data Warehouse. Creates relevant data pipelines and transforms data to suit the problem at hand by using suitable techniques. Stays updated on current trends in data science and analytics. Load the processed data into final viewpoint centralized DWH for further Power BI reporting. Environment: AWS Glue, S3, Graph DB, IAM, EC2, RDS, Flink, Redshift, EC2, Data warehouse, GCP, Lambda, Boto3, Terraform, DynamoDB, Apache Spark, Kinesis, Athena, Hive, Sqoop, Python. AmerisourceBergen, Carrolton, Tx Jun 2021 Mar 2023 Role: Data Engineer Responsibilities: Working with product teams to create various store-level metrics and supporting data pipelines written in GCP s big data stack. Normalized the data according to the business needs like data cleansing, modifying the data types and various transformations using Spark, Scala and GCP Dataproc. Worked on partitioning and clustering high-volume tables on fields in BigQuery to make queries more efficient. Worked on implementing scalable infrastructure and platform for large amounts of data ingestion, aggregation, integration, and analytics in Hadoop using Spark and Hive. Experience with creating script for data modeling and data import and export. Extensive experience in deploying, managing and developing MongoDB clusters. Experience in creating JavaScript for using DML operation with MongoDB. Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL. Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB). Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools. Creation, configuration and monitoring Shards sets. Analysis of the data to be shared, choosing a shard Key to distribute data evenly. Architecture and Capacity planning for MongoDB clusters. Implemented scripts for mongo DB import, export, dump and restore. Worked on MongoDB database concepts such as locking, transactions, indexes, Sharding, replication, schema design. Created multiple databases with sharded collections and choosing shard key based on the requirements. Experience in managing MongoDB environment from availability, performance and scalability perspectives. Designed and developed the data warehouse models by using Snowflake schema. Uploaded and downloaded data to and from Cloud Storage using the command-line tools, and client libraries. Can work parallelly in both GCP and Azure Clouds coherently. Developed PySpark scripts to handle the migration of large volumes of data, ensuring minimal downtime and optimal performance and analyzed the SQL scripts, and designed solutions to implement using PySpark. Worked on querying data using Spark SQL on the top of PySpark engine jobs to perform data cleansing, validation, and applied transformations and executed the program using Python API. Process and load bound and unbound Data from Google Pub/Subtopic to Big Query using cloud Dataflow with Python. Developed common Flink module for serializing and deserializing AVRO data by applying schema. Indexed processed data and created dashboards and alerts in splunk to be utilized/ action by support teams. Implemented layered architecture for Hadoop to modularize design. Developed framework scripts to enable quick development. Designed reusable shell scripts for Hive, Sqoop, Flink and PIG jobs. Standardize error handling, logging and metadata management processes. Worked on partitions of Pub/Sub messages and setting up the replication factors. Developed T-SQL (SQL) queries, stored procedures, user-defined functions, and built-in functions. Migrated previously written cron jobs to airflow/composer in GCP. Used to write Python DAGs in airflow which orchestrate end-to-end data pipelines for multiple applications. Used windowing functions to order data and remove duplicates in source data before loading to DataMart for better performance. Worked on importing and exporting data from Snowflake, Oracle, and DB2 into HDFS and HIVE using Sqoop for analysis, visualization, and generating reports. Created Hive tables using HiveQL, then loaded the data into Hive tables and analyzed the data by developing Hive queries. Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the map-reduce jobs that extract and Zookeeper for providing coordinating services to the cluster. Used SFTP to generate detailed logs for file transfers, user activities, and file access. Worked on NoSQL Databases such as HBase and integrated with PySpark for processing and persisting real-time streaming. Used Talend to load transformed data into BigQuery and implemented data quality checks and data governance rules. Developed Power Pivot/SSRS (SQL Server Reporting Services) Reports and added logos, pie charts, and bar graphs for display purposes as per business needs. Worked on Data serialization formats for converting complex objects into sequence bits by using Parquet, Avro, JSON, and CSV formats. Developed internal dashboards for the team using Power BI tools for tracking daily tasks. Environment: GCP Console, Cloud Storage, Big Query, Data Proc, Spark, MongoDB, Data warehouse, Hadoop, Hive, Scala, Cloud SQL, Shell Scripting, SQL Server 2016/2012, T-SQL, SSIS, Visual Studio, Power BI, PowerShell, Oracle, Teradata, Airflow, GIT, Docker. 7-Eleven, Tx Jan 2021 Jun 2021 Role: Data Engineer Responsibilities: Performed spark streaming and batch processing using Scala. Used Hive in Spark for data cleansing and transformation. Used Scala and Kafka to create data pipelines for structuring, processing, and transforming given data. Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon EMR 5.6.1. Performed the migration of Hive and MapReduce Jobs from on - premises MapR to AWS cloud using EMR and Qubole Experience in data integration and modeling Implemented Performance testing using Apache JMeter and created a Dashboard using Grafana to view the Results. Participate in creating state-of-the-art data and analytics driven solutions, developing, and deploying cutting edge scalable algorithms, working across GE to drive business analytics to a new level of predictive analytics while leveraging big data tools and technologies. Develop real-time data feeds and microservices leveraging AWS Kinesis, Lambda, Kafka, Spark Streaming, etc. to enhance useful analytic opportunities and influence customer content and experiences. Identify and utilize existing tools and algorithms from multiple sources to enhance confidence in the assessment of various targets. Worked on ETL testing,and used SSIS tester automated tool for unit and integration testing. Designed and created SSIS/ETL framework from ground up. Used AWS glue catalog with crawler to get the data from S3 and perform sql query operations. Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP. clusters on AWS using Docker, Ansible, and Terraform Strong knowledge of various data warehousing methodologies and data modeling concepts. Hands on modeling experience is highly desired. Heavily involved in testing Snowflake to understand best possible way to use the cloud resources Performed efficient load and transform Spark code using Python and Spark SQL. To meet specific business requirements wrote UDF s in Scala and Pyspark. Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity. Managed huge volume of structured, semi structured, and unstructured data. Used Oozie to create big data workflows for ingesting data from various sources to Hadoop. Developed Spark jobs using Scala for better data processing and used Spark SQL for querying. Environment: HDFS, Spark, Scala, Pyspark, ADF, Kafka, AWS, Pig, SBT, SQOOP, Maven, Zookeeper. USAA, TX Mar 2019-Dec 2020 Role: Data Engineer Responsibilities: Designed and built a custom and genetic ETL framework Spark application using Scala. Handled data transformations based on the requirements. Configured spark jobs for weekly and monthly executions using amazon data pipeline. Handled application log data by creating customer loggers. Created error reprocessing framework to handle errors during subsequent loads. Queried Cassandra tables using Zeppelin. Executed queries using Spark SQL for complex joins and data validation. Handled complex transformation logics. Developed Complex transformations Mapplets using Informatica to Extract Transform and Load Data into Data marts Enterprise Data warehouse EDW and Operational data store ODS. Created SSIS package to get the dynamic source file name using ForEachLoop Container. Used the Lookup, Merge, Data conversion, sort etc Data flow transformations in SSIS. Created independent components for AWS S3 connections and extracted data into Redshift. Involved in writing Scala scripts for extracting from Cassandra Operational Data Store tables for comparing with legacy system data. Worked on data ingestion file validation component for threshold levels, last modified and checksum. Environment: Spark, Scala, AWS, DBeaver, Zeppelin, S3, Cassandra, Alteryx 11, Workspace, Shell scripting. Nevonex Solutions, India Jan 2014 Dec 2018 Role: Data Engineer Responsibilities: Analyzed the requirements provided by the client and developed a detailed design with the team. Worked with the client team to confirm the design and modified based on the changes mentioned. Involved in extracting and exporting data from DB2 into AWS for analysis, visualization, and report generation. Created HBase tables and columns to store the user event data. Used Hive and Impala to query the data in HBase. Developed and implemented core API services using Scala and Spark. Managed querying the data frames using Spark SQL. Used Spark data frames to migrate data from AWS to MySQL. Built continuous ETL pipeline by using Kafka, Spark streaming and HDFS. Performed ETL on data from various file formats (JSON, Parquet and Database). Performed complex data transformations using Scala in Spark. Converted SQL queries to Spark transformations using Spark RDDs and Scala. Worked on importing real time data to Hadoop using Kafka and implemented Oozie job. Collected log data from web servers and exported to HDFS. Involved in defining job flows, management, and log files reviews. Installed Oozie workflow to run Spark, Pig jobs simultaneously. Created hive tables to store the data in table format. Environment: Spark, Scala, HDFS, SQL, Oozie, SQOOP, Zookeeper, MySQL, HBase. Keywords: continuous integration continuous deployment business intelligence sthree database information technology golang procedural language Texas |