Home

Priyadarshini - Data Engineer
[email protected]
Location: Dallas, Texas, USA
Relocation: Yes
Visa: H1B
Priyadarshini Thota Ph. No: (610) 787-7015
Data Engineer Email: [email protected]
-----------------------------------------------------------------------------------------------------------------------------
SUMMARY
Overall, around 8++ Years of strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
Have very strong inter-personal skills and the ability to work independently and with the group, can learn quickly and easily adaptable to the working environment.
Hands on experience on Google Cloud Platform (GCP) in all the big data products BigQuery, Cloud Data Proc, Google Cloud Storage, Composer (Air Flow as a service).
Solid Experience and understanding of Implementing large scale Data warehousing Programs and E2E Data Integration Solutions on Snowflake Cloud. AWS Redshift, Informatica Intelligent Cloud Services (IICS - CDI) & Informatica Power Center integrated with multiple Relational databases (MySQL, Teradata, Oracle, Sybase, SQL server) Extensive knowledge in writing Hadoop jobs for data analysis as per the business requirement.
Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned. Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, MapR, and Amazon EMR) to fully implement and leverage new features.
Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics. Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala. Experience in writing complex SQL queries, creating reports and dashboards.
Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
Experienced with Teradata utilities Fast Load, Multi Load, BTEQ scripting, Fast Export, SQL Assistant and Tuning of Teradata Queries using Explain plan O Worked on Dimensional Data modeling in Star and Snowflake schemas and Slowly Changing Dimensions (SCD). Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing, and tuning the HQL queries.
Designed, developed, and deployed DataLakes, Data Marts and Datawarehouse using AWS cloud like S3, AWS RDS and terraform, Lambda, Glue, EMR, Step Function, CloudWatch events, SNS, Redshift, S3, IAM, etc. Designed, developed, and deployed Datawarehouse, AWS Redshift, applied my best practices
Experience in using stackdriver service/ dataproc clusters in GCP for accessing logs for debugging.
Designed, developed, and deployed DataLakes, Data Marts and Datawarehouse using Azure cloud like adls gen2, blob storage , Azure data factory , data bricks , Azure synapse , Key vault and event hub
Proficient in using UNIX based Command Line Interface, Expertise in handling ETL tools like Informatica.
Strong experience using pyspark, HDFS, MapReduce, Hive, Pig, Spark, Sqoop, Oozie, and HBase.
Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators both old and newer operators. Experience in developing Spark Applications using Spark RDD, Spark-SQL and Data frame APIs.
Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
Worked with real-time data processing and streaming techniques using Spark streaming and Kafka.
Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop. Significant experience writing custom UDFs in Hive and custom Input Formats in MapReduce.
Involved in creating Hive tables, loading with data, and writing Hive ad-hoc queries that will run internally in MapReduce and TEZ, replaced existing MR jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing, Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data. Integrated Kafka with Spark Streaming for real time data processing.
Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2, MongoDB, HBase and SQL Server databases.
Experience in GCP Dataproc, GCS, Cloud functions, BigQuery, Azure Data Factory DataBricks.
Experience in building efficient pipelines for moving data between GCP and Azure using Azure Data Factory.
Deep understanding of MapReduce with Hadoop and Spark. Good knowledge of Big Data ecosystem like Hadoop 2.0 (HDFS, Hive, Pig, Impala), Spark (SparkSQL, Spark MLlib, Spark Streaming).
Experienced in writing complex SQL Queries like Stored Procedures, triggers, joints, and Sub quires.
Interpret problems and provide solutions to business problems using data analysis, data mining, optimization tools, and machine learning techniques and statistics.
Expertise in configuring the monitoring and alerting tools according to the requirement like AWS CloudWatch.
Large scale Hadoop environments build and support including design, configuration, installation, performance tuning and monitoring. Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
Developed multi cloud strategies in better using GCP (for its PAAS) and Azure (for its SAAS).
Extensive experience in loading and analyzing large datasets with Hadoop framework (MapReduce, HDFS, PIG, HIVE, Flume, Sqoop, SPARK, Impala, Scala), NoSQL databases like MongoDB, HBase, Cassandra.
Skilled in performing data parsing, data manipulation and data preparation with methods including describe data contents. Experienced on Hadoop Ecosystem and Big Data components including Apache Spark, Scala, Python, HDFS, Map Reduce, and KAFKA.
Strong experience in the Analysis, design, development, testing and Implementation of Business Intelligence solutions using Data Warehouse/Data Mart Design, ETL, BI, Client/Server applications and writing ETL scripts using Regular Expressions and custom tools (Informatica, Pentaho, and Sync Sort) to ETL data.
Expert in designing Server jobs using various types of stages like Sequential file, ODBC, Hashed file, Aggregator, Transformer, Sort, Link Partitioner and Link Collector.
Proficiency in Big Data Practices and Technologies like HDFS, MapReduce, Hive, Pig, HBase, Sqoop, Oozie, Flume, Spark, Kafka. Developed web-based applications using Python, DJANGO, QT, C++, XML, CSS3, HTML5, DHTML, JavaScript and jQuery. Extensive use of cloud shell SDK in GCP to configure/deploy the services Data Proc, Storage, and BigQuery
Experience in building power bi reports on Azure Analysis services for better performance when comparing that to direct query using GCP BigQuery.
Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau. Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.
Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and worked on various applications using python integrated IDEs like Sublime Text and PyCharm
EDUCATION
Bachelors
TECHNICAL SKILLS
Big Data Technologies: Hadoop, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, Yarn, Apache Spark, Mahout, Sparklib
Databases: Oracle, MySQL, SQL Server, MongoDB, Cassandra, DynamoDB, PostgreSQL, Teradata.
Programming: Python, PySpark, Scala, Shell script, Perl script, SQL
Cloud Technologies: AWS, Microsoft Azure
Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman
Versioning tools: SVN, Git, GitHub
Operating Systems: Windows 7/8/XP/2008/2012, Ubuntu Linux, MacOS
Network Security: Kerberos
Database Modeling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling
Monitoring Tool: Control-M, Oozie
PROFESSIONAL EXPERIENCE
Capital One Financial Corporation, McLean, Virginia Sep 2021 - Present
Sr. Data Engineer
Responsibilities:-
Designed and developed scalable and cost-effective architecture in AWS Big Data services for data life cycle of collection, ingestion, storage, processing, and visualization.
Developed PySpark Data Ingestion framework to ingest source claims data into HIVE tables by performing Data cleansing, Aggregations and applying De-dup logic to identify updated and latest records.
Integrated Big Data Spark jobs with EMR and glue to create ETL jobs for around 450 GB of data daily.
Involved in creating End-to-End data pipeline within distributed environment using the Big data tools, Spark framework and Tableau for data visualization. Created Terraform modules and resources to deploy AWS services.
Worked on developing CFT s for migrating the infra from lower environment to higher environment.
Leverage Spark features such as In-Memory processing, Distributed Cache, Broadcast, Accumulators, Map side Joins to implement data preprocessing pipelines with minimal latency.
Got involved in migrating on prem Hadoop system to using GCP (Google Cloud Platform).
Wrote scripts in Hive SQL/Presto SQL, using python plugin for both spark and presto for creating complex tables with high performance metrics like partitioning, clustering and skewing.
Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP
Experience in creating python topology script to generate cloud formation template for creating the EMR cluster in AWS. Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds.
Designed, developed, and deployed DataLakes, Data Marts and Data warehouse using AWS Redshift and terraform. Worked on processing batch and real time data using Spark using Scala.
Designed, developed, and deployed ETL pipelines using services like, Lambda, Glue, EMR, Step Function, CloudWatch events, SNS, Redshift, S3, IAM, etc. Migrated previously written cron jobs to airflow/composer in GCP.
Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple data file formats to uncover insights into the customer usage patterns.
Designed and developed ETL pipelines and dashboards using Step Function, Lambda, Glue and Quick Sight.
Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning. Written Programs in Spark using Scala for data quality check.
Ingest the Data into Cargill Data Lake from different sources & did some transformations in Data Lake with spark-Scala as per the business requirements. Developed multiple ETL pipelines to deliver data to the Stakeholders.
Used Apache Spark Data frames, Spark-SOL, Spark MLLib extensively and developing and designing
Worked on building centralized Data Lake on AWS Cloud utilizing primary services like S3, EMR, Redshift, and Athena. Worked extensively on fine-tuning spark applications and optimizing SQL queries.
Built data ingestion pipelines and moved terabytes of data from existing data warehouses to the cloud and scheduled through EMR, S3 and Spark. Was involved in setting up of apache airflow service in GCP.
Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
Developed PySpark-based pipelines using spark data frame operations to load data to EDL using EMR for jobs execution & S3 as a storage layer. Developed UDF s to perform standardization on the entire dataset.
Created a full spectrum of data engineering pipelines: data ingestion, data transformations, and data consumption.
Worked closely with business teams and data science teams and ensured all the requirements are translated accurately into our data pipelines. Developed CICD pipelines and developed require docker images for the pipelines.
Developed an ETL application using Spark, Scala, and Java on EMR to process/transform files and loaded them into S3. Queried and ran analysis over processed Analytics data using Athena.
Can work parallelly in both GCP and Azure Clouds coherently.
Improved the performance of the pipelines further using Apache Spark and Scala with batch and stream processing of the data based on the requirement. Used Spark and Kafka for building batch and streaming pipelines.
Automated the data flow and data validations on the input and output data to simplify the testing process using Shell Scripting and SQL. POC's using Scala, Spark SOL and MLlib libraries.
Used AWS services like Lambda, Glue, EMR, Ec2 and EKS for Data processing. Developed Data Marts, Data Lakes and Data Warehouse Migrating an entire oracle database to BigQuery and using of power bi for reporting.
Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
Experience in building power bi reports on Azure Analysis services for better performance.
Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
Coordinated with team and Developed framework to generate Daily adhoc reports and Extracts from enterprise data from BigQuery. Experience in moving data between GCP and Azure using Azure Data Factory.
Worked on End-to-End Development of the ingestion Framework using Glue, IAM, Cloud Formation, Athena, and REST API. Worked on the core logic of masking and creating the mask utility using the SHA-2.
Used Redshift to store the hashed and un-hashed values for a corresponding PII attribute and to map the user with his email id or the oxygen id which is unique for the user.
Changed the entire existing datasets into GDPR Complaint Datasets and pushed it to production.
Worked on the successful transformation of the project in the PRD among the users so the company is completely GDPR complaint. Evaluated and implemented next generation AWS Serverless Architecture.

Environment: S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Scala, Apache, Pig, Java, SSRS, Tableau.
U.S. Bank , Minneapolis, Minnesota Dec 2019 Aug 2021
Data Engineer
Responsibilities:-
Worked on multiple Modules, HCM Global Integration with different Regions and ONE CRM Sales force Cloud.
Analyze and develop Data Integration templates to extract, cleanse, transform, integrate and load to data marts for user consumption. Review the code against standards and checklists.
DevOps role converting existing AWS infrastructure to Server-less architecture (Lambda, Kinesis) deployed via Cloud Formation. Involved in gathering and analyzing the requirements and preparing business Requirements.
In-depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker, and Map Reduce programming paradigm.
Extensive experience in Hadoop-led development of enterprise-level solutions utilizing Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.
Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations). Strong Knowledge of the Architecture of Distributed systems and parallel processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework.
Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, and Pair RDD, and worked explicitly on PySpark and Scala. Experience in using Tableau, creating dash boards and quality story telling
Handled ingestion of data from different data sources into HDFS using Sqoop, and Flume and perform transformations using Hive, Map Reduce, and then loaded data into HDFS. Managed Sqoop jobs with incremental load to populate HIVE external tables. Experience in importing streaming data into HDFS using Flume sources, and Flume sinks and transforming the data using Flume interceptors.
Create High & Low-level design documents for the various modules. Review the design to ensure adherence to standards, templates, and corporate guidelines. Validate design specifications against the results from proof of concept and technical considerations. Deployed the application using Docker
Coordinate with the Application support team and help them assist to understand the business and necessary components for the Integration, Extraction, Transformation, and load of data.
Perform Analysis of the existing source systems, understand the Informatica/ETL/SQL/Unix-based applications and provide the services which are required for the development & maintenance of the applications.
Create a Deployment document for the developed code and provide support during the code migration phase.
Create an Initial Unit Test Plan to demonstrate that the software, scripts, and databases developed conform to the Design Document. Provides support during the integration testing and User Acceptance phase of the project. Also, provide hyper-care support post-deployment.
Perform Analysis of the existing source systems, understand the Informatica/Teradata-based applications and provide the services which are required for the development & maintenance of the applications.
Worked with Google Cloud (GCP) Services like Compute Engine, Cloud Functions, Cloud DNS, Cloud Storage, and Cloud Deployment Manager and SaaS, PaaS, and IaaS concepts of Cloud Computing and Implementation using GCP.
Coordinate with the Application support team and help them assist to understand the business and necessary components for the Integration, Extraction, Transformation, and a load of data.
Analyze and develop Data Integration templates to extract, cleanse, transform, integrate and load to data marts for user consumption. Review the code against standards and checklists.
Strong working experience with various python libraries such as NumPy, SciPy for mathematical calculations
Experienced in SQL Queries and optimizing the queries in Oracle, SQL Server, MONGODB, PostgreSQL and Teradata.

Environment: Informatica 10.1.1, Oracle, SQL server, Unix, Flat files, Autosys, Web services, HCM Oracle Fusion, Soapui, Sales force cloud, Oracle MDM, ESB.
Cyient, India Feb 2017 May 2019
Hadoop Developer
Responsibilities:-
Developed Spark Applications to implement various data cleansing/validation and processing activity of large-scale datasets ingested from traditional data warehouse systems. Used Jenkins for Continuous integration.
Developed custom Kafka producers to write the streaming messages from external Rest applications to Kafka topics.
Developed spark streaming applications to consume the streaming json messages from Kafka topics.
Worked with the Spark for improving performance and optimization of the existing transformations.
Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Snowflake.
Worked and learned a great deal from AWS Cloud services like EMR, S3, RDS, Redshift, Athena, and Glue.
Creating end to end Spark applications using Scala to perform various data cleansing, validation, transformation and summarization activities on user behavioral data. Worked with Log4j framework for logging debug, info & error data.
Developed custom Input Adaptor utilizing the HDFS File system API to ingest click stream log files from FTP server to HDFS. Developed end-to-end data pipeline using FTP Adaptor, Spark, Hive and Impala.
Exploring with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.
Handled importing other enterprise data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HBase tables.
Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team. Used Jira for bug tracking and Git to check-in and checkout code changes..
Migrated an existing on-premises data pipelines Worked on automating provisioning of EMR clusters.
Expertise on HIVE optimization techniques like partitioning and bucketing on the different formats of data.
Used Hive QL to analyze the partitioned and bucketed data, executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic. Worked both with batch and real time streaming data sources. Developed data transformations job using Spark Data frames to flatten JSON documents to csv.
Collected the Json data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables. Experience in using Avro, Parquet, ORC file and JSON file formats, developed UDFs in Hive.
Generated various kinds of reports using Tableau based on client specification.
Responsible for generating actionable insights from complex data to drive real business results for various application teams and worked in Agile Methodology projects extensively.
Worked with Scrum team in delivering agreed user stories on time for every Sprint

Environment: Cassandra, PySpark, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, FLUME, Apache oozie, Zookeeper, ETL, UDF, Map Reduce, Snowflake, Apache Pig, Python, Java, SSRS.
Evergent Technologies India Limited, India Aug 2014 Jan 2017
Data Analyst
Responsibilities:-
Collaborated with my manager to gather all the information needed for data analysis and databases and analyzed the raw data. Developed the content involving in data manipulation, visualization, Machine Learning, and SQL.
Implemented and designed predictive models using Natural Processing Language Techniques and machine learning algorithms such as linear, logistic and multivariate regression, random forests, k means clustering, KNN, PCA for data analysis. Creation of JCL scripting language to direct mainframe OS schedule batch jobs.
Involved in all aspects like data collection, data cleaning, developing models, visualization.
Maintenance of large data sets, combining datasets from various sources like Excel and SQL Queries.
Writing SQL Scripts to select the data from the serves and modify the data as the need of python pandas and stored back to the different database servers. Published customized reports and dashboards, report scheduling using tableau server and used Teradata SQL Queries using Teradata SQL Assistant.
Created action filters, parameters and calculated sets for dashboards and worksheets in the tableau.
Performed data cleaning, exploring analysis and feature engineer using R. Performed data visualization with tableau and generated the findings and enchanted customer satisfactions. Programmed in python that used in packages like NumPy, pandas and SciPy. Used different kinds of statistical models like chi-square test, hypothesis testing, t-Test, ANOVA, Correlation Testing and Descriptive Testing.
Implemented classification using supervised algorithms like Decision trees, KNN, Logistic Regression, Naive Bayes.
Good knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and MapReduce concepts.
Understanding and analyzing the data using appropriate statistical models to generate insights.
Keywords: cplusplus business intelligence sthree active directory rlang information technology quasar toolkit Delaware Idaho

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];2118
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: