Deeksha - Senior Lead Data Engineer |
[email protected] |
Location: Dallas, Texas, USA |
Relocation: Yes |
Visa: H1B |
Deeksha
[email protected] || 817-678-0872 Corp to Corp only Professional Summary Professional qualified Data Engineer/Data Analyst around 10 years of experience in Data Engineer and Analytics including Data Mining and Statistical Analysis. Experienced in Network and healthcare industry. Proficient in data mining techniques like Association, classification, Outlier detection, Clustering. Have understanding and experience of data vault and data warehouse building and maintenance. Experienced in Data ingestion/ETL using Apache Nifi. Proficient in Statistical Methods like Regression models, hypothesis testing, A/B test, experiment design, ANOVA, confidence intervals, principal component analysis and dimensionality reduction. Expert in R and Python scripting. Worked in stats function with Numpy, visualization using Matplotlib/Seaborn and Pandas for organizing data. Experience in using various packages in R and python like ggplot2, caret, dplyr, Beautiful Soup, Rpy2. Proficient in Tableau 9.x & 10.x and R-Shiny data visualization tools to analyse and obtain insights into large datasets, create visually powerful and actionable interactive reports and dashboards. Experience data loads, extracts, statistical analysis, modeling, and data munging. Strong skills in statistical methodologies such as Solid ability to write and optimize diverse SQL queries, working knowledge of RDBMS like SQL Server 2008, NoSQL, databases like MongoDB 3.2 Experienced in writing complex SQL Queries like Stored Procedures, triggers, joints, and Subqueries. Well experienced in Normalization & Denormalization techniques for optimum performance in relational and dimensional database environments. Excellent understanding Agile and Scrum development methodology. Used the version control tools like Git 2.X Experience in implementing data analysis with various analytic tools, such as Anaconda 4.0 Jupyter Notebook 4.X, R 3.0 (ggplot2, Caret, dplyr) and Excel. Involved in the data science project life cycle and actively involved in all the phases including data extraction, data cleaning, statistical modelling, and data visualization with large data sets of structured and unstructured data. Collaborated with engineers to deploy successful models and algorithms into production environments. Setup storage and data analysis tools in Amazon Web Services , Azure, GCP cloud computing infrastructure. Flexible with Unix/Linux and Windows Environments, working with Operating Systems. Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner. Have extensively worked with data management for AI solutions with data scientists. Technical Skills Databases Oracle 12c, MySQL, SQLite, NOSQL, RDBMS, SQL Server 2014, MongoDB 3.2, Teradata. Programming Languages R, Python, SQL, Scala, MATLAB, Groovy, Jython Data Visualization Qlikview, Tableau 9.4/9.2, ggplot2 (R), Power BI, Kibana, Matplotlib Big Data Framework HDFS, MapReduce, Hive, Amazon EC2, S3 and Redshift, Spark, Apache Nifi,ADF,Blob Storage,Data Lake,Log Analytics, Informatica,Talend , Datastage, ELK Stack Technologies/Tools Pytorch, NumPy, Jupyter, Spyder, RStudio, OpenCV, Jenkins, Kubernetes. Operating Systems Windows, Linux, Unix. Environment AWS, AZURE, GCP. Work Experience Charter Communications Plano, TX Sep 2022- now Role: Senior Lead Data Engineer Responsibilities: Responsible for transforming the company's data into meaningful and useful information for business purposes. Understanding business processes and performing analytics to build reports for various departments. Define requirements to the internal developers and software development in a structured manner and communicate the solutions to management. Understanding the company and customer's different data sources and applying business analytics to overcome challenges. Communicating project expectations and status to team members and stakeholders in a clear, concise, and timely fashion. Meeting with stakeholders to elicit requirements for project requests. Gathering and Writing business requirements from business users and stakeholders. Writing SQL queries such as stored procedure, views, and CTE to pull datasets for reporting. Extracting data from several sources such as Excel, and SQL Server, transforming that data, validating it, and loading it into the target database for reporting purposes. Develop ETL (extract, transform, and load) jobs to facilitate data retrieval using SSIS. Worked on MongoDB concepts such as locking, transactions, indexes, sharding, replication, schema design. Worked on creating documents in Mongo database. Applied write concern level of acknowledgement while MongoDB write operations and to avoid rollback. Create and maintain database models. Creating SSRS / Power BI paginated reports, and dashboards implementing functionalities such as drill through, drill down list reports, and sub-reports for businesses, using Power Bi builder. Creating Power BI reports from end to end. Writing DAX queries such as Calculate, and Filter to manipulate the dataset for reporting purposes. Published reports and created dashboards in Power BI services. Used different types of transformations in the Power BI Query editor. Scheduled Automatic refresh in Power BI service. Involved in the creation and maintenance of SSAS cubes, perspectives, aggregations, and named sets for reporting. Developed Flask Api to integrated in the ETL flow for data transformation. Experience creating dynamic variable-driven code that can be used across different customers. Experience adding programmatic validation checks into the code to fail or throw warnings when checks are out of balance during a data transformation or load process. Utilized SQL and Python to extract, clean, and transform data from various sources for analysis. Documented all database objects, procedures, views, functions & packages for future references. Used Snowflake SQL and PostgreSQL to develop datasets for the backend of reports and products. Designed and implemented ETL pipelines to load API or Kafka data in JSON file format into Snowflake database using Python. Configured Snow pipe to pull the data from S3 buckets into Snowflakes table and stored incoming data in the Snowflakes staging area. Worked on AWS Data Pipeline to configure data loads from S3 to into Snowflake. Collaborated with multiple departments to develop products and perform analytics, such as working with R&D, sales, and customer services to develop an attrition estimation tool. Created api monitoring metrics dashboard using Tableau and Power BI. Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipelines using Snow Pipe and Matillion from data lake Confidential AWS S3 buckets. Built Data pipelines using aws glue,postgres. Utilized data stored in Redshift to build BI visuals for the business. Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap and Designing the business requirement collection approach based on the project scope and SDLC methodology. Develop and maintain data models to support data analysis and reporting using Databricks. Authoring Python (PySpark) Scripts for custom UDF s for Row/ Column manipulations, merges, aggregations, stacking, data labelling and for all Cleaning and conforming tasks. Manage AWS EC2 instances utilizing Auto Scaling, Elastic Load Balancing and Glacier for our QA and UAT environments. Extensively used DB link between AWS Aurora and Redshift for handling cross database queries. Worked in Amazon Web Services AWS using EC2 for computing and S3 for storage. Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark. Designed, Developed, and Implemented ETL pipeline performance using python API (pyspark) and Spark SQL API on AWS EMR. Experience in working with AWS services like Lambda, S3, EC2, SNS and SQS. Worked on Spark using Python and Spark SQL for faster testing and processing of data. Configured AWS from scratch to setup various EC2 instances for Web and Application servers. Proficient in data visualization using Tableau, including Maps, Density Maps, Tree Maps, Heat Maps, Pareto charts, Bubble charts, Bullet Charts, Pie Charts, Bar charts, and Line charts. Created pipelines in ADF using linked services/datasets/pipelines/ to extract, transform. Scheduled jobs in Flows and ADF Pipelines Migrating the data from different sources to the destination with help of ADF. Environment: AWS, S3,EC2,SNS, Snowflake, Python ,Tableau, Power BI,PySpark,postgres AT&T Plano, TX Sep 2019- Sep 2022 Role: Lead Data Engineer Description: Built an end-to-end real-time data streaming and monitoring platform. Developed microservices for anomaly detection using seasonality. Developed decision tree classification model. Responsibilities: Collaborated with product management and engineering departments to understand company needs and devise possible solutions. Implemented Data analysis on large Application Performance monitoring datasets to recognize patterns. Data Extraction experience from Cassandra, Oracle DB, Teradata, Vertica, MySQL for data analysis and modelling. Built an end-to-end real-time data streaming and monitoring platform using Apache Nifi. Implemented ETL using Apache Nifi. Built ML ensembled models to perform data transformation. Built CDP deployment platforms using At&t s CDP ECO platform. Experienced in creating index, index patterns and dashboards in Kibana. Supported alert monitoring of AT&T s FRAUD applications . Experienced with Devops tools Jenkins, Kubernetes. Experienced with bitbucket version control. Research and develop statistical learning models for data analysis. Developed seasonality-based anomaly detection microservice in python and integrated it with the real-time data ingestion. Built and deployed Weco, R based package microservice for anomaly detection. Coordinate with various technical/functional teams to implement models and monitor results. Supported integration testing for At&t CMLP platform. Create dashboards and interactive visual reports using Power BI and Tableau. Using advance level calculations on the data set, publishing reports via the Power BI service, managing them through distribution of apps and monitoring usage/data refreshes/security access, etc. Designed cloud-based solutions in azure by creating azure SQL database, setting up elastic pool job and design tabular models in azure analysis services. Experienced in developing and implementing data pipelines using Azure Data Factory for ingestion, transformation, and storage of big data. Proficient in designing and developing data models using Azure Data Lake Store and Azure Data Lake Analytics. Skilled in using Azure Stream Analytics for real-time data processing and analysis. Extract Transform and Load data from Source Systems to Azure Data Storage Services using a combination of ADF, Palantir Foundry, ADLS. Working on Palantir Foundry tool to create contract entity modeling and design patterns for the automation of development and production environment Experienced in working with Azure Cosmos DB for NoSQL database management and data analytics. Experienced in using Azure Databricks for collaborative data engineering using spark and machine learning libraries. Knowledge of Azure Synapse Analytics for integrating and analyzing big data with Power BI and other visualization tools. Used Azure log analytics to automate the Workflows on VMs for data processing and loading to Azure Blob storage. Created pipelines in ADF using linked services/datasets/pipelines/ to extract, transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards. Extract Transform and Load data from Sources Systems to Azure Data Storage Services using a combination of Azure data factory(ADF), T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure Services- (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks. Configured Azure Automation DSC configuration management to assign permissions through RBAC, assign nodes to proper automation account and DSC configurations, to get alerted on any changes made to nodes and their configuration. Created Azure BLOB and Data Lake storage and loaded data into Azure SQL Synapse analytics (DW). Automate data processing tasks using scripting languages such as Python and Bash. Experienced in using ELK stack for application monitoring. Environment: Apache Nifi, Azure Data Factory, Python, R, Azure Data Lake, Blob Storage, Log Analytics, Azure Functions, Apache Kafka, Cosmo DB, Power BI, Tableau, Palantir Foundry TransUnion - Chicago, IL Aug 2018- Sep 2019 Role: Data Engineer Responsibilities: Automate data processing tasks using scripting languages such as Python and Bash. Configured AWS IAM Groups and Users for improved login authentication. Worked with various AWS cloud services, including EC2, S3, EMR, Redshift, Lambda, and Glue. Implemented and maintained Hadoop cluster on AWS EMR. Loaded data into S3 buckets using AWS Glue and PySpark. Utilized AWS Glue for data cataloging, ETL processing, and data preparation tasks, enabling seamless integration of diverse data sources and efficient transformation of large-scale datasets. Leveraged Glue's built-in connectors and transformations to handle complex data structures, such as nested data and JSON formats, facilitating effective data processing and integration for analytics and reporting purposes. Proficient in AWS Lambda for developing scalable and cost-effective applications. Implemented AWS Step Functions to orchestrate complex workflows in a serverless architecture. Utilized Amazon Redshift for high-performance data warehousing solutions. Optimized Redshift clusters through performance tuning, query optimization, and data distribution strategies. Utilized Amazon CloudWatch for monitoring and gaining insights into application and infrastructure performance. Set up CloudWatch alarms and automated actions based on defined thresholds. Implemented AWS CI/CD pipelines using services like AWS CodePipeline, AWS CodeBuild, and AWS CodeDeploy. Implemented infrastructure-as-code (IaC) practices using AWS CloudFormation or AWS CDK. Used Apache Kafka to aggregate web log data from multiple servers and make it available for downstream analysis. Designed and deployed automated ETL workflows using AWS lambda, organized and cleansed the data in S3 buckets using AWS Glue, and processed the data using Amazon Redshift. Worked on creating Glue jobs and managed policies and Utilized S3 for storage and backup on AWS. Worked on developing the process and ingested the data from web service and load it to Dynamo DB. Worked with Spark applications in Python for developing the distributed environment to load high volume files using PySpark with different schema into PySpark Data frames and process them to reload into DynamoDB tables. Designed and developed the pipelines using Azure Databricks and automated the pipelines for the ETL processes and further maintenance of the workloads in the process. Python ETL is implemented in the source ingestions and transformations based on the input keys. Expertise in migrating existing applications, Talend Data feeds, and ETL Pipelines to the Hadoop, snowflake, and AWS. Built data pipelines and ETL using Airflow. Worked with the implementation of the ETL architecture for enhancing the data and optimized workflows by building DAGs in Apache Airflow to schedule the ETL jobs and additional components in Apache Airflow like Pool, Executors, and multi-node functionality. Experienced in automating various cloud operations using Airflow operators. Environment: AWS, S3,Lambda,Glue,Databricks, Airflow, Python ,Tableau, Power BI, PySpark Group Health, Seattle, WA. Jan 2017 Aug 2018 Role: Data Engineer Worked on Extraction of healthcare claims data, electronic health records (EHR) for analysis and reporting to make business decisions. Responsibilities: Worked extensively with the GCP cloud platform, gaining hands-on experience with Dataproc, BigQuery, Cloud Storage, Dataflow, Cloud SQL, and Datastore. Designed and deployed data pipelines using Cloud Storage, Dataproc, and Cloud Composer. Utilized Google Cloud Dataflow to integrate data from both on-prem (MySQL, Cassandra) and cloud (Cloud Storage, Cloud SQL) sources, applying transformations for loading into Google BigQuery. Developed data ingestion pipelines on Google Cloud Dataproc Spark cluster using Google Cloud Dataflow and Spark SQL. Also worked with Google Cloud Firestore and Cloud Bigtable Used Google Cloud Build and Cloud Source Repositories for CI/CD, Google Cloud Identity and Access Management (IAM) for authentication, and Apache Ranger for authorization. Developed ETL jobs for data extraction and loading into Big Query. Integrated Cloud Functions with GCP services like Google Cloud Storage(GCS) and BigQuery with Looker. Filtered data in Google Cloud Storage(GCS) using Elasticsearch and loaded it into BigQuery with Looker. Analyzed large datasets using Pandas and performed regression modeling with SciPy, incorporating Teradata data when necessary. Designed and deployed data pipelines using Cloud Storage, Google Cloud Dataproc, and Apache Airflow. Collaborated with data analysts and business stakeholders to create custom dashboards and reports using tools like Data Studio, connected to BigQuery for real-time data visualization Extensive experience in working with No-SQL databases and their integration with Google Cloud Platform, including Google Cloud Firestore, Google Cloud Datastore, MongoDB Atlas, Cassandra on GCP, and HBase on GCP. Designed and deployed functions that seamlessly integrate with other GCP services, enabling real-time data processing, event-driven triggers, and automated workflows. Set up and configured Google Data Transfer Services to automate the extraction and loading of data from various sources into BigQuery, ensuring data freshness and consistency. Implemented access controls and encryption mechanisms for data stored in GCS buckets, ensuring data security and compliance with privacy regulations. Monitored and troubleshoot data pipelines and storage solutions using GCP's Stackdriver and Cloud Monitoring. Environment: Google Cloud Storage, BigQuery, Dataproc, Airflow, Python, MySQL, MongoDB, Cassandra, Erwin tool PayPal, Bangalore March 2016- Aug 2016 Role: Hadoop Developer Description: The main goal of the project is to translate the available data into information which helps in taking meaningful decisions. Responsibilities: Involved in Configuring Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs and used Nifi for data cleaning and preprocessing. Imported and exported data into HDFS from Oracle database and vice versa using Sqoop. Installed and configured Hadoop Cluster for major Hadoop distributions. Used Hive, Pig and Informatica as an ETL tools for event joins, filters, transformations, and pre-aggregations. Created partitions, bucketing across state in Hive to handle structured data using Elastic search. Developed workflow in Oozie to orchestrate a series of Pig scripts to cleanse data such as removing personal information or merging many small files into a handful of very large, compressed files using Pig pipelines in the data preparation stage. Involved in moving all log files generated from various sources to HDFS for further processing through Elastic search, Kafka, Flume & Informatica and process the files. Extensively used PIG to communicate with Hive using Catalog and HBase using Handlers. Used Spark SQL for Scala & amp, Python interface that automatically converts RDD case classes to schema RDD. Used Spark SQL to read and write table which are stored in Hive. Performed Sqooping for various file transfers through the HBase tables for processing of data to several NoSQL DBs- Cassandra, Mongo DB. Created tables, secondary indices, join indices viewed in Teradata development environment for testing. Captured data logs from web server and Elastic search into HDFS using Flume for analysis. Managed and reviewed Hadoop log files. Designed and Developed Scala workflows for data pull from cloud based systems and applying transformations on to it. Designed, developed, and optimized data processing pipelines using Apache Spark. Designed, built, and maintained the data ingestion pipelines using Scala FS2/Akka Streams/Kafka Streams. Experienced in developing Scala applications for loading/streaming data into NoSQL databases(MongoDB) and HDFS. Processed data into HDFS by developing solutions, analyzed the data using MapReduce, Hive and produced summary results from Hadoop to downstream systems. Data sources are extracted, transformed, and loaded to generate CSV data files with Python programming and SQL queries. Developed Spark applications using python and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns. Developed the strategy/implementation of Hadoop Impala integration with existing ecosystem of RDBMS using Apache spark. Worked with Spark using Scala for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL, PySpark, Pair RDD's, Spark YARN. Developed spark programs using Scala APIs to compare the performance of spark with hive & sql Performed Streaming of pipelines using Spark Streaming and Stream Analytics to analyze the data from the data-driven workflows. Used Sqoop to dump data from MySQL relational database into HDFS for processing and exporting data to RDMS. Strong data warehouse and database design skills. Used Spark DataFrames to create datasets, applying business transformations and data cleansing operations. Used Pig and Hive in the analysis of data. Worked with Flume to import the log data from the reaper logs and syslog's into the Hadoop cluster. Involved in managing running and pending tasks Map Reduce through Cloudera manager console. Hands on experience with NoSQL databases like HBase, Cassandra for POC (proof of concept) in storing URL's, images, products, and supplements information at real time. Worked on Hive for analysis and generating transforming files from different analytical formats to text files. Involved in writing Hive queries for data analysis with respect to business requirements. Worked on Spark using Python and Spark SQL for faster testing and processing of data. Involved in loading data from LINUX file system to HDFS. Environment: : Hive, Pig, Map Reduce, Apache Nifi, Sqoop, Oozie, Flume, Kafka, Informatica, EMR, Storm, HBase, Unix, Linux, Python, Spark, SQL, Hadoop 1.x, HDFS, GitHub, Informatica, Python Scripting, Scala VIE Techno Solutions, Bangalore June 2015- Jan 2016 Role: Data Analyst Description: The main goal of the project is to translate the available data into information which helps in taking meaningful decisions. Responsibilities: Understood basic business analyst s concepts for logical data modeling, data flow processing and database design. Interface with users and business analysts to gather information and requirements. Conducted analysis and profiling of potential data sources, upon high level determination of sources during initial scoping by the project team. Responsible for logical and physical data modeling, database design, star schema, data analysis, documentation, implementation, and support. Involved in Data modeling and design of data marts using ER/Studio. Involved in the study of the business logic and understanding the physical system and the terms and condition for sales data mart. Striking a balance between scope of the project and user requirements by ensuring that it covered minimum user requirements without making it too complex to handle. Requirement gathering from the users by participating in JAD sessions. A series of meetings were conducted with the business system users to gather the requirements for reporting. And worked on reporting using Tableau Reports. Requirement analysis was done to determine how the proposed enhancements would affect the current system. Created LDM, PDM in 3NF using ER/Studio tool and converted the logical models to the physical design. Design and development of the changes and enhancements. The existing submission and refining programs were enhanced to incorporate the new calculations. Interacted with end users to identify key dimensions and measures that were relevant quantitative. Used reverse engineering to connect to existing database and create graphical representation (E-R diagram) using Erwin4.0. Designed a simple dimensional model of a business that sells products in different markets and evaluates business performance over time. Coordinated data profiling/data mapping with business subject matter experts, data stewards, data architects, ETL developers, and data modelers. Developed logical/ physical data models using Erwin tool across the subject areas based on the specifications and established referential integrity of the system. Normalized the database to put them into the 3NF of the data warehouse. Involved in dimensional modeling, identifying the Facts and Dimensions. Maintain and enhance data model with changes and furnish with definitions, notes, reference values and check lists. Generate DDL scripts for database modification, Teradata Macros, Views and set tables Developed ETL programs using Informatica to implement the business requirements Created shell scripts to fine tune the ETL flow of the Informatica workflows. Used Python to clean the data and remove outliers from the data for creating reports. Reported and created dashboards for Global Services & Technical Services using SSRS, Oracle BI, and Excel. Design, develop and maintain reports, ad hoc queries and analytical tools. Provide analysis and issue resolution on business reported concerns. Produce metrics and other report objects and components. Prepare user guides and provide training on new / enhanced reporting solutions. Perform ongoing monitoring, automation and refinement of reports and BI solutions. Environment: Teradata, SQL Server 2005/2017, Erwin 7.5, XML, Excel, Access, Informatica. Trigeo Technologies Dec 2014- Apr 2015 Role: SQL Developer Responsibilities: Work with architects and senior developers to design and enforce established standards for building reporting solutions. Aid with interpretation of raw data, statistical results, or compiled information. Work with business stakeholders to define and execute test plans and seek implementation signoff. Understanding the Business requirements, use cases and process flow to come up with a conceptual model, logical model, and Physical model. Created Physical model from the Logical model and followed Enterprise naming standards applying them to indexes, foreign key constraints, data base and views. Generated DDL from the Physical model and ran the DDL on to the database to create physical tables. Populated data from source to destination using SSIS (ETL) packages and performing joins and other logics to parse the data. Performed unit testing and debugging and set test conditions based on requirements. Blended data from multiple databases into one report by selecting primary keys from each database for data validation. Deployed Excel VLOOKUP, PivotTable, and Access Query functionalities to research data issues and cleaned, reformatted, and documented user's satisfaction survey data. Developed data gathering application. Continually explore opportunities to improve processes in the ongoing development of analytical tools and models, ad-hoc reports, dashboards and analysis Coordinate and complete ad-hoc inter-department requests and requests from strategic partners by providing decision support and consultative service. Scripting in T-SQL from tables, developing test procedure cases. Exposure to database administration principles and practices including security, permissions, and alerts. Extract, manipulate and analyze data and create reports using T-SQL. Used Excel in setting up pivot tables to create various reports using a set of data from an SQL query. Designed database in Access from data in Excel linking them so any changes in Excel are automatically reflected when the linked table is viewed or queried in Access. Created data validation rules for entering data and importing data from Excel spreadsheets. Extensive use of Import/Export wizard to convert T- SQL Queries, Data and Reports from different forms like XML file, MS Excel file, TTF file, PDF file and other formats to SQL data type and vice versa. Used DDL and DML to write Triggers, Stored procedures to check the Data Integrity and payment verification at early stages before calling them. Environment: Excel, Tableau, SQL server, SSIS, MySQL, Oracle Education Master of Science in Engineering from UTEP. BE in Electrical Engineering from NMIT Keywords: continuous integration continuous deployment quality analyst artificial intelligence machine learning business intelligence sthree database active directory rlang information technology microsoft Illinois Texas Washington |