Home

Vineesha C - Sr. Data Engineer
[email protected]
Location: Remote, Remote, USA
Relocation: Any
Visa: GC-EAD
Vineesha C
Sr. Data Engineer
845-276-4109
[email protected]

Any
GC-EAD


Technical Summary:
Experienced Senior Data Engineer with over 8+ years of expertise in building robust data pipelines and infrastructures.
Proficient in utilizing Databricks and Spark for information analysis and transformation. Skilled in implementing complex operations like merging, updating, and delete using Delta Lake.
Expertise in developing Spark Scala scripts for mining and transforming large datasets, while ensuring performance and ongoing insights.
Strong background in designing and deploying microservices for concurrency and high traffic handling.
Understanding of structured data sets, data pipelines, ETL tools, data reduction, transformation and aggregation technique, Knowledge of tools such as DBT, DataStage
Maintained accurate records and documentation for all mechanical projects.
Well-versed in Azure services for data ingestion and processing, along with optimization techniques like query refactoring and Redis integration.
Collaborative team player who has provided guidance on PySpark to development teams and worked closely with data science groups for preprocessing and feature engineering.
Proficient in creating data lakes on AWS and ingesting data from diverse sources using technologies like API Gateway, Lambda, and Kinesis Firehouse.
Skilled in Spark transformations and Python for big data processing, with a focus on optimizing data flow across all stages.
Used Collibra for Data Quality and lineage.
Good Understanding on ODI Navigators like Designer, Topology, Operator and Security Tab.
Proficient in Teradata Data Warehouse and Kafka for stream-processing. Extensive experience in designing complex data processing rules and SSIS packages for loading into data warehouses.
Strong aptitude for customer insights generation using tools like Gainsight, along with process improvements using Alteryx and SQL.
Collaborative and adept at automated business processes and data storytelling using Tableau and Agile methodologies.
Extensive experience in creating ETL pipelines from various sources including Segment, MongoDB, and MYSQL Shards.
Demonstrated expertise in building complex SQL queries for testing data flow. Well-versed in creating user behavior, engagement, and sales analytics data lakes.
Adept at stream-processing using Kafka and designing rules engines for data processing.
Proficient in using Microsoft Azure services to bring data together, cleanse, transform, and optimize it for storage and use.
Skilled in automating tasks and deploying production-standard code.
Proficient in predictive and prescriptive analytics using regression and logistics models. Strong background in implementing logistics and Random Forest ML models using Python packages.
Adept at training models, implementing ensemble learning techniques, and optimizing models through hyperparameter tuning. Collaborative communicator with a history of collaborating with various departments to provide data-driven solutions.

Technical Skills:
Languages Java, Scala, Python, SQL, and C/C++
Big Data Ecosystem Hadoop, YARN, Flume, Sqoop, Oozie, Airflow, Zookeeper, Talend Map Reduce, Kafka, Spark, Pig, Hive
Hadoop Distribution Cloudera Enterprise, Databricks, Horton Works, EMC Pivotal.
Databases Oracle, MS-SQL Server, PostgreSQL, Db2, MySQL,Erwin.
Streaming Tools Kafka, RabbitMQ
Cloud AWS, GCP, Microsoft Azure, Glue, RDS, Kinesis, DynamoDB, Redshift Cluster Azure, AWS EM.
Operating Systems Linux Red Hat/Ubuntu/CentOS, Windows 10/8.1/7/XP
Testing Hadoop Testing, Hive Testing
Application Servers Apache Tomcat, JBOSS, Web Sphere, RHEL, Windows
Tools and Technologies Servlets, JSP, Spring (Boot, MVC, Batch, Security), Web Services, Hibernate, Maven, GitHub, Bamboo.
IDE s IntelliJ, Eclipse, Net Beans.

Educational Qualification:


Masters in computer and information Sciences
University of Missouri, Kansas City, USA August 2013 - May 2015
GPA: 3.5/4.0
Bachelor of Technology in Computer and Information Science, GEC , India July 2009 May 2013
GPA: 3.4/4.0

Professional Experience:

CGFNS International INC Philadelphia, PA Jan 2022 Till date Sr. Data Engineer:
Responsibilities:
Extensively utilized Databricks notebooks for interactive analysis utilizing Spark APIs.
Delta lake supports merge, update and delete operations to enable complex use cases.
Developed Spark Scala scripts for mining information and performed changes on huge datasets to handle ongoing insights and reports.
Experience using android debugging tools like Logcat, Android Monitor using Android Studio. Worked with ADB commands and Appium node.js commands.
Worked on Collibra administration by creating Communities, Domains, Custom Asset types of Relations and managing users.
Implemented versatile microservices to deal with simultaneousness and high traffic.
Data Ingestion to one or more Azure services and processing the data in Azure Databricks.
Extensive information in data changes, mapping, cleansing, monitoring, debugging, execution tuning and investigating Hadoop clusters.
Experience on Palantir Foundry and Data warehouses (SQL Azure and Confidential Redshift/RDS).
Created and executed Test Complete automated Test Scripts.
Developed Spark applications in python on distributed environment to load huge number of CSV files with different Schema in to Hive ORC tables.
Developed python application for Google Analytics aggregation and reporting and used Django configuration to manage URLs and application parameters.
Worked on Dimensional Data modelling in Star and Snowflake schemas and Slowly Changing Dimensions (SCD).
Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
Designed and implemented a scalable and efficient data pipeline for collecting, processing, and analyzing large volumes of advertising data, resulting in a 30% reduction in data processing time.
Designed and analyzed mechanical systems and components using CAD software (e.g., SolidWorks) and finite element analysis (FEA) tools.
Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS and Google Cloud Formation (GCP).
Supporting Continuous storage in AWS using Elastic Block Storage, S3. Created Volumes and configured Snapshots for EC2 instances.
Administered Master Data Management (MDM) and Extract, Transform, Load (ETL) data application environments.
Designed, built and managed ELT data pipeline, leveraging Airflow, python, dbt, Stitch Data and GCP solutions.
Implemented data quality checks and monitoring processes to ensure the integrity and reliability of the advertising data, reducing data errors by 15%.
Used Oracle Data Integrator ODI to develop processes for extracting, cleansing, transforming, integrating, and loading data into data warehouse database.
Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala APIs.
Utilize Alation to monitor and manage data quality, implementing checks and validations within the platform.
Configure and maintain Alation connectors to various data sources, enabling the automatic discovery and cataloging of new data assets.
Developed Python scripts using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
Solid Experience and understanding of Implementing large scale Data warehousing Programs and E2E Data Integration Solutions on Snowflake Cloud, AWS Redshift, Informatica Intelligent Cloud Services & Informatica PowerCenter integrated with multiple Relational databases.
Developed Hive queries to pre-process the data required for running the business process.
Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL, and a variety of portfolios.
Implementation of generalized solution model using AWS Sage Maker.
Extensive expertise using the core Spark APIs and processing data on an EMR cluster.
Programmed in Hive, Spark SQL, Java, and Python to streamline the incoming data and build the data pipelines to get the useful insights, and orchestrated pipelines.
Experienced in creating metamodel/operating model in Collibra as per Confidential data governance council requirements.
Worked on ETL pipeline to source these tables and to deliver this calculated ratio data from AWS to Datamart (SQL Server) & Credit Edge server.
Environment: Hortonworks, Hadoop, HDFS, AWS Glue, Palantir, ODI, AWS Athena, EMR, Pig, Sqoop, Hive, NoSQL, HBase, Shell Scripting, Airflow, Erwin, Python, Scala, Spark, Spark SQL, AWS, GCP, SQL Server, Tableau, ETL

HP INC Vancouver, WA Mar 2020 Dec 2021
Sr. Data Engineer:
Responsibilities:

Responsible for building the data lake in Amazon AWS, ingesting structured shipment and mastering data from Azure Service Bus using the AWS API Gateway, Lambda, and Kinesis Firehouse into s3 buckets.
Implemented Data pipelines for big data processing using Spark transformations and Python API and clusters in AWS.
Create complex SQL queries in Teradata Data Warehouse environment to test the data flow across all the stages.
Integrated data sources from Kafka for data stream-processing in Spark using AWS Network.
Code review of all the Kafka Unit test case documents, Palantir Foundry, Talend, EQA documents, completed by team with proper review check list. Also do development for same.
Designing the rules engine in spark SQL which will process millions of records on a Spark Cluster on the Azure Data Lake.
Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
Oversaw the production and assembly of mechanical components, ensuring quality control and adherence to specifications.
Lead efforts in data cataloging, utilizing Alation's features to organize, classify, and document data assets.
Manage metadata within Alation, including data lineage, data quality, and other relevant attributes.
Established and oversaw connections for new data assets, as well as managed existing ones in large-scale deployments.
Stage the API or Kafka Data (in JSON file format) into Snowflake DB by Flattening the same for different functional services.
Collaborated with ETSO and business units to define and manage connectivity requirements and firewall configurations.
Extensively involved in designing the SSIS packages to load data into Data Warehouse.
Built customer insights on customer/service utilization, booking, and CRM data using Gainsight.
Executed process improvements in data workflows using Alteryx processing engine and SQL.
Collaborated with business owners of products for understanding business needs and automated business processes and data storytelling in Tableau.
Designed and developed a data pipeline to collect data from multiple sources and inject it in Hadoop, Hive data lake using Talend Bigdata, Spark.
Extracted toxic and hazardous substances data from the internet and injected it to MYSQL using Beautiful Soup, Python.
Implement One time Data Migration of Multistate level data from SQL server to Snowflake by using Python and Snow SQL.
Extensively used ODI tool to create data warehousing OLAP model.
Implemented user behavior, engagement, retention, and sales analytics data lake.
Created ETL pipelines to process data from Segment, MongoDB, and multiple MYSQL Shards.
Mentor and guide analyst on building purposeful analytics tables in dbt for cleaner schemas.
Implemented pipeline using PySpark. Also used Talend spark components.
Implemented Agile methodology for building data applications and framework development.
Implemented business processing models using predictive & prescriptive analytics on transactional data with regression.
Implemented logistics, Random Forest ML models and Python packages.

Environment: Python, Bigdata, Hadoop, HBase, Hive, Palantir, Spark, PySpark, Cloudera, Kafka, Sqoop, ODI, Jenkins, Unix Shell scripting, Airflow, GitHub, SQL, Tableau, Power BI.


Philips INC Andover, MD Nov 2018 Feb 2020
Data Engineer:
Responsibilities:

Worked with NIFI, Teradata, and DB2 interfaces to accomplish ETL tasks using Spark streaming.
Loaded data from Teradata to Hadoop Cluster by using TDCH scripts.
Created pipelines for Stream Sets Data Collector to read data from Kafka and write it to HDFS and MapR DB.
Used Python and SQL programming to extract, transform and load data from source to CSV data files.
Conducted experiments and tests to evaluate product performance and troubleshoot mechanical issues.
Presented monthly based reports and performed data cleanup by implementing Map Reduce jobs on Yarn.
with Hadoop clusters.
Loaded the data of various data formats such as JSON, XML with the help of Python into data frames for analysis.
Proficient knowledge of Big Data components such as Hive, PySpark, Map Reduce, HDFS.
Used Python, SQL Scripts, and GIT to develop Python modules and DataMart s.
Created data frames from the csv files and loaded them to Spark and retrieved the data using Spark SQL.
Performed structure and profile scans using Informatica Enterprise Data Catalog (EDC) and Data Privacy Management (DPM) tools.
Loaded JSON format log files into the Hive and external tables and then used HiveQL to access the data.
Performed data loading and transformations in PySpark by implementing User Defined Functions.
Implemented fact tables to refer to any dimension tables and STAR schema for the data warehouses.
Used Spark RDDs and Scala to convert Hive/SQL queries into Spark transformations.
Implemented lookup and staging table principles together with the HBase Row key for entering data into
HBase tables.
Developed Spark streaming jobs in Scala to access the data from Kafka and modified it to fit into the HBase database.
Monitored the system's status by developing Data visualizations for the system logs.
Created Python programs for several data formats, including JSON and XML, to facilitate easy access to and management of large amounts of data.
Created Scala code and used Object-Oriented Design (OOD) to create mathematical models in Spark analytics.
Worked on the HBase (NoSQL) database and configured MySQL to hold the Hive metadata.
Performed manual and functional testing on the developed applications.
Used Map Reduce jobs to perform analysis on the data imported into HDFS.
Automated the tasks, by creating workflows using Oozie, to load the data into HDFS.
Created Hive DDLs for activities such as creating, reading, updating, and deleting Hive tables.
Exposure to the internal workings of RDDs and the Spark architecture by using and processing data from local files, HDFS, and RDBMS sources by developing RDD and performance-enhancing.
Assisted the business decisions by extracting meaningful insights from the raw data.

Environment: DataMart, Kafka, Spark, Teradata, DB2, TDCH, Map Reduce, YARN, Hive, HiveQL, Scala, Python, HDFS, HBase.

AgFirst Columbia, SC Jul 2017 Oct 2018
Data Engineer
Responsibilities:

Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data. Created various types of data visualizations using Python and Tableau.
Wrote various data normalization jobs for new data ingested into Redshift.
Created various complex SSIS/ETL packages to Extract, Transform and Load data.
Established and upheld data architecture, ensuring accuracy, consistency, and security.
Demonstrated proficiency with database environments such as Oracle, MS-SQL Server, MySQL, and DB2.
Used Zookeeper to store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters.
Used Hive SQL, Presto SQL, and Spark SQL for ETL jobs and using the right technology for the job to get done.
Oversaw and administered Informatica Enterprise Data Catalog (EDC), Axon, Data Privacy Management (DPM), and Data Masking application environments, or comparable data management/governance applications.
Migrated on premise database structure to Confidential Redshift data warehouse.
Was responsible for ETL and data validation using SQL Server Integration Services.
Worked on Big data on AWS cloud services i.e., EC2, S3, EMR and DynamoDB
Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using ERwin and subsequent deployment to Enterprise Data Warehouse.
Defined and deployed monitoring, metrics, and logging systems on AWS.
Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems.
Updated and upheld network diagrams and related connectivity and firewall documents.
Implementing and Managing ETL solutions and automating operational processes.
Defined facts, dimensions and designed the data marts using Ralph Kimball's Dimensional Data Mart modeling methodology using Erwin.
Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin.
Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler, and database engine tuning advisor to enhance performance.

Environment: Informatica, RDS, NOSQL, Snowflake Schema, Apache Kafka, Python, Zookeeper, SQL Server, Erwin, Oracle, Redshift, MySQL, PostgreSQL


United Biosource Corporation Bethesda, MD Mar 2016 Jun 2017
Data Analyst
Responsibilities:

Involved in designing/developing Logical Data Analyst & Physical Data Analyst using Erwin DM.
Worked with DB2 Enterprise, Oracle Enterprise, Teradata13, Mainframe sources, Netezza Flat files, and datasets operational sources.
Worked with various process improvements, normalization, de-normalization, data extraction, data cleansing, and data manipulation.
Performed data management projects and fulfilled ad-hoc requests according to user specifications by utilizing data management software programs and tools like TOAD, MS Access, Excel, XLS and SQL Server.
Worked with requirements management, workflow analysis, source data analysis, data mapping, Metadata management, data quality, testing strategy and maintenance of the model.
Used DVO to validate the data moving from Source to Target.
Creating the requests in answers and see the results in various views like title view, table view, compound layout, chart, pivot table, ticker, and static view.
Assisted in production OLAP cubes, wrote queries to produce reports using SQL Server Analysis Services (SSAS) and Reporting service (SSRS) Editing, upgrading, and maintaining ASP.NET website and IIS Server.
Used SQL Profiler for troubleshooting, monitoring, and optimization of SQL Server and non- production database code as well as T-SQL code from developers and QA.
Assigned relevant IT, regulatory, breach, and records management policies to the identified data assets.
Investigated and validated lineage models associated with identified data assets.
Involved in data from various sources like Oracle Database, XML, Flat Files, CSV files and loaded to target warehouse.
Created complex mappings in Informatica Power Center Designer using Aggregate, Expression, Filter, Sequence
Designed the ER diagrams, logical model (relationship, cardinality, attributes, and candidate keys) and physical database (capacity planning, object creation and aggregation strategies) for Oracle and Teradata as per business requirements using Erwin.
Designed Power View and Power Pivot reports and designed and developed the Reports using SSRS.
Designed and created MDX queries to retrieve data from cubes using SSIS.
Created SSIS Packages using SSIS Designer for exporting heterogeneous data from OLE DB Source, Excel Spreadsheets to SQL Server.
Extensively worked in SQL, PL/SQL, SQL Plus, SQL Loader, Query performance tuning, DDL scripts, database objects like Tables, Views Indexes, Synonyms and Sequences.
Developed and supported the extraction, transformation, and load process (ETL) for a Data.

Environment: ERWIN9.1, Netezza, Oracle8.x, SQL, PL/SQL, SQL Plus, SQL Loader, Informatica, CSV, Taradata13, T-SQL, SQL Server, SharePoint, Pivot tables, Power view, DB2, SSIS, DVO, LINUX, MDM, PL/SQL, ETL, Excel, Pivot tables, SAS, SSAS, SPSS, SSRS

SSP America INC Ashburn, VA Dec 2014 Feb 2016
Data Analyst:
Responsibilities:

Used Python tools such as Pandas, Matplotlib, Seaborn, NumPy, Scikit-learn, and Plotly to perform data cleaning, feature selection, feature engineering, and extensive statistical analysis.
Conducted Exploratory Data Analysis (EDA) using python libraries such pandas, matplotlib, seaborn and plotly.
Involved in feature engineering raw data by performing Missing Value Imputation, Normalization and Standardization, conversion categorical features to numerical using Label Encoder, OneHotEncoder for readability by machine learning models.
Involved in training models using Linear Regression and Support Vector Machine and improved performance by adapting Ensemble Learning techniques like Bagging and Boosting.
Optimized models using Hyperparameter tuning method like Grid Search.
Collaborated with different departments in the organization to understand business requirements and to provide the best solution.
Used graphical packages in python like Seaborn and Matplotlib to produce ROC to visually represent.
True Positive Rate vs False Positive Rate. Equally, produced visualization of Precision and Recall curve.
Analyze and extract data from various Confidential databases using SQL queries.
Performed data cleaning and manipulation by using excel VLOOKUP S, pivot and other advanced excel functions.
Matched business content and glossaries with the data assets identified through scans.
Creates project presentations for business, stakeholders and clients using MS Power Point.
Research, Update and validate data underlying spreadsheet production: strategically fill gaps using Microsoft tools.
Prepared reports by using and utilizing MS Excel.
Created process flowchart presentations and defects management using JIRA, Visio, PowerPoint, and MS Excel.
Participated in daily Agile and Scrum master meeting to provide feedback and updates on the project.
Managed spreadsheets and maintained data integrity to ensure accurate data availability for higher management.
Reviewed and discussed proposed issues with the Line of Business, compliance, risk partners, audit teams, and SMEs to ensure sufficient details are included.
Maintain data flow documentation and perform object mapping using BI tools and validation.
Maintain and perform process analysis and designing of BI reports.
Involved in data mining, transformation and loading from the source systems to the target system.
Finished any additional application administration duties as assigned.

Environment: Tableau, Azure, Informatica, Oracle server, PL/SQL, Linux, Python
Keywords: cprogramm cplusplus quality analyst machine learning javascript business intelligence sthree database active directory information technology hewlett packard green card microsoft procedural language Delaware Maryland Pennsylvania South Carolina Virginia Washington

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];1354
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: