Shreeja soni - Data Engineer |
[email protected] |
Location: San Bernardino, California, USA |
Relocation: Yes |
Visa: GC |
SHREEJA
Email: [email protected] PH: 503-583-8672 Sr. Data Engineer Professional Summary Overall 8+ years of profession experience in Data Systems Development, Business Systems including designing and developing with Data Engineer and Data Analyst. Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP} using cloud native tools such as BIG query, Cloud Data Proc. Google Cloud Storage, Composer. Good experience in all phases of SDLC and participated in daily scrum meetings with cross teams. Excellent experience in developing and designing data integration and migration solutions in MS Azure. Hands on experience with ETL tools such as AWS Glue. Developed Glue Scripts to load data from S3 to Redshift tables. Expertise with container systems like Docker, as well as container orchestration such as EC2 Container Service, Kubernetes, and Terraform. Managed Docker orchestration and Docker containerization using Kubernetes. Excellent understanding and hands on experience with AWS, AWS S3 and EC2. Worked on Snowflake for big data loads using PySpark. Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inman Approach. Experience on importing and exporting data using stream processing platforms like Flume and Kafka. Implemented large Lambda architectures using AWS Data platform capabilities like AWS Glue, Elasticsearch, S3, EC2 and Redshift. Good experience in using SSIS and SSRS in creating and managing reports for an organization. Proficient knowledge in Designing and implementing data structures and commonly used data business intelligence tools for data analysis. Extensive experience in writing Storm topology to accept the events from Kafka producer and emit into Cassandra DB. Excellent working with data modeling tools like Erwin, Power Designer and ER Studio. Proficient working experience on big data tools like Hadoop, Azure Data Lake, AWS Redshift. Strong experience in Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export. Excellent technical and analytical skills with clear understanding of design goals and development for OLTP and dimension modeling for OLAP. Strong experience in migrating data warehouses and databases into Hadoop/NoSQL platforms. Designing and Developing Oracle PL/SQL and Shell Scripts, Data Conversions and Data Cleansing. Participating in requirements sessions to gather requirements along with business analysts and product owners. Experience in designing a component using UML Design-Use Case, Class, Sequence, and Development, Component diagrams for the requirements. Experience on implementation of a log producer in Scala that watches for application logs, transform incremental log. Extensive experience in writing UNIX shell scripts and automation of the ETL processes using UNIX shell scripting. Strong experience in using Excel and MS Access to dump the data and analyze based on business needs. Designing and Developing Oracle PL/SQL and Shell Scripts, Data Conversions and Data Cleansing. Experienced in working with different scripting technologies like Python, Unix shell scripts. Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services successfully loaded files to HDFS from Oracle, SQL Server, Teradata and Netezza using Sqoop. Excellent Knowledge in understanding Big Data infrastructure, distributed file systems - HDFS, parallel processing - MapReduce framework. Experienced of building Data Warehouse in Azure platform using Azure data bricks and data factory Extensive knowledge in working with IDE Tools such as My Eclipse, RAD, IntelliJ, NetBeans Expert in Amazon EMR, S3, ECS, Elastic Cache, Dynamo DB, Redshift. Professional Experience Sr Data Engineer AIG, Houston, TX April 2023 to Present Responsibilities: Responsible for design and development and implementation of Dataflow Pipelines. Built Production ready batch data pipelines using AWS Glue, S3 data lake and EMR using PySpark, SQL and Python Coordinated with business customers to gather business requirements. Built Production ready batch data pipelines using AWS Glue, S3 data lake and EMR using PySpark, SQL and Python. Implemented Spark Data frames and Spark SQL API for faster and efficient processing of data. Worked Extensively on AWS Glue to create batch pipelines. Extracted and loaded data into Data Lake environment (MS Azure) by using Sqoop which was accessed by business users. Used SDLC (System Development Life Cycle) methodologies like RUP and Agile methodology. Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/ Big Data concepts. Experienced in developing Web Services with Python programming language. Experience in GCP Dataproc, GCS, Cloud functions, BigQuery. Experience in moving data between GCP and Azure using Azure Data Factory. Develop ETL processes in AWS Glue to migrate Campaign data from external sources like S3, ORC, Parquet and Text Files into AWS Redshift. Participated in JAD sessions and requirements gathering & identification of business subject areas. Importing and exporting data into HDFS from MySQL and vice versa using Sqoop and manage the data coming from different sources. Used Reverse Engineering approach to redefine entities, relationships and attributes in the data model. Used Python to extract weekly information from XML files. Worked with Hadoop infrastructure to storage data in HDFS storage and use HIVE SQL to migrate underlying SQL codebase in Azure. Created Hive External tables to stage data and then move the data from Staging to main tables Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system. Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform. Designed and produced logical and physical data models for the financial platform and other in-house applications running on Oracle databases. Created dimensional model based on star schemas and designed them using Erwin. Involved in Analyzing raw files from S3 data lake using AWS Athena, Glue without loading the data in the database. Worked with ETL tools to migrate data from various OLAP and OLTP databases to the data mart. Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team. Developed PySpark code in snowflake for FiOS digital. Worked in Azure environment for development and deployment of Custom Hadoop Applications. Designed and implemented scalable Cloud Data and Analytical a solution for various public and private cloud platforms using Azure. Wrote Python scripts to parse XML documents and load the data in database. Developed numerous MapReduce jobs in Scala for Data Cleansing and Analyzing Data in Impala. Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard Worked in developing Pig Scripts for data capture change and delta record processing between newly arrived data and already existing data in HDFS. Optimized Hive queries to extract the customer information from HDFS. Analyzed data using Hive the partitioned and bucketed data and compute various metrics for reporting. Used python scripts to update content in the database and manipulate files. Built Azure Data Warehouse Table Data sets for Power BI Reports. Environment: Hive, Sqoop, Kafka, Apache Spark, Python, AWS, Glue, Postgres, GIT, Bitbucket Sr Data Engineer Cummins Columbus, Indiana August 2020 to March 2023 Responsibilities: Involved in all the phases of Software development life cycle (SDLC) using Agile Scrum methodology Worked in Big data team to migrate RDBMS to HDFS through Sqoop. Understanding MapReduce Programs and converting these programs to Hive and Pig. Creating and Managing Hive jobs through Oozie. Migrating flat files and RDBMS into HDFS through Talend using Bigdata components and Sqoop. Aid in developing style guide for development of BI Visualizations, design of components, naming of items, data source creation and naming. Created List Reports, Crosstab Reports, Chart Reports, Repeaters, Drill-Thru and Master Detail Query Reports using Jasper Report Studio. Involved in Analyzing raw files from S3 data lake using AWS Athena, Glue without loading the data in the database. Customized login page and home page design as per the client specification by adding Client Logos and background images of the Client in Cognos 11.0.6 version. Created role based custom login and home page design. Worked on PySpark Data sources, PySpark Data frames, Spark SQL and Streaming Created Dashboards and reports by using Cognos advanced Visualization. Created Reports using Tabular SQL to reduce the load on the Framework Manager model Created Query Prompts, Calculations, Conditions and Filters in the reports. Developed Prompt Pages and Cascaded Input Prompts to display only the selected entries in the report. Develop SCD1, SCD2 and SCD3 types in informatica by using look up transformation, Expression transformation and other advanced transformations. Build Informatica Mappings, Sessions and workflows, check in and check out through Versioning Control in Informatica for code changes. Environment: MS Azure, Sqoop, Agile, Kafka, Erwin 9.7, Spark, PYSpark, MongoDB, Hive, Python, GIT, BitBucket Data Engineer GDIT Rensselaer, NY May 2018 to July 2020 Responsibilities: Worked as a HIVE team member and involved in design of the High Availability for the Hive Server. Hive Server is the Single Point of Failure and is a Data Warehouse solution for querying and analysis on large sets of Big Data. Involved in the design review of the High Availability (HA) feature on Hive. Worked as a HBASE team member. HBASE is the column-oriented database that is built over HDFS. Worked together with Apache HBase committers and mentored team members Involved in Requirement Analysis, design and execution, automation of unit testcases for HDFS, Hive, Map Reduce and H Base in J unit. Worked on PySpark Data sources, PySpark Data frames, Spark SQL and Streaming using Scala. Good at Hadoop cluster setup, monitoring, administration. Resolved all the customer queries related to installation, configuration, administration, etc. Experience on Non-functional testing tools like Heap Dump Analyzers, Thread Dump Analyzers, GC log Analyzers, Profilers Design and develop automation frameworks and automation suites using Java, Junit and Ant. Worked on improving the Performance for many Huawei Hadoop versions. Good knowledge on Linux commands and scripting. Contributed the patches in Apache open source for HBase component for major bugs Participated in product functional reviews, test specifications, document reviews. Executing the Map Reduce jobs and building data lakes Worked on the Zookeeper, Bookkeeper and data analytics Data Engineer Dhruvsoft Services Private Limited, Hyderabad, India October 2016 to February 2018 Responsibilities: Analyzing Functional Specifications Based on Project Requirement. Ingested data from various data sources into Hadoop HDFS/Hive Tables using SQOOP, Flume, Kafka. Extended Hive core functionality by writing custom UDFs using Java. Developing Hive Queries for the user requirement. Worked on multiple POCs in Implementing Data Lake for Multiple Data Sources ranging from Team Center, SAP, Workday, Machine logs. Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data. Worked on MS Sql Server PDW migration for MSBI warehouse. Planning, scheduling, and implementing Oracle to MS SQL server migrations for AMAT in house applications and tools. Worked on Solr Search Engine to index incident reports data and developed dash boards in Banana Reporting tool. Integrated Tableau with Hadoop data source for building dashboard to provide various insights on sales of the organization. Worked on Spark in building BI reports using Tableau. Tableau was integrated with Spark using Spark-SQL. Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis. Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS. Developed work flows in Live Compare to Analyze SAP Data and Reporting. Worked on Java development and support and tools support for in house applications. Participated in daily scrum meetings and iterative development. Search functionality for searching through millions of files of logistics groups. Data Engineer Ceequence Technologies Hyderabad, India June 2015 to September 2016 Responsibilities: Create high- and low-level design documents for the various modules. Review the design to ensure adherence to standards, templates and corporate guidelines. Validate design specifications against the results from proof of concept and technical considerations. Worked on implementing pipelines and analytical workloads using big data technologies such as Hadoop, Spark, Hive and HDFS. Experienced in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Sqoop, Kafka, Spark, Impala. Perform Analysis on the existing source systems, understand the Informatica/Teradata based applications and provide the services which are required for development & maintenance of the applications. Worked with Google Cloud (GCP) Services like Compute Engine, Cloud Functions, Cloud DNS, Cloud Storage and Cloud Deployment Manager and SaaS, PaaS and IaaS concepts of Cloud Computing and Implementation using GCP. Coordinate with the Application support team and help them assist understand the business and necessary components for the Integration, Extraction, Transformation and load data. Analyze and develop Data Integration templates to extract, cleanse, transform, integrate and load to data marts for user consumption. Review the code against standards and checklists. Create a Deployment document for the developed code and provide support during the code migration phase. Create Initial Unit Test Plan to demonstrate that the software, scripts and databases developed conforms to the Design Document. Provides support during the integration testing and User Acceptance phase of the project. Also provide hyper care support post deployment. Environment: Informatica power center 9.6.1, Power exchange, Teradata database and Utilities, Oracle, GCP, Python, Business Objects, Tableau, Flat files, UC4, big data, HDFS, Mastreo scheduler, Unix. Technical Skills Big Data Technologies: Hive, Apache Spark, HBase, Oozie, MongoDB, Kafka Programming Languages: Java, Python, PYSpark RDBMS: Microsoft SQL Server 2017, Teradata 15.0, Oracle 9i/11g Data Modeling Tools: Erwin BI Tools: Power BI, Tableau Cloud Platform: GCP, AWS, Microsoft Azure, Glue OS Windows: [...], Linux/Unix Keywords: business intelligence sthree database information technology green card microsoft procedural language New York Texas |