Nithisha M - Data Engineer |
[email protected] |
Location: Overland Park, Kansas, USA |
Relocation: Open for relocation |
Visa: H4EAD |
PROFESSIONAL SUMMARY:
Over 8+ years of experience in IT Industry as a Data engineer & having extensive hands-on experience in Apache Hadoop ecosystem and enterprise application development. Good knowledge on extracting the models and trends from the raw data collaborating with the data science team. Good understanding of Spark Architecture, MPP Architecture, including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks. Implemented and customized Jira workflows to streamline data engineering processes, ensuring efficient task management from initiation to completion. Developed and implemented scalable and reliable data architecture on AWS, utilizing services such as Amazon S3, Amazon Redshift, and Amazon RDS to ensure efficient data storage and retrieval. Implemented version control using Git to track changes in codebase efficiently. Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper. Extensive experience working in various domains like Healthcare, Finance and IT Industry. Experience in working with number of public and private cloud platforms like Amazon Web Services (AWS), Microsoft Azure. Collaborated with team members to establish and follow Git workflows, ensuring code integrity. Configured GitHub webhooks for real-time notifications and integration with external tools. Provisioned and managed scalable infrastructure on AWS to support data processing tasks. Implemented AWS Lambda functions for serverless data processing and automation. Designed and maintained AWS Glue ETL jobs for efficient data extraction and transformation. Adept at architecting and managing PostgreSQL and Snowflake databases, ensuring data integrity, optimal performance, and efficient indexing. Expertise in Python and R for data analysis, scripting, and automation, with a strong focus on developing reusable and scalable code. Experience in building data pipelines using Azure Data factory, Azure databricks and loading data to Azure data Lake, Azure SQL Database, Azure SQL Data warehouse and controlling and granting database access. Experience in analyzing data using R, SAS and Python. Experience in Machine Learning, Deep Learning, and Datamining with large datasets of Structured and Unstructured Data, Data Acquisition, Data Validation, Predictive Modeling, Data Visualization. Involved in the entire data science project life cycle and actively involved in all the phases including data extraction, data cleaning, statistical modeling, and data visualization with large data sets of structured and unstructured data. Having knowledge on Apache Spark and developing data processing and analysis algorithms using Python. Experience in building models with deep learning frameworks like TensorFlow, PyTorch and Keras Extensively worked on Python 3.5/2.7 (Numpy, Pandas, Matplotlib, NLTK and Scikit -learn). Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB. Data Driven and highly analytical with working knowledge and statistical model approaches and methodologies (clustering, Segmentation, Variable reduction, Regression analysis, Hypothesis testing, Decision trees, Machine learning), rules and ever evolving regulatory environment. Experience in Developing Spark applications using Spark - SQL, Pyspark and DataLake in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns. Managed AWS S3 buckets for storing and retrieving large volumes of structured and unstructured data. Proficient in Power BI, Tableau, Qlik and R-Shiny data visualization tools to analyze and obtain insights into large datasets, create visually powerful and actionable interactive reports and dashboards. Experienced in writing complex SQL Queries like Stored Procedures, triggers, joints, and Sub queries. Excellent understanding Agile and Scrum development methodology. Ability to maintain a fun, casual, professional, and productive team atmosphere. TECHNICAL SKILLS: Programming Languages Python, R, SAS, SQL, C, C#, Java, C++ Data and Analytics Tools PySpark, Apache (Hadoop ecosystem) Hive, PIG, Zoo, SQL, Enterprise Miner, PyTorch, Keras, scikit-learn, Tensor Flow, Open CV Script Languages JavaScript, jQuery, Python, Shell Script(bash,sh) Database Management MySQL, SQL Server, PostgreSQL, MongoDB, Oracle-10g, Presto Python Libraries/Packages NumPy, SciPy, Boto, Pickle, PySide, PyTables, Data Frames, Pandas, Matplotlib, SQLAlchemy, HTTPLib2, Urllib2, Beautiful Soup, PyQuery Web Development HTML, CSS, Java Script, AJAX, JQuery, Django, Flask Tools Power BI, Tableau, Qlik, R-Shiny, Alteryx Cloud Computing AWS(EC2,S3,EMR,Redshift), Azure(Data Lake, Data Factory, SQL), GCP IDE PyCharm, PyScripter, Spyder, PyStudio, PyDev, IDLE, NetBeans, Sublime Text, Visual Code, IntelliJ, Eclipse PROFESSIONAL EXPERIENCE: Client: Mayo Clinic, Rochester MN May 2020 - Present Role: Sr. Data Engineer Responsibilities: Developed Microsoft Azure to facilitate seamless data movement and scheduling functionalities for cloud-based technologies, including Azure Blob Storage and Azure SQL Database. Extracted, transformed, and loaded data from various source systems to Azure Data Storage services using Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Successfully integrated Jira with Agile methodologies, facilitating iterative development cycles and improving collaboration between data engineering and cross-functional teams. Utilized Git hooks for automated pre-commit and pre-push checks to enhance code quality. Maintained and updated documentation using GitHub Wiki for clear project understanding. Independently managed the development of ETL processes from development to delivery. Utilize data platform technology to centralize and organize diverse datasets from various sources. Designed, developed, and implemented ETL processes using Informatica PowerCenter to extract, transform, and load data developed custom alerts using Azure Data Factory, SQLDB, and Logic App. Created and manage workflow schedules, dependencies, and execution sequences in Informatica PowerCenter. Proficient in creating and managing user stories in Jira, aligning data engineering tasks with project requirements and business needs. Designed, implemented, and managed NoSQL databases, such as MongoDB, ensuring efficient data storage and retrieval. Implemented Git tags for release management and tracking project milestones. Received comprehensive training in Hadoop, Linux, ELK Stack, Kubernetes, Docker, Cloudera, Machine Learning, TensorFlow, Elasticsearch, and web development and deployment. Developed and design data models and structures, creating ETL jobs for data acquisition and manipulation. Gained deep understanding of data sources, implement data standards, and maintain data quality and master data management. Work with enterprise Data Modeling teams on the creation of logical models. Developed complex SQL queries using stored procedures, common table expressions (CTEs), and temporary tables to support Tableau reports. Implement monitoring solutions in Ansible, Terraform, Docker, and Jenkins. KPIs used for data quality ensured that data meets established standards, improving the reliability and accuracy of analytics and decision-making. Demonstrated expertise in utilizing Jira for issue tracking and resolution, ensuring timely identification and resolution of data engineering challenges and bottlenecks. Implemented intricate business logic through T-SQL stored procedures, functions, views, and advanced query concept and designed and implemented container orchestration systems with Docker Swarm. Administered and optimized SQL databases, particularly PostgreSQL, for structured and relational data storage. Created and maintained data models for both NoSQL and SQL databases to meet specific application and business requirements. Analyze data residing in Azure Data Lake and Blob by integrating with Databricks Designed and implemented Hadoop-based solutions for large-scale data processing, storage, and analysis. Designed and implemented robust Snowflake data architectures, ensuring optimal performance and scalability. Monitored Git repositories for vulnerabilities and applied security best practices. Leveraged data platform technology to enable seamless integration with external applications, databases, and services. Collaborated with cross-functional teams through GitHub Issues and Pull Requests. Configured custom fields in Jira to capture and track relevant data engineering metrics, providing insightful analytics for continuous process improvement. Developed and maintained data models in Snowflake, aligning them with business requirements and best practices. Designed and implement Power BI reports and dashboards for effective data visualization and analysis. Extracted and transformed data from diverse sources for integration into Power BI solutions. Implemented data modeling and relationships in Power BI to ensure accurate and meaningful insights. Implemented and managed Apache Kafka clusters for real-time data streaming and event-driven architectures. Led epic and sprint planning sessions using Jira, ensuring alignment of data engineering tasks with project goals and milestones. Designed and developed Kafka producers and consumers to facilitate efficient data communication between applications. Collaborated with stakeholders to understand data requirements and translate them into effective Tableau solutions. Implemented GitHub Actions for continuous integration and automated testing. Optimized Tableau performance by creating efficient data extracts and implementing best practices. Utilized Logic App for decision-making actions based on workflow. Environment: Microsoft SQL, , BIDS, Pyspark, Spark, T-SQL, REST, SOAP, Docker, Tableau, ETL, Oracle, Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Python, Hive, Sqoop, Hortonworks, Linux, Unix, SSIS, SQL, SSRS, Azure, Tableau, Postgres SQL, Power BI, Informatica, Snowflake. Client: USAA, San Antonio, TX Mar 2018 Apr 2020 Role: Data Engineer Responsibilities: Managed data storage solutions on AWS, including S3, Glacier, and other storage services. Administered and optimized AWS database services such as RDS, DynamoDB, and Redshift for efficient data storage and retrieval. Integrated Jira with version control systems (e.g., Git), enhancing traceability and providing a comprehensive view of code changes associated with data engineering tasks. Implemented and maintained ETL processes using AWS Glue for seamless data extraction, transformation, and loading. Optimized and troubleshot, integrated test cases into CI/CD pipeline using Docker images. Applied data cleansing, validation, and transformation rules using Informatica transformations for accurate and reliable data. Conducted Git training sessions for team members to enhance version control proficiency. Created customized reports and dashboards in Jira, providing stakeholders with real-time visibility into the progress and performance of data engineering projects. Utilized version control systems to manage and track changes to Informatica workflows, mappings, and configurations. Worked on various Spark optimization techniques for memory management, garbage collection, Serialization, and custom partitioning. Developed Spark programs to parse raw data, populate staging tables, and store refined data in partitioned tables in the Enterprise Data Warehouse (EDW). Optimized database queries for MongoDB and PostgreSQL to enhance overall system performance. Planned and executed data migration tasks between NoSQL and SQL databases, ensuring data integrity and consistency. Developed Spark applications to implement various aggregation and transformation functions of Spark RDD and Spark SQL. Worked on DB2 for SQL connection to Spark Scala code for Select, Insert, and Update operations. Documented and communicated branching strategies and Git workflows for project consistency. Created and managed repositories on GitHub, facilitating seamless code collaboration. Utilized Broadcast Join in Spark for optimizing datasets without shuffling data across nodes. Implemented Oozie Scheduler systems to automate pipeline workflows and orchestrate Spark jobs. Designed and implemented Sqoop for incremental jobs, reading data from DB2 and loading it into Hive tables for generating interactive reports using Tableau. Implemented Spark scripts using Spark Session, Python, and Spark SQL to access Hive tables for faster data processing. Optimized Kafka configurations for high availability, fault tolerance, and scalability. Ensured seamless data integration using Snowflake, optimizing ETL workflows for efficiency. Optimized SQL queries and performance in Snowflake to enhance data retrieval and processing speed. Collaborated with business stakeholders to understand reporting requirements and translated them into Power BI solutions. Established and maintained data governance standards within Power BI for data accuracy and consistency. Ensured data accuracy and integrity in Tableau visualizations through proper data validation. Implemented interactive and user-friendly features in Tableau dashboards for an enhanced user experience. Virtualized servers using Docker for test and development environments, with configuration automation using Docker containers. Facilitated data profiling in exploration and discovery of data characteristics, allowing data engineers and analysts to understand the structure and content of datasets. Worked with Hue GUI for scheduling jobs, file browsing, job browsing, and Megastore management. Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Python, Hive, Sqoop, Hortonworks, Linux, Unix, SSIS, SQL, SSRS, AWS, Tableau , PostgreSQL SQL, Power BI, Informatica, Snowflake. Client: Cummins, Mumbai, India July 2016 Oct 2017 Role: Data Engineer Responsibilities: Developed an automated process in Azure cloud for daily data ingestion from web services into Azure SQL DB. Created streaming pipelines using Azure Event Hubs and Stream Analytics to analyze data for dealer efficiency and open table counts from IoT-enabled tables. Used Data bricks with Azure Data Factory (ADF) for computing large volumes of data. Performed ETL operations in Azure Data bricks by dbt connecting to different relational database source systems using JDBC connectors. Collaborated with data engineers to integrate Kafka with various data sources and destinations. Developed ETL pipelines using SSIS and NIFI from source systems like SQL Oracle and dumped data into HDFS. Created Hive tables and loaded/analyzed data using Hive queries. Converted SQL codes to Spark codes using Scala and Spark-SQL/Streaming for faster testing and processing of data. Created SSRS reports/dashboards from source data before loading into RDD. Converted SQL codes to Spark codes using Scala and Spark-SQL/Streaming for faster testing and processing of data. Used SSIS, NIFI, and Sqoop for ETL operations to create data flow pipelines. Developed and implemented HIVE queries and functions for loading, evaluation, filtering, and storing of data. Wrote shell scripts to automate the SQL-Spark conversion in Linux. Developed Spark programs using Scala API's to compare the performance of Spark with Hive and SQL. Utilized Spark API over Hortonworks Hadoop YARN for analytics on data in Hive. Implemented Scala scripts using RDDs and Data Frames/SQL/Datasets in Spark 1.6 and Spark 2.1 for data aggregation, queries, and writing data. Optimized existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames, and RDDs. Implemented and managed security measures within Snowflake, including role-based access control and data encryption. Developed Hive queries to process the data and generate data cubes for visualization. Monitored Kafka cluster performance, trouble shot issues, and implemented necessary optimizations. Developed Python scripts for file validations in Data bricks and automated processes using ADF. Utilized Python for exploratory data analysis, gaining insights and preparing data for Power BI reporting. Developed Python scripts for Extract, Transform, Load (ETL) processes to prepare data for Power BI. Established and enforced security measures for NoSQL and SQL databases, managing user access and permissions. Used Python and Groovy scripts for NIFI transformations in the pipelines. Designed and implemented effective indexing strategies to enhance query performance and response times in MongoDB and PostgreSQL. Involved in creating Hive tables and loading and analyzing data using Hive queries. Conducted Tableau server administration tasks, including user access management and security configurations. Integrated Tableau with various data sources, databases, and data warehouses for seamless data connectivity. Stayed updated on Power BI features and updates, incorporating new functionalities into existing solutions. Created visually compelling and interactive reports in Power BI for effective data visualization. Performed optimizing Map Reduce Programs using combiners, partitions, and custom counters for delivering the best results. Developed tools using Python, Shell scripting, and XML to automate some of the menial tasks. Environment Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Python, Hive, Sqoop, Hortonworks, Linux, UNIX, SSIS, SQL, SSRS, Azure, Tableau, Postgres SQL, Power BI, Snowflake. Client: Micron, Hyderabad, India May 2015 June 2016 Role: SQL Developer Responsibilities: Ensured data security on AWS by enforcing measures through IAM, KMS, and VPC to safeguard sensitive information. Provided Tableau training and support to end-users and team members. Integrated diverse data sources and formats using AWS services such as Data Sync and Database Migration Service (DMS). Design and develop Power BI data models for efficient data representation and analysis. Leveraged AWS Lambda for serverless computing, enabling scalable and cost-effective data processing. Stayed updated on Tableau features, enhancements, and industry best practices. Designed, built, and optimized end-to-end ETL pipelines for efficient and reliable data movement. Performed data wrangling on JSON format vehicle data, cleaning and structuring required fields into Pandas data frames. Developed SQL scripts to extract data from the GDW (Geico Data Warehouse) for quote and customer data, monitored, and analyzed the performance of both NoSQL and SQL databases. Troubleshot and resolved issues related to Hadoop cluster performance and data processing. Stayed informed about Hadoop ecosystem updates and best practices for data management. Contributed to the development and execution of data warehousing strategies leveraging Snowflake. Established and maintained metadata management processes within Snowflake for enhanced data governance Implemented Spark using Python and Spark SQL for efficient testing and processing of large datasets. Developed Spark Streaming jobs in Python to consume messages from Kafka and download JSON files from AWS S3 buckets. Identified and proactively resolved bottlenecks and issues to ensure optimal database performance. Implemented data cleansing and transformation using Python, ensuring high-quality data for Power BI dashboards. Integrated Python scripts seamlessly into Power BI workflows for enhanced data processing capabilities. Conducted training sessions and provided support to teams using Kafka for data streaming applications. Created and managed Power BI data gateways for secure and efficient data connectivity. Environment: Spark, Python, SQL, Kafka, JSON, AWS, Tableau, Power BI, ETL, Snowflake, Pandas, Hadoop. EDUCATION: JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY Hyderabad, TS, India BTech in COMPUTER SCIENCE AND ENGINEERING June 2011 - March 2015 Major in Computer Science Keywords: cprogramm cplusplus csharp continuous integration continuous deployment business intelligence sthree database rlang information technology Minnesota Texas |