Nitish Kumar - DATA ENGINEER |
[email protected] |
Location: Dallas, Texas, USA |
Relocation: |
Visa: H1B |
NITISH KUMAR
Data Engineer [email protected] Phn: 832-307-1698 Professional Summary: Over 10 years of extensive IT experience as a Data Engineer, specializing in designing data-intensive applications using Hadoop Ecosystem and Big Data Analytics, Cloud Data Engineering (AWS, Azure), Data Visualization, Warehousing, Reporting, and Data Quality solutions. Worked with big data tools and technologies (Hadoop, Hive, Spark, HBase, Kafka) to create and build data lakes. Experience using Snowflake, Hadoop distribution like Cloudera, Hortonworks distribution & Amazon AWS (EC2, EMR, RDS, Redshift, DynamoDB, Snowball), and Data bricks (data factory, notebook, etc.) Experienced in Spark Architecture including Spark Core, Spark SQL, DataFrames, Spark Streaming and Spark MLlib. Expertise in implementing Spark, Scala application using higher order functions for both batch and interactive analysis requirement. Proficient in creating Pipelines in Azure Data Factory (ADF), Azure Databricks, and setting up AzureDatabricksDelta tables in Azure Cloud. Extensive hands-on experience in working with various components of AWS, including EMR, EC2, S3, RDS, IAM, Auto Scaling, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, Redshift, DynamoDB, for ensuring secure and efficient data management in the AWS public cloud. Experienced in configuring Apache Airflow for workflow management, creating workflows in Python, and utilizing DAGs using Airflow for sequential and parallel job execution. Optimized Pyspark jobs to run on Kubernetes Cluster to achieve faster data processing Proficient in Spark Architecture with Databricks and Structured Streaming. Skilled in setting up AWS and Microsoft Azure with Databricks and utilizing Databricks Workspace for Business Analytics and managing clusters. Demonstrated understanding of the Fact/Dimension data warehouse design model, including star and snowflake design methods. Skilled in building Snowpipe and deep knowledge of Data Sharing in Snowflake and Snowflake Database, Schema and Table structures. Designed and developed logical and physical data models using concepts like Star Schema, Snowflake Schema and Slowly Changing Dimensions. Proficient in handling various file formats such as Avro, Parquet, ORC, JSON, XML and compression techniques like Snappy and Bzip. Extensive hands-on experience in handling SQL and NoSQL databases, including MongoDB, HBase, SQL server. Developed Java applications for data management in MongoDB and HBase. Proficient in defining user stories, driving the agile board in JIRA, participating in sprint demos and retrospectives. Strong experience with SQL and NoSQL databases, data modeling and data pipelines. Involved in end-to-end development and automation of ETL pipelines using SQL and Python. Experienced in source control repositories like Bitbucket, SVN and Git. Good Knowledge in Google Cloud Platform (GCP) including BigQuery, GCS Bucket, G-Cloud Function, Cloud Dataflow, Pub/SubCloud Shell, GSUTIL, BQ command-line utilities, Data Proc, Stackdriver. Familiarity with CI/CD using containerization technologies like Docker and Jenkins. Experience in Agile/Scrum development and Waterfall project execution methodologies. Proficient in working with operating systems such as Windows, Linux, UNIX, and Ubuntu. Provide support to development, testing, and operation teams during new system deployments. Experience working with NoSQL database technologies, including Cassandra, MongoDB, and HBase. TECHNICAL SKILLS Bigdata Ecosystem HDFS and Map Reduce, Hive, YARN, HUE, Oozie, Apache Spark, Sqoop Hadoop Distributions Cloudera, Hortonworks Programming languages Java, C/C++, SCALA Scripting Languages Shell Scripting, python Scripting Databases MySQL, oracle, Teradata, DB2 Version control Tools SVN, Git, GitHub Operating Systems WINDOWS, Linux Development IDEs Eclipse IDE, Python (IDE) Cloud Technologies AWS (EMR, S3, EC2), Azure and Google Cloud Platform (GCP) Professional Experience Senior Data Engineer Aug 2023 to Present Cigna- Bloomfield, CT Design, Develop and test ETL Processes in AWS Implemented AWS IAM for managing the user permissions of applications that runs on EC2 instances. Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift. Worked with AWS EMR to run spark and hive applications and used airflow as scheduling and orchestration tool. Create PySpark frame to bring data from DB2 to Amazon S3. Moved on-prem jobs to run on AWS cloud. Hands on experience working with snowflake to move the data from s3 to snowflake vice versa. Worked on performance tuning of snowflake ETL jobs and implemented row level security solution in snowflake. Worked with Kafka and spark structured streaming to build sales pipeline that support reporting needs. Hands on experience working with Databricks and Delta tables. Deployed applications onto AWS lambda with http triggers and integrated them with API Gateway Developed multiple ETL Hive scripts for data cleansing and transformations for data. Provide guidance to development team working on PySpark as ETL platform. Optimize the PySpark jobs to run on Kubernetes Cluster for faster data processing. Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables. Exported data from Hive to AWS s3 bucket for further near real time analytics. Ingested data in real time from Apache Kafka to Hive and HDFS. Developed the streaming applications using spark structured streaming, Kafka, and s3 integration project to do a real-time data analysis. Processing data in Alteryx to create TDE for tableau reporting. Use of Sqoop to import and export data from RDBMS to HDFS and vice-versa. Wrote a Python library built on NumPy and Pandas to detect errors on daily reports. Debugging the spark code using local and in cluster more. Experience working with Python data stack (Pandas, dask, scikit learn, stats models, NumPy, matplotlib, seaborn). Create a complete processing engine, based on Hortonworks distribution, enhanced to performance. Analyzed the SQL scripts and designed it by using PySpark SQL for faster performance. Implemented Kerberos and Ranger security Authentication protocol for existing cluster. Building, publishing customized interactive reports and dashboards, report scheduling using Tableau server. Migration of few BO reports in Tableau Desktop views. Complete end-to-end design and development of Apache Nifi flow, which acts as the agent between middleware team and EBI team and executes all the actions mentioned above. Involved in advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala. Environment: AWS EMR, Athena, AWS S3, Tableau, Hive, Spark, Java, SQL Server, PySpark, Python(Pandas, Numpy), Hortonworks, Linux, Azure, Redshift, Maven, GIT, JIRA, ETL, Toad 9.6, UNIX Shell Scripting, Scala. Data Engineer May 2021 to July 2023 Truist - Charlotte, NC Experience in using different types of stages like Transformer, Aggregator, Merge, Join, Lookup, and Sort, remove duplicated, Funnel, Filter, Pivot for developing jobs. Created pipelines to load the data using ADF. Building/Maintaining Docker container clusters managed by Kubernetes Linux, Bash, GIT, Docker. Build Data pipelines using Python, Apache Airflow for ETL related jobs inserting data into Oracle. Creating job flow using Airflow in python and automating the jobs. Airflow will have separate stack for developing DAGs on and will run jobs on EMR or EC2 Cluster. Written queries in MySQL and Native SQL. Pipelines were created in Azure Data Factory utilizing Linked Services to extract, transform, and load data from many sources such as Azure SQL Data warehouse, write-back tool, and backwards. Configured Spark streaming to get ongoing information from the Kafka and store the stream information to DBFS. Deployed models as python package, as API for backend integration and as services in a microservices architecture with a Kubernetes orchestration layer for the Dockers containers. Worked on Big Data Integration & Analytics based on Hadoop, SOLR, PySpark, Kafka, Storm and web Methods. Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries. Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS. Used AWS to create storage resources and define resource attributes, such as disk type or redundancy type, at the service level. Created Data tables utilizing PyQt to display customer and policy information and add, delete, update customer records. Working on data management disciplines including data integration, modeling and other areas directly relevant to business intelligence/business analytics development. Developed tools using Python, Shell scripting, XML to automate tasks. Involved in database migration methodologies and integration conversion solutions to convert legacy ETL processes into Azure Synapse compatible architecture. Created clusters to classify control and test groups. Developed multiple notebooks using Pyspark and Spark SQL in Databricks for data extraction, analyzing and transforming the data according to the business requirements. Build Jenkins jobs for CI/CD Infrastructure for GitHub repos. Automated and monitored AWS infrastructure with Terraform for high availability and reliability, reducing infrastructure management time by 90% and improving system uptime. Integrated Azure Data Factory with Blob Storage to move data through DataBricks for processing and then to Azure Data Lake Storage and Azure SQL data warehouse. Environment: ER/Studio, Teradata, SSIS, SAS, Excel, T-SQL, SSRS, Tableau, SQLServer, Cognos, Pivottables, Graphs, MDM, PL/SQL, ETL, DB2, Oracle, SQL, Teradata, Informatica Power Center etc. Data Engineer Oct 2019 to April 2021 Costco Travels - Seattle, WA Developed pipeline using Hive (HQL) to retrieve the data from Hadoop cluster, SQL to retrieve data from Oracle database and used ETL for data transformation. Analyzed and gathered business requirements from clients, conceptualized solutions with technical architects, and verified approach with appropriate stakeholders, developed E2E scenarios for building the application. Derived data from relational databases to perform complex data manipulations and conducted extensive data checks to ensure data quality. Performed Data wrangling to clean, transform and reshape the data utilizing NumPy and Pandas library. We have worked with datasets of varying degrees of size and complexity including both structured and unstructured data and Participated in all phases of Data mining, Data cleaning, Data collection, variable selection, feature engineering, developing models, Validation, Visualization and Performed Gap analysis. Optimized lot of SQL statements and PL/SQL blocks by analyzing the execute plans of SQL statement and created and modified triggers, SQL queries, stored procedures for performance improvement. Implemented Predictive analytics and machine learning algorithms in Databricks to forecast key metrics in the form of designed dashboards on to AWS (S3/EC2) and Django platform for the company's core business. Participated in features engineering such as feature generating, PCA, Feature normalization and label encoding with Scikit-learn preprocessing. Data Imputation using variant methods in Scikit-learn package in Python. Used Sqoop to move data from oracle database into hive by creating a delimiter separated files and using these files in an external location to be used as an external table in hive and further moving the data into refined tables as parquet format using hive queries Used Teradata utilities such as Fast Export, MLOAD for handling various tasks data migration/ETL from OLTP Source Systems to OLAP Target Systems. Developed spark programs using Scala API's to compare the performance of spark with HIVE and SQL. Designed and created Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning, and buckets. Evaluated the performance of Databricks environment by converting complex Redshift scripts to spark SQL as part of new technology adaption project. Lead engagement planning: developed and managed Tableau implementation plans for the stakeholders, ensuring timely completion and successful delivery according to stakeholder expectations. Managed workload and utilization of the team. Coordinated resources and processes to achieve Tableau implementation plans. Environment: R, Python, ETL, Agile, Data Quality, R Studio, Tableau, Data Governance, Supervised & Unsupervised Learning, Java, NumPy, SciPy, Hadoop, Sqoop, HDFS, Spark SQL, Pandas, PostgreSQL, AWS (EC2, RDS, S3), Matplotlib, Scikit-learn, Shiny. ETL Engineer Nov 2017 to Aug 2019 Info Edge India Ltd Hyderabad Involved in requirement analysis, design, coding, and implementation phases of the project. Loaded the data from Teradata to HDFS using Teradata Hadoop connectors. Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL APIs. Written new spark jobs in Scala to analyze the data of the customers and sales history. Used Kafka to get data from many streaming sources into HDFS. Involved in collecting and aggregating substantial amounts of log data using Apache Flume and staging data in HDFS for further analysis. Enjoyable experience in Hive partitioning, Bucketing, and Collections performing several types of joins on Hive tables. Created Hive external tables to perform ETL on data that is generated on a daily basis. Written HBase bulk load jobs to load processed data to HBase tables by converting to HFiles. Performed validation on the data ingested to filter and cleanse the data in Hive. Created Sqoop jobs to handle incremental loads from RDBMS into HDFS and applied Spark transformations. Loaded the data into hive tables from spark and used parquet columnar format. Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig. Developed Sqoop import Scripts for importing reference data from Teradata. Used Docker-Compose to create Kubernetes clusters and maintained them in QA. Maintained Python program libraries, user's manuals, and technical documentation. Managed large datasets using Panda data frames and MySQL. Developed monitoring and notification tools using Python. Worked in monitoring, managing, and troubleshooting the Hadoop Log files. Collaborated with the infrastructure, network, database, application, and BA teams to ensure data quality and availability. Environment: Apache Spark,HDFS, TDCH, Kafka, Flume, Hive, HBase, Sqoop, Oozie Teradata. Kubernetes, MySQL Python, Pandas Splunk. Hadoop Developer Jan 2017 to Oct 2017 CA Technologies - Hyderabad Extensively used Agile methodology as the Organization Standard to implement the data Models. Used Micro service architecture with Spring Boot based services interacting through a combination of REST and Apache Kafka message brokers. Created several types of data visualizations using Python and Tableau. Extracted Mega Data from AWS using SQL Queries to create reports. Performed reverse engineering using Erwin to redefine entities, attributes, and relationships existing database. Imported Legacy data from SQL Server and Teradata into Amazon S3. Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process using python scripts. Developed Spark jobs using Scala for faster real-time analytics and used Spark SQL for querying Used pandas, NumPy, seaborne, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms. Expertise in R, Matlab, python and respective libraries. Development of Spark structured streaming to read the data from Kafka in real time and batch modes, apply different mode of Change data captures (CDCs) and then load the data into Hive Developed and Configured Kafka brokers to pipeline server logs data into spark streaming Performed K-means clustering, Regression and Decision Trees in R. Worked on data cleaning and reshaping, generated segmented subsets using NumPy and Pandas in Python. Applied different dimensionality reduction techniques like principal component analysis (PCA) and t-stochastic neighborhood embedding(t-SNE) on feature matrix. Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS Responsible for design and development of Python programs/scripts to prepare transform and harmonize data sets in preparation for modeling. Worked with Market Mix Modeling to strategize the advertisement investments to better balance the ROI on advertisements. Used Python(NumPy, SciPy, pandas, scikit-learn, seaborn) and Spark (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes. Performed Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python and build models using deep learning frameworks Implemented Univariate, Bivariate, and Multivariate Analysis on the cleaned data for getting actionable insights on the 500-product sales data by using visualization techniques in Matplotlib, Seaborn, Bokeh, and created reports in Power BI. Involved in creating Hive tables, loading with data and writing hive queries which will run internally in Map Reduce way. Environment: Spark, Python, HDFS, Hive, AWS,Redshift, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology. Associate Software Engineer June 2015 to Dec 2016 3i InfoTech - Mumbai, India Performed Data analysis, Data Profiling and Requirement Analysis. Developed automated processes for flattening the upstream data from Cassandra which in JSON format. Used Hive UDFs to flatten the JSON Data. Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms Involved in managing and reviewing Hadoop log files. Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and analysis. Designed and Deployed AWS Solutions using EC2, S3, EBS, Elastic Load balancer (ELB), auto-scaling groups and OpsWorks. Created Hive tables and involved in data loading and writing Hive UDFs. Developed Hive UDFs for rating aggregation Implemented Map Reduce programs to handle semi/unstructured data like XML, JSON, Avro data files and sequence files for log files. Designing NoSQL schemas in HBase. Involved in weekly walkthroughs and inspection meetings, to verify the status of the testing efforts and the project as a whole. Environment: Spark, Scala, Hive, JSON, AWS, MapReduce, Hadoop, Python, XML, NoSQL, HBase, and Windows. Educational Background Bachelors in computer science at V.R Siddhartha Engineering College-Andhra Pradesh, India Keywords: cprogramm cplusplus continuous integration continuous deployment quality analyst business analyst business intelligence sthree rlang information technology procedural language California Connecticut North Carolina Washington |