Home

Naveen Kumar - Data Architect/Engineer
naveenkumarp0201@gmail.com
Location: Albany, Texas, USA
Relocation: yes
Visa: H1B
Naveen K linkedin

Mob: +1 6122135126 Email: naveenkumarp0201@gmail.com

Summary:

Having 14+ years of experience in designing end-to-end data solutions using Databricks, PySpark, Java, and performed system analysis in Manufacturing, Agri, Banking, Finance, Insurance, and Telecom sectors.
Developed Datalakes, Deltalake, and Lakehouse architectures.
Implemented Single ETL Pipeline framework using Pyspark and Java Spark for big data processing on platforms like Azure Databricks, Apache Spark, and Hadoop.
Created and extended Data Frameworks, including Storage,Fivetran, Hevodata, ADLS, Analysis, and Engineering, with custom frameworks.
Proficient in SQL/NoSQL, optimized Hive performance, and analyzed data using Hive QL and Spark QL.
Designed and maintained high-performance ELT/ETL processes as a Data Architect.
Demonstrated expertise in Hadoop Architecture, Cloud computing infrastructure (AWS), and Spark streaming modules.
Experienced in Dimensional Data Modeling, Lambda Architecture, Batch processing, and Oozie.
Proficiency in core Java, Python, multi-threading, multiprocessing, JDBC, Shell Scripting, and Java APIs.
ollaborated with Dev and QA teams to ensure data accuracy and integrity.
Built prototypes of ML models, addressing diverse customer needs, and translated business needs into ML problems.
Worked with various programming languages and technologies including Scala, JSON, XML, REST, SOAP Web services, Groovy, MVC, Eclipse, Weblogic, Websphere, and Apache Tomcat servers.
Extensive knowledge of Data Modeling, Data Conversions, Data integration, and Data Migration.
Experienced in data extraction, transformation, and loading from heterogeneous systems like flat files, excel, Oracle, Teradata, MSSQL Server.
Proficient in UNIX/Linux commands, scripting, and deploying applications on servers.
Good understanding of Machine Learning algorithms and strategies for Data Analysis.
Strong skills in algorithms, data structures, Object-oriented design, Design patterns, documenation, QA/testing.
Worked in fast-paced Agile Teams, practicing Test-Driven development and testing in scrum teams.
Ability to collaborate with cross-functional teams, including data scientists, developers, and business stakeholders.
Excellent domain knowledge in Insurance, Telecom, Manufacturing.


Education:

MS(IT) from IIITM-Kerala, Thiruvananthapuram.
BTech(ECE) from Kakatiya University, Warangal.


Technical Skills:




BigData Technologies

Azure Cloud, Airflow, ADF, pyspark. Airflow, Azure Databricks, AWS EMR, S3, EC2-Fleet, Spark-2.2, 2.0 and 1.6, Hortonworks HDP, Hadoop, Mapreduce, Pig, Hive, Apache Spark, SparkSQL, Informatica Power Center 9.6.1/8.x, Kafka, NoSQL, Elastic Mapreduce(EMR), Hue, YARN, Nifi, Impala, Sqoop, Solr , OOZie, Fivetran, Hevodata

ML Numpy, Scipy, Scikit-learn, Tensorflow, Keras, Pandas, Matplotlib, Plotly
Databases HBase, Microsoft SQL Server, MySQL, noSQL, SQL databases, Snowflake.
Relational and No-Sql Databases.
Platforms (O/S) Red-Hat LINUX, Ubuntu, Windows NT/2000/XP.
Programming languages Java, Scala, SQL, UNIX shell script, JDBC, Python, Perl.

Security Management Hortonworks Ambari, Cloudera Manager, Apache Knox, XA Secure, Kerberos .
Web-technologies DHTML, HTML, XHTML, XML, XSL (XSLT, XPATH), XSD, CSS, JavaScript, SOAP, RESTful, Agile, Design Patterns
Data warehousing
Informatica Powercenter/Powermart/Dataquality/Bigdata, Pentaho, ETL Development, Amazon Redshift, IDQ.
Database Tools JDBC, HADOOP, Hive, No-SQL, SQL Navigator, SQL Developer, TOAD, SQL Plus, SAP Business Objects
Data Modeling Rational Rose, Erwin 7.3/7.1/4.1/4.0
Code Editors Eclipse, Intellij, Pycharm


Projects:

UPL-Ltd June 2022- Sep 2024
Data Architect

Design Implemented Concept of Mono-ETL pipeline, which addresses almost 80% of the ETL requirements of the organization using Databricks, Python, and Pyspark.
Developed a Machine learning model that improved the accuracy of predictions by 20%.
Implemented delta lake, and optimized most of the data pipelines.
Proven ability to lead and manage a team of data engineers.
Experience with machine learning algorithms and techniques, such as regression, classification, clustering, and natural language processing.
Created ML Model UPL Customer Churn using combination of XGBoost, Lightgbm and Crossell with recommend ARM approaches and data mining activity as well.
Implemented Deal Price Prediction Model, and Health of the Deal for the next customer, ERD.
Experience with cloud computing platforms, such as AWS, Azure, Data Governance: Strong understanding of RBAC/ABAC.
Experience with Azure Data Engineering technologies, such as Azure Data Factory, Azure Databricks, Azure Synapse Analytics, Azure SQL Data Warehouse, Azure Data Lake Storage, Azure Data Explorer, Azure Stream Analytics, and Azure Machine Learning.
Designed and implemented ETL pipelines using SnapLogic to integrate cloud and on-premise data sources efficiently.
Experience with data modeling, data warehousing, data lakes, and data pipelines.
Experience with Data Lineage and Data Leaks, security and compliance.
Experience with SQL, Python, and other programming languages.
Experience with cloud computing platforms, such as Azure, AWS, and Google Cloud Platform.
Projects and accomplishments:
Led the development and implementation of a data pipeline that improved the efficiency of data processing by 50%.
Managed a team of 10 data engineers who built a data lake that enabled the company to store and analyze petabytes of data.
Developed and optimized data transformation workflows in DBT for improved data modeling and analytics.
Automated data validation and quality checks using SnapLogic and DBT, ensuring accurate and reliable reporting.
Define a path operation that calls the ML model REST API. Create a new Python file and import the FastAPI and uvicorn packages.
Designed and Implemented Pipeline Migrator tool from Hevodata to Fivetran.

Azure Cloud, Airflow, ADF, pyspark. Airflow, Azure Databricks, Azure Data Lake Storage. Azure Key Vault Azure Logic Apps PySpark, Java Python (Programming Language). Azure Web Apps, REST API and ML, Fivetran, Hevodata, DBT.

Ford MI, USA May 2017 June 2021
Role: Sr. Data Engineer/ML
Responsibilities:

Data from HDFS into Spark RDDs, for running predictive analytics on data.
Used Hive Context which provides a superset of the functionality provided by SQLContext and Preferred to write queries using the HiveQL parser to read data from Hive tables (fact, syndicate).
Modeled Hive partitions extensively for data separation and faster data processing and followed Hive best practices for tuning.
Utilized machine learning techniques to identify campaign s information likely decline in product sales as well as identify the characteristics of the customers who will most likely increase in sales.
Developed Spark scripts by writing custom RDDs in Scala for data transformations and perform actions on RDDs.
Caching of RDDs for better performance and performing actions on each RDD.
Created Hive Fact tables on top of raw data from different retailer s which indeed partitioned by Time dimension key, Retailer name, Data supplier name which further processed pulled by analytics service engine.
Developed highly complex Python and Scala code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries.
Involved in designing optimizing Spark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into HDFS/AWS S3.
Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon EMR.
Worked on Kafka REST API to collect and load the data on Hadoop file system and also used sqoop to load the data from relational databases.
Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
Successfully loading files to Hive and HDFS from Oracle, SQL Server using SQOOP.

Environment: Azure, Databricks, Spark-Java, Spark-Scala, Hbase, Spark, K-Means, Scala, Oozie, Bitbucket Github

Project 2:

Used Job management scheduler apache Oozie to execute the workflow.
Used Ambari to monitor node's health and status of the jobs in Hadoop clusters.
Designing and implementing data warehouses and data marts using components of Kimball Methodology, like Data Warehouse Bus, Conformed Facts & Dimensions, Slowly Changing Dimensions, Surrogate Keys, Star Schema, Snowflake Schema, etc.
Worked on Tableau to build customized interactive reports, worksheets and dashboards.
Implemented Kerberos for strong authentication to provide data security.
Implemented LDAP and Active directory for Hadoop clusters
Worked on apache Solr for indexing and load balanced querying to search for specific data in larger datasets.
Involved in performance tuning of spark jobs using Cache and using complete advantage of cluster environment.
Environment: AWS- S3, EMR, Lambda, CloudWatch, Amazon Redshift, Spark-Java, Spark- Scala, Athena, Hive, HDFS, Spark, Scala, Oozie, Bitbucket Github.

Farmers Insurance Oct 2013 Dec 2015
Project: FNWL, CA
Role: Hadoop Consultant
Projects: Farmers FNWL, Liberty Mutual
Responsibilities:
Understand the requirements and prepared architecture document for the Big Data project.
Worked with HortonWorks distribution
Supported MapReduce Java Programs those are running on the cluster.
Optimized Amazon Redshift clusters, Apache Hadoop clusters, data distribution, and data processing
Developed MapReduce programs to process the Avro files and to get the results by performing some calculations on data and also performed map side joins.
Imported Bulk Data into HBase Using MapReduce programs.
Used Rest ApI to Access HBase data to perform analytics.
Designed and implemented Incremental Imports into Hive tables.
Involved in creating Hive tables, loading with data and writing Hive queries that will run internally in MapReduce way
Involved in collecting, aggregating and moving data from servers to HDFS using Flume.
Imported and Exported Data from Different Relational Data Sources like DB2,SQL Server, Teradata to HDFS using Sqoop.
Migrated complex map reduce programs into in memory Spark processing using Transformations and actions.
Worked on POC for IOT devices data, with spark.
Used SCALA to store streaming data to HDFS and to implement Spark for faster processing of data.
Worked on creating the RDD's, DF's for the required input data and performed the data transformations using Spark Python.
Involved in developing Spark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into HDFS.
Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
Developed PIG scripts for the analysis of semi structured data.
Developed PIG UDF'S for manipulating the data according to Business Requirements and also worked on developing custom PIG Loaders.
Worked on Oozie workflow engine for job scheduling.
Developed Oozie workflow for scheduling and orchestrating the ETL process.
Experienced in managing and reviewing the Hadoop log files using Shell scripts.
Migrated ETL jobs to Pig scripts to do Transformations, even joins and some pre-aggregations before storing the data onto HDFS.
Worked on different file formats like Sequence files, XML files and Map files using MapReduce Programs.
Worked with Avro Data Serialization system to work with JSON data formats.
Used AWS S3 to store large amount of data in identical/similar repository.
Involved in build applications using Maven and integrated with Continuous Integration servers like Jenkins to build jobs.
Used Enterprise Data Warehouse database to store the information and to make it access all over organization.
Responsible for preparing technical specifications, analyzing functional Specs, development and maintenance of code.
Worked with the Data Science team to gather requirements for various data mining projects
Written shell scripts for rolling day-to-day processes and it is automated.

JPMorganChase, Columbus, OH May 2011 Oct 2013
Role: Sr. ETL Consultant
Environment: Hadoop, Informatica 9.01, Teradata-13, Abnitio, UNIX, DB2.

Responsibilities:
Gathering requirements for RDM project which involves implementing EDW data quality fixes and Retail data mart.
Prepare functional and technical specification design document for building Member Data Mart according to ICDW Banking Model.
Responsible for data gathering from multiple sources like Teradata, Oracle.
Created Hive tables to store the processed results in a tabular format.
Written Map Reduce jobs in java to process the log data.
Implemented external and managed tables using HIVE.
Work with the Teradata analysis team using BigData technologies to gather the business requirements.
Fixing error data, Data Reconciliation process.
Used Partitioning and bucketing concepts for performance optimization in hive.
Responsible for delivering the Informatica artifacts for Mart Specific Semantic Layer for subject areas like Reference, Third Party, Involved Party, Event, Customer and etc.
Reviewing the deliverable and ensured that the quality of code before delivering to client by reviewing the code and testing the code.
Involved in implementation as Kimball s methodology, OLAP, SCDs (type1, type2 and type3), starschema and snowflake schema.
Prepared and implemented successfully automated UNIX scripts to execute the end to end history load process.
Prepared Job execution tool [Tivoli] design in order to run Membership Reporting Data mart in production environment.
Managing the versioning of the mappings, scripts, documents in version controlled tool [SCM].
Keywords: quality analyst machine learning sthree information technology microsoft California Michigan Ohio

To remove this resume please click here or send an email from naveenkumarp0201@gmail.com to usjobs@nvoids.com with subject as "delete" (without inverted commas)
naveenkumarp0201@gmail.com;4778
Enter the captcha code and we will send and email at naveenkumarp0201@gmail.com
with a link to edit / delete this resume
Captcha Image: