Home

Amit Khanna - Senior Data Engineer
[email protected]
Location: Reston, Virginia, USA
Relocation: Any
Visa: H1B
Amit Khanna
Lead Data Engineer
________________________________________
Professional Summary
10+ years of experience in Data Analysis, Data Scientist, Data mining with large datasets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modelling, Data Visualization.
Experience in coding SQL/PL SQL using Procedures, Triggers, and Packages.
Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python.
Excellent Knowledge of Relational Database Design, Data Warehouse/OLAP concepts, and methodologies.
Data Driven and highly analytical with working knowledge and statistical model approaches and methodologies (Clustering, Regression analysis, Hypothesis testing, Decision trees, Machine learning), rules and ever-evolving regulatory environment.
Professional working experience in Machine Learning algorithms such as Linear Regression, Logistic Regression, Naive Bayes, Decision Trees, K-Means Clustering and Association Rules.
Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
Experience with data visualization using tools like Ggplot, Matplotlib, Seaborn, Tableau and using Tableau software to publish and present dashboards, storyline on web and desktop platforms.
Experienced in python data manipulation for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.
Worked on creating access control privileges to Snow pipes.
Automated continuous data loading using cloud messages.
Well experienced in Normalization, De-Normalization and Standardization techniques for optimal performance in relational and dimensional database environments.
Worked on Timetravel in Snowflake and performed various operations on Time travel on Snowflake tables.
Experience in multiple software tools and languages to provide data-driven analytical solutions to decision makers or research teams.
Worked on setting up the data_retention_time_in days in Snowflake.
Extracted data from Sage Maker to Snowflake using Snowpark.
Worked on different types of Life Cycle Configurations in Sage Maker.
Automated the pipenv in Sage Maker through Aws S3 Bucket.
Created Sage Maker Notebook instances using Cloud Formation Template.
Automated the Aws consumer using Cloud Formation Template Instance.
Worked on Snowflake Cloning of data from disparate sources.
In depth Knowledge of AWS cloud service like Compute, Network, Storage, and Identity & access management.
Hands-on Experience in configuration of Network architecture on AWS with VPC, Subnets, Internet gateway, NAT, Route table.
Experience in configuring, deployment and support of cloud services including Amazon Web Services (AWS).
Load disaggregation using Microscopic Power and pattern recognition.
Applied machine learning on the timeseries to perform the load disaggregation.
Familiar with predictive models using numeric and classification prediction algorithms like support vector machines and neural networks, and ensemble methods like bagging, boosting and random forest to improve the efficiency of the predictive model.
Worked on Text Mining and Sentimental analysis for extracting the unstructured data from various social Media platforms like Facebook, Twitter, and Reddit.
Good Knowledge of NoSQL databases like Mongo DB and HBase.
Extensive experience in designing and implementation of continuous integration, continuous delivery, continuous deployment through Jenkins.
Extensively worked on ERWIN tool with all features like REVERSE Engineering, FORWARD Engineering, SUBJECTAREA, DOMAIN, Naming Standards Document etc.
Experience in using various packages in Rand python like ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, Beautiful Soup, Rpy2.
Excellent and experience and knowledge of Machine Learning, Mathematical Modeling and Operations Research. Comfortable with R, Python, SAS and Weka, MATLAB, Relational databases. Deep understanding & exposure of Big Data Eco-system.
Experienced with Integration Services (SSIS), Reporting Service (SSRS) and Analysis Services (SSAS)
Develop, maintain, and teach new tools and methodologies related to data science and high-performance computing.
Extensive hands-on experience and high proficiency with structures, semi-structured and unstructured data, using a broad range of data science programming languages and big data tools including R, Python, Spark, SQL, Scikit Learn, Hadoop Map Reduce
Expertise in Technical proficiency in Designing, Data Modeling Online Application, Solution Lead for Architecting Data Warehouse/Business Intelligence Applications.
Cluster Analysis, Principal Component Analysis (PCA), Association Rules, Recommender Systems.
Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
Adept in statistical programming languages like R and Python including Big Data technologies like Hadoop, Hive.
Hands-on experience with RStudio for doing data pre-processing and building machine learning algorithms on different datasets.
Collaborated with the lead Data Architect to model the Data warehouse in accordance with FSLDM subject areas, 3NF format, and Snowflake schema.
Worked and extracted data from various database sources like Oracle, SQL Server, and DB2.
Implemented machine learning algorithms on large datasets to understand hidden patterns and capture insights.
Predictive Modelling Algorithms: Logistic Regression, Linear Regression, Decision Trees, K-Nearest Neighbors, Bootstrap Aggregation (Bagging), Naive Bayes Classifier, Random Forests, Boosting, Support Vector Machines.
Flexible with Unix/Linux and Windows Environments, working with Operating Systems like Centos5/6, Ubuntu13/14, Cosmos.




Technical Skills:

Languages Python, R Machine Learning Regression, Polynomial Regression, Random Forest, Logistic Regression, Decision Trees, Classification, Clustering, Association, Simple/Multiple linear, Kernel SVM, K-Nearest Neighbors (K-NN),
OLAP/ BI / ETL Tool: Business Objects 6.1/XI, MS SQL Server 2008/2005 Analysis Services (MS OLAP, SSAS), Integration Services (SSIS), Reporting Services (SSRS), Performance Point Server (PPS), Oracle 9i OLAP, MS Office Web Components (OWC11), DTS, MDX, Crystal Reports 10, Crystal Enterprise 10(CMC)

Web Technologies JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL,
Tools Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer. Big Data Technologies spark peg, Hive, HDFS, Map Reduce, Pig, Kafka.
Databases SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, MySQL, MS Access, HDFS, HBase, Teradata, Netezza, MongoDB, Cassandra, SAP HANA.
Reporting Tools MS Office (Word/Excel/PowerPoint/ Visio), Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.
Version Control Tools SVM, GitHub.
Project Execution Methodologies Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD).

BI Tools Tableau, Tableau server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse
Operating System Windows, Linux, Unix, Macintosh HD, Red Hat.


Professional Experience:

Allstate, Northfield Township, IL Dec 2021- Present
Lead Data Engineer
Responsibilities:
Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.
Identified areas of improvement in existing business by unearthing insights by analyzing vast amounts of data using machine learning techniques.
Application was based on service-oriented architecture and used Python 2.7, Django1.5, JSF 2, Spring 2, Ajax, HTML, CSS for the frontend.
Interpret problems and provides solutions to business problems using data analysis, data mining, optimization tools, and machine learning techniques and statistics.
Designed and developed NLP models for sentiment analysis.
Developed mapping parameters and variables to support SQL override.
Created mapplets to use them in different mappings.
Developed mappings to load into staging tables and then to Dimensions and Facts.
Used existing ETL standards to develop these mappings.
Worked on machine learning on large size data using Spark and MapReduce.
Let the implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.
Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
Developed and implemented R and shiny application which showcases machine learning for business forecasting
Wored with Caffe Deep Learning framework
Worked on R packages to interface with Caffe Deep Learning Framework
Generated graphs and reports using gglpt package in R studio for analytical models.
Worked on data wrangling libraries like dply,tidyr and plyr package for data mugning and data analysis.
Data sources are extracted, transformed, and loaded to generate CSV data files with Python programming and SQL queries.
Expertise in creating HDInsight cluster and Storage Account with End-to-End environment for running the jobs.
Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.
Stored and retrieved data from data-warehouses using Amazon Redshift.
Worked on Teradata s queries, Teradata Indexes, Utilities such as Mload, Tpump, Fast load, and Fast Export.
Refined time-series data and validated mathematical models using analytical tools like R and SPSS to reduce forecasting errors.
Create ETL scripts for the ad-hoc requests, requests to retrieve data from analytic sites.
Create ETL scripts to retrieve data feeds, page metrics from Google analytic services (for Star Wars site).
Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data. Created various types of data visualizations using Python and Tableau.
Used pandas, NumPy, Seaborn, SciPy, matplotlib, sci-kit-learn, NLTK in Python for developing various machine learning algorithms.
Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
Worked as Data Architects and IT Architects to understand the movement of data and its storage and ER Studio 9.7
Participated in all phases of data mining; data collection, data cleaning, developing models, validation, visualization and performed Gap analysis.
Data Manipulation and Aggregation from a different source using Nexus, Toad, Business Objects, Powerball, and Smart View.
Implemented Agile Methodology for building an internal application. .
Good knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, SecondaryNameNode, and Map Reduce concepts.
Environment: Hadoop, Map Reduce, Spark, Spark MLLib, Tableau, SQL, Excel, VBA, SAS, Matlab, AWS, SPSS, Cassandra, Oracle, MongoDB, SQL Server 2012, DB2, T-SQL, PL/SQL, XML, Tableau.

Chubb Group,Warren, NJ Jun 2019 Dec 2021
Sr. Data Engineer
Responsibilities:
Setup storage and data analysis tools in Amazon Web services cloud computing infrastructure.
Conducted research on development and designing of sample methodologies and analuzed data for pricing of client s products.
Developed and implemented R and shiny application which showcases machine learning for business forecasting
Used pandas, NumPy, Seaborn, SciPy, matplotlib, sci-kit-learn, NLTK in Python for developing various machine learning algorithms.
Installed and used Caffe Deep Learning Framework
Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
Worked as Data Architects and IT Architects to understand the movement of data and its storage and ER Studio 9.7
Participated in all phases of data mining; data collection, data cleaning, developing models, validation, visualization and performed Gap analysis.
Hosted standalone applications on a webpage, embed interactive charts in R Markdown documents, or build dashboards.
Built interactive web applications that can execute R code on the backend.
Automated resulting scripts using Apache Airflow and shell scripting to ensure daily execution in production.
Configured Apache Airflow to S3 bucket and Snowflake data warehouse and creatd dags to run the Airflow.
Worked on both up and down stream jobs in Airflow.
Worked on XCOMS for having tasks talk to each other in Airflow.
Data Manipulation and Aggregation from a different source using Nexus, Toad, Business Objects, Powerball, and Smart View.
Good knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary NameNode, and Map Reduce concepts.
Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.
Implemented data Ingestion and handling clusters in real time processing using Apache Kafka.
Programmed by a utility in Python that used multiple packages (SciPy, NumPy, pandas)
Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, KNN, Naive Bayes.
Responsible for design and development of advanced R/Python programs to prepare to transform and harmonize data sets in preparation for modeling.
Loaded Bulk data using Copy.
Performed data migration from AWS S3 to Snowflake using Copy and performed Data Validation.
Used Streamlit that enables developers to quickly and easily write, share, and deploy data applications.
Used Snowpark API to work on python API s and streamline the data using Pandas and performed various data wrangling operations.
Performed production level test on snowflake using dbt.
Worked on various features in data clean that included Secure Data Sharing, Row access policies, Stored Procedures and Stream and tasks.
Performed Data Transformation using Snowflake and AWS Glue.
Automated the pipeline on Snowflake using Tag.
Performed Snowflake performance tuning using Table Scans, Column optimization and through Latency.
Worked on Transient tables, Permanent tables and done lot of Travel Time
Performed Tag Based masking policy conditions that was written to protect the column data based on the policy assigned to the tag.
Worked on incremental data changes using dbt and performed various dbt operations on it.
Performed various levels of Snapshot challenges that includes detecting row changes
Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage.
Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
Configured AWS Cloud Watch to monitor AWS resources, including creating AWS customized Scripts to monitor various application and system & Instance metrics.
Data transformation from various resources, data organization, features extraction from raw and stored.
Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
Interaction with Business Analyst, SMEs, and other Data Architects to understand Business needs and functionality for various project solutions
Researched, evaluated, architected, and deployed new tools, frameworks, and patterns to build sustainable Big Data platforms for the clients.
Involved in Continuous Integration (CI) and Continuous Delivery (CD) process implementation using Jenkins along with Shell script
Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
Data transformation from various resources, data organization, features extraction from raw and stored.
Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
Identifying and executing process improvements, hands-on in various technologies such as Oracle, Informatica, and Business Objects.
Designed both 3NF data models for ODS, OLTP systems and dimensional data models using Star and Snowflake Schemas.
Used Scala collection framework to store and process the complex consumer information.
Used Scala functional programming concepts to develop business logic.
Developed programs in JAVA, Scala-Spark for data reformation after extraction from HDFS for analysis.
Developed Spark scripts by using Scala shell commands as per the requirement.
Processed the schema oriented and non-schema-oriented data using Scala and Spark.
Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
Provided architecture and design as product is migrated to Scala, Play framework and Sencha UI
Implemented applications with Scala along with Akka and Play framework.
Expert in implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
Auction web app - calculated bids for energy auctions utilizing Scala, JPA and Oracle.
Built Kafka-Spark-Cassandra Scala simulator for MetiStream, a big data consultancy; Kafka-Spark-Cassandra prototypes.
Developed a Restful API using & Scala for tracking open source projects in Github and computing the in-process metrics information for those projects
Directed connectivity from Power BI or Azure Analysis Services are not available. Power BI Dataflows can connect with Azure Data Lake.
Gen 2Integrated with Azure Data Lake Analytics (U-SQL).
Used ADLS Gen2, when connecting through the DFS endpoint this is a metadata-only operation. This results in significantly improved
Performance for the data load, particularly at higher data volume
Used Groovy as a scripting language inside an application, for building the scripts
Implemented MuleSoft integration platform for connecting SaaS and enterprise applications in the cloud
Designed and developed a decision tree application using Neo4J graph database to model the nodes and relationships for each decision
Worked with No SQL databases like MongoDB to save and retrieve the data.
Implemented Neo4j to integrate graph databases with relational databases and to efficiently store, handle and query highly connected elements in your data model especially for object oriented and relational developers.


Environment: Erwin r9.6, Python, SQL, Oracle 12c, Netezza, SQL Server, SSRS, PL/SQL, T-SQL, Tableau, MLlib, regression, Cluster analysis, ScalaNLP, Spark, Kafka, Mongo DB, logistic regression, Apache, Jenkin, Hadoop, PySpark, Teradata, random forest, OLAP, Azure, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, Cassandra, MapReduce, Airflows, AWS, Snowflake and Snowpark API.

Nationwide Mutual Insurance Company Columbus, OH Jan 2013 May 2019 Sr. Data Engineer
Responsibilities:
Utilized Spark, Scala, Hadoop, HQL, VQL, oozie, pySpark, Data Lake, TensorFlow, HBase, Cassandra, Redshift, MongoDB, Kafka, Kinesis, Spark Streaming, Edward, CUDA, MLLib, AWS, Python, a broad variety of machine learning methods including classifications, regressions, dimensionality reduction etc.
Worked in Automating, Configuring, and deploying instances on AWS cloud environments and Data centers, also familiar with EC2, Cloud watch, Elastic IP's and managing security groups on AWS.
Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Mat lab.
Used the version control tools like Git 2.X and build tools like Apache Maven/Ant
Worked on analyzing data from Google Analytics, Ad Words, and Facebook etc.
Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like Elastic Search, Kibana.
Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, Date and Time etc.
Categorized comments into positive and negative clusters from different social networking sites using Sentiment Analysis and Text Analytics
Used Python scripts to update content in the database and manipulate files
Skilled in using dplyr and pandas in R and Python for performing exploratory data analysis.
Performed Multinomial Logistic Regression, Decision Tree, Random forest, SVM to classify package is going to deliver on time for the new route.
Implemented the Web service client for login verification, credit reports and applicant information using Apache Axis 2 web service
Used Jenkins for Continuous Integration Builds and deployments (CI/CD).
Performed data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve data from Oracle database and used ETL for data transformation.
Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
Exploring DAG's, their dependencies and logs using Airflow pipelines for automation
Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon.
Developed Spark/Scala, R Python for regular expression (regex) projects in the Hadoop/Hive environment with Linux/Windows for big data resources.
Used clustering technique K-Means to identify outliers and to classify unlabeled data.
Tracking operations using sensors until certain criteria is met using Air Flow technology.
Responsible for different Data mapping activities from Source systems to Teradata using utilities like TPump, FEXP, BTEQ, MLOAD, FLOAD etc.
Used Groovy as a scripting language inside an application, for building the scripts.
Implemented MuleSoft integration platform for connecting SaaS and enterprise applications in the cloud.
Designed and developed a decision tree application using Neo4J graph database to model the nodes and relationships for each decision.


Environment: R, Python, HDFS, ODS, OLTP, Oracle 10g, Hive, OLAP, DB2, Apache, Jenkin, Metadata, MS Excel, Mainframes MS Vision, Map-Reduce, Rational Rose, SQL, and MongoDB.
Keywords: continuous integration continuous deployment machine learning user interface business intelligence sthree database active directory rlang information technology microsoft procedural language Delaware Illinois New Jersey Ohio

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];2133
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: