Home

Saketh - Sr Datascientist
[email protected]
Location: Chicago, Illinois, USA
Relocation: Open
Visa: H1B
Saketh Pilli

PROFESSIONAL SUMMARY

Overall 10+ years in IT and 8+ years of experience in interpreting and analyzing data for driving business solutions.
Hands-on experience developing predictive models across domains such as telecom, healthcare, retail, marketing etc as well as leading team of data scientists.
Proficient in identifying trends and discovering insights from high-dimensional data sets using a variety of supervised and unsupervised algorithms.
Worked extensively on Python and R stack which include libraries such as sci-kit learn, pandas, numpy, scipy, dplyr, ggplot2, seaborn, matplotlib etc.
Deep Learning experience using libraries such as Theano, TensorFlow and Keras.
Experience deploying models using Azure ML Studio, Azure DataBricks, AWS Sagemaker and adept with Spark and Hadoop-Mapreduce, GCP-Vertex AI
Knowledge and experience of extracting information from text data using Natural Language Processing (NLP) methods such as Bag of Words, Sentiment Analysis, Topic Modeling using LDA, TF-IDF.
Experience in data collection using Scrapy, querying using SQL Server and building dashboards in Tableau
Strong knowledge of experimental design and conducting various statistical analyses such as Hypothesis Testing, running A/B tests as well as discovering product insights using funnel analysis, segmentation etc.
Having more than one-year university level teaching experience in Python programming.
Able to communicate results clearly to technical as well as non-technical audience.
Strong ability to wear multiple hats to complete the relevant tasks.

TECHNICAL SKILLS

Programming Languages Python, R, SAS, C, Basic Java
Machine Learning Regression, Classification, Decision Trees, Random Forests, Boosting (XG-Boost), SVM, k-means, k-modes, k-prototype, Recommender systems, Neural Networks (CNN,RNN), Regularization (L1,L2), Na ve Bayes, PCA
Deep Learning Tensorflow, Keras, Theano
Natural Language Processing NLTK, word2vec
Cloud Technologies Spark, Azure ML Studio, AWS EC2, S3, Hive, Hadoop- MapReduce
Statistics & Data Mining Hypothesis Testing, Time Series Analysis, A/B testing, Association Rules
Analytical Packages NumPy, SciPy, Sci-kit learn, Pandas, dplyr
Data Visualization tools Matplotlib, Seaborn, ggplot2
Databases & Dashboards SQL Server, MySQL, MongoDB, Tableau
Others Mixpanel, SPSS, R-shiny, MS-Excel, Knime

EDUCATION

Master s in Data Science, Indiana University Bloomington, Indiana, United States
Coursework: Statistics (Frequentist, Bayesian), Machine Learning, Applied Machine Learning, Data Visualization, Big Data, Time Series, Categorical Data Analysis, Database Design
Bachelor s in Engineering, Birla Institute of Technology & Sciences (BITS- Pilani), India


PROFESSIONAL EXPERIENCE
Magna Feb 2023 Present
Role: Data Scientist/Data Engineer
Project: Cosma Sustainability Process
Responsibilities:
Gathering data from sensors that monitors part production, energy consumed, water consumed, air and gas consumed in the manufacturing plants in 10 different countries and 25 divisions and writing them into AWS Timestream and S3 databases using AWS Lambda.
Building AWS Greengrass components to interface with the various devices in the manufacturing plant as a part of the modernization process.
Developed dashboards in Tableau to develop visibility into the utility per part consumption thereby helping division managers to monitor trends and identify gaps in utility leaks.
Built an anomaly detection solution that identifies utility leaks/higher than normal consumption values and alerting the divisions about abnormal equipment values using AWS SNS
Currently working on automating the process using CDK Terraform with Github Actions with the idea of connecting to IoT devices, provisioning of clusters and writing time-series data into AWS Timestream.
Using ChatGPT to generate ideas/summaries (anomaly detection in industry), regex (data cleaning), simple sql queries (data analysis) and boiler plate code to improve upon development time
Supporting Ad-hoc requests from other teams w.r.t to data science solution monitoring, data quality issues and data pipelines.
Help the team with Automatic GMAW sectional analysis (Gas Metal Arc Welding) which involves building CNNs (Convolutional Neural Networks) to segment images into individual components. The images are then reassembled to identify key points which are then used as reference for weld measurements.
Currently working on Binary Classification (Sk-learn) model to identify patterns that determine whether a process produces a good part or a faulty part
Dell Aug 2022 Feb 2023
Role: Lead Data Scientist/MLOps
Project: Product Authoring Recommendations

Responsibilities:
Involved in gathering data from APIs to create data sets with historical information of configurations sold in offline and online channels for various countries.
Responsible for data extraction and data ingestion from different data sources into a common database by creating pipelines using Python, Pyspark, Hive/HDFS, SQL Server. The resultant data set would be consumed by data scientists for product recommendations.
Cleaned and identified relevant features necessary for building the potential machine models. Potential approaches included framing the problem as a supervised classification task and moving to a hybrid recommender system based on the results of the initial approach.
Experimented with various approaches such as multi-label classification, association rule mining, similarity-based approaches to understand the interaction between various components of configurations.
Built APIs (Fast API) and integrated them with dash (for data visualization) to load the ML model and generate recommendations when asked by the user for POC.
Converted jupyter notebooks to pipeline using KubeFlow Kale and performed hyperparameter tuning Kubeflow Katib
Selected the best ML model using KubeFlow Kale and performed Model Serving using KFServing
Deployed the APIs to production using GKE (Google Kubernetes Engine) and performed model monitoring using DKube Monitor
Led a team of 2 data scientists to achieve the various implementations upon data science workflow iteration and help generate ideas/ clean code/ achieve optimized code to help move the project forward.
GAP (contracted through DPP Tech) Mar 2021 Aug 2022
Role: Machine Learning Engineer/ML Ops
Project: Size and Pack Model Building and Operationalization (Batch and Real-time)

Responsibilities:
Involved in identifying data drifts, concept drifts for the clustering and random forest algorithms already built by the Size and Pack team and included building custom metrics to track drifts (e.g., Wasserstein Distance)
Built custom data quality reports by integrating Great Expectations package with the ML development pipeline to provide good quality data for the ML model to consume.
Facilitated easy model management for the data scientists through integration of ML Flow and KubeFlow in the workspace. This involved writing scripts that the data scientists can place in their code to enable tracking and reproducibility.
Refactor code and operationalize machine learning models built by data scientists in Azure Databricks. This was achieved through chaining of notebooks from the Azure Databricks environment.
Successfully delivered POC (proof-of-concept) to test if third party ML ops tools such as Neptune AI, DVC and Argo-CI/CD can be used in the current environment.
Writing transformations for data using Pyspark.
Built CI/CD pipelines in codefresh and deploy a docker image of the model to a Kubernetes cluster and tested the effectiveness of the test pipeline. Orchestration was performed using KubeFlow Pipelines UI and experiments performed with Airflow.
Built dummy ML model to simulate actual model using Auto ML and wrote logic to obtain real-time and near real time size profiles through the use of structured streaming (Kafka) to facilitate faster end-to-end testing of data pipelines.
Collaborated with data engineers to understand the current data pipeline and made changes to fit the current use case by the use of structured streaming.

American Express, Phoenix Mar 2019 Mar 2021
Role: Data Scientist/Senior Data Scientist
Project: Hierarchy Affiliation for Hospitality Clients

Identify and correctly affiliate impacted merchants to their corresponding parent top of chain (TOC) to help automate hierarchy reviews. This helps to identify locations that belong to out of hierarchy and help them tie back to TOC which improves merchant loyalty (because all the locations getting American express rewards) as well as cost saving (locations that do not belong are removed).
Tested multiple machine learning solutions including unsupervised clustering One Class SVM and Isolation Forests to identify in hierarchy and out of hierarchy locations
Helped to generate a heuristic using text similarity approaches (Levenshtein, Jaro-Winkler, Jaccard and Soft Jaccard) to identity name similarity, brand similarity, parent-name similarity, industry similarity , active location similarity and proximity (distance based on geo-codes) to identify potential parents for a given brand location.
Potential brands that this project helped included Hilton Hotels, Wyndham Hotels and a bunch of fast food chains such as Subway, Mcdonalds, Jamba Juice etc.
Helped automate the reviews leading to saved man hours as well improving on the current manual approach.

Project: Merchant Call Prediction
Responsibilities:
Built a classification model (XGBoost) to reduce number of calls made to the call center for servicing customer complaints.
The idea is to predict what category a user will most likely call about and service the user through chat saving costs.
Data was collected from Adobe Analytics application which tracks information about user visits to the website and various modules such as Disputes, Chargebacks etc.
Involved in data cleaning, feature engineering, model selection, model building and model deployment.

Project: Supplier Cancellation
Responsibilities:
Involved in the end-to-end delivery of the use case from data gathering, problem formulation, data profiling, data analysis, data modeling, software refactoring, testing the pipeline and deployment of the machine learning solution
The goal of the use case is to identify suppliers who are at a high-risk of cancellation and bucketing them based on the risk scores. The solution would be used by Client Level Managers to monitor activity and prevent churn
Features were generated from scratch through the warehouses inside the workplace using PySpark (window functions, in-built functions)
A baseline model (Logistic Regression- binary classification) was built and other approaches (xgboost, light gbm) were used to improve on the baseline. Xgboost model gave ~75% recall and was used as the deployment solution
For interpretation, SHAP (Shapely Additive Explanations) package was used to identify top features that contribute towards cancellation.
Modularized, tested, packaged machine learning model using AWS Sagemaker and deployed it in Amex environment

Project: Reject Payable Prediction
Applied traditional forecasting techniques such Moving Average, Exponential Smoothing and Sarimax (Seasonal Arima) to forecast reject payable amounts over the next 7 days
Built a baseline model using Fbprophet and compared it with the traditional forecasting approaches
Added new additional exogenous variables that impact the forecasts and built a LSTM (time-series) using Keras to generate multi-feature multi-step forecasts using both window based as well as sequence based approaches.

Project: Merchant Spend Analysis
Responsibilities:
Applied Supervised Regression models to predict total merchant spend and Unsupervised Clustering to identify high performing merchants with the aim of increasing client s share of merchant wallet.
Formulated the data problem, identified data sources (D&B, iClick, Internal), validated and cleaned data,extracted features and performed data extraction in PySpark
Built Random Forests, Decision Tree models in Sci-kit learn and then changed to Spark ML-LIB
Ran k-means model using the results of random forest model to identify clusters of merchants, thereby identifying the high-performing merchants for other similar merchants to emulate.
Modularized, tested, packaged machine learning model using Maven and deployed it in Amex environment


Data Cabinet, California Mar 2018 Feb 2019
Role: Data Scientist
Project: Analysis of Hospital Visits (k-modes, sci-kit learn, SQL Server)

Responsibilities:
Involved in sourcing data into jupyter notebook by using pyodbc.
Performed basic descriptive analysis and data pre-processing by dealing with missing values, scaling etc.
Performed unsupervised analysis using k-modes as well as k-prototype to find out any underlying structure in the data set.
Found optimal k by using elbow method as well as using Silhouette scores and examined the clusters to identify insights for the client.

Project: Sentiment Analysis and Topic Modeling for Company Reviews (Scrapy, NLTK)
Responsibilities:
Performed data collection by scraping the reviews of desired companies using Scrapy.
Pre-processed the data by standardizing location information and ensured data consistency through sanity checks.
Conducted further pre-processing that included stemming, lemmatization, removing stop words etc.
Built word clouds and performed sentiment analysis using SentimentIntensityAnalyzer in the VaderSentiment class.
Performed topic modeling using LDA (Latent Dirichlet Allocation) to automatically discover topics.

Environment: Scrapy, NLT-K, Word2Vec, Python, Sci-kit learn, pandas, numpy, seaborn, VaderSentiment

Indiana University, Bloomington Jan 2016 Dec 2017
Role: Data Modeler
Project: Audio Classification with Machine Learning (R, Python, Sci-kit learn, Keras)

Responsibilities:
Sourced the required WAV audio files and checked quality of files.
Used the WarbleR package in R for data preparation of audio files that involved splitting each audio file into a feature vector consisting of features such as duration, mean frequency, peak frequency etc.
Dealt with missing values (by removing) and exported the data for analysis in Sci-kit learn.
Performed AdaBoost, SVM modeling on the training set and measured performance using Confusion Matrix and Cross-Validation Score.
Improved performance of the algorithm by implementing a multi-layer perceptron deep neural network in Keras using different activation functions such as Sigmoid, Softmax and Relu.

Role: Python Developer & Business Intelligence Analyst
Responsibilities:
Involved in data collection process and maintaining consistency of data.
Created dashboards in Tableau to identify trends in student performance that help in early intervention to improve scores.

Project: IU Bus Route Optimization: [R, MySQL, Google Maps, GPS Visualizer]
Responsibilities:

Sourced required data from IU Bloomington Transit service center and added additional information from other data sources such as weather data, schedule data, GPS data.
Consolidated all the data sources into a single comprehensive database for efficient data mining purposes and to be a single point of information for all transit related information.
Converted data into a JSON format and analyzed data by generating a series of graphs to understand data trends with respect to transit/dwell time for Fall semester months.
Provided recommendations to the university which led to increased convenience for the travelers.

Environment: Python 3.5.2, R-Studio, Tableau, Sci-kit learn, pandas, numpy, Keras, WarbleR, seaborn, matplotlib, Spark, R-Shiny

Saras Analytics, Hyderabad, India Oct 2014 Nov 2015
Role: Data Scientist
Project: Churn Prediction: (Python, SQL Server, Decision Trees, matplotlib)

Responsibilities:
Involved in understanding business objectives for the predictive model and conversion of the business problem into a concrete analytics solution.
Identified existing descriptive features from raw data sources such as Customer Demographics records, customer billing records, transactional call records, sales team s transactional database etc.,
Integrated all the relevant features required for prediction into a single data set and made data dictionaries for future reference.
Used SQL Server to pull the data points from the sales as well as the retention team s databases.
Created new features such as Bill Change Amt. Percentage, Number of Handsets to incorporate business understanding acquired through the executive and retention teams.
Deployed a decision tree model using sci-kit learn to reduce churn rate from over 16% to 11%.
Found important features which were used by operations, marketing department as a result of analysis.

Project: Customer Segmentation: (k-means, PCA, Random Forests, Sci-kit learn, pandas)

Responsibilities:
Sourced the required data for analysis from the client s marketing team.
Conducted sanity checks and inspected the data set for any errors.
Performed Exploratory Data Analysis, Feature engineering, and two-sample t-tests to identify under-performing and out-performing market channels for ticket booking.
Implemented k-means algorithm to identify natural clusters among customers using pre-determined k as per business requirements.
Examined the clusters to understand key differences among the customer segments.
Used Random forests to find important features within each cluster and generated decision trees to examine the factors leading to higher booking rates.
Environment: SQL Server, Tableau, Python 3, sci-kit learn, pandas, numpy, scipy, R, ggplot2

Integreon, Noida, India Aug 2013 Sep 2014
Role: Data Analyst
Project: Marketing Mix Modeling (R, Linear Regression, Excel)

Responsibilities:
Analyzed advertising expenditure worth 70M USD, related to TV, digital, and promotions to calculate ROI of product sales.
Performed data preparation by transforming variables using adstock and saturation techniques to capture various media effects in the data.
Built a linear regression model to predict product sales using Television Rating Point (TRP) and digital impressions to achieve a successful marketing mix model.

Project: Health Platform Analysis (A/B Testing, Tableau, Optimizely, Mixpanel)
Responsibilities:
Worked with the Product Manager to identify and define success metrics for the platform.
Integrated user data from various sources such as web, mobile etc., into Mixpanel for further analysis.
Developed A/B tests using Optimizely to test Call to Action (CTA s) buttons for the platform.
Performed Funnel analysis on the integrated data set to understand user engagement.
Built custom dashboards using Tableau to provide visibility into product analytics.
Successfully implemented the insights and achieved an increase in conversion rates from 1% to 3%.

Environment: R, Python, Optimizely, Mixpanel, Tableau, Mongo DB

KN Bio Sciences, Hyderabad, India July 2012 July 2013
Role: Data Analytics Analyst
Project: Biologics Market Opportunity Assessment (Data Visualization, Primary, Secondary Research)

Responsibilities:

Identified value chain, market structure, key drivers and barriers that impact the Biologics market.
Utilized primary and secondary sources to build a framework to identify broad market segments helping the client define their positioning in the market from a growth standpoint.
Made client recommendations using variety of data visualization techniques such as Bubble Plots, Gantt charts etc., which led to new streams of revenues amounting to USD 5 MM annually.
Keywords: cprogramm continuous integration continuous deployment artificial intelligence machine learning user interface materials management sthree database active directory rlang information technology microsoft

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];1593
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: