Siddharth - Data Scientist |
[email protected] |
Location: Remote, Remote, USA |
Relocation: Yes |
Visa: H1B |
Siddharth
Data Scientist Contact: 4699727527 EXT: 108 [email protected] Yes H1B PROFESSIONAL SUMMARY Versatile and innovative Generative AI Specialist and Data Engineer with extensive experience in Azure, Amazon and Google Cloud environments. Demonstrated ability in developing and managing end-to-end data pipelines across multiple platforms. Adept in leveraging Azure AI, machine learning studio, and expert in both data science and data engineering disciplines. Proven track record in GPU optimization, RAG vector search, and NLP and LLM technologies. Experience in Agile Methodologies, Scrum stories and sprints experience in a Python based environment, along with data analytics, data wrangling. Experience in Data Science/Machine Learning in the different domains such as Data Analytics, Machine Learning (ML), Predictive Modelling, Natural Language Processing (NLP) and Deep Learning algorithms. Proficient at a wide variety of Data Science programming languages Python, R, SQL, PySpark, Sci-kit Learn, NumPy, SciPy and Pandas, NLTK, TextBlob, Genism, SpaCy, Keras and TensorFlow. Experienced in facilitating the entire lifecycle of a data science project: Data Cleaning, Data Extraction, Data Pre-Processing, Dimensionality Reduction, Algorithm implementation and Validation. Expert in Machine Learning algorithms such as Ensemble Methods (Random forests), Linear, Polynomial, Logistic Regression, Regularized Linear Regression, Support Vector Machines (SVM), Deep Neural Networks, Extreme Gradient Boosting, Decision Trees, K-Means, K-NN , Gaussian Mixture Models, Naive Bayes. Proficient in Creating, Debugging, Scheduling and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes. Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data warehouse and Controlling and Migrating On-premises databases to Azure Data Lake store using Azure Data factory. Experience in using Jira/Azure Dev Ops for ticketing and tracking issues and Jenkins for continuous integration and continuous deployment. Experienced in Data Integration Validation and Data Quality controls for ETL process and Data Warehousing using MS Visual Studio SSIS, SSAS, SSRS. Experience with version control tool Git and build tools like Apache Maven/Ant. Hands on experience with big data tools like Hadoop, Spark, Hive, Pig, PySpark, Spark SQL,PySpark Skilled in data management, including Data munging, Data cleaning, Data Analytics, Data Visualization, and Big Data ecosystems using Hadoop, Hive, HDFS, MapReduce, Spark, Airflow, Snowflake, Teradata, Flume, Yarn, Oozie, and Zookeeper. Solid understanding of Agile Methodologies, Scrum stories and sprints in a SQL and Oracle-centric environment, bolstered by robust data analytics and data wrangling skills. Capable of facilitating the entire lifecycle of a data project: Data Cleaning, Data Extraction, Data Pre-Processing, Dimensionality Reduction, Algorithm implementation, Back Testing, and Validation. TECHNICAL SKILLS: Big Data Technologies: Azure Data Factory, Hadoop, MapReduce, HDFS, Hive, HBase, NiFi, Airflow, Apache Spark. Databases: Azure CosmosDB, Postgresql Flexible Server, MySQL, Azure SQLdb, MongoDB Programming: Python, R, Java, Shell script, SQL, markdown Machine Learning: LLM,LSTM,RESNET-50, RNN, CNN, Regression (Linear and Logistic), Decision trees, Random Forest, SVM, KNN, PCA. ML Frameworks: Promptflow, Langchain, Pandas, Keras, NumPy, TensorFlow, Scikit-Learn, NLTK, Cloud Technologies: Azure, AWS, GCP Azure Tools: Open AI, Machine Learning Studio, Promptflow, Azure notebooks AWS Tools: EC2, S3, Glue, Athena Versioning tools: Git, GitHub, bitbucket Operating Systems: Windows, Ubuntu Linux, MacOS PROFESSIONAL EXPERIENCE: BirlaSoft, Remote Sept 2023 to Till Date Role: Data Scientist / Gen-AI Data Engineer Description: Birlasoft is a multi-shore business application global IT services provider with a presence in the United States, Europe, Asia-Pacific and India. Responsibilities: Designed and implemented end-to-end data pipelines across AWS, GCP, and Azure platforms, ensuring seamless data flow and integration. Used R to generate regression models to provide statistical forecasting. Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Data Frames, Azure Databricks. Implemented a POC of RAG vector search, utilizing Azure Open AI ada 2 embeddings model and machine learning studio notebooks for a custom GPT. Deploy NLP models into production environments and monitor their performance over time. Stay up-to-date with the latest research and developments in the field of NLP and incorporate them into project workflows. Working with Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and Map Reduce concepts. Construction and maintenance of production level data engineering pipelines for optimization of ETL/ELT jobs from sources like Azure SQL, Azure Data Factory, Blob storage, Azure SQL Data warehouse, Azure Data Lake Analytics. Design and implement streaming solutions using Kafka or Azure Stream Analytics Deployed several database environments such as Cosmos DB PostgreSQL and PostgreSQL Flexible Server with pgvector extension while choose the right vector store for Retrieval Augmented Generation (RAG) Vector Search. Developed AI and ML models using Ada 2 embedding model and GPT-4 model by Azure OpenAI, with applications in public healthcare. Ensured data integrity and standardization through meticulous data preparation, including unit standardization and log transformation. Programmed a utility in Python that used multiple packages (scipy, numpy, pandas) Built models using Statistical techniques like Bayesian HMM and Machine Learning classification models like XG Boost, SVM, and Random Forest. Prepared CI/CD (Continuous Integration & Delivery) using Concourse pipelines to deploy code in DEV/STG/PRD environment. Converted complex R code to Python, enhancing the efficiency and maintainability of data processes. Performed Exploratory Data analysis for several datasets as part of Phase 1 of project. Environment: Azure Open AI, Azure SQL DB, Azure Notebooks, Azure Data Factory, Azure Functions, Power BI, Promptflow, Langchain, PostgreSQL Flexible server, pgvector extension, CosmosDB, Sharepoint, R, Python, Agile. Comcast, Philadelphia Sept 2020 to Aug 2023 Role: Data Scientist / BigData Engineer Description: Comcast Corporation is an American multinational telecommunications conglomerate headquartered in Philadelphia, Pennsylvania. Responsibilities: Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing, and reporting of voluminous, rapidly changing data. Worked on developing Pyspark script to encrypting the raw data by using hashing algorithms concepts on client specified columns. Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Python. Used AWS glue catalog with crawler to get the data from S3 and perform sql query operations. Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline. Implemented statistical modeling with XGBoost machine learning software package using Python to determine the predicted probabilities of each model. Built classification models include Logistic Regression, SVM, Decision Tree, Random Forest to predict Customer Churn Rate. Analyzed data using SQL, R, Python, and presented analytical reports to management and technical teams. Performed data cleaning and feature selection using Machine Learning package in PySpark and working with deep learning frameworks such as TensorFlow and Keras etc. Natural Language Processing (NLP) such as sentiment analysis, entity recognition, Topic Modeling and Text summarization was done using advanced python library such as NLTK, TextBlob, Spacy and Gensim. Segmented the customers based on demographics, geographic, behavioral and psychographic data using K-means Clustering. Designed and implemented end-to-end systems for Data Analytics and Automation, integrating custom visualization tools using Python and Tableau. Pandas Data frame, NumPy, Jupyter Notebook, SciPy, scikit-learn, TensorFlow, Keras was used as a tool for Machine Learning and Deep Learning. Wrote complex SQL statements to interact with the RDBMS database to filter the data and data analytics. Apache Spark is used for bigdata processing, streaming, SQL, Machine Learning (ML) Developed the Pysprk code for AWS Glue jobs and for EMR. Used Data Build Tool for transformations in ETL process, AWS lambda Worked on scheduling all jobs using Airflow scripts using python. Adding different tasks to DAG s and dependencies between the tasks Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon Athena. Used Spark SQL for Python interface that automatically converts RDD case classes to schema RDD. Wrote various data normalization jobs for new data ingested into Amazon s3 and Amazon Athena. Developing ETL pipelines in and out of data warehouse using combination of Python and pyspark. Managed the entire product site on Tableau and Quick while dealing with products relating to various different clients. Created a pipeline to hit the athena database real time to get information on the size and number of columns involved to reduce the size and optimize the pipelines. Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response. Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources. Involved in Creating, Debugging, Scheduling and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes. Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects. Conducted Data blending, Data preparation using SQL for Tableau consumption and publishing data sources to Tableau server. Worked with version control tools like GitHub. Implemented AWS Step Functions to automate and orchestrate the Amazon SageMaker related tasks such as publishing data to S3, training ML model and deploying it for prediction. Create Athena data sources on S3 buckets for adhoc querying and business dashboarding using Quicksight and Tableau reporting tools. Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon SageMaker. Environment: AWS EMR, EC2, S3, RDS, Athena, Glue, Auto Scaling, Elastic Search, Lambda, Amazon SageMaker, Apache Spark, HIVE, Map Reduce, Snowflake, Python, Tableau, Agile. HID Global, Palm Beach Gardens, FL Sept 2019 to Aug 2020 Role: Data Scientist / BigData Engineer Responsibilities: Worked on Tensorflow, Keras, NumPy, Scikit-Learn, tf.Data API, Jupyter Notebook, in Python at various stages for developing, maintaining and optimizing machine learning model. Extracted Fingerprint image Data stored on local network to Conduct Exploratory Data analysis(EDA), Cleaning and organize. Ran NFIQ algorithm to ensure data quality by collecting the high score images. Finally Created histograms to compare distributions of different datasets. Used R and Spark (PySpark, MLlib) to implement different machine learning algorithms including Generalized Linear Model, SVM, Random Forest, Boosting and Neural Network. Performed data imputation using Scikit-learn package in Python. Designed and implemented system architecture for Amazon EC2 based cloud hosted solution for the client. Performed Data pulls to get the from AWS S3 buckets. Used AWS transcribe to obtain call transcripts, perform text processing. Developed MapReduce/Spark Python modules for predictive analytics & machine learning in Hadoop on AWS. Transformed the image dataset to protocol buffers, serialized and finally stored inside TFrecord data format. Loaded the data in GPU and achieved Half Precision FP16 training on Nvidia Titan RTX and Titan V GPU for TensorFlow 1.14. Optimized TFRecord data ingestion pipeline using tf.Data API and made them scalable by streaming over network, thus enabling training of models with Datasets which were bigger than CPU memory. Automated training and optimization of model hyper parameters to quickly conduct and test 50 different variations of the model .Finally storing the results and generating automated reports. Maintaining Models created by other data scientists, retrained them with different variations of datasets. Created tooling for other data scientists to help them become more effective at exploring the data and other tasks. Productized existing TensorFlow models by converting them to tflite format which allowed integration with existing C++ and android applications. Conducted Transfer Learning using ResNet50 pretrained model by freezing the bottom layer and retraining top layers with fingerprint images. Successfully visualized what the internal layers of CNN are seeing by making Class Activation Maps (CAM). Used Validation and Testing sets to avoid the overfitting of the model to make sure the predictions are accurate and measured the performance using Confusion matrix and ROC Curves. Environment: Hadoop, Agile, MapReduce, Snowflake, Spark, Hive, Kafka, Python, R, Airflow, JSON, AWS, EC2, S3, Athena, Glue, AutoScaling, EKS, ELB, Tensorflow, Keras CISCO, San Francisco, CA March 2018 to Jul-2019 Role: Data Scientist / BigData Engineer Responsibilities: Understanding business needs, analyzing functional specifications and map those to develop and designing MapReduce programs and algorithms Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python. Setup storage and data analysis tools in Amazon Web Services (AWS) cloud computing infrastructure. Worked with several R packages including knitr, dplyr, SparkR, Causal Infer, Space-Time. Used Pandas, Numpy, Seaborn, Scipy, Matplotlib, Sci-kit-learn, and NLTK in Python for developing various machine learning algorithms. Determined customer satisfaction and helped enhance customer experience using NLP. Used R and Spark (PySpark, MLlib) to implement different machine learning algorithms including Generalized Linear Model, SVM, Random Forest, Boosting and Neural Network. Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data life cycle management in both RDBMS, Big Data environments. Used Data Quality Validation techniques to validate Critical Data Elements (CDE) and identified various anomalies. Participated in all phases of Datamining, Data-collection, Data-Cleaning, Developing-Models, Validation, Visualization and Performed Gap Analysis. Worked on NOSQL databases like Cassandra. Programmed a utility in Python that used multiple packages (Scipy, Numpy, Pandas) Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, KNN, Naive Bayes. Used different file formats like Text files, Sequence Files, Avro, Record Columnar CRC, ORC. Used Amazon web services (AWS) like EC2 and S3 for small data sets. Designed both 3NF data models for ODS, OLTP systems and Dimensional Data Models using Star and Snowflake Schemas. Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification. Created SQL tables with referential integrity and developed queries using SQL, SQL PLUS and PL/SQL. Designed and developed Use Case, Activity Diagrams, Sequence Diagrams, OOD (Object oriented Design) using UML and Visio. Environment: AWS, R, Machine learning-Algorithms, Anaconda, Predictive Analytics, Deep Learning- Algorithms, CNN, HCNN, Python, Data Mining, Data Collection, Data Cleaning, Validation, HDFS, Hive, OLAP, Metadata, MS Excel, SQL, and MongoDB. Pitney Bowes, Los Angeles, California Jan 2018 to March 2018 Role: Data Scientist/Data Engineer Responsibilities: Implementation of machine learning methods, optimization, and visualization. Mathematical methods of statistics such as Regression Models, Decision Tree, Na ve Bayes, Ensemble Classifier, Hierarchical Clustering and Semi-Supervised Learning on different datasets using Python. Participated in Data Acquisition with Data Engineer team to extract historical and real-time data by using Hadoop MapReduce and HDFS. Developed Merge jobs in Python to extract and load data into a MySQL database. Used Test driven approach for developing the application and Implemented the unit tests using Python Unit test framework. Prepared Scripts in Python and Shell for Automation of administration tasks. Maintained PL/SQL objects like packages, triggers, procedures etc. A highly immersive Data Science program involving Data Manipulation & Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT, Unix Commands, NoSQL, MongoDB, Hadoop. Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity. Tested with various Machine Learning algorithms like Support Vector Machine (SVM), Random Forest, Trees with XGBoost concluded Decision Trees as a champion model. Built models using Statistical techniques like Bayesian HMM and Machine Learning classification models like XGBoost, SVM, and Random Forest. Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python. Environment: Machine Learning, R Language, Hadoop, Big Data, Azure, Python, Spark, Scala, Hbase, MySQL, MongoDB, Agile. Tesco PLC, Bangalore, India Dec 2014 to July 2017 Role: Data Engineer/Developer Responsibilities: Collected data from end client, performed ETL and defined the uniform standard format Wrote queries to retrieve data from SQL Server database to get the sample dataset containing basic fields Performed string formatting on the dataset converting hours from date format to a numerical integer Used Python libraries like Matplotlib to visualize the numerical columns of the dataset such as day of week, age, hour and number of screens. Developed Hive queries for analysis, and exported the result set from Hive to MySQL using Sqoop after processing the data. Created HBase tables to store various data formats of data coming from different portfolios. Worked on improving performance of existing Pig and Hive Queries. Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLLib, R, a broad variety of machine learning methods including classifications, regressions, dimensionality reduction etc. Researched Ensemble learning methods like Random Forest, Bagging, Gradient Boosting, picked the final model based on confusion matrix, ROC, AUC . Worked on missing value imputation, outlier identification with statistical methodologies using Pandas NumPy Tuned the hyper parameters of the above models using Grid Search to find the optimum models Designed and implemented K-Fold Cross-validation to test and verify the model s significance Developed a dashboard and story in Tableau showing the benchmarks and summary of model s measure. Use tools extensively like R, Python, MS Excel etc. to analyze data from multiple perspectives and was able to provide a robust Machine Learning algorithm. Created new tools and business processes that simplify, standardize and enables operational excellence. Used tools like Tableau for drilling-down data, creating insightful reports and garnering actionable business insights. Environment: Tableau, R, MS Outlook, SQL Server, Python (Scikit-Learn, NumPy, Pandas, Matplotlib), Tableau, Hadoop. Sapiens, India Apr 2012 to Nov 2014 Role: Data Analyst Responsibilities: Worked with leadership teams to implement tracking and reporting of operations metrics across global programs Worked with large data sets, automate data extraction, built monitoring/reporting dashboards and high-value, automated Business Intelligence solutions (data warehousing and visualization) Gathered Business Requirements, interacted with Users and SMEs to get a better understanding of the data Performed Data entry, data auditing, creating data reports & monitoring all data for accuracy Designed, developed and modified various Reports Performed data discovery and build a stream that automatically retrieves data from multitude of sources (SQL databases, external data such as social network data, user reviews) to generate KPI's using Tableau. Wrote ETL scripts in Python/SQL for extraction and validating the data. Create data models in Python to store data from various sources. Interpreting raw data using a variety of tools (Python, R, Excel Data Analysis Toolpak), algorithms, and statistical/econometric models (including regression techniques, decision trees, etc.) to capture the bigger picture of the business. Created and presented dashboards to provide analytical insights into data to the client Translated requirement changes, analyzing, providing data driven insights into their impact on existing database structure as well as existing user data. Worked primarily on SQL Server, creating Store Procedures, Functions, Triggers, Indexes and Views using T-SQL. Environment: SQL Server, ETL, SSIS, SSRS, Tableau, Excel, R, Python, Django, Keywords: cplusplus continuous integration continuous deployment artificial intelligence machine learning business intelligence sthree database rlang information technology microsoft procedural language California Florida |