Krishna Mo - Sr. Data Scientist / GenAI / ML |
[email protected] |
Location: Dallas, Texas, USA |
Relocation: Yes |
Visa: GC |
Krishna M
Sr Data Scientist /GenAI/ML [email protected] +1 (940) 226-2019 Professional Summary: Senior Data Scientist and Data Analyst with 10+ years of professional experience, performing Statistical Modelling, Data Mining, and implementing Machine Learning (ML), Artificial Intelligence (AI), Tableau, Data Exploration and Data Visualization of structured and unstructured datasets and Deep Learning models based on business understanding to deliver insights that drive key business decisions to provide value to the business. Experience in Artificial intelligence (AI) techniques such as natural language processing (NLP), Large Language Models (LLM) computer vision, and machine learning algorithms are used to process and clean raw data. Integrated Docker containers with Kubernetes to orchestrate and scale a high-availability web application. Utilized GCP resources namely Big Query, cloud composer, computer engine, Kubernetes cluster and GCP storage buckets for building the production ML infrastructure. Expertise in building data pipelines for automating the data process and consistency in processing. Solid understanding of Data Modelling, Data Collection, Data Cleansing, Data Warehouse/Data Mart Design, Snowflake, ETL, BI, OLAP, Client/Server applications. Proficient in handling data using data handling libraries such as Pandas, NumPy, nltk, TextBlob, Gensim, and scikit-learn packages. Expertise in cleaning, blending, and transforming data from multiple sources using Alteryx. Hadoop and Spark are built using Java. Proficient in visualizing the data with the help of python matplotlib, seaborn, and plotly packages. Expertise in statistical analysis of the data with the help of hypothesis testing methods such as One-sample Z-test, T-test, Two-sample hypothesis testing, Chi-square testing, and Anova testing. Adequate exposure to Azure cloud environment and dealt with S3, DynamoDB, EC2, and IAM. Design, develop, and implemented GENAI, LLM, and NLP models using Machine learning (ML Infrastructure). Sound expertise in web scraping using python packages such as requests, bs4, lxml, and selenium. Ability to write optimized complex SQL queries for extracting data using group by, window functions, case statements, joining, etc. Utilized a range of technologies and methodologies including Spark, Hadoop, HBase, Kafka, Spark Streaming, Caffe, TensorFlow, ML Ops, and Python. Applied various machine learning techniques such as classification, regression, and dimensionality reduction. Facilitated Agile ceremonies such as sprint planning, daily stand-ups, sprint reviews, and retrospectives. Understanding of Azure CloudFormation, RDS, SNS, SQS, Lambda functions, VPC. Apache Spark is an open-source distributed computing system that provides API. Utilize GENAI models to generate new creative text formats and answer all questions. Adaptive to learning new technologies and sound knowledge of deep learning frameworks and packages such as Keras, PyTorch, TensorFlow, etc. Facilitate Scrum meetings to track progress, address roadblocks, and ensure team alignment. Understanding of various Machine learning (ML Infrastructure) algorithms such as Linear Regression, Lasso & Ridge, Logistic regression, KNN, na ve Bayes, Decision Tree, Random Forest, Snowflake, K means, DBSCAN, XGBOOST, AdaBoost, Gradient Descent, Isolation Forest, Gaussian mixture models, SVM, SVC, matrix factorization, etc. ML(Reinforcement) involves deploying Machine learning (ML Infrastructure) models into production environments, whether it's on-premises, in the cloud, or at the edge. Understanding Azure CLI, python boto3 package to connect Azure with python. Ability to handle natural languages processing (NLP) with the help of Nltk, Gensim, Large Language Models (LLM), Huggingface, TextBlob, etc. Experienced in leveraging Alteryx's spatial tools for geospatial analysis and mapping. Spearheaded the transition from Waterfall to Agile, resulting in a 30% increase in project delivery speed. Good Understanding and Experience with deep learning frameworks such as TensorFlow or PyTorch can be advantageous, especially for tasks involving complex data like images, text, or time-series data. Basic Understanding of Docker, Kubernetes, and Jenkins (CI/CD). Facilitate all Scrum ceremonies and ensured effective communication among team members and stakeholders. Open-source database systems like PostgreSQL, MySQL, and provide powerful SQL capabilities. Familiar with deploying pre-trained LLMs in production environments using frameworks such as TensorFlow, PyTorch, and Hugging Face's Transformers library, and integrating them into larger applications and pipelines. It provides auto-scaling, monitoring, and integration with other GCP services. Integrated Kafka with Spark Frameworks for real time data processing. Skilled in performing data parsing, data manipulation and data preparation with methods including describing data contents. Strong experience in the Analysis, design, Scala, development, testing and Implementation of Business Intelligence solutions using Data Warehouse/Data Mart Design, ETL, Client/Server applications. Technical Skills: Programming Languages Python (3. x), C, R, HTML, CSS, Angular JS, Java Algorithms Linear Regression, Lasso & Ridge, Logistic regression, KNN, na ve Bayes, Decision Tree, Random Forest, Apriori, K means, DBSCAN, XGBOOST, ADABOOST, Gradient Descent, Isolation Forest, Gaussian mixture models, SVM, Deep Learning Neural Networks, simulated annealing. Natural Language packages Nltk, Gensim, Hugging face, TextBlobl, Google trans, natural language processing (NLP), Large Language Models (LLM), Data Science Packages Pandas, NumPy, SciPy, Scikit-learn, xgboost, Keras, TensorFlow, PyTorch, Shap, ML ops. Visualization Matplotlib, Seaborn, Plotly Misc Flask, bottle, streamlit, requests, bs4, lxml, selenium Cloud Azure (EC2, IAM, VPC, Cloud Formation, RDS, Lambda, SNS, SQS, S3) Version Control Git Big Data Hive, Pyspark Operating System Windows, Linux (Ubuntu, Redhat) Reporting Tools Tableau, Power BI, BI - (SSIS, SSRS, SSAS) Databases SQL Server, SQL, Oracle, MySQL, SQL Lite, HBase, MongoDB, Cassandra, PostgreSQL, DynamoDB Professional Experience Client: US BANK Dallas, TX Jan 2023 - Present Role: Sr. Data Scientist/GenAI/ML Responsibilities: Created parameters, action filters and calculated sets for preparing dashboards and worksheets in Tableau. Experience in Artificial Intelligence (AI) algorithms, machine learning (Reinforcement) and deep learning models are developed and trained to solve various predictive and classification tasks. Interacting with architects solutions for data -visualization using tableau and Packages in Python. GenAI models and techniques are used in unlocking creativity and solving complex problems. Successfully delivered multiple NLP to building a chatbot that assists a customer to trouble shoot claim issues and recommend actions. Building, publishing customized interactive reports, report scheduling and dashboards using Tableau server. ML Ops leverages CI/CD pipelines to automate the process of building, testing, and deploying Machine Learning (ML Infrastructure) models. Java can help to work more effectively whether it's writing custom MapReduce in Hadoop or developing Spark applications. PyTorch and TensorFlow is most popular in open-source deep learning frameworks by AI. Can work parallelly in both GCP and Azure Clouds coherently. Utilizing Google SQL and MySQL to extract, manipulate, and analyze large datasets for actionable insights. Developed and deployed GPT-based LLM and NLP models to automate legal document analysis, saving 30% of manual review time. Facilitated Scrum ceremonies including sprint planning, daily stand-ups, sprint reviews, and retrospectives. Managed an Agile to enhance forecasting accuracy using advanced analytics and machine learning techniques. LLMs are used in AI systems designed to understand intricacies of human language and generate intelligent, creative responses to queries. Developed data pipelines, transformations, and models using ETL to migrate data to Snowflake data warehouse. Good Understanding and proficiency in NumPy for numerical computing and Pandas for data manipulation and handling, cleaning, and analysis of large datasets. Responsible for operations and support of big data Analytics platform, Splunk and Tableau visualization. Used cloud shell SDK in GCP to configure the services Data Proc, Storage, Bigquery. Involved in Data ingestion to Azure Data Lake, Azure Databricks by building pipelines in Azure Data Factory (ADF). Continuously exploring and mastering new technologies within the GCP, GenAI, Experience in working with GenAI and Discriminative Machine Learning (Reinforcement) algorithms. Used pandas, NumPy, seaborn, matplotlib, Scikit-learn, SciPy, NLTK in Python for developing various Machine Learning (ML) algorithms. Responsible for building an Azure Cloud Enterprise Data Platform. Including establishing connection between Azure Resources (ADF, Databricks, ADLS GEN2, Storage layer access for ADF) Data visualization skills are essential for communicating insights and findings from data analysis effectively. AI models are unbiased and do not perpetuate existing societal biases present in the training data. Proficient in using open source LLMs such as GPT-2 and BERT for various NLP tasks including text generation, LLM, text classification, sentiment analysis, and named entity recognition. Spearheaded the Agile transformation within the data team, leading to a 25% increase in project delivery speed. Utilized Apache Spark SQL with Python to develop and execute Big Data Analytics and Machine Learning (ML Infrastructure) applications, executed Machine Learning use cases under Spark ML and ML ops. Designed and developed NLP models for sentiment analysis. SQL which is used to interact with relational databases like PostgreSQL and MySQL. Java is used in enterprise environments and many existing systems applications are built using Java. Designed and provisioned the platform architecture to execute Hadoop and Machine Learning (ML Infrastructure) use cases under ML infrastructure, Azure, EMR, and S3. Facilitated all Scrum ceremonies, including sprint planning, reviews, and retrospectives, ensuring the team adhered to Agile principles and delivered high-quality results. Optimized query performance and snowflake architecture. PyTorch's API is more Pythonic and intuitive, making it easier to write and debug code. TensorFlow's, PyTorch's community are growing, and it has ecosystem of libraries and resources for deep learning research and development. Experience in Led an Agile project team to analyze sales performance data and identify key revenue growth Worked on Machine Learning (Reinforcement) on large size data using Spark SQL MySQL and MapReduce. Application of various Machine Learning (ML Infrastructure) algorithms and statistical modeling to identify Volume using the scikit-learn package in python, MATLAB. Skilled in fine-tuning pre-trained LLMs to adapt them to specific tasks and domains, leveraging techniques such as transfer learning and domain adaptation. Strong foundation in natural language processing (NLP) with generative models and text generation. Environment: SQL/Server, Oracle, MS-Office, LLMs, TensorFlow, GenAI, NLP, Reinforcement, Scrum, Snowflake, PyTorch, Java, Open Source, Google SQL, Scikit-learn, Machine Learning, Agile, Pandas, Artificial Intelligence, MySQL, PostgreSQL, GCP, Azure, NumPy, Machine Learning Infrastructure, Teradata, GPT, ML ops, Informatica, ER Studio, XML, R connector, Python, R, Tableau 9.2. Client: - Terumo Heart Inc, Michigan Oct 2020 Nov 2022 Role: Sr. Data Scientist /Sr. Data Analyst Project: Product Claim Analysis or Competitor Products Analysis(E-Commerce) Responsibilities: Worked with large amounts of structured and unstructured data. Knowledge in Machine Learning (ML Infrastructure) concepts (Generalized Linear models, Regularization, Random Forest, Time Series models, etc.) Experience in GenAI to monitoring systems track the performance of deployed models, detecting anomalies or drifts in data distribution that may affect model accuracy. Worked in Business Intelligence tools and visualization tools such as BusinessObjects, Tableau, Chart IO. Strong Knowledge in GenAI models is deployed into production environments to make real-time predictions or automate decision-making processes. Developed and executed ETL pipelines utilizing AWS Glue to process and transform large datasets from multiple sources. Alteryx to enhance customer segmentation, resulting in a 15% increase in targeted marketing campaign efficiency. Led a cross-functional Agile to design and implement a data processing system using Apache Kafka and Spark. NumPy is a fundamental package for numerical computing in Python and Pandas offers data structures and functions for manipulating structured data, making it indispensable for data preprocessing and analysis. Guiding Scrum from orchestrating daily stand-ups to facilitating sprint planning, reviews, and retrospectives. ML(Reinforcement) involves monitoring the performance of deployed models in real-time, tracking key metrics such as accuracy, latency, and throughput. Utilized Amazon Athena for SQL queries on data stored in S3, resulting in a 30% enhancement in query performance. Experienced in AWS Glue is a serverless service, eliminating the need for provisioning or managing infrastructure. This allows you to concentrate on data analysis and transformation. Java has its own set of libraries and frameworks for data analysis and machine learning (Reinforcement). Deployed GUI pages by using JSP, JSTL, HTML, DHTML, XHTML, CSS, JavaScript, AJAX and Python. Utilize GCP's machine learning libraries like Tensor Flow for advanced feature engineering tasks. Utilized GPT models to develop natural language processing (NLP) solutions, including chatbots and text tools. PostgreSQL and MySQL are powerful capabilities for data preprocessing tasks such as data cleaning, filtering, joining Experience in leading platform for building Python programs to work with human language data. Specialize in large language model (LLM) is used in artificial intelligence (AI) program to recognize and generate text and other tasks. Expertise in visualizing data insights and trends with Tableau. Proficient in performing complex statistical analyses with Alteryx to uncover patterns and insights. Automated data extraction, transformation, and loading processes with Glue Workflows, reducing intervention by 50% Experience in Managed an Agile to migrate the existing data warehouse to AWS Redshift. Extensive knowledge of Snowflake Database, including database schema and table structures. Experienced in evaluating the performance of LLMs using metrics such as accuracy, precision, recall, F1-score, and perplexity, and adept at interpreting model outputs to assess their quality. The ability to preprocess and clean messy data, handle outliers, and perform data wrangling tasks efficiently using Python libraries is crucial for real-world data analysis. Optimized Athena queries by partitioning data and utilizing appropriate file formats such as Parquet and ORC. GenAI assists are evaluating the performance of machine learning (ML Infrastructure) using various metrics and techniques. Cross-validation, hyper parameter tuning, and model comparison help in selecting the best-performing model for a given task. Proficient in Scrum framework including sprint planning, backlog grooming, and sprint retrospectives. Collaborate with software and developers the predominantly use Java for application development. Optimize resource usage and costs by leveraging GCP's pricing models and cost management tools. ML (Reinforcement) addresses the challenges of scaling machine learning workloads to handle increasing volumes of data and requests. Ability to convey complex data insights in an understandable and compelling manner using Tableau. Experience in fact dimensional modeling (STAR, Snowflake), transactional modeling, and SCD. Python's Pandas library or R can directly interface with PostgreSQL and MySQL databases. Experienced in documenting LLM development processes, experiments, and results, and proficient in communicating technical concepts and findings to both technical and non-technical stakeholders. Working with GCP combines expertise in data analysis, Machine learning (ML Infrastructure) and cloud computing to derive insights from data, build predictive models, and deploy them at scale to drive business value. Artificial Intelligence (AI) algorithms help to explore and visualize the characteristics of the data. Strong programming skills in Python, R, and experience with AI frameworks like TensorFlow and PyTorch. Developed Predictive Analytics using Pyspark and Spark SQL on Databricks to extract, transform and uncover insights from the raw data. Created Alteryx workflows to automate data cleaning and reducing data processing time by 40%. Use the AWS Glue Data Catalog with Amazon Athena to query your data with SQL. Experienced in scaling LLM training and inference tasks using parallel and distributed computing frameworks such as TensorFlow's distributed training, PyTorch's Distributed Data Parallel, or distributed computing platforms like Apache Spark and Spark SQL. Implemented the project in Linux environment. Environment: R 3.0, Erwin 9.5, Tableau 8.0, MDM, GenAI, LLMs, Amazon Athena, Scrum, Python, Reinforcement, Snowflake, PostgreSQL, TensorFlow, PyTorch, Google SQL, AWS Glue, Alteryx, Agile, Machine Learning (ML Infrastructure), Artificial Intelligence, AWS, GPT, GCP, NumPy, Framework, Pandas, PyTorch, QlikView, ML ops, PL/SQL, HDFS, Teradata 14.1, JSON, HADOOP (HDFS), MapReduce, PIG, Spark, R Studio, MAHOUT, JAVA, HIVE. Client: Client: Verizon, Irving, Texas Jun 2018 Sep 2020 Role: Data Scientist/Data Analyst Project: Production Scheduling and Optimization Responsibilities: Designed and developed a call quality control system using Python and Machine learning (ML Infrastructure) techniques such as Random Forest regressor using Scikit-learn. Conducted feature selection using Grid Search CV to identify and select the most important features for predicting call quality. Conducted data cleaning and preprocessing using Pandas and NumPy to ensure the data were suitable for analysis. Used standardization techniques to normalize the data and ensure consistency across all call quality metrics. ML Ops applies principles of version control to manage changes to machine learning (ML Infrastructure) models, datasets, and codebases. Developed and utilized Scikit-Learn modeling pipelines to automate the entire modeling process and increase efficiency. Integrated and analyzed data from various sources (SQL databases, APIs, Excel files) using Alteryx to provide comprehensive business intelligence reports. Collaborated with cross-functional teams, contributing to the Agile development process and ensuring timely delivery of analytics solutions. Utilized the SciPy library for statistical analysis and hypothesis testing to identify potential correlations and causations between call quality metrics. Conducted regular testing and maintenance to ensure the call quality control system was always running optimally. Conducted data exploration to identify potential outliers and anomalies in the call quality data. Created documentation and training materials to enable stakeholders to understand and use the call quality control system effectively. Worked closely with the Scrum Master and product owner to ensure the team delivers value to the customer each sprint. Statistical analyses and hypothesis testing with Alteryx, providing actionable insights for product development and optimization. Involved in the development of real time streaming applications using PySpark, Apache Flink, Kafka, Hive on distributed Hadoop Cluster. Work in a fast-paced agile development environment to quickly analyze, develop, and test potential use cases for the business. Developing ETL pipelines in and out data warehouse using combination of Python and Snowflake SQL. Designed and developed Flink pipelines to consume streaming data from Kafka and applied business logic to massage and transform and serialize raw data. Successfully loading files to Hive and HDFS from Oracle, SQL Server using SQOOP. Configure, monitor, and automate Amazon Web Services as well as involved in deploying the content cloud platform on Azure using EC2, S3 and EBS Used Azure Lambda to perform data validation, filtering, transformations for every data change in a DynamoDB table and load the transformed data to another data store. Worked on google cloud platform (GCP) services like computer engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager. Developed Map Reduce/Spark Python modules for machine learning (ML Infrastructure) & predictive analytics in Hadoop on Azure. Scrum places a high priority on delivering customer value by aligning development efforts with customer needs and gathering continuous feedback, notably through practices like Sprint Reviews. Utilized Spark, Hadoop, HBase, Kafka, Spark Streaming, Caffe, Tensor Flow, ML ops, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. Experience in LLM specializes in legal aspects like intellectual property rights related to data and algorithms and LLM involving data governance, compliance, or advising on legal implications of data. Performed time series analysis using Tableau Desktop, created detail level summary reports and dashboards using KPI's and visualized trend analysis. Environment: Python, R studio, Oracle, Snowflake, Machine Learning (ML Infrastructure), Regressions, KNN, SVM, SQL, Decision Tree, Scrum, Random Forest, XG boost, Alteryx, Azure, Agile, Collaborative filtering, Ensemble), Pandas, GCP, NLP, R, ML ops, Oracle, Flink, Spark, Hive, MapReduce, Hadoop, Scikit-Learn, KERAS, TensorFlow, Seaborn, NumPy, SciPy, MySQL, Tableau. Client: Norfolk Southern Corporation, Atlanta, GA Oct 2015 Nov 2017 Role: Data Analyst Project: Audience Profiling (Marketing, Media Tech) Responsibilities: Responsible for analyzing the data from Hive tables, extracted data from tables using Pyspark SQL. Responsible for analyzing the data with RDD using Apache spark and python programming language PySpark. Preprocessed the data such as removing duplicates, handling missing values, and feature engineering using PySpark. proficiency in leveraging cloud-based technologies like AWS Glue and Lambda to optimize data workflows and enhance analytical capabilities. Optimized data storage and querying performance by partitioning datasets and using efficient file formats such as Parquet or ORC on Amazon S3, leveraging AWS Glue and Athena. Segmented the customers by taking different features like demographics, purchase behavior, browsing behavior, customer lifetime value, etc. Increased team efficiency by 15% by implementing Agile and optimizing workflows. Built an ETL pipeline to ingest and transform sales data from multiple sources using AWS Glue, resulting in a centralized data warehouse. Utilized Amazon Athena to provide real-time analytics and reporting capabilities to the sales team. Built regression model to predict customer lifetime value prediction with the help of historical data using ML ops regression algorithms such as Linear Regression, Lasso& Ridge, and Random Forest regression algorithms. Responsible for creating data pipelines by collaborating with data engineering to create ETL pipelines. Performed elbow method to optimal K to build K means clustering. Built clustering model to segment the customers using ML ops package with algorithms such as Kmeans, Gaussian mixture models. Evaluated clusters with silhouette score and evaluated clusters to measure the clustering results. Model helped to identify the customers by targeting different products, to increase sales. Environment: Hive, PySpark, ML ops, AWS Glue, Kmeans, Agile, Amazon Athena, Gaussiam Mixture models, Linear, Python, Lasso, Ridge, Random Forest, SQL. Client: Intuit, Bangalore, India Oct 2013 Sep 2015 Role: Data Analyst Project: Financial Data Analysis (Finance) Responsibilities: Proficient in understanding and analysing data of the financial domain, Meetings with stakeholders to understand and set business problems. Extracted data from various sources which include flat files, and databases such as Oracle using SQL. Analyzed the data for data gaps and performed Exploratory data analysis on the data by performing univariate and bivariate analysis on the data with the help of pandas. Analyzed the data for identifying the gaps in the data, analyzing the various patterns of financial transactions of the data. Led an Agile team to develop an interactive dashboard that provides real-time customer insights using Tableau. Using clustering algorithms such as K means classifying the transactions and analyzing the clusters and creating the profiles for clustering to classify the clusters by labeling the clusters. Identified anomalous/fraudulent transactions using the isolation forest algorithm to mark the transaction for further investigation. Based on the type of transaction, transaction amount, reason, type of account, frequency of transactions, location, time, age, balance, rolling balance, etc., to check for fraudulent transactions and created a dashboard using PowerBI. For version control git was used. Deployment of analysis results and created data pipelines to classify future transactions by using Azure EC2 environment exposing flask, Docker. Environment: Azure (EC2, ECS), K Means, Flask, Git, Agile, PowerBI, isolation forest, Oracle SQL, EDA, Univariate and Bivariate Analysis), Python, Pandas, scikit-learn. Education: B. Tech in Computer Science Engineering from Amrita University, India 2013. Keywords: cprogramm continuous integration continuous deployment artificial intelligence machine learning javascript business intelligence sthree rlang information technology microsoft procedural language Georgia Texas |