| Sai Sriram - Senior Gen AI Data Engineer |
| [email protected] |
| Location: Charlotte, North Carolina, USA |
| Relocation: Yes |
| Visa: GC |
| Resume file: Sai Sriram- Senior AI Data Engineer Resume_1781712883562.docx Please check the file(s) for viruses. Files are checked manually and then made available for download. |
|
Sai Sriram
Senior AI Data Engineer [email protected] | +1 (704) 750-0607 | linkedin.com/in/sai-sriram-2803b1229 PROFESSIONAL SUMMARY: Senior AI Data Engineer with 12+ years of experience building production-grade data platforms, AI/ML pipelines, and enterprise data architectures across healthcare, financial services, insurance, and retail at regulated enterprise scale. Specialized in engineering end-to-end AI data pipelines for agentic and RAG-based systems, delivering HL7 FHIR ingestion, vector embedding ETL, and feature store architectures that reduced prior decision turnaround from 9 days to 1.4 days. Constructed vector embedding ETL pipelines using OpenAI Embeddings API, LlamaIndex, and Pinecone, indexing 3.2M+ clinical and 10M+ insurance documents into hybrid BM25 and dense vector retrieval layers with 40% recall improvement. Designed and operated real-time streaming data platforms using Apache Kafka, Spark Streaming, and Apache Airflow, processing 2M+ daily payment events, clinical records, and claims documents into model-ready and analytics-ready datasets. Implemented PHI-safe feature engineering and de-identification pipelines under HIPAA and CMS governance, assembling 120K+ de-identified training datasets for LLM fine-tuning and evaluation corpora for RAG quality assessment. Engineered Feast Feature Store architectures delivering sub-10ms entity-level feature serving, maintaining online-offline feature parity across GNN fraud detection, demand forecasting, and clinical risk scoring production pipelines. Built data modeling and schema design frameworks using PostgreSQL, Amazon Redshift, Hadoop HDFS, and Apache Hive, covering 3NF, star/snowflake, partitioning strategies, and query optimization to support BI, ML, and Gen AI workloads. Established data quality and validation frameworks embedded inside ETL jobs, enforcing anomaly detection, referential integrity, temporal split checks, and label leakage controls across training dataset assembly for all ML and Gen AI models. Implemented MLOps and LLMOps observability pipelines using MLflow, Langfuse, Arize AI, Ragas, and LangSmith, capturing data lineage, retrieval relevance, embedding drift, and model quality metrics to drive data pipeline improvement. Built scalable ETL and ELT pipelines using Apache Spark, Airflow, and Hive, landing feature-engineered datasets into AWS S3, Azure Blob Storage, and GCP BigQuery across multi-million POS, clinical, and financial records. Deployed containerized AI data services using Docker and Kubernetes across AWS ECS Fargate, Azure Kubernetes Service, and GCP Cloud Run, with Jenkins and GitHub Actions CI/CD pipelines automating schema validations. Leveraged GCP Vertex AI and GCP BigQuery for ML training experiments and fraud analytics reporting alongside AWS SageMaker and Azure ML, improving training cost efficiency and ensuring cloud platform redundancy across enterprise AI. Implemented HL7 FHIR-compliant REST API integrations across 12 upstream health plan systems on AWS and Azure cloud boundaries, standardizing clinical payload schemas and maintaining HIPAA-compliant data residency. Architected HIPAA-compliant AI data platforms across AWS (SageMaker, ECS Fargate, Glue), Azure (OpenAI Service, AKS, Data Factory), and GCP (Vertex AI, BigQuery, Cloud Run), distributing workloads intelligently based on cost, compliance, and performance requirements. Applied SHAP explainability and unified data governance frameworks across fraud, clinical, and claims pipelines on AWS, Azure, and GCP, ensuring cross-cloud audit trail consistency for OCC, CMS, and HIPAA compliance reviews. TECHNICAL SKILLS: Programming Languages Python, Scala, SQL, Java, JavaScript, R, C#, TypeScript, Bash/Shell Scripting AI Data Platforms, Retrieval & Embeddings Pinecone, Weaviate, FAISS, pgvector, OpenAI Embeddings API, BM25 Hybrid Search, Semantic Search, Document Chunking, Feast Feature Store, Vector Index Design and Maintenance, Metadata Filtering Data Engineering, ETL & Streaming Apache Spark, Spark Streaming, Apache Kafka, Apache Airflow, Hadoop HDFS, Apache Hive, ETL/ELT Pipelines, Batch and Real-Time Processing, Data Modeling (3NF, Star/Snowflake), Feature Engineering for ML/AI, Data Quality and Validation, CDC, Data Lakehouse Databases, Storage & Caching PostgreSQL, MySQL, MongoDB, Amazon Redshift, AWS S3 Data Lakes, Redis, Schema Design, Partitioning & Indexing, Query Optimization, Data Warehousing, OLAP/OLTP ML / AI & NLP Data Workloads Scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, Keras, LSTM, CNN-LSTM, GNNs, Deep Graph Library, BERT, BioBERT, spaCy, NLTK, Sentence Transformers, Word2Vec, GloVe, SHAP, SMOTE LLM & GenAI Data Foundations LangChain, LlamaIndex, GPT-4, GPT-4o, Azure OpenAI Service, AWS Bedrock, Meta Llama 3, RAG Data Pipelines, Embedding ETL, Evaluation Corpora Curation, QLoRA, LoRA, Hugging Face PEFT, SFT, Prompt & Context Schema Design MLOps / LLMOps & Observability MLflow, Langfuse, Arize AI, Ragas, LangSmith, Experiment Tracking, Data & Model Lineage, Drift Detection, Automated Retraining, Model Registry, Prompt/Data Regression Testing, CI/CD Cloud, DevOps & Infrastructure AWS (EC2, S3, EMR, SageMaker, Bedrock, Glue, ECS Fargate), Azure OpenAI Service, GCP Vertex AI, Docker, Kubernetes, Terraform, FastAPI, Flask, REST APIs, Git, Jenkins, GitHub Actions, CI/CD Pipelines, Linux Data Visualization & BI Tableau, Power BI, Grafana, Prometheus, Matplotlib, Seaborn, KPI Dashboards, Pipeline Telemetry, Data Quality Dashboards Governance & Regulatory Standards HIPAA, PHI Data Handling, HL7 FHIR, CMS Regulatory Compliance, Prior Authorization Workflows, ICD-9, ICD-10, Financial Risk Data, Insurance Claims Data, OCC Model Risk PROFESSIONAL EXPERIENCE: Client: Centene Corporation - St Louis, MO Sep 2024 to Present Role: Senior AI Data Engineer Responsibilities: Architected an end-to-end AI data platform for Centene's prior authorization automation system, designing ingestion, embedding ETL, feature store, and vector retrieval layers serving a LangGraph multi-agent pipeline across 28M members and 4M+ annual requests. Analyzed Centene's prior authorization data flows across 28M members, profiling claims, eligibility, and clinical feeds to surface ingestion bottlenecks and data quality gaps impacting 4M+ annual auth requests across 14 state health plan markets. Translated CMS prior authorization mandate requirements into concrete data architecture specifications, pipeline throughput targets, freshness SLAs, and audit-ready lineage standards, collaborating with compliance, clinical ops, and IT security teams. Mapped source-to-target data contracts for 12 upstream health plan systems, standardizing HL7 and proprietary payloads into a unified prior authorization data model powering both agentic decisioning and downstream analytics workloads. Engineered HL7 FHIR-compliant ingestion pipelines using Apache Kafka and AWS Glue, streaming clinical notes and eligibility records from 12 upstream health plan systems into an AWS S3-backed data lake with governed staging and curated zones aligned with HIPAA PHI handling and CMS audit trail requirements. Modeled normalized and dimensional schemas in PostgreSQL for member, provider, policy, and authorization entities, enabling consistent joins, governance enforcement, and high-performance querying for Gen AI, BI, and actuarial workloads. Constructed vector embedding ETL pipelines using OpenAI text-embedding-3-large and LlamaIndex, chunking and indexing 3.2M clinical policy documents into a Pinecone vector store with versioned metadata and namespace isolation. Orchestrated hybrid retrieval data services combining Pinecone dense vector search with BM25 sparse retrieval, exposing reusable APIs that improved policy guideline recall precision by approximately 40% over the legacy keyword-matching system. Automated document re-embedding and policy re-indexing schedules using Apache Airflow DAGs triggered by CMS guideline updates, maintaining vector index currency and eliminating stale retrieval across quarterly policy amendment cycles. Implemented a Redis-backed eligibility cache holding precomputed member profiles and plan entitlement snapshots, cutting eligibility lookup latency by 60%-65% during peak prior authorization decision windows. Assembled PHI-safe training and evaluation datasets for Llama 3 fine-tuning on AWS SageMaker, orchestrating QLoRA and Hugging Face PEFT training jobs with checkpoint management and model artifact registration, and ran parallel fine-tuning experiments on GCP Vertex AI to compare cloud training cost and GPU throughput profiles before promoting the optimal model into the production Azure OpenAI-backed inference layer. Provisioned data feeds and schemas powering the LangGraph multi-agent orchestration layer, standardizing request, feature payload formats for clinical criteria extraction, policy retrieval, eligibility verification, and recommendation agents. Built LLM evaluation data pipelines using Ragas and LangSmith, capturing retrieval relevance, context utilization, and answer quality metrics into structured stores to guide data-driven RAG tuning and pipeline iteration. Enforced role-based access controls and column-level PHI masking across AWS S3, PostgreSQL, and Azure OpenAI Service, aligning storage, feature, and inference layers with HIPAA security standards and enterprise data governance policies. Deployed containerized data platform services on AWS ECS Fargate using Docker and Terraform for primary production workloads, provisioned Azure Blob Storage and Azure Data Factory pipelines for cross-regional health plan data exchange, and configured GCP Cloud Storage and GCP Cloud Run as a disaster recovery and overflow compute layer for embedding refresh and re-indexing jobs during peak policy amendment cycles. Established LLMOps CI/CD workflows using Jenkins and GitHub Actions, automating schema validations, pipeline tests, embedding refresh jobs, and staged rollouts across development, QA, and production environments. Implemented Spark-optimized batch processing for PHI-safe de-identification and embedding pre-computation, reducing overnight dataset preparation windows by 35%-40% and ensuring clinical data was inference-ready before daily authorization. Optimized LLM inference cost profiles across Azure OpenAI Service (primary HIPAA-compliant endpoint), AWS Bedrock (secondary redundancy layer), and GCP Vertex AI (tertiary overflow), implementing dynamic traffic routing, request batching, and context window tuning strategies that collectively reduced per-authorization AI infrastructure spend by approximately 40%-45% while maintaining sub-3-second end-to-end latency. Instrumented pipeline telemetry with Prometheus and Grafana, monitoring ingestion throughput, lag, vector index freshness, cache hit rates, and token utilization to detect issues before they impacted downstream clinician workflows. Monitored LLM and embedding drift signals via LangSmith trace analytics, wiring Airflow-triggered retraining and re-indexing workflows when decision alignment or retrieval relevance fell below agreed SLA thresholds. Enabled a 68% straight-through auto-approval rate and reduced average decision turnaround from 9 days to 1.4 days by delivering reliable, governed data pipelines underpinning Centene's agentic AI prior authorization platform. Environment: Python, Apache Kafka, AWS Glue, AWS S3, AWS ECS Fargate, AWS SageMaker, Apache Airflow, PostgreSQL, Redis, Pinecone, OpenAI Embeddings API, LlamaIndex, LangGraph, LangChain, GPT-4o, Meta Llama 3, QLoRA, Hugging Face PEFT, Azure OpenAI Service, Azure Blob Storage, Azure Data Factory, GCP Vertex AI, GCP Cloud Storage, GCP Cloud Run, Terraform, Docker, Jenkins, GitHub Actions, Ragas, LangSmith, Prometheus, Grafana, HL7 FHIR REST APIs, Git Client: Liberty Mutual Insurance - Boston, MA Jun 2022 to Aug 2024 Role: Senior Data Engineer Responsibilities: Designed an end-to-end claims Gen AI data platform spanning Apache Kafka ingestion, embedding ETL, Pinecone vector indexes, and governed feature stores supporting LLM pipelines and adjuster reporting teams processing 2M+ annual claims. Assessed Liberty Mutual's legacy claims data flows across 2M+ annual claims, profiling document sources, formats, and latency hotspots that drove 17-18 day adjudication cycles and constrained adjuster capacity across regional processing centers. Mapped source-to-target data contracts for incident reports, medical records, repair estimates, and policy contracts, standardizing disparate payloads into a unified claims doc schema for downstream RAG, analytics, and workflow tools. Built Apache Kafka ingestion pipelines streaming multi-format claims documents into a centralized processing queue, ensuring durable, ordered event capture for both batch and near-real-time claims processing workloads. Implemented document staging and normalization layers in PostgreSQL, enforcing consistent identifiers, metadata enrichment, and referential integrity across claims, policy, and customer entities before AI processing. Developed a preprocessing framework using LangChain text splitters and custom chunking rules, converting 500-page policy documents into semantically coherent segments optimized for high-quality RAG retrieval and reuse. Generated dense vector embeddings using OpenAI text-embedding-ada-002, indexing 10M+ claims and policy document vectors into Pinecone and FAISS with robust sharding and namespace strategies for sub-second retrieval. Orchestrated daily document refresh and re-embedding workflows using Apache Airflow DAGs, keeping vector indexes synchronized with new claims, endorsements, and policy amendments without manual backfill intervention. Integrated pgvector into PostgreSQL to support hybrid relational-plus-vector queries, enabling retrieval workflows that joined semantic search results with structured claim and policy attributes in a single database operation. Assembled balanced training and evaluation datasets for a HuggingFace BERT claims routing model across product lines and jurisdictions, running fine-tuning jobs on AWS SageMaker as the primary training platform and validating dataset pipeline reproducibility using GCP Vertex AI Pipelines as a secondary experiment environment before promoting the production model into the Azure Kubernetes Service inference layer. Logged embeddings, prompts, hyperparameters, and evaluation metrics into MLflow, building an auditable catalog of data configurations and experiments across 30+ RAG and classification pipeline iterations. Tuned Kafka topic partitioning, Airflow scheduling, and Pinecone index parameters, reducing average claims data processing latency and supporting the reduction of adjudication cycle time from 17-18 days to approximately 10-11 days. Deployed RAG inference and retrieval services as containerized FastAPI microservices on Azure Kubernetes Service, scaling horizontally under regional adjuster query bursts to sustain 2-3 second response SLAs, with AWS ECS handling batch claims processing workloads, GCP Cloud Run provisioned as a serverless overflow layer for traffic spike absorption. Configured AWS Bedrock as a secondary LLM inference layer alongside OpenAI GPT-4, wiring routing logic and observability so production traffic could fail over without disrupting downstream data contracts. Provisioned Grafana and Prometheus dashboards tracking ingestion lag, vector index freshness, LLM latency distributions, hallucination flags, and queue throughput across all active claims pipelines in real time. Implemented automated prompt regression and retrieval quality tests using a 1,200-document golden evaluation dataset, gating deployments when relevance or accuracy metrics degraded after provider model updates. Optimized Gen AI infrastructure cost across Azure OpenAI Service (primary LLM endpoint), AWS Bedrock, and GCP Vertex AI, implementing request batching, context pruning, and Redis intermediate caching strategies that decreased per-claim AI infrastructure spend while preserving SLA targets across all active claims processing queues. Enabled approximately 50%-55% reduction in adjuster document review hours, cut claims cycle time from 17-18 to 10-11 days. Environment: Python, Apache Kafka, Apache Airflow, AWS SageMaker, AWS ECS, Azure Kubernetes Service (AKS), Azure OpenAI Service, GCP Vertex AI Pipelines, GCP Cloud Run, LangChain, LlamaIndex, GPT-3.5 Turbo, GPT-4, OpenAI Embeddings API (text-embedding-ada-002), Pinecone, FAISS, pgvector, HuggingFace Transformers, BERT, AWS Bedrock, PostgreSQL, Redis, Grafana, Prometheus, MLflow, FastAPI, Docker, Kubernetes, REST APIs, Git, Jenkins Client: JPMorgan Chase - New York, NY Aug 2020 to May 2022 Role: Senior Data Engineer Responsibilities: Designed the end-to-end fraud data platform architecture spanning Apache Kafka ingestion, Spark Streaming feature computation, transaction graph construction, Feast Feature Store serving, and low-latency REST scoring endpoints for $6T+ daily payment transactions. Assessed JPMorgan's fraud data landscape, profiling stream sources, schemas, and latency constraints that limited rule-based engines from capturing cross-account fraud patterns at the scale required for enterprise payment authorization workflows. Designed canonical transaction, account, device, and merchant schemas in PostgreSQL and partitioned Parquet data lake tables on AWS S3, and provisioned GCP BigQuery as a data warehouse for fraud pattern reporting, risk analytics, and model performance dashboards consumed by fraud operations and compliance teams across JPMC's hybrid cloud infrastructure. Constructed Spark Streaming ingestion pipelines consuming Kafka transaction event streams, cleansing and normalizing 2M+ daily payment events into a centralized feature computation layer with strict ordering and idempotency guarantees. Engineered dynamic transaction graphs using Deep Graph Library, linking accounts, devices, merchants, and IPs to compute 180-200 structural and behavioral features capturing multi-hop fraud ring patterns for GNN model training. Partitioned and stored transaction graph datasets in AWS S3 using Parquet columnar formatting, improving offline GNN training I/O efficiency by 30%-35%, and replicated curated graph feature datasets to GCP Cloud Storage for distributed training jobs run on GCP AI Platform, enabling cross-cloud GNN experiment comparison and cost-optimized GPU utilization. Implemented batched feature engineering jobs in Apache Spark, generating rolling aggregates, velocity features, peer-group statistics, and relationship metrics, feeding GNN models and downstream risk dashboards. Provisioned a Feast Feature Store serving precomputed entity-level features with sub-10ms retrieval latency, ensuring consistent online-offline feature parity across GNN training and real-time fraud scoring inference paths. Automated daily feature refresh and transaction graph reconstruction workflows using Apache Airflow DAGs, maintaining up-to-date entity representations across 300M+ customer accounts without manual engineering intervention. Assembled high-quality training datasets for PyTorch GNN and XGBoost ensemble models, enforcing temporal splits, label leakage checks, and balanced fraud/non-fraud sampling strategies across all model development cycles. Logged dataset versions, graph schema configurations, and experiment metadata into MLflow, enabling full traceability from production models back to underlying data snapshots and feature definitions for compliance reviews. Integrated Redis as a low-latency cache for high-frequency entity risk scores and recent transaction aggregates, reducing redundant graph traversals and compute load by 55%-60% during fraud spike events. Connected fraud scoring outputs to downstream payment authorization systems via Kafka topics and REST APIs, embedding real-time fraud decisions into existing transaction flows without architectural disruption. Applied SHAP-based feature attribution to ensemble outputs, persisting explanation summaries to ensure fraud decisions could be tied back to specific features for model risk, OCC regulatory reviews, and internal governance audits. Implemented full-stack pipeline observability using Prometheus and Grafana on AWS infrastructure, integrated GCP Cloud Monitoring for GCP-hosted training job telemetry, and established unified alerting dashboards tracking ingestion lag, feature freshness, inference latency, and error rates across all AWS and GCP platform components to detect drift before it impacted fraud scoring accuracy. Improved fraud detection recall by 20%-22%, reduced false-positive case volume by 35%-38. Environment: Python, Apache Spark, Spark Streaming, Apache Kafka, Apache Airflow, AWS S3, AWS EC2, AWS SageMaker, BigQuery, GCP Cloud Storage, GCP AI Platform, GCP Cloud Monitoring, PyTorch, Deep Graph Library, XGBoost, SHAP, Feast Feature Store, MLflow, Redis, FastAPI, Flask, Docker, Kubernetes, Grafana, Prometheus, PostgreSQL, SQL, Parquet, Git, Jenkins Client: Walmart - Bentonville, AR Sep 2018 to Jul 2020 Role: Data Engineer Responsibilities: Designed the end-to-end demand forecasting data architecture spanning POS, weather, and promotion feed ingestion into Hadoop HDFS, curated Apache Hive layers, Spark feature computation, and Flask model-serving APIs for 4,700+ store locations. Audited POS and inventory data across 4,700+ store locations, profiling data quality issues and latency gaps driving recurring shelf stockouts and overstock accumulation across key product categories and seasonal demand cycles. Partnered with supply chain and replenishment teams to convert informal restocking rules into formal ML forecasting requirements, defining MAPE targets, forecast horizons, and store-SKU coverage expectations for the demand prediction. Built end-to-end Spark ETL pipelines on Hadoop HDFS, ingesting daily POS records, weather feeds into partitioned Apache Hive tables, and configured GCP Cloud Storage and GCP BigQuery as a parallel analytical layer, enabling supply chain and merchandising teams to run ad-hoc demand analytics and forecast accuracy queries at scale. Implemented robust data validation and reconciliation checks inside ETL jobs, detecting anomalies in transaction volumes, missing calendar entries, and weather feed gaps before they impacted downstream model training accuracy. Engineered 130-140 time-series and categorical features in Apache Spark, including rolling windows, holiday proximity flags, store-cluster encodings, and markdown indicators, tailored to different product formats and store types. Created reusable feature computation libraries in Python and Scala, standardizing feature definitions and ensuring the same logic applied consistently across historical training datasets and real-time scoring pipelines. Automated daily feature refresh and pipeline reprocessing workflows using Apache Airflow DAGs, reducing manual pipeline intervention by 65%-70% and keeping store-SKU feature sets continuously up to date. Introduced historical snapshot tables and backfill processes for late-arriving POS and promotion data, enabling training datasets to reflect the information available at prediction time accurately and preventing label leakage. Logged all training datasets, feature set versions, and model artifacts into MLflow on AWS S3-backed artifact storage, capturing complete lineage from raw HDFS partitions to deployed models, and ran parallel LSTM forecasting experiments on GCP Vertex AI to benchmark cloud training performance and cost efficiency against the primary SageMaker infrastructure. Packaged demand forecasting models as Flask REST API services with request/response contracts designed around store, SKU, and horizon inputs sourced directly from curated Hive datasets and the feature store layer. Deployed containerized inference services on AWS EC2 with Docker and auto-scaling groups tuned for peak promotional periods, sustaining 10,000-12,000 concurrent prediction requests without performance degradation. Reduced inventory overstock by approximately 16%-18% and improved shelf availability to 95%-96. Environment: Python, Scala, Apache Spark, Hadoop HDFS, Apache Hive, Apache Airflow, Apache Kafka, AWS S3, AWS EC2, AWS EMR, AWS SageMaker, GCP Vertex AI, GCP BigQuery, GCP Cloud Storage, GCP Cloud Run, XGBoost, LightGBM, LSTM, TensorFlow 2.0, Keras, MLflow, Flask, Docker, Tableau, SQL, Git, Jenkins Client: Grapesoft Solutions - Hyderabad, India Jul 2016 to Jun 2018 Role: Data Engineer Responsibilities: Analyzed multi-client text data landscapes across product reviews, feedback forms, and support tickets, profiling source quality and volume distributions to scope NLP sentiment data pipelines for e-commerce and enterprise clients. Outlined the end-to-end NLP data architecture spanning text ingestion, preprocessing layers, embedding computation, training dataset generation, and online inference endpoints backed by PostgreSQL and MongoDB databases. Designed normalized schemas in PostgreSQL for reviews, products, users, and labels, alongside MongoDB collections for semi-structured text payloads and experiment metadata storage supporting iterative model development. Built modular text preprocessing pipelines using NLTK and spaCy, standardizing tokenization, stopword removal, lemmatization, and sequence padding across multiple client domains and enterprise text corpora. Constructed Word2Vec and GloVe embedding ETL flows that trained and loaded pre-trained vectors, mapped vocabularies, and generated dense 300-dimensional embeddings for downstream LSTM sentiment classification models. Implemented ingestion jobs in Python to pull customer reviews and support tickets from REST APIs and client databases into centralized staging tables, enforcing idempotent loads and quality checks on key fields. Assembled training, validation, and test splits with careful handling of class imbalance, deduplication, and temporal leakage, ensuring robust performance estimates across diverse client datasets and product domains. Created document-level indexing and query strategies in PostgreSQL and MongoDB to support fast sampling of labeled and unlabeled text for iterative model training, evaluation, and systematic error analysis. Developed reusable feature generation modules in Python to derive TF-IDF vectors, n-gram statistics, and sentiment lexicon scores, enabling flexible experimentation across classical and deep-learning NLP pipelines. Trained LSTM and CNN-LSTM models in TensorFlow 1.x and Keras on Word2Vec embeddings, maintaining clear separation between data loading, batching, and checkpoint management for reproducibility across client projects. Delivered a 25%-28% improvement in client review categorization accuracy across e-commerce and enterprise support datasets, enabling clients to automate sentiment-driven product and service analytics workflows at production scale. Environment: Python, TensorFlow 1.x, Keras, PyTorch, Scikit-learn, NLTK, spaCy, Word2Vec, GloVe, LSTM, CNN-LSTM, TF-IDF, SVM, Na ve Bayes, Pandas, NumPy, PostgreSQL, MongoDB, Apache Spark 2.x, Flask, REST APIs, Docker, Git, Tableau, Agile/Scrum Client: Apollo Hospitals - Hyderabad, India Jun 2014 to Jun 2016 Role: Data Analyst Responsibilities: Analyzed Apollo's EHR data landscape across multiple hospital systems, profiling ICD-9 coding standards, table structures, and data quality issues impacting patient readmission risk modeling and clinical reporting accuracy. Partnered with clinical SMEs and hospital IT teams to translate readmission risk use cases into concrete data pipeline requirements, defining cohort boundaries, refresh frequencies, and retention policies for EHR analytics. Outlined end-to-end data architecture for readmission risk scoring, spanning source EHR databases, staging schemas, feature engineering jobs, model input datasets, and downstream risk-report distribution feeds. Designed unified patient encounter schemas in MySQL and PostgreSQL, standardizing keys for encounters, diagnoses, procedures, medications, and lab results to support cross-hospital analytics and longitudinal patient histories. Built ETL jobs in Python, Pandas, and SQL to integrate ICD-9 diagnosis codes, lab results, medication history, and discharge summaries from multiple transactional systems into consolidated 2.1M-record analytical datasets. Implemented robust data cleansing routines covering missing value imputation, outlier handling, and categorical encoding, reducing noise in training datasets and stabilizing downstream model performance across diabetic and cardiac cohorts. Optimized relational schemas using indexing strategies, partitioning by encounter date and diagnosis group, and stored procedures, decreasing clinical analytics query runtimes by 32%-36% across departmental reporting and batch extraction workflows. Developed clinical text preprocessing pipelines with NLTK and TF-IDF vectorization, converting physician notes and discharge summaries into structured features integrated with tabular EHR attributes for richer readmission models. Assembled reproducible training, validation, and test datasets from EHR snapshots, enforcing temporal splits and leakage checks so readmission labels never contaminated feature windows at train time. Provisioned analytical sandboxes on Hadoop and Spark 1.x, exporting curated patient encounter datasets into HDFS and enabling distributed feature computation for large clinical cohorts. Delivered a 22%-25% improvement in high-risk patient identification accuracy, enabling proactive clinical intervention and reducing preventable readmissions across inpatient departments through reliable, well-governed EHR data pipelines. Environment: Python, R, Scikit-learn, XGBoost, Pandas, NumPy, SciPy, NLTK, TF-IDF, Theano, Matplotlib, Seaborn, Jupyter Notebook, MySQL, PostgreSQL, Apache Hadoop, Apache Spark 1.x, SVM, Random Forest, Logistic Regression, SMOTE, ICD-9, Git, Shell Scripting, Cron Scheduling, Agile/Scrum Education: Bachelor of Engineering in Information Technology - Gayatri Vidya Parishad College Of Engineering, India | 2014 Keywords: csharp continuous integration continuous deployment quality analyst artificial intelligence machine learning business intelligence sthree active directory rlang information technology Arkansas Delaware Massachusetts Missouri New York |