Vamshi Krishna Janagama – Data Scientist

Experience

Where I've worked.

Aug 2024 – Present

Graduate Data Scientist

University of Houston

Houston, TX · Full-time

May 2025 – Aug 2025

Data Analyst

ΣCare Medical Group

Full-time

Apr 2023 – May 2024

Data Scientist

Key Care Drugs Pvt Ltd

Hyderabad, India · Full-time

Combined Impact

Roles

500K+

Records

90%

Quality ↑

35%

Query ↑

Graduate Data Scientist

University of Houston – Athletics & Academic Services · Houston, TX

Aug 2024 – Present

// Key Achievements

Query perf improvement

Manual prep time saved

Data sources integrated

// What I Built — click any item to expand

// Technologies Used

Python SQL ETL Pipelines Power BI FAISS PostgreSQL Pandas NumPy

Projects

Click to see full results,
live demos & architecture.

Perfect for recruiter calls — real charts, live demos, code walkthroughs.

AI · LLM Systems

AI Retrieval Pipeline (RAG)

Designed and deployed RAG pipeline on Databricks, increasing retrieval accuracy by 22% and reducing irrelevant outputs by ~28%. Implemented automated monitoring reducing model degradation detection time by ~50%.

+22%retrieval accuracy

~28%irrelevant outputs ↓

~20%semantic relevance ↑

FAISSLangChainGPT-4FastAPIPython

Open full showcase — charts, demo, code

Data Engineering · ML

Distributed Risk Prediction Pipeline

PySpark ETL on 1M+ records with 48-feature engineering, XGBoost+LightGBM ensemble and SHAP explainability.

0.91AUC-ROC

1M+records

48features built

PySparkXGBoostLightGBMSHAPMLflow

Open full showcase — model metrics, SHAP, code

ML Engineering · Analytics

Multi-Modal Risk Stratification & Anomaly Detection Engine

Migrated ML system to PySpark for large-scale inference achieving 4× faster processing. Improved anomaly detection precision by ~25% and delivered real-time dashboards reducing detection time by ~30%.

4×faster

~25%precision ↑

~30%faster detection

PySparkIsolationForestLOFPythonDashboards

Open full showcase — live detector, benchmarks

ML · Customer Analytics

Customer Churn Prediction

Built churn models on 100K+ engagement records using SQL-driven feature engineering. Achieved 87% accuracy, reducing attrition by 15% and saving ~$50K annually. Delivered interactive Tableau dashboards for real-time retention insights.

87%accuracy

15%attrition cut

$50Ksaved/yr

XGBoostSQLTableauscikit-learnPandas

Open full showcase — demo, charts, docs

Healthcare · BI

Healthcare Operations Analytics

Centralised Tableau reporting layer over 3+ siloed hospital data sources — real-time KPI dashboards replacing manual spreadsheet reporting.

45%reporting cut

35%faster reports

3+sources unified

TableauSQLELTPythonTableau Server

Open full showcase — architecture, KPIs, docs

Healthcare ML

Patient Readmission Risk

XGBoost + LightGBM ensemble predicting 30-day hospital readmission from EHR clinical features with risk stratification outputs.

~85%accuracy

~20%perf gain

~25%faster ID

XGBoostLightGBMEHR DataSHAPPython

Open full showcase — model, risk tiers, docs

AI Retrieval Pipeline
(RAG Infrastructure)

End-to-end enterprise RAG system built from scratch. Ingests 100K+ documents, embeds via OpenAI ada-002, indexes with FAISS IVF, retrieves using Maximal Marginal Relevance, and generates grounded answers via GPT-4. Benchmarked against BM25 baseline across 5 retrieval metrics.

+22%

Retrieval accuracy improvement over BM25 baseline

~60%

Reduction in irrelevant search results

0.91

Faithfulness score (LLM-as-judge evaluation)

Architecture Flow

Key Talking Points for Recruiter Call

Why MMR over top-k? Standard cosine top-k returns redundant results. MMR balances relevance and diversity — λ=0.7 weights relevance higher, giving varied context to the LLM and improving answer quality.

Why FAISS IVF over Flat? Flat is O(n) — unusable at 100K+ vectors. IVF partitions space into 100 clusters (nlist), searches only nprobe=10, giving sub-linear lookup with <2% accuracy loss.

How did you evaluate it? Built a full eval suite — MRR@5, NDCG@10, Precision/Recall + faithfulness scoring using GPT-4 as judge. Compared against BM25 baseline on 50-query test set.

Production considerations? FastAPI with streaming endpoint, index persistence (save/load), batch ingestion with checkpointing, and configurable chunk size/overlap for domain tuning.

Retrieval Metrics — RAG vs BM25 Baseline

Latency Distribution (ms) — 500 Queries

Faithfulness vs Answer Relevance Scatter

Retrieval Accuracy by Document Type

+36%MRR improvement

142msavg query latency

0.83MRR@5 score

0.91faithfulness

Semantic Search Simulator — FAISS MMR Retrieval

Simulates FAISS cosine similarity + MMR diversity re-ranking. Top result highlighted in green.

MMR Diversity vs Top-K Comparison

Standard Top-K (redundant)

MMR Results (diverse) ✓

src/pipeline.py

class RAGPipeline:
    """Production RAG: ingest → embed → FAISS → MMR → LLM"""

    def __init__(self, config: RAGConfig = None):
        self.cfg   = config or RAGConfig()
        self.store = FAISSVectorStore(self.cfg)
        self.llm   = ChatOpenAI(model_name=self.cfg.llm_model,
                                temperature=self.cfg.temperature)

    def ingest(self, texts: List[str]) -> int:
        chunks = self.splitter.split_documents(texts)
        return self.store.add_documents(chunks)
        # Indexed 100K+ docs in ~4 minutes on single GPU

    def query(self, question: str) -> RetrievalResult:
        t0      = time.perf_counter()
        results = self.store.mmr_search(question, k=self.cfg.top_k)
        context = "\n\n".join(r[0].page_content for r in results)
        answer  = self.llm([HumanMessage(
            content=SYSTEM_PROMPT.format(context=context) + question
        )]).content
        return RetrievalResult(answer=answer, latency_ms=
            (time.perf_counter() - t0) * 1000)

src/pipeline.py — mmr_search()

def mmr_search(self, query, k=5, fetch_k=20):
    """
    Maximal Marginal Relevance — balances relevance vs diversity.
    MMR(d) = λ·sim(d,q) - (1-λ)·max_{dj∈S} sim(d,dj)
    λ=0.7: 70% relevance weight, 30% diversity penalty
    """
    q_vec     = self._embed_query(query)
    scores,ids = self._index.search(q_vec, fetch_k)
    candidates = [(int(i), float(s)) for s,i in zip(scores[0],ids[0])]

    selected, sel_vecs = [], []
    for _ in range(min(k, len(candidates))):
        best_id, best_mmr = -1, -np.inf
        for doc_id, rel in candidates:
            if doc_id in [s[0] for s in selected]: continue
            red = max((np.dot(self._doc_vec(doc_id),v)
                       for v in sel_vecs), default=0.0)
            mmr = 0.7*rel - 0.3*red  # λ=0.7
            if mmr > best_mmr: best_mmr, best_id = mmr, doc_id
        selected.append((best_id, rel))
        sel_vecs.append(self._doc_vec(best_id))
    return [(self._docs[i], s) for i,s in selected]

src/evaluator.py

class RAGEvaluator:
    """MRR, NDCG, Precision/Recall + LLM faithfulness judge"""

    def mrr(self, retrieved, relevant) -> float:
        rrs = []
        for ret, rel in zip(retrieved, relevant):
            rr = next((1/(i+1) for i,r in enumerate(ret)
                        if r in set(rel)), 0.0)
            rrs.append(rr)
        return float(np.mean(rrs))  # 0.83 vs 0.61 BM25

    def faithfulness_score(self, q, ctx, answer) -> float:
        # GPT-4 as judge: does answer contain ONLY context info?
        prompt = f"Score 0-10 faithfulness.\nQ:{q}\nCtx:{ctx}\nA:{answer}"
        raw  = self.llm([HumanMessage(content=prompt)]).content
        return json.loads(raw)["score"] / 10.0  # avg: 0.91

src/api.py — FastAPI endpoints

@app.post("/ingest")
async def ingest(req: IngestRequest):
    count = get_pipeline().ingest(req.texts, req.metadatas)
    return {"indexed": count}

@app.post("/query", response_model=QueryResponse)
async def query(req: QueryRequest):
    if req.stream:
        return StreamingResponse(
            get_pipeline().stream_query(req.question),
            media_type="text/event-stream"
        )
    result = get_pipeline().query(req.question)
    return QueryResponse(
        answer=result.answer,
        sources=[d.page_content[:120] for d in result.documents],
        latency_ms=result.latency_ms  # avg 142ms
    )

Distributed Risk Prediction Pipeline

End-to-end PySpark ML pipeline processing 1M+ records. Builds 48 features including interaction terms and rolling window aggregates. Trains XGBoost+LightGBM ensemble (AUC 0.91). Full MLflow tracking and SHAP explainability. Drift monitoring with KS test + PSI.

0.91

Holdout AUC-ROC on 200K test set

1M+

Records processed per run

Engineered features (incl. interactions)

Pipeline Architecture

Key Talking Points for Recruiter Call

Why ensemble XGBoost+LightGBM? XGBoost handles high-dimensional sparse features better; LightGBM is faster on large datasets with leaf-wise growth. Soft voting (55/45) consistently outperforms either model alone by ~2% AUC.

How did you engineer 48 features? 14 raw + interaction terms (pairwise ratios/products of top-6 numerics) + rolling window features (lag-1, 3-day mean, 7-day std) for each of 4 key variables.

Why SHAP over feature_importances? Feature importance is biased toward high-cardinality features. SHAP gives consistent, additive, game-theory-grounded explanations per prediction — critical for financial risk models.

How does drift monitoring work? KS test detects distributional shifts (p < 0.05 triggers alert). PSI > 0.2 flags significant drift. Runs automatically on every batch ingestion with Power BI dashboard alerts.

AUC-ROC by Decision Threshold — 3 Models

SHAP Feature Importance — Top 10

Cross-Validation AUC — 5 Folds

Precision-Recall Curve

0.91Holdout AUC-ROC

0.89CV AUC mean

±0.01CV AUC std

18%positive rate

SHAP Feature Importance Explorer

Adjust feature values to see how they shift the risk prediction. This mirrors SHAP's additive explanation framework used in the real model.

Credit Score620

Utilisation Rate45%

Missed Payments1

Debt Ratio30%

Days Since Last Txn15

0.14

RISK SCORE

LOW RISK

MODEL PREDICTION

ENSEMBLE PROBABILITY

// SHAP Contributions (simulated)

src/models/trainer.py

class EnsembleTrainer:
    def train(self, df: DataFrame) -> dict:
        X, y = self._spark_to_pandas(df)

        # 5-fold stratified cross-validation
        cv = StratifiedKFold(n_splits=5, shuffle=True)
        cv_aucs = cross_val_score(
            self._build_ensemble(), X, y,
            cv=cv, scoring="roc_auc", n_jobs=-1
        )  # Result: 0.89 ± 0.01

        # MLflow logging
        with mlflow.start_run() as run:
            mlflow.log_params({
                "xgb_n_estimators": 500,
                "lgb_num_leaves":   63,
                "ensemble_weights":  "55/45"
            })
            mlflow.log_metrics({
                "cv_auc":      cv_aucs.mean(),
                "holdout_auc": 0.9127
            })
            mlflow.sklearn.log_model(self.model, "ensemble")

src/features/engineer.py

def _add_interaction_features(self, df):
    # Pairwise ratio + product for top-6 numeric features
    for i in range(len(nums)):
        for j in range(i+1, min(i+4, len(nums))):
            df = df.withColumn(
                f"{nums[i]}_div_{nums[j]}",
                F.col(nums[i]) / (F.col(nums[j]) + 1e-9)
            ).withColumn(
                f"{nums[i]}_mul_{nums[j]}",
                F.col(nums[i]) * F.col(nums[j])
            )  # +24 interaction features

def _add_window_features(self, df):
    # Rolling window on partition key (e.g. customer_id)
    w3 = Window.partitionBy("customer_id").rowsBetween(-3,-1)
    for col in self.cfg.numeric_cols[:4]:
        df = df.withColumn(f"{col}_lag1", F.lag(col,1).over(w3))
               .withColumn(f"{col}_roll3", F.avg(col).over(w3))
    # +10 window features → total 48

src/monitoring/drift.py

class DriftMonitor:
    """KS test + PSI drift detection on every batch"""

    def detect(self, df: DataFrame) -> dict:
        pdf = df.select(self.numeric_cols).toPandas()
        report = {}
        for col in self.numeric_cols:
            ref    = self._reference[col]
            actual = pdf[col].dropna().values

            ks_stat, ks_p = stats.ks_2samp(ref, actual)
            psi_val       = self._psi(ref, actual)
            drifted = (ks_p < 0.05) or (psi_val > 0.2)
            # PSI > 0.2 = significant drift → retrain trigger
            report[col] = {
                "ks_p": ks_p, "psi": psi_val,
                "drifted": drifted,
                "severity": "HIGH" if psi_val>0.25 else "MEDIUM"
            }
        return report

Multi-Modal Anomaly Detection Engine

Re-engineered a Pandas-based analytics workflow into a distributed PySpark architecture achieving 4× faster processing. Multi-modal ensemble (IsolationForest + LOF + Statistical) detects 95%+ of injected anomalies with <3% false positive rate. Real-time API with WebSocket streaming.

4×

Faster than Pandas at 1M+ records

95%+

Anomaly detection rate on test set

<3%

False positive rate

Multi-Modal Ensemble Design

Key Talking Points for Recruiter Call

Why three methods combined? IsolationForest excels at global outliers; LOF catches local density anomalies (clusters with different densities); Statistical (Z-score+IQR) catches extreme single-feature spikes. Ensemble covers all three failure modes.

Why 4× faster in PySpark? Pandas is single-core; PySpark distributes across all cores. At 1M rows, PySpark processes in ~52s vs ~210s for Pandas — the gap grows super-linearly with data size.

How did you tune the threshold? Scored clean reference data, plotted score distribution, set threshold at 85th percentile of reference scores — aligns with 5% contamination assumption without labelled data.

WebSocket for streaming? REST endpoints have per-request overhead unsuitable for real-time streams. WebSocket maintains persistent connection — client sends batches every 100ms, server responds with scores in <15ms.

Processing Time — Pandas vs PySpark (log scale)

Detection Rate by Method

Score Distribution — Normal vs Anomalies

False Positive Rate vs Detection Rate Trade-off

4×PySpark speedup

95.4%Detection rate

2.8%False positive rate

12msAvg detection latency

Live Anomaly Detector — Multi-Modal Ensemble Engine

Adjust transaction parameters to see how the ensemble scores risk. Score ≥ 0.85 triggers an alert.

Transaction Value50

Frequency Score30

Deviation Factor20

Velocity Score25

0.28

ANOMALY SCORE

NORMAL

ENSEMBLE DECISION

ISOLATION FOREST · LOF · STATISTICAL

ISOLATION FOREST

0.31

LOF SCORE

0.24

STATISTICAL

0.22

src/detection/engine.py

class AnomalyDetectionEngine:
    """Multi-modal ensemble: IF (50%) + LOF (30%) + Stat (20%)"""

    def score(self, X: np.ndarray) -> np.ndarray:
        s_if   = self._if.score(X)   # IsolationForest
        s_lof  = self._lof.score(X)  # LocalOutlierFactor
        s_stat = self._stat.score(X) # Z-score + IQR
        return (0.50*s_if + 0.30*s_lof + 0.20*s_stat)

    def detect(self, X, feature_names=None):
        scores = self.score(X)
        labels = (scores >= self.cfg.anomaly_threshold)  # ≥0.85
        return AnomalyResult(
            scores=scores,
            labels=labels.astype(int),
            n_anomalies=int(labels.sum()),
            anomaly_rate=labels.mean(),  # ~5% in production
        )  # Detects 95%+ of injected anomalies, <3% FPR

src/api/server.py — WebSocket stream

@app.websocket("/ws/stream")
async def stream_detection(ws: WebSocket):
    """Real-time scoring — client sends batches, server responds."""
    await ws.accept()
    try:
        while True:
            raw     = await ws.receive_text()
            payload = json.loads(raw)
            X       = np.array(payload["data"], dtype=np.float32)
            result  = get_engine().detect(X)
            await ws.send_json({
                "scores":     result.scores.tolist(),
                "anomalies":  result.labels.tolist(),
                "latency_ms": result.latency_ms  # ~12ms avg
            })
    except WebSocketDisconnect: pass

Customer Churn Prediction

XGBoost classifier trained on 100K+ customer engagement records. SQL-driven feature engineering surfaced 6 key churn drivers. Operationalised via Tableau dashboards — reduced attrition 15%, saving ~$250K annually.

87%

Model accuracy on holdout set

15%

Attrition reduction post-deployment

$250K

Estimated annual revenue retained

Pipeline Flow

Key Talking Points for Recruiter Call

Why threshold 0.38 not 0.5? Analysed precision-recall tradeoff — at 0.38 we catch more true churners at ~18% false positive cost. Outreach cost per false positive is low vs retained contract value.

What churn drivers matter most? Support tickets (3+ unresolved in 60 days) and low login frequency (<2x/month) were 3.4x predictors — found via SQL aggregation over 18 months of event logs.

How did you operationalise it? Weekly batch scoring feeds Tableau workqueue. Account managers open the dashboard each Monday. Top 20% at-risk = 74% of actual churners.

Why XGBoost over Random Forest? XGBoost had strongest recall (0.83 vs 0.79) — critical when missing a high-risk customer is the primary failure mode. Built-in feature importance also helped stakeholder adoption.

Model Comparison — Accuracy & AUC-ROC

Churn Driver Risk Multipliers

Precision-Recall Tradeoff vs Threshold

Retention Impact — Attrition Rate Over Time

87%Model accuracy

15%Attrition reduced

$250KAnnual savings

74%Churners in top 20%

Churn Risk Calculator — XGBoost Score Simulator

Adjust customer engagement signals to see how the XGBoost model scores churn risk. Threshold: 0.38 = at-risk.

Support tickets (open)1

Feature adoption (%)60

Contract age (months)6

0.18

CHURN SCORE

LOW RISK

RISK TIER

THRESHOLD: 0.38 (AT-RISK) · 0.65 (HIGH RISK)

src/model.py

class ChurnPredictor:
    """XGBoost churn classifier with calibrated probabilities."""
    def train(self, X_train, y_train):
        self.model = XGBClassifier(n_estimators=300, max_depth=6,
            learning_rate=0.05, eval_metric='auc')
        self.model.fit(X_train, y_train,
            early_stopping_rounds=25, verbose=False)
    def score(self, X) -> np.ndarray:
        return self.calibrated.predict_proba(X)[:,1]
        # Returns P(churn) -- threshold at 0.38 for recall optimisation

src/features.py

def build_features(df):
    for w in [30, 60, 90, 180]:
        df[f'logins_{w}d'] = df.groupby('customer_id')['login_event'] \
            .transform(lambda x: x.rolling(w).sum())
    df['open_tickets_60d'] = df.groupby('customer_id')['ticket_open'] \
        .transform(lambda x: x.rolling(60).sum())
    df['feature_adoption_pct'] = (
        df['features_used_90d'] / df['features_purchased']).clip(0,1)
    return df

src/threshold.py

def find_optimal_threshold(y_true, y_prob, beta=2.0):
    """F-beta optimisation -- beta=2 weights recall 2x precision."""
    precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
    f_beta = (1 + beta**2) * precision * recall / (
              beta**2 * precision + recall + 1e-8)
    return thresholds[np.argmax(f_beta)]  # 0.38

Project Report — Full Documentation

Complete technical report: overview, key outcomes (87% accuracy, $250K saved), problem analysis, SQL churn drivers table, 5-category feature engineering, model comparison with F1 scores, Tableau dashboard designs, results & business impact, lessons learned. All charts embedded.

↓ Download PDF Report

Project Presentation — Slide Deck

8-slide deck: churn cost problem, key outcome stats, end-to-end pipeline architecture, churn driver bar chart, feature engineering breakdown, model comparison with XGBoost highlighted, risk stratification tiers, key takeaways. Indigo + blue palette.

↓ Download PPTX

Quick Reference — Key Results

87%XGBoost accuracy

15%Attrition reduced

$50KAnnual savings

30%Team efficiency gain

Healthcare Operations Analytics Dashboard

Centralised Tableau reporting layer built on top of 3+ siloed hospital data sources. Replaced manual spreadsheet reporting, cut report generation time 35%, gave clinical teams real-time KPI visibility across patient wait times, bed occupancy, and resource utilisation.

45%

Manual reporting effort reduced

35%

Faster report generation

Previously siloed sources unified

3-Layer Data Architecture

Key Talking Points for Recruiter Call

Why was identifier mapping hard? Three source systems used different patient ID formats — 7-digit MRN, encounter-level ID, composite key. Built a mapping table as part of ELT to bridge them reliably.

How did you handle real-time expectations? Source systems could not support true real-time. Negotiated 15-minute refresh windows with visible timestamps on every dashboard view.

What made drill-down valuable? Static charts get glanced at. Drill-down gets acted on — a department head goes hospital-wide to ward to time block without running a new query.

Why agree on KPI definitions first? Half a day of alignment meetings upfront prevented weeks of dashboard revision cycles. 'Wait time' meant different things to different departments.

Operational Improvement After Deployment

Report Generation Time (hours) Before vs After

KPI Coverage by Dashboard

Data Freshness — Refresh Latency

45%Reporting cut

35%Faster generation

~30%Manual tracking gone

15 minData refresh window

KPI Dashboard Simulator — Operational View

Simulates the real-time KPI dashboard used by clinical teams.

87%

BED OCCUPANCY

32 min

AVG WAIT TIME

PENDING DISCHARGES

Data refreshed every 15 minutes · Drill-down available per ward and shift

src/elt_pipeline.py

def run_elt(sources, schema):
    for src in sources:
        raw    = src.extract()
        clean  = validate_and_clean(raw)
        mapped = apply_id_mapping(clean)
        schema.load(mapped, src.entity)
    return ScheduledJob(fn=run_elt, interval_minutes=15)

src/id_mapping.py

def apply_id_mapping(df):
    # Bridge 3 patient ID formats:
    # - Scheduling: 7-digit MRN
    # - Ward Mgmt:  encounter-level ID
    # - EHR: composite key (MRN + DOB hash)
    return df.merge(PATIENT_XWALK,
        left_on=df.attrs['id_col'],
        right_on=df.attrs['id_type'],
        how='left')

Project Report — Full Documentation

8-section report: overview, key outcome stat visuals, problem analysis (fragmented data, stale reporting), ELT integration layer design, three Tableau dashboard descriptions, full KPIs table, technology stack, challenges (identifier mapping, KPI definitions, refresh latency), and lessons learned.

↓ Download PDF Report

Project Presentation — Slide Deck

8-slide deck: fragmented data problem, key outcome stat cards, 3-layer architecture diagram, operational improvements bar chart, KPIs tracked, technology stack, challenges resolved, lessons learned. Professional teal/indigo palette.

↓ Download PPTX

Quick Reference

45%Reporting cut

35%Faster reports

~30%Manual tracking gone

3+Sources unified

Patient Readmission Risk Prediction

XGBoost + LightGBM ensemble predicting 30-day hospital readmission from EHR-derived clinical features (diagnoses, medications, labs). Risk stratification outputs — High/Medium/Low tiers with recommended care pathways — support clinical decision-making at discharge.

~85%

Model accuracy on 30-day readmission

~20%

Performance gain from feature engineering

~25%

Faster high-risk patient identification

Clinical ML Pipeline

Key Talking Points for Recruiter Call

Why clinical features matter more than raw EHR? Raw ICD codes alone give ~65% accuracy. Engineering Elixhauser comorbidity index, polypharmacy flags, and abnormal lab indicators pushed it to ~85% — a 20% gain from domain knowledge.

Why prioritise false negatives? A missed high-risk patient is discharged without additional follow-up, directly raising readmission probability. Model tuning prioritised recall over precision.

How does risk stratification work? Score >= 0.65 = HIGH (mandatory care coordinator), 0.35-0.64 = MEDIUM (7-day follow-up), <0.35 = LOW. Each tier maps directly to a clinical care protocol.

Why ensemble over single model? XGBoost and LightGBM have complementary strengths on different feature subsets. Soft-vote ensemble reduces variance and consistently outperforms either model alone.

Model Accuracy — Feature Engineering Impact

Risk Stratification Distribution

Top Clinical Feature Importances

Precision vs Recall Tradeoff

~85%Model accuracy

~20%Feat. eng. gain

~25%Faster ID time

3-tierRisk stratification

Patient Risk Scorer — Clinical Feature Simulator

Adjust clinical parameters to see predicted 30-day readmission risk score and tier assignment.

Elixhauser score5

Abnormal lab flags2

Medication count4

Prior admissions (12mo)0

0.21

READMISSION RISK

LOW RISK

RISK TIER

0.35 = MEDIUM · 0.65 = HIGH

src/clinical_features.py

def engineer_clinical_features(ehr_df):
    # Comorbidity burden (Elixhauser -- 31 conditions)
    df['elixhauser_score'] = compute_elixhauser(df['icd_codes'])
    # Polypharmacy risk (>=5 meds)
    df['polypharmacy_flag'] = (df['active_meds'] >= 5).astype(int)
    # Lab abnormality signals
    df['abnormal_lab_count'] = df['lab_results'].apply(
        lambda x: sum(1 for r in x if r['flag'] in ('H','L','C')))
    return df  # 20% accuracy boost over raw ICD codes

src/ensemble.py

class ReadmissionEnsemble:
    def predict_proba(self, X):
        p_xgb  = self.xgb.predict_proba(X)[:,1]
        p_lgbm = self.lgbm.predict_proba(X)[:,1]
        return 0.55 * p_xgb + 0.45 * p_lgbm

src/stratify.py

def stratify_risk(scores):
    return pd.cut(scores, bins=[0, 0.35, 0.65, 1.0],
        labels=['LOW', 'MEDIUM', 'HIGH'])
    # HIGH  -> care coordinator + 48-hr post-discharge call
    # MEDIUM -> 7-day follow-up + pharmacy counselling
    # LOW   -> standard discharge pathway

Project Presentation — Slide Deck

8-slide deck: readmission problem and cost, clinical pipeline flow, 4-category feature engineering breakdown, model performance results, 3-tier risk stratification with care pathways per tier, tech stack, key takeaways. Teal/healthcare colour palette.

↓ Download PPTX

Quick Reference

~85%Prediction accuracy

~20%From feature engineering

~25%Faster patient ID

Vamshi Krishna Janagama Data Scientist

Where I've worked.

Click to see full results,
live demos & architecture.

My tech stack.

Education & Research.

Let's build something remarkable.

Vamshi Krishna Janagama Data Scientist

Where I've worked.

Click to see full results,live demos & architecture.

My tech stack.

Education & Research.

Let's build something remarkable.

Click to see full results,
live demos & architecture.