[ MS Data Science · University of Houston · 2026 ]

Vamshi Krishna Janagama Data Scientist

Houston, TX ML · LLMs · Data Engineering Open to full-time roles

Building AI-powered data systems — from distributed ETL pipelines processing millions of records to RAG infrastructure and production ML models. I turn raw data into decisions that scale.

↓ View Experience Get in touch
3
Roles held
3
AI/ML projects
1
IEEE publication
87%
Best model accuracy

Experience

Where I've worked.

Aug 2024 – Present
Graduate Data Scientist
University of Houston
Houston, TX · Full-time
May 2025 – Aug 2025
Data Analyst
ΣCare Medical Group
Full-time
Apr 2023 – May 2024
Data Scientist
Key Care Drugs Pvt Ltd
Hyderabad, India · Full-time
Combined Impact
3
Roles
500K+
Records
90%
Quality ↑
35%
Query ↑
Graduate Data Scientist
University of Houston – Athletics & Academic Services · Houston, TX
Aug 2024 – Present
// Key Achievements
0%
Query perf improvement
0%
Manual prep time saved
5+
Data sources integrated
// What I Built — click any item to expand
// Technologies Used
Python SQL ETL Pipelines Power BI FAISS PostgreSQL Pandas NumPy

Projects

Click to see full results,
live demos & architecture.

Perfect for recruiter calls — real charts, live demos, code walkthroughs.

AI · LLM Systems
AI Retrieval Pipeline (RAG)

Designed and deployed RAG pipeline on Databricks, increasing retrieval accuracy by 22% and reducing irrelevant outputs by ~28%. Implemented automated monitoring reducing model degradation detection time by ~50%.

+22%retrieval accuracy
~28%irrelevant outputs ↓
~20%semantic relevance ↑
FAISSLangChainGPT-4FastAPIPython
Open full showcase — charts, demo, code
Data Engineering · ML
Distributed Risk Prediction Pipeline

PySpark ETL on 1M+ records with 48-feature engineering, XGBoost+LightGBM ensemble and SHAP explainability.

0.91AUC-ROC
1M+records
48features built
PySparkXGBoostLightGBMSHAPMLflow
Open full showcase — model metrics, SHAP, code
ML Engineering · Analytics
Multi-Modal Risk Stratification & Anomaly Detection Engine

Migrated ML system to PySpark for large-scale inference achieving 4× faster processing. Improved anomaly detection precision by ~25% and delivered real-time dashboards reducing detection time by ~30%.

faster
~25%precision ↑
~30%faster detection
PySparkIsolationForestLOFPythonDashboards
Open full showcase — live detector, benchmarks
ML · Customer Analytics
Customer Churn Prediction

Built churn models on 100K+ engagement records using SQL-driven feature engineering. Achieved 87% accuracy, reducing attrition by 15% and saving ~$50K annually. Delivered interactive Tableau dashboards for real-time retention insights.

87%accuracy
15%attrition cut
$50Ksaved/yr
XGBoostSQLTableauscikit-learnPandas
Open full showcase — demo, charts, docs
Healthcare · BI
Healthcare Operations Analytics

Centralised Tableau reporting layer over 3+ siloed hospital data sources — real-time KPI dashboards replacing manual spreadsheet reporting.

45%reporting cut
35%faster reports
3+sources unified
TableauSQLELTPythonTableau Server
Open full showcase — architecture, KPIs, docs
Healthcare ML
Patient Readmission Risk

XGBoost + LightGBM ensemble predicting 30-day hospital readmission from EHR clinical features with risk stratification outputs.

~85%accuracy
~20%perf gain
~25%faster ID
XGBoostLightGBMEHR DataSHAPPython
Open full showcase — model, risk tiers, docs
Portfolio / RAG Pipeline
↗ GitHub
AI Retrieval Pipeline
(RAG Infrastructure)
End-to-end enterprise RAG system built from scratch. Ingests 100K+ documents, embeds via OpenAI ada-002, indexes with FAISS IVF, retrieves using Maximal Marginal Relevance, and generates grounded answers via GPT-4. Benchmarked against BM25 baseline across 5 retrieval metrics.
+22%
Retrieval accuracy improvement over BM25 baseline
~60%
Reduction in irrelevant search results
0.91
Faithfulness score (LLM-as-judge evaluation)
Architecture Flow
Documents100K+ Chunk +Embed FAISS IVF1536-dim MMRk=5 Re-rankBM25 LLMGPT-4 Ingestion Vector Store Retrieval Generation
Key Talking Points for Recruiter Call
1
Why MMR over top-k? Standard cosine top-k returns redundant results. MMR balances relevance and diversity — λ=0.7 weights relevance higher, giving varied context to the LLM and improving answer quality.
2
Why FAISS IVF over Flat? Flat is O(n) — unusable at 100K+ vectors. IVF partitions space into 100 clusters (nlist), searches only nprobe=10, giving sub-linear lookup with <2% accuracy loss.
3
How did you evaluate it? Built a full eval suite — MRR@5, NDCG@10, Precision/Recall + faithfulness scoring using GPT-4 as judge. Compared against BM25 baseline on 50-query test set.
4
Production considerations? FastAPI with streaming endpoint, index persistence (save/load), batch ingestion with checkpointing, and configurable chunk size/overlap for domain tuning.
Retrieval Metrics — RAG vs BM25 Baseline
Latency Distribution (ms) — 500 Queries
Faithfulness vs Answer Relevance Scatter
Retrieval Accuracy by Document Type
+36%MRR improvement
142msavg query latency
0.83MRR@5 score
0.91faithfulness
Semantic Search Simulator — FAISS MMR Retrieval

Simulates FAISS cosine similarity + MMR diversity re-ranking. Top result highlighted in green.

MMR Diversity vs Top-K Comparison
Standard Top-K (redundant)
MMR Results (diverse) ✓
src/pipeline.py
class RAGPipeline:
    """Production RAG: ingest → embed → FAISS → MMR → LLM"""

    def __init__(self, config: RAGConfig = None):
        self.cfg   = config or RAGConfig()
        self.store = FAISSVectorStore(self.cfg)
        self.llm   = ChatOpenAI(model_name=self.cfg.llm_model,
                                temperature=self.cfg.temperature)

    def ingest(self, texts: List[str]) -> int:
        chunks = self.splitter.split_documents(texts)
        return self.store.add_documents(chunks)
        # Indexed 100K+ docs in ~4 minutes on single GPU

    def query(self, question: str) -> RetrievalResult:
        t0      = time.perf_counter()
        results = self.store.mmr_search(question, k=self.cfg.top_k)
        context = "\n\n".join(r[0].page_content for r in results)
        answer  = self.llm([HumanMessage(
            content=SYSTEM_PROMPT.format(context=context) + question
        )]).content
        return RetrievalResult(answer=answer, latency_ms=
            (time.perf_counter() - t0) * 1000)
src/pipeline.py — mmr_search()
def mmr_search(self, query, k=5, fetch_k=20):
    """
    Maximal Marginal Relevance — balances relevance vs diversity.
    MMR(d) = λ·sim(d,q) - (1-λ)·max_{dj∈S} sim(d,dj)
    λ=0.7: 70% relevance weight, 30% diversity penalty
    """
    q_vec     = self._embed_query(query)
    scores,ids = self._index.search(q_vec, fetch_k)
    candidates = [(int(i), float(s)) for s,i in zip(scores[0],ids[0])]

    selected, sel_vecs = [], []
    for _ in range(min(k, len(candidates))):
        best_id, best_mmr = -1, -np.inf
        for doc_id, rel in candidates:
            if doc_id in [s[0] for s in selected]: continue
            red = max((np.dot(self._doc_vec(doc_id),v)
                       for v in sel_vecs), default=0.0)
            mmr = 0.7*rel - 0.3*red  # λ=0.7
            if mmr > best_mmr: best_mmr, best_id = mmr, doc_id
        selected.append((best_id, rel))
        sel_vecs.append(self._doc_vec(best_id))
    return [(self._docs[i], s) for i,s in selected]
src/evaluator.py
class RAGEvaluator:
    """MRR, NDCG, Precision/Recall + LLM faithfulness judge"""

    def mrr(self, retrieved, relevant) -> float:
        rrs = []
        for ret, rel in zip(retrieved, relevant):
            rr = next((1/(i+1) for i,r in enumerate(ret)
                        if r in set(rel)), 0.0)
            rrs.append(rr)
        return float(np.mean(rrs))  # 0.83 vs 0.61 BM25

    def faithfulness_score(self, q, ctx, answer) -> float:
        # GPT-4 as judge: does answer contain ONLY context info?
        prompt = f"Score 0-10 faithfulness.\nQ:{q}\nCtx:{ctx}\nA:{answer}"
        raw  = self.llm([HumanMessage(content=prompt)]).content
        return json.loads(raw)["score"] / 10.0  # avg: 0.91
src/api.py — FastAPI endpoints
@app.post("/ingest")
async def ingest(req: IngestRequest):
    count = get_pipeline().ingest(req.texts, req.metadatas)
    return {"indexed": count}

@app.post("/query", response_model=QueryResponse)
async def query(req: QueryRequest):
    if req.stream:
        return StreamingResponse(
            get_pipeline().stream_query(req.question),
            media_type="text/event-stream"
        )
    result = get_pipeline().query(req.question)
    return QueryResponse(
        answer=result.answer,
        sources=[d.page_content[:120] for d in result.documents],
        latency_ms=result.latency_ms  # avg 142ms
    )
Portfolio / Risk Prediction Pipeline
↗ GitHub
Distributed Risk Prediction Pipeline
End-to-end PySpark ML pipeline processing 1M+ records. Builds 48 features including interaction terms and rolling window aggregates. Trains XGBoost+LightGBM ensemble (AUC 0.91). Full MLflow tracking and SHAP explainability. Drift monitoring with KS test + PSI.
0.91
Holdout AUC-ROC on 200K test set
1M+
Records processed per run
48
Engineered features (incl. interactions)
Pipeline Architecture
Raw Data1M rows PySparkETL 48 FeatureEngineering XGBoostLightGBMAUC 0.91 SHAPExplain MLflowRegistry Source Features Ensemble MLOps
Key Talking Points for Recruiter Call
1
Why ensemble XGBoost+LightGBM? XGBoost handles high-dimensional sparse features better; LightGBM is faster on large datasets with leaf-wise growth. Soft voting (55/45) consistently outperforms either model alone by ~2% AUC.
2
How did you engineer 48 features? 14 raw + interaction terms (pairwise ratios/products of top-6 numerics) + rolling window features (lag-1, 3-day mean, 7-day std) for each of 4 key variables.
3
Why SHAP over feature_importances? Feature importance is biased toward high-cardinality features. SHAP gives consistent, additive, game-theory-grounded explanations per prediction — critical for financial risk models.
4
How does drift monitoring work? KS test detects distributional shifts (p < 0.05 triggers alert). PSI > 0.2 flags significant drift. Runs automatically on every batch ingestion with Power BI dashboard alerts.
AUC-ROC by Decision Threshold — 3 Models
SHAP Feature Importance — Top 10
Cross-Validation AUC — 5 Folds
Precision-Recall Curve
0.91Holdout AUC-ROC
0.89CV AUC mean
±0.01CV AUC std
18%positive rate
SHAP Feature Importance Explorer

Adjust feature values to see how they shift the risk prediction. This mirrors SHAP's additive explanation framework used in the real model.

620
45%
1
30%
15
0.14
RISK SCORE
LOW RISK
MODEL PREDICTION
ENSEMBLE PROBABILITY
// SHAP Contributions (simulated)
src/models/trainer.py
class EnsembleTrainer:
    def train(self, df: DataFrame) -> dict:
        X, y = self._spark_to_pandas(df)

        # 5-fold stratified cross-validation
        cv = StratifiedKFold(n_splits=5, shuffle=True)
        cv_aucs = cross_val_score(
            self._build_ensemble(), X, y,
            cv=cv, scoring="roc_auc", n_jobs=-1
        )  # Result: 0.89 ± 0.01

        # MLflow logging
        with mlflow.start_run() as run:
            mlflow.log_params({
                "xgb_n_estimators": 500,
                "lgb_num_leaves":   63,
                "ensemble_weights":  "55/45"
            })
            mlflow.log_metrics({
                "cv_auc":      cv_aucs.mean(),
                "holdout_auc": 0.9127
            })
            mlflow.sklearn.log_model(self.model, "ensemble")
src/features/engineer.py
def _add_interaction_features(self, df):
    # Pairwise ratio + product for top-6 numeric features
    for i in range(len(nums)):
        for j in range(i+1, min(i+4, len(nums))):
            df = df.withColumn(
                f"{nums[i]}_div_{nums[j]}",
                F.col(nums[i]) / (F.col(nums[j]) + 1e-9)
            ).withColumn(
                f"{nums[i]}_mul_{nums[j]}",
                F.col(nums[i]) * F.col(nums[j])
            )  # +24 interaction features

def _add_window_features(self, df):
    # Rolling window on partition key (e.g. customer_id)
    w3 = Window.partitionBy("customer_id").rowsBetween(-3,-1)
    for col in self.cfg.numeric_cols[:4]:
        df = df.withColumn(f"{col}_lag1", F.lag(col,1).over(w3))
               .withColumn(f"{col}_roll3", F.avg(col).over(w3))
    # +10 window features → total 48
src/monitoring/drift.py
class DriftMonitor:
    """KS test + PSI drift detection on every batch"""

    def detect(self, df: DataFrame) -> dict:
        pdf = df.select(self.numeric_cols).toPandas()
        report = {}
        for col in self.numeric_cols:
            ref    = self._reference[col]
            actual = pdf[col].dropna().values

            ks_stat, ks_p = stats.ks_2samp(ref, actual)
            psi_val       = self._psi(ref, actual)
            drifted = (ks_p < 0.05) or (psi_val > 0.2)
            # PSI > 0.2 = significant drift → retrain trigger
            report[col] = {
                "ks_p": ks_p, "psi": psi_val,
                "drifted": drifted,
                "severity": "HIGH" if psi_val>0.25 else "MEDIUM"
            }
        return report
Portfolio / Anomaly Detection Engine
↗ GitHub
Multi-Modal Anomaly Detection Engine
Re-engineered a Pandas-based analytics workflow into a distributed PySpark architecture achieving 4× faster processing. Multi-modal ensemble (IsolationForest + LOF + Statistical) detects 95%+ of injected anomalies with <3% false positive rate. Real-time API with WebSocket streaming.
Faster than Pandas at 1M+ records
95%+
Anomaly detection rate on test set
<3%
False positive rate
Multi-Modal Ensemble Design
StreamData In IsolationForest (50%) LOF Detector (30%) Statistical (20%) WeightedEnsemble Threshold≥0.85 ALERT /Flag
Key Talking Points for Recruiter Call
1
Why three methods combined? IsolationForest excels at global outliers; LOF catches local density anomalies (clusters with different densities); Statistical (Z-score+IQR) catches extreme single-feature spikes. Ensemble covers all three failure modes.
2
Why 4× faster in PySpark? Pandas is single-core; PySpark distributes across all cores. At 1M rows, PySpark processes in ~52s vs ~210s for Pandas — the gap grows super-linearly with data size.
3
How did you tune the threshold? Scored clean reference data, plotted score distribution, set threshold at 85th percentile of reference scores — aligns with 5% contamination assumption without labelled data.
4
WebSocket for streaming? REST endpoints have per-request overhead unsuitable for real-time streams. WebSocket maintains persistent connection — client sends batches every 100ms, server responds with scores in <15ms.
Processing Time — Pandas vs PySpark (log scale)
Detection Rate by Method
Score Distribution — Normal vs Anomalies
False Positive Rate vs Detection Rate Trade-off
PySpark speedup
95.4%Detection rate
2.8%False positive rate
12msAvg detection latency
Live Anomaly Detector — Multi-Modal Ensemble Engine

Adjust transaction parameters to see how the ensemble scores risk. Score ≥ 0.85 triggers an alert.

50
30
20
25
0.28
ANOMALY SCORE
NORMAL
ENSEMBLE DECISION
ISOLATION FOREST · LOF · STATISTICAL
ISOLATION FOREST
0.31
LOF SCORE
0.24
STATISTICAL
0.22
src/detection/engine.py
class AnomalyDetectionEngine:
    """Multi-modal ensemble: IF (50%) + LOF (30%) + Stat (20%)"""

    def score(self, X: np.ndarray) -> np.ndarray:
        s_if   = self._if.score(X)   # IsolationForest
        s_lof  = self._lof.score(X)  # LocalOutlierFactor
        s_stat = self._stat.score(X) # Z-score + IQR
        return (0.50*s_if + 0.30*s_lof + 0.20*s_stat)

    def detect(self, X, feature_names=None):
        scores = self.score(X)
        labels = (scores >= self.cfg.anomaly_threshold)  # ≥0.85
        return AnomalyResult(
            scores=scores,
            labels=labels.astype(int),
            n_anomalies=int(labels.sum()),
            anomaly_rate=labels.mean(),  # ~5% in production
        )  # Detects 95%+ of injected anomalies, <3% FPR
src/api/server.py — WebSocket stream
@app.websocket("/ws/stream")
async def stream_detection(ws: WebSocket):
    """Real-time scoring — client sends batches, server responds."""
    await ws.accept()
    try:
        while True:
            raw     = await ws.receive_text()
            payload = json.loads(raw)
            X       = np.array(payload["data"], dtype=np.float32)
            result  = get_engine().detect(X)
            await ws.send_json({
                "scores":     result.scores.tolist(),
                "anomalies":  result.labels.tolist(),
                "latency_ms": result.latency_ms  # ~12ms avg
            })
    except WebSocketDisconnect: pass
Project Presentation — Slide Deck

8-slide deck covering: the three-part challenge (scale/precision/visibility), 6-step pipeline architecture, PySpark Before vs After comparison, anomaly detection methods (Isolation Forest + statistical controls + ensemble), real-time dashboard with mock wireframe, results (4×, ~25%, ~30%), and key takeaways. Charcoal + electric red-orange palette.

↓ Download PPTX
Quick Reference — Key Results
Faster processing
~25%Precision gain
~30%Faster detection
95%+Detection rate

Skills

My tech stack.

Languages
Python95%
SQL90%
Machine Learning
Scikit-learn90%
XGBoost / LightGBM88%
TensorFlow80%
PyTorch78%
HuggingFace Transformers82%
A/B Testing & Hypothesis Testing85%
Generative AI
RAG Pipelines88%
LangChain / LlamaIndex85%
FAISS83%
Prompt Engineering87%
Visualization
Power BI90%
Tableau88%
Excel (Advanced)85%
Cloud & Tools
AWS (S3, RDS, Redshift)80%
Azure75%
Docker78%
Git / GitHub / CI-CD85%
MLflow82%
Data Analysis
Data Quality Management90%
KPI Reporting92%
Regulatory Reporting80%
SOP Development85%

Background

Education & Research.

M.S. Degree
Engineering Data Science
University of Houston · Houston, TX
Aug 2024 – May 2026
B.Tech Degree
Computer Science Engineering (AI)
Amrita Vishwa Vidyapeetham · India
Jul 2020 – May 2024
Publication · IEEE Xplore 2024
Spatial Analysis-Enhanced Dermatological Image Classification for Paronychia
Deep learning pipeline combining U-Net segmentation and spatial analysis on DermNet. Evaluated ResNet34, VGG16, DenseNet121, InceptionV3, EfficientNet with confidence scoring framework.
97.1% classification accuracy ↗ View on IEEE Xplore

Contact

Let's build something remarkable.

Actively seeking full-time Data Scientist and ML Engineer roles. Whether you have a position or just want to talk data — I respond within 24 hours.

✉ Send an email
Portfolio / Customer Churn Prediction
Customer Churn Prediction
XGBoost classifier trained on 100K+ customer engagement records. SQL-driven feature engineering surfaced 6 key churn drivers. Operationalised via Tableau dashboards — reduced attrition 15%, saving ~$250K annually.
87%
Model accuracy on holdout set
15%
Attrition reduction post-deployment
$250K
Estimated annual revenue retained
Pipeline Flow
100K+Eng Records SQLAnalysis FeatureEng. XGBoost87% acc RiskScores TableauDashboards
Key Talking Points for Recruiter Call
1
Why threshold 0.38 not 0.5? Analysed precision-recall tradeoff — at 0.38 we catch more true churners at ~18% false positive cost. Outreach cost per false positive is low vs retained contract value.
2
What churn drivers matter most? Support tickets (3+ unresolved in 60 days) and low login frequency (<2x/month) were 3.4x predictors — found via SQL aggregation over 18 months of event logs.
3
How did you operationalise it? Weekly batch scoring feeds Tableau workqueue. Account managers open the dashboard each Monday. Top 20% at-risk = 74% of actual churners.
4
Why XGBoost over Random Forest? XGBoost had strongest recall (0.83 vs 0.79) — critical when missing a high-risk customer is the primary failure mode. Built-in feature importance also helped stakeholder adoption.
Model Comparison — Accuracy & AUC-ROC
Churn Driver Risk Multipliers
Precision-Recall Tradeoff vs Threshold
Retention Impact — Attrition Rate Over Time
87%Model accuracy
15%Attrition reduced
$250KAnnual savings
74%Churners in top 20%
Churn Risk Calculator — XGBoost Score Simulator

Adjust customer engagement signals to see how the XGBoost model scores churn risk. Threshold: 0.38 = at-risk.

8
1
60
6
0.18
CHURN SCORE
LOW RISK
RISK TIER
THRESHOLD: 0.38 (AT-RISK) · 0.65 (HIGH RISK)
src/model.py
class ChurnPredictor:
    """XGBoost churn classifier with calibrated probabilities."""
    def train(self, X_train, y_train):
        self.model = XGBClassifier(n_estimators=300, max_depth=6,
            learning_rate=0.05, eval_metric='auc')
        self.model.fit(X_train, y_train,
            early_stopping_rounds=25, verbose=False)
    def score(self, X) -> np.ndarray:
        return self.calibrated.predict_proba(X)[:,1]
        # Returns P(churn) -- threshold at 0.38 for recall optimisation
src/features.py
def build_features(df):
    for w in [30, 60, 90, 180]:
        df[f'logins_{w}d'] = df.groupby('customer_id')['login_event'] \
            .transform(lambda x: x.rolling(w).sum())
    df['open_tickets_60d'] = df.groupby('customer_id')['ticket_open'] \
        .transform(lambda x: x.rolling(60).sum())
    df['feature_adoption_pct'] = (
        df['features_used_90d'] / df['features_purchased']).clip(0,1)
    return df
src/threshold.py
def find_optimal_threshold(y_true, y_prob, beta=2.0):
    """F-beta optimisation -- beta=2 weights recall 2x precision."""
    precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
    f_beta = (1 + beta**2) * precision * recall / (
              beta**2 * precision + recall + 1e-8)
    return thresholds[np.argmax(f_beta)]  # 0.38
Project Report — Full Documentation

Complete technical report: overview, key outcomes (87% accuracy, $250K saved), problem analysis, SQL churn drivers table, 5-category feature engineering, model comparison with F1 scores, Tableau dashboard designs, results & business impact, lessons learned. All charts embedded.

↓ Download PDF Report
Project Presentation — Slide Deck

8-slide deck: churn cost problem, key outcome stats, end-to-end pipeline architecture, churn driver bar chart, feature engineering breakdown, model comparison with XGBoost highlighted, risk stratification tiers, key takeaways. Indigo + blue palette.

↓ Download PPTX
Quick Reference — Key Results
87%XGBoost accuracy
15%Attrition reduced
$50KAnnual savings
30%Team efficiency gain
Portfolio / Healthcare Operations Analytics
Healthcare Operations Analytics Dashboard
Centralised Tableau reporting layer built on top of 3+ siloed hospital data sources. Replaced manual spreadsheet reporting, cut report generation time 35%, gave clinical teams real-time KPI visibility across patient wait times, bed occupancy, and resource utilisation.
45%
Manual reporting effort reduced
35%
Faster report generation
3+
Previously siloed sources unified
3-Layer Data Architecture
SchedulingPatient data Ward MgmtBed data EHR SystemClinical data ELT Pipeline -- Unified Schema15-min refresh Tableau3 dashboards
Key Talking Points for Recruiter Call
1
Why was identifier mapping hard? Three source systems used different patient ID formats — 7-digit MRN, encounter-level ID, composite key. Built a mapping table as part of ELT to bridge them reliably.
2
How did you handle real-time expectations? Source systems could not support true real-time. Negotiated 15-minute refresh windows with visible timestamps on every dashboard view.
3
What made drill-down valuable? Static charts get glanced at. Drill-down gets acted on — a department head goes hospital-wide to ward to time block without running a new query.
4
Why agree on KPI definitions first? Half a day of alignment meetings upfront prevented weeks of dashboard revision cycles. 'Wait time' meant different things to different departments.
Operational Improvement After Deployment
Report Generation Time (hours) Before vs After
KPI Coverage by Dashboard
Data Freshness — Refresh Latency
45%Reporting cut
35%Faster generation
~30%Manual tracking gone
15 minData refresh window
KPI Dashboard Simulator — Operational View

Simulates the real-time KPI dashboard used by clinical teams.

87%
BED OCCUPANCY
32 min
AVG WAIT TIME
14
PENDING DISCHARGES

Data refreshed every 15 minutes · Drill-down available per ward and shift

src/elt_pipeline.py
def run_elt(sources, schema):
    for src in sources:
        raw    = src.extract()
        clean  = validate_and_clean(raw)
        mapped = apply_id_mapping(clean)
        schema.load(mapped, src.entity)
    return ScheduledJob(fn=run_elt, interval_minutes=15)
src/id_mapping.py
def apply_id_mapping(df):
    # Bridge 3 patient ID formats:
    # - Scheduling: 7-digit MRN
    # - Ward Mgmt:  encounter-level ID
    # - EHR: composite key (MRN + DOB hash)
    return df.merge(PATIENT_XWALK,
        left_on=df.attrs['id_col'],
        right_on=df.attrs['id_type'],
        how='left')
Project Report — Full Documentation

8-section report: overview, key outcome stat visuals, problem analysis (fragmented data, stale reporting), ELT integration layer design, three Tableau dashboard descriptions, full KPIs table, technology stack, challenges (identifier mapping, KPI definitions, refresh latency), and lessons learned.

↓ Download PDF Report
Project Presentation — Slide Deck

8-slide deck: fragmented data problem, key outcome stat cards, 3-layer architecture diagram, operational improvements bar chart, KPIs tracked, technology stack, challenges resolved, lessons learned. Professional teal/indigo palette.

↓ Download PPTX
Quick Reference
45%Reporting cut
35%Faster reports
~30%Manual tracking gone
3+Sources unified
Portfolio / Patient Readmission Risk
↗ GitHub
Patient Readmission Risk Prediction
XGBoost + LightGBM ensemble predicting 30-day hospital readmission from EHR-derived clinical features (diagnoses, medications, labs). Risk stratification outputs — High/Medium/Low tiers with recommended care pathways — support clinical decision-making at discharge.
~85%
Model accuracy on 30-day readmission
~20%
Performance gain from feature engineering
~25%
Faster high-risk patient identification
Clinical ML Pipeline
EHR DataICD/meds/labs ClinicalFeature Eng. EnsembleXGB+LGBM Risk Score0.0 - 1.0 StratifyH/M/L tiers ClinicalAction
Key Talking Points for Recruiter Call
1
Why clinical features matter more than raw EHR? Raw ICD codes alone give ~65% accuracy. Engineering Elixhauser comorbidity index, polypharmacy flags, and abnormal lab indicators pushed it to ~85% — a 20% gain from domain knowledge.
2
Why prioritise false negatives? A missed high-risk patient is discharged without additional follow-up, directly raising readmission probability. Model tuning prioritised recall over precision.
3
How does risk stratification work? Score >= 0.65 = HIGH (mandatory care coordinator), 0.35-0.64 = MEDIUM (7-day follow-up), <0.35 = LOW. Each tier maps directly to a clinical care protocol.
4
Why ensemble over single model? XGBoost and LightGBM have complementary strengths on different feature subsets. Soft-vote ensemble reduces variance and consistently outperforms either model alone.
Model Accuracy — Feature Engineering Impact
Risk Stratification Distribution
Top Clinical Feature Importances
Precision vs Recall Tradeoff
~85%Model accuracy
~20%Feat. eng. gain
~25%Faster ID time
3-tierRisk stratification
Patient Risk Scorer — Clinical Feature Simulator

Adjust clinical parameters to see predicted 30-day readmission risk score and tier assignment.

5
2
4
0
0.21
READMISSION RISK
LOW RISK
RISK TIER
0.35 = MEDIUM · 0.65 = HIGH
src/clinical_features.py
def engineer_clinical_features(ehr_df):
    # Comorbidity burden (Elixhauser -- 31 conditions)
    df['elixhauser_score'] = compute_elixhauser(df['icd_codes'])
    # Polypharmacy risk (>=5 meds)
    df['polypharmacy_flag'] = (df['active_meds'] >= 5).astype(int)
    # Lab abnormality signals
    df['abnormal_lab_count'] = df['lab_results'].apply(
        lambda x: sum(1 for r in x if r['flag'] in ('H','L','C')))
    return df  # 20% accuracy boost over raw ICD codes
src/ensemble.py
class ReadmissionEnsemble:
    def predict_proba(self, X):
        p_xgb  = self.xgb.predict_proba(X)[:,1]
        p_lgbm = self.lgbm.predict_proba(X)[:,1]
        return 0.55 * p_xgb + 0.45 * p_lgbm
src/stratify.py
def stratify_risk(scores):
    return pd.cut(scores, bins=[0, 0.35, 0.65, 1.0],
        labels=['LOW', 'MEDIUM', 'HIGH'])
    # HIGH  -> care coordinator + 48-hr post-discharge call
    # MEDIUM -> 7-day follow-up + pharmacy counselling
    # LOW   -> standard discharge pathway
Project Presentation — Slide Deck

8-slide deck: readmission problem and cost, clinical pipeline flow, 4-category feature engineering breakdown, model performance results, 3-tier risk stratification with care pathways per tier, tech stack, key takeaways. Teal/healthcare colour palette.

↓ Download PPTX
Quick Reference
~85%Prediction accuracy
~20%From feature engineering
~25%Faster patient ID