SNOW: Agent-Based Feature Generation from Clinical Notes

Electronic health records (EHRs) contain rich unstructured clinical notes that could enhance predictive modeling, yet extracting meaningful features from these notes remains challenging. Current approaches range from labor-intensive manual clinician feature generation (CFG) to fully automated representational feature generation (RFG) that lack interpretability and clinical relevance.

Here we introduce SNOW (Scalable Note-to-Outcome Workflow), a modular multi-agent system powered by large language models (LLMs) that autonomously generates structured clinical features from unstructured notes without human intervention. We evaluated SNOW against manual CFG, clinician-guided LLM approaches, and RFG methods for predicting 5-year prostate cancer recurrence in 147 patients from Stanford Healthcare.

While manual CFG achieved the highest performance (AUC-ROC: 0.771 ± 0.036), SNOW matched this performance (0.761 ± 0.046) without requiring any clinical expertise, significantly outperforming both baseline features alone (0.691 ± 0.079) and all RFG approaches. SNOW's specialized agents handle feature discovery, extraction, validation, post-processing, and aggregation, creating interpretable features that capture complex clinical information typically accessible only through manual review.

Step 1: Feature Discovery Agent

Action: Defines new feature from clinical note analysis

Feature Definition: Percentage of biopsy core involved with cancer for left apex medial region. Higher percentage indicates greater tumor burden and is associated with higher risk of biological failure.

Initial Instructions: "Extract the percentage value from phrases like 'COMPRISING X% OF THE CORE' or 'INVOLVING X% OF THE CORE'. Return as integer 0-100."

Step 2: Feature Extraction Agent (Attempt 1)

Action: Extracts feature values from all 147 patient notes using initial instructions

Results: Successfully extracts percentages when explicitly stated, but returns NaN for notes with linear measurements (e.g., "1.0 CM" instead of percentage)

Step 3: Feature Validation Agent (Validation 1)

Action: Reviews extracted values against source notes

Analysis:

Note 0: "COMPRISING 50% OF THE CORE" → ✓ 50.0 (correct)
Note 74: "PROSTATE ADENOCARCINOMA, GLEASON SCORE 4 + 4 = 8, 1.0 CM" → ⚠ NaN (missing)
Note 105: "PROSTATIC ADENOCARCINOMA, 0.7 CM" → ⚠ NaN (missing)

Decision: Re-extract

Reasoning: Missing values when linear measurements available. Both formats represent tumor burden.

Updated Instructions: "When percentage not provided but linear measurement available, calculate percentage by dividing tumor length by total core length from gross description section."

Step 4: Feature Extraction Agent (Attempt 2)

Action: Re-extracts feature values using updated instructions that handle linear measurements

Results: Now calculates percentages from linear measurements, but introduces calculation error in some cases

Step 5: Feature Validation Agent (Validation 2)

Action: Reviews re-extracted values

Analysis:

Note 39: Cancer 2mm, Core 0.8cm → ✗ 250.0% (should be 25%)
Note 107: "INVOLVING 1.0 cm of 1 CM CORE" → ✓ 100.0 (correct)

Decision: Re-extract

Reasoning: Decimal point calculation error causing impossible values over 100%

Updated Instructions: "Ensure all calculated percentages are capped at 100% maximum. Double-check all calculations to prevent decimal place errors."

Step 6: Feature Extraction Agent (Attempt 3)

Action: Re-extracts feature values with decimal correction and 100% cap

Results: All values now correctly extracted and within valid range (0-100%)

Step 7: Feature Validation Agent (Validation 3)

Action: Final quality check on corrected values

Analysis:

Note 0: "COMPRISING 50% OF THE CORE" → ✓ 50.0
Note 5: "comprising 100% of 1/1 core" → ✓ 100.0
Note 14: "NO PROSTATIC GLANDULAR TISSUE" → ✓ NaN (appropriate)
Note 6: "NO SIGNIFICANT ABNORMALITY" → ✓ 0.0

Quality Metrics:

Missing value rate: 2.04% (acceptable)
All values within valid range (0-100%)
Correctly handles explicit percentages, calculated values, and missing data

Decision: Proceed with Feature

Feature Accepted

After 3 validation cycles and 2 re-extractions, the feature percent_core_involvement_left_apex_medial successfully passes all quality checks and proceeds to the final dataset for model training.

Total Process: 7 agent interactions (1 discovery + 3 extractions + 3 validations)

🎯 SNOW Matches Expert-Level Performance

SNOW achieved an AUC-ROC of 0.761 ± 0.046, matching the gold-standard manual clinician feature generation (CFG) performance of 0.771 ± 0.036 without requiring any clinical expertise or human intervention. This represents the first demonstration of a fully automated system matching expert-driven performance for clinical feature generation.

🚀 Significant Improvement Over Baseline

All feature generation methods (except RFG) substantially outperformed baseline features alone (0.691 ± 0.079). SNOW provided a 10.1% improvement in AUC-ROC over baseline, demonstrating the critical value of extracting information from unstructured clinical notes for outcome prediction.

📊 RFG Methods Failed to Provide Value

None of the 'RFG + baseline' approaches outperformed the baseline features alone. These results suggest that while RFG methods likely introduce new signal, they also increase the dimensionality of the feature space and may complicate model training in a small-sample setting with 147 patients. This added complexity makes the model harder to optimize effectively. Because they did not improve model performance, we excluded them from the performance plot above.

⚖️ Quality vs. Scalability Trade-off Resolved

The results demonstrate that SNOW successfully bridges the gap between clinical expertise and scalable automation. While manual CFG delivers the highest performance, SNOW achieves comparable results with full automation, eliminating the traditional trade-off between quality and scalability in clinical feature engineering.

🔍 Interpretable Feature Generation

Unlike black-box RFG methods, SNOW generates interpretable structured features that capture complex clinical information. The system autonomously discovered and validated features across 14 prostate regions, including Gleason scores, tumor percentages, and cancer presence indicators that are directly interpretable by clinicians.

🏥 Clinical Deployment Potential

The combination of expert-level performance, full automation, and clinical interpretability positions SNOW as a viable solution for real-world clinical deployment, potentially transforming how healthcare organizations leverage unstructured EHR data for predictive modeling at scale.

BibTeX

@misc{wang2025agentbasedfeaturegenerationclinical,
      title={Agent-Based Feature Generation from Clinical Notes for Outcome Prediction}, 
      author={Jiayi Wang and Jacqueline Jil Vallon and Neil Panjwani and Xi Ling and Sushmita Vij and Sandy Srinivas and John Leppert and Mark K. Buyyounouski and Mohsen Bayati},
      year={2025},
      eprint={2508.01956},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.01956}, 
}

SNOW: Agent-Based Feature Generation from Clinical Notes for Outcome Prediction

Abstract

Scalable Note-to-Outcome Workflow (SNOW)

SNOW Feature Generation Example

Complete agent workflow for feature percent_core_involvement_left_apex_medial

Step 1: Feature Discovery Agent

Step 2: Feature Extraction Agent (Attempt 1)

Step 3: Feature Validation Agent (Validation 1)

Step 4: Feature Extraction Agent (Attempt 2)

Step 5: Feature Validation Agent (Validation 2)

Step 6: Feature Extraction Agent (Attempt 3)

Step 7: Feature Validation Agent (Validation 3)

Feature Accepted

Methods Compared

Manual Clinician Feature Generation (CFG)

SNOW (Scalable Note-to-Outcome Workflow)

Clinician-Guided LLM (CLFG)

Baseline Features

Representational Feature Generation (RFG) Methods

Traditional NLP

Transformer Models

Advanced Embeddings

Results

Key Discoveries

🎯 SNOW Matches Expert-Level Performance

🚀 Significant Improvement Over Baseline

📊 RFG Methods Failed to Provide Value

⚖️ Quality vs. Scalability Trade-off Resolved

🔍 Interpretable Feature Generation

🏥 Clinical Deployment Potential

BibTeX