SNOW: Agent-Based Feature Generation from Clinical Notes for Outcome Prediction

1Department of Management Science and Engineering, Stanford University
2Department of Radiation Oncology, Stanford University School of Medicine
3Graduate Business School Research Hub, Stanford University
4Department of Medicine (Oncology), Stanford University
5Department of Medicine, Stanford University
6Department of Urology, Stanford University
7Veterans Affairs Palo Alto Health Care System
8Department of Electrical Engineering, Stanford University
9Operations, Information and Technology, Stanford Graduate Business School
*Corresponding author. Email: jyw@stanford.edu
Served as equally contributing co-senior authors

Abstract

Electronic health records (EHRs) contain rich unstructured clinical notes that could enhance predictive modeling, yet extracting meaningful features from these notes remains challenging. Current approaches range from labor-intensive manual clinician feature generation (CFG) to fully automated representational feature generation (RFG) that lack interpretability and clinical relevance.

Here we introduce SNOW (Scalable Note-to-Outcome Workflow), a modular multi-agent system powered by large language models (LLMs) that autonomously generates structured clinical features from unstructured notes without human intervention. We evaluated SNOW against manual CFG, clinician-guided LLM approaches, and RFG methods for predicting 5-year prostate cancer recurrence in 147 patients from Stanford Healthcare.

While manual CFG achieved the highest performance (AUC-ROC: 0.771 ± 0.036), SNOW matched this performance (0.761 ± 0.046) without requiring any clinical expertise, significantly outperforming both baseline features alone (0.691 ± 0.079) and all RFG approaches. SNOW's specialized agents handle feature discovery, extraction, validation, post-processing, and aggregation, creating interpretable features that capture complex clinical information typically accessible only through manual review.

Scalable Note-to-Outcome Workflow (SNOW)

Interactive workflow diagram - click on nodes to learn more

100%
+

SNOW Feature Generation Example

See how SNOW validates and automatically refines feature extraction

Complete agent workflow for feature percent_core_involvement_left_apex_medial

Step 1: Feature Discovery Agent

Action: Defines new feature from clinical note analysis

Feature Definition: Percentage of biopsy core involved with cancer for left apex medial region. Higher percentage indicates greater tumor burden and is associated with higher risk of biological failure.

Initial Instructions: "Extract the percentage value from phrases like 'COMPRISING X% OF THE CORE' or 'INVOLVING X% OF THE CORE'. Return as integer 0-100."

Step 2: Feature Extraction Agent (Attempt 1)

Action: Extracts feature values from all 147 patient notes using initial instructions

Results: Successfully extracts percentages when explicitly stated, but returns NaN for notes with linear measurements (e.g., "1.0 CM" instead of percentage)

Step 3: Feature Validation Agent (Validation 1)

Action: Reviews extracted values against source notes

Analysis:

  • Note 0: "COMPRISING 50% OF THE CORE" → ✓ 50.0 (correct)
  • Note 74: "PROSTATE ADENOCARCINOMA, GLEASON SCORE 4 + 4 = 8, 1.0 CM" → ⚠ NaN (missing)
  • Note 105: "PROSTATIC ADENOCARCINOMA, 0.7 CM" → ⚠ NaN (missing)

Decision: Re-extract

Reasoning: Missing values when linear measurements available. Both formats represent tumor burden.

Updated Instructions: "When percentage not provided but linear measurement available, calculate percentage by dividing tumor length by total core length from gross description section."

Step 4: Feature Extraction Agent (Attempt 2)

Action: Re-extracts feature values using updated instructions that handle linear measurements

Results: Now calculates percentages from linear measurements, but introduces calculation error in some cases

Step 5: Feature Validation Agent (Validation 2)

Action: Reviews re-extracted values

Analysis:

  • Note 39: Cancer 2mm, Core 0.8cm → ✗ 250.0% (should be 25%)
  • Note 107: "INVOLVING 1.0 cm of 1 CM CORE" → ✓ 100.0 (correct)

Decision: Re-extract

Reasoning: Decimal point calculation error causing impossible values over 100%

Updated Instructions: "Ensure all calculated percentages are capped at 100% maximum. Double-check all calculations to prevent decimal place errors."

Step 6: Feature Extraction Agent (Attempt 3)

Action: Re-extracts feature values with decimal correction and 100% cap

Results: All values now correctly extracted and within valid range (0-100%)

Step 7: Feature Validation Agent (Validation 3)

Action: Final quality check on corrected values

Analysis:

  • Note 0: "COMPRISING 50% OF THE CORE" → ✓ 50.0
  • Note 5: "comprising 100% of 1/1 core" → ✓ 100.0
  • Note 14: "NO PROSTATIC GLANDULAR TISSUE" → ✓ NaN (appropriate)
  • Note 6: "NO SIGNIFICANT ABNORMALITY" → ✓ 0.0

Quality Metrics:

  • Missing value rate: 2.04% (acceptable)
  • All values within valid range (0-100%)
  • Correctly handles explicit percentages, calculated values, and missing data

Decision: Proceed with Feature

Feature Accepted

After 3 validation cycles and 2 re-extractions, the feature percent_core_involvement_left_apex_medial successfully passes all quality checks and proceeds to the final dataset for model training.

Total Process: 7 agent interactions (1 discovery + 3 extractions + 3 validations)

Methods Compared

Overview of all feature generation approaches evaluated in this study

Manual Clinician Feature Generation (CFG)

Gold standard approach. Expert oncologists and data scientists collaborated over a year to manually curate clinically relevant features from medical records. Involves extensive per-patient clinical expertise for feature definition and extraction from biopsy reports.

High Performance Not Scalable Interpretable

SNOW (Scalable Note-to-Outcome Workflow)

Our proposed method. Fully autonomous multi-agent LLM system with specialized agents for feature discovery, extraction, validation, post-processing, and aggregation. Requires no human intervention or clinical expertise.

High Performance Scalable Interpretable

Clinician-Guided LLM (CLFG)

Semi-automated approach. Uses LLMs with expert-written prompts to extract predefined clinician features. Includes detailed instructions based on manual processing experience, with post-processing following clinical rules.

Good Performance Partially Scalable Interpretable

Baseline Features

Structured data only. Features from structured EHR sources requiring minimal clinical expertise: demographics, maximum pre-treatment PSA, and Charlson Comorbidity Index. Does not utilize unstructured clinical notes.

Low Performance Scalable Interpretable

Representational Feature Generation (RFG) Methods

14 automated NLP approaches that generate latent features from clinical text without human oversight. Despite extensive evaluation, none provided additional predictive value beyond baseline features.

Traditional NLP
  • • Bag-of-Words (Classic & TF-IDF)
  • • 2-gram, 3-gram, 4-gram variants
Transformer Models
  • • BERT, DistilBERT
  • • ClinicalBERT, Longformer
Advanced Embeddings
  • • Fine-tuned Mistral-7B-v0.3
  • • OpenAI text-embedding models
Low Performance Scalable Not Interpretable

Results

Key Discoveries

🎯 SNOW Matches Expert-Level Performance

SNOW achieved an AUC-ROC of 0.761 ± 0.046, matching the gold-standard manual clinician feature generation (CFG) performance of 0.771 ± 0.036 without requiring any clinical expertise or human intervention. This represents the first demonstration of a fully automated system matching expert-driven performance for clinical feature generation.

🚀 Significant Improvement Over Baseline

All feature generation methods (except RFG) substantially outperformed baseline features alone (0.691 ± 0.079). SNOW provided a 10.1% improvement in AUC-ROC over baseline, demonstrating the critical value of extracting information from unstructured clinical notes for outcome prediction.

📊 RFG Methods Failed to Provide Value

None of the 'RFG + baseline' approaches outperformed the baseline features alone. These results suggest that while RFG methods likely introduce new signal, they also increase the dimensionality of the feature space and may complicate model training in a small-sample setting with 147 patients. This added complexity makes the model harder to optimize effectively. Because they did not improve model performance, we excluded them from the performance plot above.

⚖️ Quality vs. Scalability Trade-off Resolved

The results demonstrate that SNOW successfully bridges the gap between clinical expertise and scalable automation. While manual CFG delivers the highest performance, SNOW achieves comparable results with full automation, eliminating the traditional trade-off between quality and scalability in clinical feature engineering.

🔍 Interpretable Feature Generation

Unlike black-box RFG methods, SNOW generates interpretable structured features that capture complex clinical information. The system autonomously discovered and validated features across 14 prostate regions, including Gleason scores, tumor percentages, and cancer presence indicators that are directly interpretable by clinicians.

🏥 Clinical Deployment Potential

The combination of expert-level performance, full automation, and clinical interpretability positions SNOW as a viable solution for real-world clinical deployment, potentially transforming how healthcare organizations leverage unstructured EHR data for predictive modeling at scale.

BibTeX

@misc{wang2025agentbasedfeaturegenerationclinical,
      title={Agent-Based Feature Generation from Clinical Notes for Outcome Prediction}, 
      author={Jiayi Wang and Jacqueline Jil Vallon and Neil Panjwani and Xi Ling and Sushmita Vij and Sandy Srinivas and John Leppert and Mark K. Buyyounouski and Mohsen Bayati},
      year={2025},
      eprint={2508.01956},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.01956}, 
}