[Syllabus]: Artificial Intelligence: Practical Breakthroughs for Breast Imaging and Care

Introduction

This session explores how artificial intelligence (AI) can transform risk prediction in oncology, with a focus on breast cancer risk from screening mammography. It contrasts traditional biomarker-based approaches (e.g., mammographic density, Tyrer-Cuzick) with outcome-driven deep learning models that learn directly from raw imaging. The talk covers clinical deployment, model development and validation, interpretability, bias and equity considerations, pitfalls in dataset curation, and practical strategies for responsible implementation in clinical workflows.

The Evolving Role of AI in Medicine

AI can augment medicine along three axes: automating human-performable tasks, performing tasks beyond human capability, and generating novel clinical or biological insights. Risk prediction exemplifies tasks where AI can exceed human performance by leveraging subtle image patterns imperceptible to clinicians.

Key Points

Three categories: automate human tasks; enable supra-human tasks (e.g., future risk prediction); derive new insights.
AI can reshape the diagnostic paradigm by using raw data (imaging, pathology, genomics, EHR) without lossy human-crafted abstractions.
Real-world data beyond clinical trials is underused; scalable AI can harness retrospective hospital data to improve care.

Traditional Risk Assessment and the Density Biomarker

Mammographic density emerged historically as a risk biomarker, but its predictive value is modest and subject to inter-reader variability. Conventional multivariable risk models achieve limited discrimination and improve only slightly with density.

Key Points

Classic risk models achieve AUC ≈ 0.60; adding density increases AUC marginally (≈ 0.63).
Density assessment suffers from substantial inter- and intra-radiologist variability, especially in middle categories.
In a large cohort, dense breasts are common (~42%), yet the absolute incident cancer difference between dense and non-dense is small (e.g., ~8/1000 vs ~6/1000), diluting clinical actionability.

Automating Density Assessment: Clinical Deployment

Automated deep learning-based density scoring was deployed at Massachusetts General Hospital (MGH), improving consistency and workflow integration, but did not materially improve downstream risk prediction.

Key Points

High-quality training yielded automated density estimates with strong radiologist concordance and acceptance in routine practice.
Automation enhances consistency but does not overcome the biomarker’s inherent predictive limitations.
Lessons: easy-to-define tasks can use off-the-shelf tools; risk prediction demands more sophisticated, domain-specific modeling.

Learning Risk Directly from Outcomes

Outcome-based learning trains models to predict future cancer directly from imaging and known longitudinal outcomes, avoiding hand-crafted biomarkers.

Key Points

Training data: ~200,000+ mammograms with longitudinal follow-up and ~1,200 incident cancers; minimal exclusions (except missing outcomes).
Model inputs: raw mammograms; optional clinical risk factors (age, BRCA, etc.).
Architecture innovations included: multi-view fusion, device/domain harmonization, and modeling risk across time horizons (year-by-year risk curves).

Model Performance, Calibration, and Clinical Utility

Image-based risk models substantially outperform standard clinical risk tools, enabling more targeted screening strategies.

Key Points

One-year risk AUC: Tyrer-Cuzick ≈ 0.66 vs imaging model ≈ 0.88; performance remains superior at 2- and 5-year horizons.
Triage efficiency: Top 10% highest-risk by the model captured ≈ 57.8% of cancers within 1 year, outperforming density (≈ 51%) while flagging far fewer women (10% vs ~42%).
Risk is not synonymous with density; high-risk and density can be discordant, reflecting richer image-derived phenotypes.

External Validation and Generalizability

Robustness across populations, institutions, and devices is essential; external evaluations support generalizability when models are appropriately trained.

Key Points

Models trained at MGH sustained performance at Karolinska Institute and in an Asian population (Taipei), despite device and demographic differences.
Device calibration differences (even within a single vendor) introduce distributional shifts; explicit harmonization is necessary.
Prospective, population-specific validation remains critical for clinical deployment.

Defining Outcomes and Label Integrity

Outcome specification and label quality determine what the model learns; misuse of proxies can encode biases or spurious signals.

Key Points

Outcome choices (e.g., “any cancer within X years,” invasive vs in situ, receptor status) must be clinically meaningful and supported by sufficient sample size.
Link to tumor registries and verified follow-up to ensure accurate case ascertainment and censoring.
Cautionary example: algorithms trained on cost/procedure proxies (e.g., UnitedHealth) can inadvertently learn socioeconomic and racial disparities.

Bias, Equity, and Responsible Evaluation

Bias predates AI; modern models offer an opportunity to improve equity if trained and evaluated correctly.

Key Points

Legacy clinical models (e.g., Tyrer-Cuzick) underperform in non-White populations; AI can mitigate this with diverse, representative training data.
Evaluate across subgroups (race/ethnicity, age, breast size, implants), settings (academic vs community), and devices.
Establish human–machine workflows to detect systematic biases and support clinician oversight.

Interpretability, Spurious Correlations, and Adversarial Vulnerabilities

Interpretability methods can elucidate model focus, yet models remain vulnerable to spurious correlations and adversarial perturbations.

Key Points

Saliency/attention maps show spatial focus; over successive years, risk heatmaps increasingly localize to future lesion sites, differentiating short-horizon “occult disease” from long-horizon “tissue predisposition.”
Models can infer patient attributes (age, BMI proxy, menopause status, density) from images, explaining limited additive value of questionnaires.
Beware data leakage and spurious features (e.g., year/device changes, hospital identity) that inflate performance without learning pathology.
Adversarial examples highlight theoretical vulnerabilities; secured pipelines and QA processes are essential.

Implementation Considerations and Modeling Strategy

Successful deployment demands tailored modeling and rigorous engineering beyond off-the-shelf components.

Key Points

Layered approach: baseline architectures (“blue” = generic); domain adaptations (“gold” = advanced ML); task-specific innovations (“green” = custom methods) materially improve performance.
Multi-view aggregation, cross-device/domain adaptation, and time-to-event risk modeling are crucial for mammography risk prediction.
Train/dev/test segregation with locked test evaluation prevents optimistic bias; calibration and decision thresholds should align with clinical workflows.

Common Pitfalls in AI for Imaging Risk Prediction

Missteps in data and design can mislead development and hinder translation.

Key Points

Artificially easy datasets (e.g., large, obvious tumors) overestimate performance; focus on clinically challenging, contemporary cases.
Excluding subgroups (e.g., small breasts) limits applicability and propagates inequity.
Missing tumor registry linkage, inadequate follow-up, or proxy labels impair outcome validity.
Ignoring device/version shifts and site differences degrades robustness.
Overreliance on “plug-and-play” tools can yield subpar models; domain-specific customization is often necessary.

Future Directions: From Prediction to Insight

Beyond prediction, AI can advance disease understanding and personalized care by integrating multimodal data and revealing latent biology.

Key Points

Use large-scale real-world data to complement trial evidence and guide individualized screening, prevention, and surveillance.
Translate model behavior into mechanistic hypotheses (e.g., tissue microenvironment signatures of susceptibility).
Expand to multimodal fusion (pathology, genomics, clinical text) to refine risk stratification and therapeutic decisions.

Conclusion

AI-driven, outcome-based imaging models markedly improve breast cancer risk prediction over traditional tools, enabling targeted, equitable screening. Real-world clinical deployment requires precise outcome definition, rigorous external validation, bias auditing, interpretability, and device-aware modeling. With careful curation and implementation, these systems can shift practice from coarse biomarkers to individualized, data-rich risk assessment and ultimately generate new insights into disease biology and care.

Regina Barzilay, Ph.D.(bio)

Introduction

The Evolving Role of AI in Medicine

Traditional Risk Assessment and the Density Biomarker

Automating Density Assessment: Clinical Deployment

Learning Risk Directly from Outcomes

Model Performance, Calibration, and Clinical Utility

External Validation and Generalizability

Defining Outcomes and Label Integrity

Bias, Equity, and Responsible Evaluation

Interpretability, Spurious Correlations, and Adversarial Vulnerabilities

Implementation Considerations and Modeling Strategy

Common Pitfalls in AI for Imaging Risk Prediction

Future Directions: From Prediction to Insight

Conclusion

Regina Barzilay, Ph.D.^(bio)