When “Accuracy” Isn’t Enough: Balancing Sensitivity and Specificity in Flow Cytometry

8 min read
TABLE OF CONTENTS

Flow cytometry helps identify abnormal or malignant blood and bone marrow cells.

But accurately detecting these rare cells requires a careful balance of sensitivity and specificity.

Sensitivity is the ability to find every true abnormal cell, while specificity is the ability to identify normal cells to avoid false positives that can cause alarms. 

Even if this balance is off, high accuracy may appear impressive on paper, but it will fail where it matters most: in real clinical decisions. 

What does it mean to “balance” sensitivity and specificity?

In medical diagnostics, sensitivity and specificity are not “good” or “bad” numbers. Rather, they are dials that you adjust depending on the desired outcome of the test.

This is much like the trade-offs we face when choosing between deterministic and stochastic models, or interpretable and black-box models.

In the field of hematology, this balancing act becomes especially crucial in flow cytometry.

Detecting rare malignant cells in diseases such as chronic lymphocytic leukemia (CLL) or acute myeloid leukemia (AML) requires precise management of multiple factors.

Too much sensitivity results in an overwhelming number of false alarms.

Too much specificity, and you risk overlooking the very few cells that prove disease persistence.

Some AI algorithms can be tuned to optimize sensitivity and specificity based on the clinical context.

The goal is to improve diagnostic accuracy to support informed clinical decision-making, rather than relying on a single performance metric, such as overall accuracy.

Flow cytometry and rare-cell detection: why it’s so challenging

Flow cytometry can measure millions of cells per sample across dozens of biomarkers. In measurable residual disease (MRD) detection, the task is challenging among the millions of cells.

The malignant ones could number in the hundreds, or even fewer. 

In other words, we’re trying to find a needle in a haystack. 

Let’s break it down:

  • Data Overload: Each sample contains hundreds of thousands to millions of data points, each representing a cell with its own marker expression pattern. 
  • Rarity of malignant events: In follow-up MRD monitoring, there may be only one malignant cell per million. 
  • Clinical consequence: One misclassified cell may not matter. However, a few repeated errors can flip a patient’s result from negative to positive, which could alter treatment or remission status.

This is why flow cytometry is not just a data challenge but also a diagnostic responsibility. 

Why “accuracy” alone is the wrong scoreboard 

Imagine analyzing 100,000 cells, where 1,000 are malignant (1%). 

If your algorithm mechanically labels every cell as healthy, it achieves 99% accuracy but misses all 1,000 malignant cells. 

On the left is the true sample distribution (99,000 healthy and 1,000 malignant).

On the right are the trivial classifier’s predictions (100,000 “healthy,” 0 malignant).

Although the classifier is 99% accurate, it misses all the malignant cells, proving that accuracy alone can be misleading in highly imbalanced settings. 

Statistically “impressive.” But clinically, it’s disastrous. 

This is why accuracy can be misleading in imbalanced datasets, where disease is rare. 

In measurable residual disease testing, it’s possible to have 99.9999% accuracy and still fail at the task that matters most: detecting the one malignant cell that signals relapse. 

Metrics such as sensitivity, specificity, precision, and recall provide a more accurate representation of clinical performance than raw accuracy. 

Breaking down the metrics: TP, TN, FP, and FN

All diagnostic models’ predictions fall into one of four categories: 

Prediction Type Meaning 
True Positive (TP) Correctly identifies a malignant cell as malignant 
True Negative (TN) Correctly identify a healthy cell as healthy 
False Positive (FP) Incorrectly labels a healthy cell as malignant 
False Negative (FN) Misses a malignant cell, labeling it healthy 

Accuracy only considers the total number of correct answers (TP + TN), ignoring the cost of mistakes. 

But in diagnostics, not all mistakes are equal. 

Missing one malignant clone (FN) can be far more damaging than a few false positives (FP). 

That’s why clinicians and data scientists focus on sensitivity (how well you detect disease) and specificity (how well you avoid misdiagnosis). 

There are four possible outcomes for every diagnostic prediction.

Understanding these categories helps clinicians interpret model performance beyond accuracy. 

Let’s revisit the example of 100,000 cells with 1,000 malignant (1%):  

The trivial classifier (“everything is healthy”) scores:  

  • Accuracy: (TP + TN) / Total = (0 + 99,000) / 100,000 = 99%  
  • Sensitivity (Recall): TP / (TP + FN) = 0 / 1,000 = 0%  
  • Specificity: TN / (TN + FP) = 99,000 / 99,000 = 100%

From a clinical perspective, this system is a disaster: it perfectly identifies healthy cells but misses every malignant cell.  

This is why clinicians (and modelers) emphasize sensitivity and specificity rather than raw accuracy.

In MRD detection, sensitivity ensures that we don’t overlook the rare malignant cells that indicate disease.

Specificity ensures we don’t drown in false positives that could mislabel a patient as sick.  

Why manual gating isn’t a perfect “gold standard” 

For decades, expert cytometrists have manually gated flow plots, using experience and intuition to draw boundaries. But even experts disagree. 

  • Operator variability: Two analysts may define different gates for the same population.  
  • Subjectivity: Some draw tight gates to minimize false positives, while others are more permissive. 
  • Batch effects: Day-to-day shifts in staining or instrument calibration can subtly alter results. 
  • Scalability Challenge: Modern cytometers measure dozens of parameters, so the volume and dimensionality of the resulting data make manual gating increasingly unsustainable and time-consuming.

Most importantly, manual gating can miss outlier malignant cells that do not fit the textbook profile. 

Ironically, machine learning models, which are often more inclusive, may flag these atypical cells and result in “false positives” when compared to the ground truth, whereas they are, in fact, clinically relevant. 

This serves as a reminder that automation not only replicates human reasoning but can also extend it. 

Screening vs. confirmation: two different diagnostic tasks 

The balance between sensitivity and specificity depends on the purpose of the test. 

Screening or triage: 

Goal: Don’t miss the disease. 

Priority: High sensitivity, even if that means more false positives. 

Example: An early MRD screen that flags every suspicious signal prevents patients from being falsely reassured.  

In flow cytometry, screening may involve rapid analysis of cell populations to identify samples containing abnormal or rare cells that require further study. 

Confirmatory testing: 

The goal is to provide high-confidence results that guide treatment. 

Priority: High specificity, minimizing false positives. 

Example: Before stopping therapy, you need near-certain evidence that no malignant cells remain. A false positive could lead to unnecessary chemotherapy or distress for the patient. 

In flow cytometry, confirmation assays typically use more detailed or functional analyses to definitively characterize abnormal cells and guarantee an accurate diagnosis. 

The same AI model can be tuned differently for each role, and this distinction is critical for clinical deployment. 

Depending on whether it’s used as a first-pass safety net (screening) or as a final arbiter (confirmation), the same ML predictor can often be tuned differently. This distinction is often overlooked, yet it is essential for real-world adoption. 

Checklist for evaluating flow cytometry ML models

When evaluating or deploying an ML predictor for MRD detection in hematology, consider the following:  

  1. Does it correctly identify truly rare malignant cells as true positives?  
  2. Does it prevent the system from being flooded with false alarms (false positives)?  
  3. Are the results reproducible across machines and operators?  
  4. Does it detect atypical malignant populations that manual gating might overlook? 

Here are some rules of thumb for clinicians:  

  • Never trust accuracy alone because it hides the rare-cell problem.
  • Always ask how many false positives accompany the true positives.
  • Keep in mind that manual gating isn’t perfect and may miss malignant outliers.
  • Adjust the sensitivity and specificity settings based on whether you’re screening or confirming.

Key takeaways: 

  • Accuracy can be misleading in highly imbalanced datasets 
  • Sensitivity and specificity should be adjusted according to the clinical purpose, whether screening or confirmation 
  • Manual gating has limitations, but AI can improve consistency and detect events that might be overlooked 
  • Trust in diagnostics comes from interpretability, reproducibility, and clear communication, not from black-box predictions 

A note from flow cytometry experts: 

Balancing sensitivity and specificity is more than just a technical exercise. It reflects how we approach patient care. Every diagnostic decision involves risk, trade-offs, and uncertainty. 

The art and science of flow cytometry are evolving quickly. With the rise of interpretable AI, clinicians and scientists now have tools that bring consistency and clarity to one of the most complex analyses in medicine. 

But no algorithm can replace judgment. Real progress lies in the collaboration of technology and expertise, helping clinicians see what matters most: every cell in every patient. 

Share

Don't miss on another flow cytometry article
ON THIS PAGE +
📋 On This Page