|
Medical outcome data
are often utilized to compare treatments or providers. Because
patient outcomes may be influenced by severity of illness, treatment
effectiveness, or chance{1-5}, such studies must account for differences
in the prevalence of patient risk factors (case-mix). In some
instances, it is possible to reduce or eliminate outcome variability
due to case-mix through randomization, which hopefully
should balance both known and unknown risk factors. Unfortunately,
this is impractical in most real-life situations such as the comparison
of results among institutions. Other study designs rely on covariate
matching or the use of propensity scores {6, 7},
both of which balance only known risk-factors. However, in the
majority of existing observational studies in medicine
and surgery, case-mix has usually been accounted for through risk
adjustment. Using statistical modeling techniques, typically
some form of multivariable regression analysis {8}, investigators
study the association between individual risk factors (also referred
to as predictor variables or covariates) and outcomes while holding
constant the effect of others. Once the impact of each risk factor
is determined from a given population sample, it then becomes
possible to estimate the probability of the outcome for patients
having particular combinations of these risk factors.
Although researchers
have utilized risk models for many years, their broader applicability
and critical importance were more fully appreciated following
the 1986 release of unadjusted hospital outcome data by the Health
Care Financing Administration ( HCFA, now called CMS or the Centers
for Medicare and Medicaid Services). Providers correctly argued
that such data did not account for patient severity, and this
led directly to the development of a number of high quality clinical
databases and risk models, especially in cardiac surgery. Coronary
artery bypass grafting (CABG) has been a particular focus of such
research, not only because of the desire of cardiac surgeons to
improve patient outcomes, but also because regulators and insurers
have sought greater control over this high-profile, costly, and
frequently performed procedure.
Some of the original
cardiac surgery databases were voluntary (Northern New England
Cardiovascular Disease Study Group [NNE]{9} and some were mandated
by state or federal law (NY{10}, NJ{11}, Veterans Affairs Administration
[VA]{12}). Soon after the HCFA release of unadjusted outcome data
in 1986, the Society of Thoracic Surgeons established an Ad Hoc
Committee on Risk Factors for Coronary Artery Bypass Surgery {13}.
Subsequently, a committee under the direction of Dr. Richard Clark
began work on the development of the STS National Cardiac Database
(STS NCD). This database was formally established in 1989 and
the software was released to STS members in 1990 {14-16}. Over
the subsequent 13 years this has evolved to become one of the
largest single-specialty databases in the world, containing data
on over 2.4 million patients from 60% of US cardiac programs.
In this review, we
present some fundamental aspects of risk model development and
validation, the current uses and limitations of such models in
cardiac surgery (with special attention to CABG), and prospects
for the future.
DATA
Sources
No risk adjustment
model is better than the data upon which it is based. Administrative
data, such as that from the CMS MEDPAR database, provide one of
the most commonly used sources for observational studies. Such
data are readily available, relatively inexpensive, and contain
information on millions of patients {2, 17}. However, because
these administrative data have been collected primarily for billing
purposes rather than for clinical studies, critical variables
such as ejection fraction are unavailable, and differentiation
of co-morbidities from complications is problematic. The latter
deficiency may exaggerate the predictive ability of risk models
derived from such data and may occasionally lead to the paradoxical
appearance of higher predictive accuracy than that of models derived
from clinical databases {2, 17, 18}. This results from the inappropriate
inclusion in the model of pre-terminal complications that are
highly correlated with mortality. Administrative databases may
exclude important variables that are not billable diagnoses ("field
saturation") {19}, they limit the number of secondary diagnoses,
and they generally have insufficient flexibility to classify properly
certain co-morbidities {17}, all of which limit the accuracy of
risk models derived from them.
Core Variables
Griffith and associates
{20} found that critical clinical variables known to be associated
with mortality (e.g. ejection fraction, emergency procedures)
were not included in the first Pennsylvania CABG database, leading
to problematic accuracy. Hannan and associates compared models
based upon the New York clinical cardiac surgery database (CSRS)
with models derived from the New York administrative database
(SPARCS) {21} and with the HCFA MEDPAR database {17}. In both
instances, models derived from the clinical database were found
to provide superior performance. Accuracy of the models based
upon administrative data was improved substantially by the addition
of a few critical clinical variables (ejection fraction, re-operation,
left main obstruction).
Tu and associates {4}
have determined a limited set of six core variables (age, gender,
acuity, reoperation, left main coronary obstruction, ejection
fraction) beyond which they believe there is little incremental
improvement in model performance. Jones and associates {22} from
the Cooperative CABG Database Project identified seven core variables
(age, gender, left ventricular function, acuity, left main disease,
reoperation, number of diseased vessels). They found that acuity,
reoperation, and age accounted for the majority of predictive
information in most CABG databases. In the STS NCD, 78% of the
explained variance from the entire 28 variable model is derived
from the eight most important predictors (age, surgical acuity,
reoperative status, creatinine level, dialysis, shock, chronic
lung disease, and ejection fraction).
All studies confirm
that the gold standard for data is a specialty-specific, prospectively
maintained clinical database such as the STS NCD. Such databases
should contain, at a minimum, a core set of variables that has
been demonstrated to be associated with outcome.
Definitions
Certain caveats regarding
data apply generally to all regression models including those
employed for cardiac surgery. One of the most important is strict
standardization of data definitions. Even for a seemingly unambiguous
endpoint like mortality, there are important statistical and policy
implications of using (a) in-hospital mortality, regardless of
when it occurs, (b) thirty-day all-cause mortality, regardless
of where it occurs, and (c) operative mortality, defined
as either (a) or (b). A fixed time period is statistically preferable
{1} although more difficult to obtain than in-hospital mortality
from databases that are not payer-based. Osswald and colleagues
{23, 24} have studied the implications of different definitions
of "early" post-
CABG mortality, especially in light of advancements in postoperative
care. Particularly for higher risk patients where the early postoperative
phase may be prolonged, these investigators assert that the true
early mortality will be underestimated unless a complete tally
of deaths, and their time of occurrence, is compiled for the first
6 months after surgery.
Quality
Whenever practical,
continuous data should be used as such to avoid the arbitrariness
and loss of valuable information that occurs with categorization
{25}. Transformations of the measurement scale may be required
for the values of the variables to be commensurate with the assumptions
of the risk factor model being used (so-called linearizing transformations).
Data entry software should contain internal quality controls for
out of range, inconsistent, contradictory, and missing data. Ideally,
values for missing data should be substituted using multiple imputation
techniques {26, 27}. Institutions should receive periodic reports
regarding their data quality and any anomalies in recording the
values of variables compared with regional and national averages.
All these features are included in the STS NCD.
MODEL DEVELOPMENT
Risk model development
requires considerable statistical expertise and judgment, a caveat
that is sometimes forgotten in this era of ubiquitous, powerful,
off-the shelf statistical software. For example, the type of modeling
strategy and validation techniques may differ depending on whether
the purpose of the model is description of relationships
(e.g. comparison of providers or treatments) or prediction
of future events {28, 29}. Spiegelhalter has demonstrated the
importance of such considerations when the aim of the model is
probabilistic prediction to aid in patient selection and counseling
{29}. Even when the same basic model is used, Naftel has demonstrated
numerous technical reasons that different multivariable equations
might be developed by different statisticians from the same data
{30}.
Three principal techniques
have been utilized for the construction of cardiac surgery risk
models. Bayesian models were used initially for the STS NCD because
they are robust with regard to missing data, an important problem
in the early database experience. As data completeness improved,
logistic models were substituted in 1995 {31, 32}. Logistic regression
models continue to be the most common statistical technique for
cardiac surgery risk modeling, not only for the STS National Database
but also for those developed in New York {33}, the Veterans Affairs
Administration {34}, and the Northern New England Cardiovascular
Disease Study Group {35}. Some groups have employed simple additive
scores with weights derived from the logistic regression model
{11, 36}. Comparative studies {37} have generally demonstrated
that logistic models offer the best overall performance.
Some have suggested
that the next major advance in model performance would come from
the application of algorithmic models, sometimes called machine
learning techniques {38}, of which artificial neural networks
are an example{39}. These models permit complex, non-linear information
processing, thus avoiding one of the constraints of logistic models.
However, two studies {40, 41}, one using 80,606 patients from
the STS NCD {40}, failed to demonstrate any significant improvement
over logistic or Bayesian models.
Strategies for logistic
regression model development include data reduction techniques
and variable selection, interaction terms and transformations
of variables where appropriate, imputation of missing data, verification
of model assumptions, and model validation {8, 42, 43}. One fundamental
and still controversial question is the number of variables to
include in a risk model. An excessive number of covariates may
lead to some predictors having statistical significance but not
biologic or clinical relevance, over-fitting of the model, numerical
instability, and increased cost and difficulty of data collection
{4, 8, 42, 43}. Harrell {8, 42} has recommended that the number
of covariates considered for inclusion in such models be less
than one-tenth the number of end points observed in the dataset
(percent occurrence x sample size, for early events), which typically
requires some data reduction technique to decrease the number
of candidate variables. Univariate screening of candidate covariates
followed by forward or backward stepwise selection is commonly
employed for this purpose. However, some statisticians including
Harrell have criticized this variable selection technique on theoretical
grounds, and research continues on the relative merits of parsimonious
models versus those that retain most or all potential predictors
{8, 29, 44, 45}. The use of "machine-learning" variable selection
methods such as bagging (bootstrap aggregating) are also
gaining popularity {46, 47}.
Finally, most studies
demonstrate that models have inferior performance when applied
to patient groups other than the one from which they were developed.
Ivanov and colleagues {48} found that institution or region-specific
custom models performed best, followed by recalibrated models
using the covariate set from a "ready-made" model but different
weights. Applying unmodified off-the-shelf models to other groups
of patients provided the least satisfactory performance. In contrast,
Nashef and colleagues {49} recently reported that the additive
EuroSCORE cardiac surgery risk-prediction model functioned well
when applied to North American populations.
MODEL VALIDITY
Any statistical risk
model must be scrutinized to determine whether it functions reliably
for its intended purpose. Numerous types of validity have been
summarized by Daley {1} including face validity (the model
is reasonable to experts), content validity (all important
variables have been included), attributional validity (risk
adjustment is adequate to insure that differences in outcome are
not due to patient characteristics) and predictive validity.
The latter is usually assessed by means of model calibration
and discrimination, typically employing a cross-validation
technique in which the whole dataset is split into development
and validation subsets. Harrell has recommended using the
entire dataset for model development, then validating using another
technique such as bootstrapping {8}.
Calibration assesses
the concordance between deciles of observed and expected risk
as measured by the Hosmer-Lemeshow test {43}, and answers the
question "In 100 patients with the same estimated mortality risk
(R%) as mine, would the observed number of deaths equal R?" Most
CABG risk models are well calibrated overall but may produce estimates
that are too extreme in the lowest and especially the highest
risk subsets of patients, particularly in small datasets {45,
50}. Shrinkage of regression coefficients may provide more accurate
and realistic prediction for future patients {8, 45, 50, 51}
Discrimination is the
more demanding test and measures the tradeoff between the specificity
and sensitivity of the risk model at various probability cut points.
It asks the question "How well does the model separate patients
who die from those who survive?" and may be interpreted as the
percentage of discordant (meaning one death and one survivor)
patient-pairs for which the model predicts a higher mortality
risk for the patient who actually dies {43, 52, 53}. This percentage
is equal to the area under the ROC curve {43} and is called the
c-index or c-statistic {8, 42}. Unfortunately, most CABG risk
models have only moderate discrimination as measured by the c-index.
This has significant implications for their use in individual
patient counseling {52, 54}. Risk models may accurately predict
that three out of one hundred patients with a given set of risk
factors will die postoperatively, but they cannot identify which
three.
The similar and limited
discrimination of most CABG risk models (VA, NNE, STS, NY) suggests
that we still await the quantum improvement in risk prediction
anticipated by Steen {39} over a decade ago. The performance of
current models is limited by as yet unknown predictors, difficulty
in measuring or representing certain complex clinical states,
random catastrophic events such as a serious protamine reactions
or sudden hemorrhage, and similar occurrences that are rare in
the population but important in the individual patient {22, 40,
55-57}. Because known patient characteristics will never explain
all the variance in cardiac surgery outcomes, the performance
of any risk model has inherent limitations.
USES OF RISK
MODELS
One of the most important
uses of cardiac risk models, including the STS risk model, has
been for academic research. Typically this has involved estimation
of the effect of risk factors or particular therapies on patient
outcome. Logistic models are well suited to this function because
they readily provide odds ratios for each risk factor.
STS NCD studies of preoperative risk factors include the impact
of race {58}, gender{59}, and obesity {60}. Similar investigations
of therapeutic options have included the value of IMA use {61,
62}, beta blockade {63}, and off-pump techniques {64}. The STS
NCD has also been utilized to clarify our understanding of the
relationship between volume and outcome for CABG {65-67}.
A second use for risk
models is the development of tools that aid in everyday patient
management. These would include patient care algorithms or critical
pathways scientifically based upon risk-adjusted studies {56}.
One such management tool is the use of risk models in individual
patient counseling or as a decision support tool for clinicians
choosing between different interventions (e.g. coronary artery
bypass versus percutaneous angioplasty). Pocket cards (Northern
New England Cardiovascular Disease Study Group) {68}or handheld
computers {69} have been used to generate bedside risk estimates
for individual patients. Obviously, in order to properly apply
such methods, values for each risk factor in the model must be
available. Because of the modest discrimination of most risk models,
which limits the accuracy of individual patient prediction, it
is recommended that such information be presented to patients
not simply as a probability estimate, but accompanied by confidence
limits that demonstrate the uncertainty of those estimates.
One of the most common
uses of risk models is to compare provider performance. This is
statistically challenging from the outset because of the low incidence
of the binary outcome (operative mortality) as well as the highly
variable and often small sample sizes from different providers.
Typically, as in New York State, profiling has been achieved by
aggregating the probabilities of death of each patient treated
by a given provider based upon the results of logistic modeling.
This aggregate mortality is used together with the observed provider
mortality to construct a ratio of observed to expected mortalities,
or O/E ratio. However, modern research in provider profiling has
demonstrated potential deficiencies with this approach. Standard
techniques may not accurately reflect the unobserved true mortality
of low volume providers, and they do not adequately account for
clustering (non-random allocation) of observations within
providers (such as the prevalence of heart failure patients at
transplant centers {56}). The net effect may be an underestimation
of random inter-provider variability, an overestimation of systematic
inter-provider variability, and an increased likelihood of falsely
classifying a provider as an outlier. Hierarchical or multilevel
models have been designed to address these concerns and some advocate
their use for profiling whenever feasible {44, 70-75}.
Whatever method is
used for provider profiling, identification of statistical outliers
is only a starting point for further analysis {56, 76}. All risk
models have inherent limitations, and the results obtained from
different models and by different statisticians may vary {28-30,
50, 72}.The results derived from any risk model should be regarded
as one element to be considered in conjunction with other traditional
indices of competent surgical care. The Society of Thoracic Surgeons
firmly holds that those surgical programs identified as statistical
outliers should not be arbitrarily labeled "substandard". Outlying
providers' coding practices and individual mortalities should
be carefully analyzed, and process or structural causes should
be sought to explain any truly aberrant results {12, 77-80}.
Finally, risk-adjusted
outcomes have also been used in northern New England {81} and
Minnesota {82} as the basis for confidential continuous quality
improvement (CQI) activities, and in the Veterans Affairs Administration
for both confidential monitoring of performance and CQI activities
{12, 78}. Here, the main goal is not public accountability but
rather provider-initiated determination of best practice, benchmarking,
and regional or system-wide improvement. This has resulted in
mortality reduction that appears comparable to that achieved using
public report cards {83}. The Society of Thoracic has implemented
process improvement initiatives based on analyses derived from
the STS NCD {62}.
LIMITATIONS AND
DISADVANTAGES OF RISK MODELS
As previously described,
risk models do not have perfect discrimination, the ability to
predict the death of specific individuals. Because our ability
to determine expected outcome is limited, risk-adjusted mortality
derived from the O/E ratio are also subject to error. Publication
of risk-adjusted mortality "accurate" to the nearest hundredth
of a percentage may be misleading to the public, and all the more
so when these are not accompanied by confidence limits. Furthermore,
even the best available risk models explain only a small proportion
of the variability in cardiac surgery outcomes, a significant
liability if the goal is to assess inter-provider differences
in quality of care based upon their relative risk-adjusted mortalities
{28, 84}. Statisticians continue to evolve new and more sophisticated
techniques to model complex biological phenomena, and it is hopeful
that the performance of risk models will continue to improve.
From a health policy
perspective, some argue that current risk models place too much
emphasis on mortality as the sole endpoint, which may ultimately
decrease access to surgery for those who might benefit most (high-risk
case avoidance) {13, 55, 70, 85}. For similar reasons, such
models may encourage gaming of the reporting system when
used for provider profiling {70}. Emphasis on outcome endpoints
such as mortality may deflect attention from important process
and structural aspects of care. Finally, challenges common
to all cardiac surgery databases include their cost, especially
at a time when revenues from heart surgery are declining, and
privacy issues related to the HIPAA.
CURRENT STATUS
AND FUTURE DIRECTION
Between 1990 and 1997,
the number of centers contributing patients to the STS NCD grew
from 105 to 450, and it currently contains clinical data on 2.4
million patients {16, 76, 86}. Results over the decade 1990-1999
demonstrate a progressive increase in preoperative risk and a
decline in both observed mortality and the O/E ratio {86}.
Although it is already
the dominant cardiac surgical database in the world, it is the
goal of the STS to achieve 100% participation of all cardiac surgery
providers in the U.S. This, together with increased efforts to
validate the accuracy of the submitted data, will eliminate any
remaining concerns about the voluntary nature of the database,
although there has generally been close correlation between the
results obtained from voluntary (STS NCD, NNE) and mandatory databases
(VA, NY) {76, 86}. Recent studies matching STS NCD results with
those from CMS claims data suggest substantial agreement both
with regard to completeness and observed mortality. The STS NCD
should become the dominant source of accurate information for
the federal and state governments regarding outcome and reimbursement
issues, and there is also the opportunity to partner with industry
in the assessment of new technologies.
A systematic review
of the STS NCD was begun in 1997 {16, 76}. The STS Definitions
Committee worked with the American College of Cardiology to eliminate
unimportant variables and to develop a new data dictionary. The
original 512 fields were reduced to 217 core and 255 extended
fields. Discussions continue among representatives of the major
cardiac surgery and cardiology databases to resolve inconsistent
data definitions.
The Duke Clinical Research
Institute (DCRI) became the data analysis and warehouse center
for the STS NCD during 1998-1999. In addition to their sophisticated
resources for data cleaning and verification, DCRI provides detailed
semiannual reports comparing individual programs with their region
and with the overall STS national dataset. Regular national meetings
of hospital data managers have substantially enhanced uniformity
of coding practices and completeness.
From models that focused
primarily on CABG mortality, the STS NCD has evolved to a family
of related risk models for prediction of outcomes following valve
replacement with or without CABG {87, 88}, congenital heart surgery
{89}, and general thoracic surgery. There is increased awareness
of the importance of endpoints other than operative mortality,
including perioperative morbidity and length of stay {15, 16,
76, 90, 91}. Postoperative complications and readmissions (which
are delayed complications) appear to measure different and complementary
aspects of care compared with hospital mortality {92, 93}, and
not all measures of outcome are equally indicative of program
quality . Future emphasis will also include documentation of long-term
follow-up data, functional status and quality of life measures,
and the relationship of clinical factors to hospital costs. Finally,
there will be an increasing emphasis on process measures
(e.g. IMA and beta blocker use) to assess and improve provider
performance. This approach may offer the greatest potential to
enhance the results of all cardiac surgery programs and to reduce
interprovider variability. It also diminishes our reliance upon
sophisticated, yet still imperfect, methods of risk-adjustment
{56, 62, 76}.
Cardiac surgery remains
at the forefront of risk model development and clinical quality
monitoring. With advancements in statistical methodology, expanding
enrollment in major databases such as the STS NCD and EuroSCORE,
and the firm commitment of cardiac surgeons, our profession will
maintain its leadership in this vital area of health care.
|