Evidence Based Guidelines
 
 
Evidence
Based
Guidelines
<<< HOME Supported by an Unrestricted Educational Grant from
Bayer Healthcare, Pharmaceuticals, West Haven, CT

Transmyocardial Revascularization Guideline

Cardiac Surgery Risk Models Guideline

Aspirin Guideline

 
 

Data

Model Development

Model Validity

Uses of Risk Models

Limitations and Disadvantages of Risk Models

Current Status and Future Direction

References

Cardiac Surgery Risk Models

A Position Paper
from
The Society of Thoracic Surgeons Workforce on Evidence-Based Medicine

Running Head: Cardiac surgery risk models

David M. Shahian, MD (Taskforce Chair)1, Eugene H. Blackstone, MD2, Fred H. Edwards, MD3, Frederick L. Grover, MD4, Gary L. Grunkemeier, PhD5, David C. Naftel, PhD6, Samer A. M. Nashef, FRCS7, William C. Nugent, MD8, Eric D. Peterson, MD, MPH9

Lahey Clinic, Burlington, MA 018051, Cleveland Clinic Foundation, Cleveland, OH 441952, University of Florida, Jacksonville, FL 32209-65973, University of Colorado HSC, Denver, CO 802624, Providence Health System, Portland, OR 972255, University of Alabama, Birmingham, AL 35294-00076, Papworth Hospital, Cambridge, United Kingdom CB3 8RE7, Dartmouth-Hitchcock Medical Center, Lebanon, NH 037568, Duke Clinical Research Institute, Durham, NC 277159

Keywords: Health policy, statistics, database
Word count:
Corresponding Author: David M. Shahian, MD, Department of Thoracic and Cardiovascular Surgery, Lahey Clinic, 41 Mall Road, Burlington, MA 01805
Phone: 781-744-8575 Fax 781-744-5641
E-mail: David.M.Shahian@Lahey.org

 

Medical outcome data are often utilized to compare treatments or providers. Because patient outcomes may be influenced by severity of illness, treatment effectiveness, or chance{1-5}, such studies must account for differences in the prevalence of patient risk factors (case-mix). In some instances, it is possible to reduce or eliminate outcome variability due to case-mix through randomization, which hopefully should balance both known and unknown risk factors. Unfortunately, this is impractical in most real-life situations such as the comparison of results among institutions. Other study designs rely on covariate matching or the use of propensity scores {6, 7}, both of which balance only known risk-factors. However, in the majority of existing observational studies in medicine and surgery, case-mix has usually been accounted for through risk adjustment. Using statistical modeling techniques, typically some form of multivariable regression analysis {8}, investigators study the association between individual risk factors (also referred to as predictor variables or covariates) and outcomes while holding constant the effect of others. Once the impact of each risk factor is determined from a given population sample, it then becomes possible to estimate the probability of the outcome for patients having particular combinations of these risk factors.

Although researchers have utilized risk models for many years, their broader applicability and critical importance were more fully appreciated following the 1986 release of unadjusted hospital outcome data by the Health Care Financing Administration ( HCFA, now called CMS or the Centers for Medicare and Medicaid Services). Providers correctly argued that such data did not account for patient severity, and this led directly to the development of a number of high quality clinical databases and risk models, especially in cardiac surgery. Coronary artery bypass grafting (CABG) has been a particular focus of such research, not only because of the desire of cardiac surgeons to improve patient outcomes, but also because regulators and insurers have sought greater control over this high-profile, costly, and frequently performed procedure.

Some of the original cardiac surgery databases were voluntary (Northern New England Cardiovascular Disease Study Group [NNE]{9} and some were mandated by state or federal law (NY{10}, NJ{11}, Veterans Affairs Administration [VA]{12}). Soon after the HCFA release of unadjusted outcome data in 1986, the Society of Thoracic Surgeons established an Ad Hoc Committee on Risk Factors for Coronary Artery Bypass Surgery {13}. Subsequently, a committee under the direction of Dr. Richard Clark began work on the development of the STS National Cardiac Database (STS NCD). This database was formally established in 1989 and the software was released to STS members in 1990 {14-16}. Over the subsequent 13 years this has evolved to become one of the largest single-specialty databases in the world, containing data on over 2.4 million patients from 60% of US cardiac programs.

In this review, we present some fundamental aspects of risk model development and validation, the current uses and limitations of such models in cardiac surgery (with special attention to CABG), and prospects for the future.

DATA

Sources

No risk adjustment model is better than the data upon which it is based. Administrative data, such as that from the CMS MEDPAR database, provide one of the most commonly used sources for observational studies. Such data are readily available, relatively inexpensive, and contain information on millions of patients {2, 17}. However, because these administrative data have been collected primarily for billing purposes rather than for clinical studies, critical variables such as ejection fraction are unavailable, and differentiation of co-morbidities from complications is problematic. The latter deficiency may exaggerate the predictive ability of risk models derived from such data and may occasionally lead to the paradoxical appearance of higher predictive accuracy than that of models derived from clinical databases {2, 17, 18}. This results from the inappropriate inclusion in the model of pre-terminal complications that are highly correlated with mortality. Administrative databases may exclude important variables that are not billable diagnoses ("field saturation") {19}, they limit the number of secondary diagnoses, and they generally have insufficient flexibility to classify properly certain co-morbidities {17}, all of which limit the accuracy of risk models derived from them.

Core Variables

Griffith and associates {20} found that critical clinical variables known to be associated with mortality (e.g. ejection fraction, emergency procedures) were not included in the first Pennsylvania CABG database, leading to problematic accuracy. Hannan and associates compared models based upon the New York clinical cardiac surgery database (CSRS) with models derived from the New York administrative database (SPARCS) {21} and with the HCFA MEDPAR database {17}. In both instances, models derived from the clinical database were found to provide superior performance. Accuracy of the models based upon administrative data was improved substantially by the addition of a few critical clinical variables (ejection fraction, re-operation, left main obstruction).

Tu and associates {4} have determined a limited set of six core variables (age, gender, acuity, reoperation, left main coronary obstruction, ejection fraction) beyond which they believe there is little incremental improvement in model performance. Jones and associates {22} from the Cooperative CABG Database Project identified seven core variables (age, gender, left ventricular function, acuity, left main disease, reoperation, number of diseased vessels). They found that acuity, reoperation, and age accounted for the majority of predictive information in most CABG databases. In the STS NCD, 78% of the explained variance from the entire 28 variable model is derived from the eight most important predictors (age, surgical acuity, reoperative status, creatinine level, dialysis, shock, chronic lung disease, and ejection fraction).

All studies confirm that the gold standard for data is a specialty-specific, prospectively maintained clinical database such as the STS NCD. Such databases should contain, at a minimum, a core set of variables that has been demonstrated to be associated with outcome.

Definitions

Certain caveats regarding data apply generally to all regression models including those employed for cardiac surgery. One of the most important is strict standardization of data definitions. Even for a seemingly unambiguous endpoint like mortality, there are important statistical and policy implications of using (a) in-hospital mortality, regardless of when it occurs, (b) thirty-day all-cause mortality, regardless of where it occurs, and (c) operative mortality, defined as either (a) or (b). A fixed time period is statistically preferable {1} although more difficult to obtain than in-hospital mortality from databases that are not payer-based. Osswald and colleagues {23, 24} have studied the implications of different definitions of "early" post-
CABG mortality, especially in light of advancements in postoperative care. Particularly for higher risk patients where the early postoperative phase may be prolonged, these investigators assert that the true early mortality will be underestimated unless a complete tally of deaths, and their time of occurrence, is compiled for the first 6 months after surgery.

Quality

Whenever practical, continuous data should be used as such to avoid the arbitrariness and loss of valuable information that occurs with categorization {25}. Transformations of the measurement scale may be required for the values of the variables to be commensurate with the assumptions of the risk factor model being used (so-called linearizing transformations). Data entry software should contain internal quality controls for out of range, inconsistent, contradictory, and missing data. Ideally, values for missing data should be substituted using multiple imputation techniques {26, 27}. Institutions should receive periodic reports regarding their data quality and any anomalies in recording the values of variables compared with regional and national averages. All these features are included in the STS NCD.

MODEL DEVELOPMENT

Risk model development requires considerable statistical expertise and judgment, a caveat that is sometimes forgotten in this era of ubiquitous, powerful, off-the shelf statistical software. For example, the type of modeling strategy and validation techniques may differ depending on whether the purpose of the model is description of relationships (e.g. comparison of providers or treatments) or prediction of future events {28, 29}. Spiegelhalter has demonstrated the importance of such considerations when the aim of the model is probabilistic prediction to aid in patient selection and counseling {29}. Even when the same basic model is used, Naftel has demonstrated numerous technical reasons that different multivariable equations might be developed by different statisticians from the same data {30}.

Three principal techniques have been utilized for the construction of cardiac surgery risk models. Bayesian models were used initially for the STS NCD because they are robust with regard to missing data, an important problem in the early database experience. As data completeness improved, logistic models were substituted in 1995 {31, 32}. Logistic regression models continue to be the most common statistical technique for cardiac surgery risk modeling, not only for the STS National Database but also for those developed in New York {33}, the Veterans Affairs Administration {34}, and the Northern New England Cardiovascular Disease Study Group {35}. Some groups have employed simple additive scores with weights derived from the logistic regression model {11, 36}. Comparative studies {37} have generally demonstrated that logistic models offer the best overall performance.

Some have suggested that the next major advance in model performance would come from the application of algorithmic models, sometimes called machine learning techniques {38}, of which artificial neural networks are an example{39}. These models permit complex, non-linear information processing, thus avoiding one of the constraints of logistic models. However, two studies {40, 41}, one using 80,606 patients from the STS NCD {40}, failed to demonstrate any significant improvement over logistic or Bayesian models.

Strategies for logistic regression model development include data reduction techniques and variable selection, interaction terms and transformations of variables where appropriate, imputation of missing data, verification of model assumptions, and model validation {8, 42, 43}. One fundamental and still controversial question is the number of variables to include in a risk model. An excessive number of covariates may lead to some predictors having statistical significance but not biologic or clinical relevance, over-fitting of the model, numerical instability, and increased cost and difficulty of data collection {4, 8, 42, 43}. Harrell {8, 42} has recommended that the number of covariates considered for inclusion in such models be less than one-tenth the number of end points observed in the dataset (percent occurrence x sample size, for early events), which typically requires some data reduction technique to decrease the number of candidate variables. Univariate screening of candidate covariates followed by forward or backward stepwise selection is commonly employed for this purpose. However, some statisticians including Harrell have criticized this variable selection technique on theoretical grounds, and research continues on the relative merits of parsimonious models versus those that retain most or all potential predictors {8, 29, 44, 45}. The use of "machine-learning" variable selection methods such as bagging (bootstrap aggregating) are also gaining popularity {46, 47}.

Finally, most studies demonstrate that models have inferior performance when applied to patient groups other than the one from which they were developed. Ivanov and colleagues {48} found that institution or region-specific custom models performed best, followed by recalibrated models using the covariate set from a "ready-made" model but different weights. Applying unmodified off-the-shelf models to other groups of patients provided the least satisfactory performance. In contrast, Nashef and colleagues {49} recently reported that the additive EuroSCORE cardiac surgery risk-prediction model functioned well when applied to North American populations.

MODEL VALIDITY

Any statistical risk model must be scrutinized to determine whether it functions reliably for its intended purpose. Numerous types of validity have been summarized by Daley {1} including face validity (the model is reasonable to experts), content validity (all important variables have been included), attributional validity (risk adjustment is adequate to insure that differences in outcome are not due to patient characteristics) and predictive validity. The latter is usually assessed by means of model calibration and discrimination, typically employing a cross-validation technique in which the whole dataset is split into development and validation subsets. Harrell has recommended using the entire dataset for model development, then validating using another technique such as bootstrapping {8}.

Calibration assesses the concordance between deciles of observed and expected risk as measured by the Hosmer-Lemeshow test {43}, and answers the question "In 100 patients with the same estimated mortality risk (R%) as mine, would the observed number of deaths equal R?" Most CABG risk models are well calibrated overall but may produce estimates that are too extreme in the lowest and especially the highest risk subsets of patients, particularly in small datasets {45, 50}. Shrinkage of regression coefficients may provide more accurate and realistic prediction for future patients {8, 45, 50, 51}

Discrimination is the more demanding test and measures the tradeoff between the specificity and sensitivity of the risk model at various probability cut points. It asks the question "How well does the model separate patients who die from those who survive?" and may be interpreted as the percentage of discordant (meaning one death and one survivor) patient-pairs for which the model predicts a higher mortality risk for the patient who actually dies {43, 52, 53}. This percentage is equal to the area under the ROC curve {43} and is called the c-index or c-statistic {8, 42}. Unfortunately, most CABG risk models have only moderate discrimination as measured by the c-index. This has significant implications for their use in individual patient counseling {52, 54}. Risk models may accurately predict that three out of one hundred patients with a given set of risk factors will die postoperatively, but they cannot identify which three.

The similar and limited discrimination of most CABG risk models (VA, NNE, STS, NY) suggests that we still await the quantum improvement in risk prediction anticipated by Steen {39} over a decade ago. The performance of current models is limited by as yet unknown predictors, difficulty in measuring or representing certain complex clinical states, random catastrophic events such as a serious protamine reactions or sudden hemorrhage, and similar occurrences that are rare in the population but important in the individual patient {22, 40, 55-57}. Because known patient characteristics will never explain all the variance in cardiac surgery outcomes, the performance of any risk model has inherent limitations.

USES OF RISK MODELS

One of the most important uses of cardiac risk models, including the STS risk model, has been for academic research. Typically this has involved estimation of the effect of risk factors or particular therapies on patient outcome. Logistic models are well suited to this function because they readily provide odds ratios for each risk factor. STS NCD studies of preoperative risk factors include the impact of race {58}, gender{59}, and obesity {60}. Similar investigations of therapeutic options have included the value of IMA use {61, 62}, beta blockade {63}, and off-pump techniques {64}. The STS NCD has also been utilized to clarify our understanding of the relationship between volume and outcome for CABG {65-67}.

A second use for risk models is the development of tools that aid in everyday patient management. These would include patient care algorithms or critical pathways scientifically based upon risk-adjusted studies {56}. One such management tool is the use of risk models in individual patient counseling or as a decision support tool for clinicians choosing between different interventions (e.g. coronary artery bypass versus percutaneous angioplasty). Pocket cards (Northern New England Cardiovascular Disease Study Group) {68}or handheld computers {69} have been used to generate bedside risk estimates for individual patients. Obviously, in order to properly apply such methods, values for each risk factor in the model must be available. Because of the modest discrimination of most risk models, which limits the accuracy of individual patient prediction, it is recommended that such information be presented to patients not simply as a probability estimate, but accompanied by confidence limits that demonstrate the uncertainty of those estimates.

One of the most common uses of risk models is to compare provider performance. This is statistically challenging from the outset because of the low incidence of the binary outcome (operative mortality) as well as the highly variable and often small sample sizes from different providers. Typically, as in New York State, profiling has been achieved by aggregating the probabilities of death of each patient treated by a given provider based upon the results of logistic modeling. This aggregate mortality is used together with the observed provider mortality to construct a ratio of observed to expected mortalities, or O/E ratio. However, modern research in provider profiling has demonstrated potential deficiencies with this approach. Standard techniques may not accurately reflect the unobserved true mortality of low volume providers, and they do not adequately account for clustering (non-random allocation) of observations within providers (such as the prevalence of heart failure patients at transplant centers {56}). The net effect may be an underestimation of random inter-provider variability, an overestimation of systematic inter-provider variability, and an increased likelihood of falsely classifying a provider as an outlier. Hierarchical or multilevel models have been designed to address these concerns and some advocate their use for profiling whenever feasible {44, 70-75}.

Whatever method is used for provider profiling, identification of statistical outliers is only a starting point for further analysis {56, 76}. All risk models have inherent limitations, and the results obtained from different models and by different statisticians may vary {28-30, 50, 72}.The results derived from any risk model should be regarded as one element to be considered in conjunction with other traditional indices of competent surgical care. The Society of Thoracic Surgeons firmly holds that those surgical programs identified as statistical outliers should not be arbitrarily labeled "substandard". Outlying providers' coding practices and individual mortalities should be carefully analyzed, and process or structural causes should be sought to explain any truly aberrant results {12, 77-80}.

Finally, risk-adjusted outcomes have also been used in northern New England {81} and Minnesota {82} as the basis for confidential continuous quality improvement (CQI) activities, and in the Veterans Affairs Administration for both confidential monitoring of performance and CQI activities {12, 78}. Here, the main goal is not public accountability but rather provider-initiated determination of best practice, benchmarking, and regional or system-wide improvement. This has resulted in mortality reduction that appears comparable to that achieved using public report cards {83}. The Society of Thoracic has implemented process improvement initiatives based on analyses derived from the STS NCD {62}.

LIMITATIONS AND DISADVANTAGES OF RISK MODELS

As previously described, risk models do not have perfect discrimination, the ability to predict the death of specific individuals. Because our ability to determine expected outcome is limited, risk-adjusted mortality derived from the O/E ratio are also subject to error. Publication of risk-adjusted mortality "accurate" to the nearest hundredth of a percentage may be misleading to the public, and all the more so when these are not accompanied by confidence limits. Furthermore, even the best available risk models explain only a small proportion of the variability in cardiac surgery outcomes, a significant liability if the goal is to assess inter-provider differences in quality of care based upon their relative risk-adjusted mortalities {28, 84}. Statisticians continue to evolve new and more sophisticated techniques to model complex biological phenomena, and it is hopeful that the performance of risk models will continue to improve.

From a health policy perspective, some argue that current risk models place too much emphasis on mortality as the sole endpoint, which may ultimately decrease access to surgery for those who might benefit most (high-risk case avoidance) {13, 55, 70, 85}. For similar reasons, such models may encourage gaming of the reporting system when used for provider profiling {70}. Emphasis on outcome endpoints such as mortality may deflect attention from important process and structural aspects of care. Finally, challenges common to all cardiac surgery databases include their cost, especially at a time when revenues from heart surgery are declining, and privacy issues related to the HIPAA.

CURRENT STATUS AND FUTURE DIRECTION

Between 1990 and 1997, the number of centers contributing patients to the STS NCD grew from 105 to 450, and it currently contains clinical data on 2.4 million patients {16, 76, 86}. Results over the decade 1990-1999 demonstrate a progressive increase in preoperative risk and a decline in both observed mortality and the O/E ratio {86}.

Although it is already the dominant cardiac surgical database in the world, it is the goal of the STS to achieve 100% participation of all cardiac surgery providers in the U.S. This, together with increased efforts to validate the accuracy of the submitted data, will eliminate any remaining concerns about the voluntary nature of the database, although there has generally been close correlation between the results obtained from voluntary (STS NCD, NNE) and mandatory databases (VA, NY) {76, 86}. Recent studies matching STS NCD results with those from CMS claims data suggest substantial agreement both with regard to completeness and observed mortality. The STS NCD should become the dominant source of accurate information for the federal and state governments regarding outcome and reimbursement issues, and there is also the opportunity to partner with industry in the assessment of new technologies.

A systematic review of the STS NCD was begun in 1997 {16, 76}. The STS Definitions Committee worked with the American College of Cardiology to eliminate unimportant variables and to develop a new data dictionary. The original 512 fields were reduced to 217 core and 255 extended fields. Discussions continue among representatives of the major cardiac surgery and cardiology databases to resolve inconsistent data definitions.

The Duke Clinical Research Institute (DCRI) became the data analysis and warehouse center for the STS NCD during 1998-1999. In addition to their sophisticated resources for data cleaning and verification, DCRI provides detailed semiannual reports comparing individual programs with their region and with the overall STS national dataset. Regular national meetings of hospital data managers have substantially enhanced uniformity of coding practices and completeness.

From models that focused primarily on CABG mortality, the STS NCD has evolved to a family of related risk models for prediction of outcomes following valve replacement with or without CABG {87, 88}, congenital heart surgery {89}, and general thoracic surgery. There is increased awareness of the importance of endpoints other than operative mortality, including perioperative morbidity and length of stay {15, 16, 76, 90, 91}. Postoperative complications and readmissions (which are delayed complications) appear to measure different and complementary aspects of care compared with hospital mortality {92, 93}, and not all measures of outcome are equally indicative of program quality . Future emphasis will also include documentation of long-term follow-up data, functional status and quality of life measures, and the relationship of clinical factors to hospital costs. Finally, there will be an increasing emphasis on process measures (e.g. IMA and beta blocker use) to assess and improve provider performance. This approach may offer the greatest potential to enhance the results of all cardiac surgery programs and to reduce interprovider variability. It also diminishes our reliance upon sophisticated, yet still imperfect, methods of risk-adjustment {56, 62, 76}.

Cardiac surgery remains at the forefront of risk model development and clinical quality monitoring. With advancements in statistical methodology, expanding enrollment in major databases such as the STS NCD and EuroSCORE, and the firm commitment of cardiac surgeons, our profession will maintain its leadership in this vital area of health care.