Methodological Considerations in Developing Hospital Composite Performance Measures Sean M. OBrien, PhD Department of Biostatistics & Bioinformatics Duke University Medical Center [email protected] Introduction A composite performance measure is a combination of two or more related indicators e.g. process measures, outcome measures Useful for summarizing a large number of indicators Reduces a large number of indicators into a

single simple summary Example #1 of 3: CMS / Premier Hospital Quality Incentive Demonstration Project source: http://www.premierinc.com/quality-safety/tools-services/p4p/hqi/images/composite-score.pdf Example #2 of 3: US News & World Reports Hospital Rankings 2007 Rankings Heart and Heart Surgery Rank Hospital Score #1 Cleveland Clinic 100.0 #2

Mayo Clinic, Rochester, Minn. 79.7 #3 Brigham and Women's Hospital, Boston 50.5 #4 Johns Hopkins Hospital, Baltimore 48.6 #5 Massachusetts General Hospital, Boston 47.6 #6

New York-Presbyterian Univ. Hosp. of Columbia and Cornell 45.6 #7 Texas Heart Institute at St. Luke's Episcopal Hospital, Houston 45.0 #8 Duke University Medical Center, Durham, N.C. 42.2 source: http://www.usnews.com Example #3 of 3: Society of Thoracic Surgeons Composite Score for CABG Quality STS Database Participant Feedback Report

STS Composite Quality Rating Why Composite Measures? Simplifies reporting Facilitates ranking More comprehensive than single measure More precision than single measure Limitations of Composite Measures

Loss of information Requires subjective weighting No single objective methodology Hospital rankings may depend on weights Hard to interpret May seem like a black box Not always clear what is being measured Goals Discuss methodological issues & approaches for constructing composite scores

Illustrate inherent limitations of composite scores Outline Motivating Example: US News & World Reports Best Hospitals Case Study: Developing a Composite Score for CABG Motivating Example: US News & World Reports Best Hospitals 2007 Quality Measures for Heart and Heart Surgery Structure Component Volume

Reputation Score (Based on physician survey. Percent of physicians who list your hospital in the top 5) + Mortality Index (Risk adjusted 30-day. Ratio of observed to expected number of mortalities for for AMI, CABG etc.) + Nursing index Nurse magnet hosp Advanced services Patient services Trauma center

Motivating Example: US News & World Reports Best Hospitals 2007 structure, process, and outcomes each received one-third of the weight. - Americas Best Hospitals 2007 Methodology Report Motivating Example: US News & World Reports Best Hospitals 2007 Example Data Heart and Heart Surgery Duke University Medical Center Reputation source: usnews.com 16.2% Mortality index 0.77

Discharges 6624 Nursing index 1.6 Nurse magnet hosp Yes Advanced services 5 of 5 Patient services 6 of 6 Trauma center Yes

Which hospital is better? Hospital A Hospital B Reputation 5.7% Reputation Mortality index 0.74 Mortality index 1.10 Discharges 2922 Discharges

10047 14.3% Nursing index 2.0 Nursing index 2.0 Nurse magnet hosp Yes Nurse magnet hosp Yes Advanced services 5 of 5

Advanced services 5 of 5 Patient services 6 of 6 Patient services 6 of 6 Trauma center Yes Trauma center Yes Despite Equal Weighting, Results Are Largely Driven By Reputation Hospital

Overall Score Reputation Score #1 Cleveland Clinic 100.0 67.7% #2 Mayo Clinic, Rochester, Minn. 79.7 51.1%

#3 Brigham and Women's Hospital, Boston 50.5 23.5% #4 Johns Hopkins Hospital, Baltimore 48.6 19.8% #5 Massachusetts General Hospital, Boston 47.6

20.4% #6 New York-Presbyterian Univ. Hosp. of Columbia and Cornell 45.6 18.5% #7 Texas Heart Institute at St. Luke's Episcopal Hospital, Houston 45.0 20.1% #8

Duke University Medical Center, Durham, N.C. 42.2 16.2% (source of data: http://www.usnews.com) 100 90 Overall Score 2007 Rank 80 70 60 50 40 10% 20% 30% 40% 50% 60% 70%

Reputation Score Lesson for Hospital Administrators (?) Best way to improve your score is to boost your reputation Focus on publishing, research, etc. Improving your mortality rate may have a modest impact Lesson for Composite Measure Developers No single objective method of choosing weights

Equal weighting may not always behave like it sounds Case Study: Composite Measurement for Coronary Artery Bypass Surgery Background Society of Thoracic Surgeons (STS) Adult Cardiac Database Since 1990 Largest quality improvement registry for adult cardiac surgery Primarily for internal feedback Increasingly used for reporting to 3 rd parties STS Quality Measurement Taskforce (QMTF) Created in 2005 First task: Develop a composite score for CABG for use by 3rd party payers

Why Not Use the CMS HQID Composite Score? Choice of measures Some HQID measures not available in STS (Also, some nationally endorsed measures are not included in HQID) Weighting of process vs. outcome measures HQID is heavily weighted toward process measures STS QMTF surgeons wanted a score that was heavily driven by outcomes Our Process for Developing Composite Scores Review specific examples of composite

scores in medicine Example: CMS HQID Review and apply approaches from other disciplines Psychometrics Explore the behavior of alternative weighting methods in real data Assess the performance of the chosen methodology CABG Composite Scores in HQID (Year 1) Process Measures (4 items) Outcome Measures (3 items)

Aspirin prescribed at discharge Inpatient mortality rate Antibiotics <1 hour prior to incision Postop hemorrhage/hematoma Prophylactic antibiotics selection Postop physiologic/metabolic derangement Antibiotics discontinued <48 hours Outcome Score Process Score Overall Composite CABG Composite Scores in HQID Calculation of the Process Component Score

Based on an opportunity model Each time a patient is eligible to receive a care process, there is an opportunity for the hospital to deliver required care The hospitals score for the process component is the percent of opportunities for which the hospital delivered the required care CABG Composite Scores in HQID Calculation of the Process Component Score Hypothetical example with N = 10 patients Aspirin at Discharge Antibiotics Initiated

Antibiotics Selection Antibiotics Discontinued 9/9 (100%) 9 / 10 (90%) 10 / 10 (100%) 9/9 (100%) 9 9 10 9 37 / 38 97.4% 9+10+10+9 CABG Composite Scores in HQID Calculation of Outcome Component

Risk-adjusted using 3MTM APR-DRGTM model Based on ratio of observed / expected outcomes Outcomes measures are: Survival index Avoidance index for hematoma/hemmorhage Avoidance index for physiologic/metabolic derangement CABG Composite Scores in HQID Calculation of Outcome Component Survival Index observed # of patients surviving survival index expected # of patients surviving Interpretation:

index <1 implies worse-than-expected survival index >1 implies better-than-expected survival (Avoidance indexes have analogous definition & interpretation) CABG Composite Scores in HQID Combining Process and Outcomes Equal weight for each measure 4 process measures 3 outcome measures each individual measure is weighted 1 / 7 4 / 7 x Process Score + 1 / 7 x survival index + 1 / 7 x avoidance index for hemorrhage/hematoma + 1 / 7 x avoidance index for physiologic derangment = Overall Composite Score Strengths & Limitations Advantages: Simple Transparent Avoids subjective weighting

Disadvantages: Ignores uncertainty in performance measures Not able to calculate confidence intervals An Unexpected Feature: Heavily weighted toward process measures As shown below CABG Composite Scores in HQID Exploring the Implications of Equal Weighting HQID performance measures are publicly reported for the top 50% of hospitals Used these publicly reported data to study

the weighting of process vs. outcomes Publicly Reported HQID Data CABG Year 1 Process Measures Outcome Measures Process Performance vs. Overall Composite Decile Ranking Decile Ranking 1st 2nd other 75% 80% 85% 90%

95% Average of Process M easures 100% Outcome Performance vs. Overall Composite Decile Ranking Decile Ranking 1st 2nd other 98.5% 99.0% 99.5% 100.0% 100.5% 101.0% 101.5% Average of Outcome M easures

Explanation: Process Measures Have Wider Range of Values Decile Ranking Decile Ranking 1st 1st 2nd 2nd other other 75% 80%

85% 90% 95% Average of Process M easures 100% 98.5% 99.0% 99.5% 100.0% 100.5% 101.0% 101.5% Average of Outcome M easures The amount that outcomes can increase or decrease the composite score is small relative to process measures

Process vs. Outcomes: Conclusions Outcomes will only have an impact if a hospital is on the threshold between a better and worse classification This weighting may have advantages Outcomes can be unreliable - Chance variation - Imperfect risk-adjustment Process measures are actionable Not transparent Lessons from HQID Equal weighting may not behave like

it sounds If you prefer to emphasize outcomes, must account for unequal measurement scales, e.g. standardize the measures to a common scale or weight process and outcomes unequally Goals for STS Composite Measure Heavily weight outcomes Use statistical methods to account for small sample sizes & rare outcomes Make the implications of the weights as transparent as possible

Assess whether inferences about hospital performance are sensitive to the choice of statistical / weighting methods Outline Measure selection Data Latent variable approach to composite measures STS approach to composite measures The STS Composite Measure for CABG

Criteria for Measure Selection Use Donabedian model of quality Structure, process, outcomes Address three temporal domains Preoperative, intraoperative, postoperative Choose measures that meet various criteria for validity Adequately risk-adjusted Adequate data quality The STS Composite Measure for CABG Criteria for Measure Selection Endorsed by NQF Captured

In STS + Process Measures Internal mammary artery (IMA) Preoperative betablockers Discharge antiplatelets Discharge betablockers

Discharge antilipids Risk-Adjusted Outcome Measures Operative mortality Prolonged ventilation Deep sternal infection Permanent stroke Renal failure

Reoperation NQF Measures Not Included In Composite Inpatient Mortality Redundant with operative mortality Participation in a Quality Improvement Registry Annual CABG Volume Other Measures Not Included in Composite HQID measures, not captured in STS Antibiotics Selection & Timing

Post-op hematoma/hemmorhage Post-op physiologic/metabolic derangment Structural measures Patient satisfaction Appropriateness Access Efficiency Data

STS database 133,149 isolated CABG operations during 2004 530 providers Inclusion/exclusion: Exclude sites with >5% missing data on any process measures For discharge meds exclude in-hospital mortalities

For IMA usage exclude redo CABG Impute missing data to negative (e.g. did not receive process measure) Distribution of Process Measures in STS IMA Median = 93.5% IQR: 89.7% to 96.2% 20 40 60 80 100 60 80

100 60 80 100 60 80 100 60 80 100 DC Antiplatelets Median = 94.9% IQR: 91.0% to 97.4%

20 40 DC Antilipids Median = 79.6% IQR: 67.3% to 88.8% 20 40 DC Beta Blockers Median = 85.0% IQR: 76.6% to 90.5% 20 40 Preop Beta Blockers Median = 73.1% IQR: 64.4% to 79.4%

20 40 Hospital-Specific Usage Rates (%) Distribution of Outcomes Measures in STS Prolonged Ventilation Sternal Infection Median = 7.7% IQR: 4.9% to 11.2% 0 10 30 Renal Failure Median = 2.8% IQR: 1.7% to 4.6% 0 4

Stroke Median = 0.2% IQR: 0.0% to 0.7% 0 1 2 Reoperation Median = 4.8% IQR: 3.3% to 6.8% 3 Median = 1.1% IQR: 0.6% to 1.7% 0 1

2 3 4 Mortality Median = 2.2% IQR: 1.3% to 3.3% 8 12 0 10 20 0 4 Hospital-Specific Unadjusted Event Rates (%) 8 Latent Variable Approach to

Composite Measures Psychometric approach Quality is a latent variable - Not directly measurable - Not precisely defined Quality indicators are the observable manifestations of this latent variable Goal is to use the observed indicators to make inferences about the underlying latent trait X1 X2 X5 (Quality) X4 X3

Common Modeling Assumptions Case #1: A single latent trait All variables measure the same thing (unidimensionality) Variables are highly correlated (internal consistency) Imperfect correlation is due to random measurement error Can compensate for random measurement error by collecting lots of variables and averaging them Case #2: More than a single latent trait Can identify clusters of variables that describe a single latent trait (and meet the assumptions of Case #1) NOTE: Measurement theory does not indicate how to reduce multiple distinct latent traits into a single dimension Beyond the scope of measurement theory Inherently normative, not descriptive Models for A Single Latent Trait

Latent Trait Logistic Model Landrum et al. 2000 Example of latent trait logistic model applied to 4 medication measures (Preop Betablocker) X1 (Discharge Betablocker) X2 Quality of Perioperative Medical Management X3 (Discharge Antiplatelets) X4 (Discharge Antilipids) Example of latent trait logistic model applied to 4 medication measures (Preop Betablocker)

numerator1 denominator1 1 (Discharge Betablocker) 2 numerator2 denominator2 Quality of Perioperative Medical Management numerator3 denominator3 3 (Discharge Antiplatelets) 4 numerator4

denominator 4 (Discharge Antilipids) denotes underlying true probability Technical Details of Latent Trait Analysis (preop betablockers) log[ 1 /(1 1 )] 1 1Q (discharge betablockers) log[ 2 /(1 2 )] 2 2Q (discharge antiplatelets) log[ 3 /(1 3 )] 3 3Q (discharge antilipids) log[ 4 /(1 4 )] 4 4Q Q is an unobserved latent variable Goal is to estimate Q for each participant Use observed numerators and denominators Frequency of Usage (%) 60 80 100

Latent trait logistic model 40 Preop Beta DC Antiplatelet DC Betablocker DC Antilipid -2 -1 0 Latent Quality 1 2 Latent Trait Analysis Advantages:

Quality can be estimated efficiently Concentrates information from multiple variables into a single parameter Avoids having to determine weights Latent Trait Analysis Disadvantages: Hard for sites to know where to focus improvement efforts because weights are not stated explicitly Strong modeling assumptions A single latent trait (unidimensionality) Latent trait is normally distributed One major assumption is not stated explicitly but can be derived by examining the model 100% correlation between the individual items

A very unrealistic assumption!! Model did not fit the data Table 1. Correlation between hospital log-odds parameters under IRT model DISCHARGE ANTIPLATELETS DISCHARGE ANTILIPIDS DISCHARGE BETABLOCKER PREOPERATIVE BETABLOCKER 1.00 1.00 1.00

1.00 1.00 DISCHARGE ANTILIPIDS 1.00 DISCHARGE BETABLOCKER Table 2. Estimated correlation between hospital log-odds parameters DISCHARGE ANTIPLATELETS DISCHARGE ANTILIPIDS DISCHARGE BETABLOCKER DISCHARGE ANTILIPIDS

DISCHARGE BETABLOCKER PREOPERATIVE BETABLOCKER 0.38 0.30 0.15 0.34 0.19 0.50 Model Also Did Not Fit When Applied to Outcomes VENT INFECT STROKE RENAL

REOP INFEC STROKE RENAL REOP MORT 0.46 0.15 0.49 0.49 0.50 0.16

0.16 0.54 0.65 0.40 0.43 0.43 0.44 0.54 0.61 Latent Trait Analysis Conclusions Model did not fit the data!

Each measure captures something different # latent variables = # of measures? Cannot use latent variable models to avoid choosing weights The STS Composite Method The STS Composite Method Step 1. Quality Measures are Grouped Into 4 Domains Step 2. A Summary Score is Defined for Each Domain Step 3. Hierarchical Models Are Used to Separate True Quality Differences From Random Noise and Case Mix Bias Step 4. The Domain Scores are Standardized to a Common Scale Step 5. The Standardized Domain Scores are Combined Into an Overall Composite Score by Adding Them

Preview: The STS Hospital Feedback Report Score + confidence interval Overall composite score Domainspecific scores 3-star rating categories Graphical display of STS distribution Step 1. Quality Measures Are Grouped Into Four Domains

Perioperative Medical Care Bundle Operative Technique Risk-Adjusted Mortality Measure Risk-Adjusted Morbidity Bundle Preop B-blocker IMA Usage Operative Mortality Stroke

Discharge B-blocker Renal Failure Discharge Antilipids Reoperation Discharge ASA Sternal Infection Prolonged Ventilation Of Course Other Ways of Grouping Items Are Possible Taxonomy of Animals in a Certain Chinese Encyclopedia* a) b) c) d) e) f)

g) h) i) j) k) l) m) n) Those that belong to the Emperor Embalmed ones Tame ones Suckling pigs Sirens Fabulous ones Stray dogs Those included in the present classification Frenzied ones Innumerable ones Those drawn with a very fine camelhair brush Others Those that have just broken a water pitcher Those that from a long way off look like flies

*According to Michel Foucault, The Order of Things, 1966 Step 2. A Summary Measure Is Defined for Each Domain Perioperative Medical Care Bundle Operative Technique Risk-Adjusted Mortality Measure Risk-Adjusted Morbidity Bundle Medications all-or-none composite endpoint Proportion of patients who received ALL four

medications (except where contraindicated) Morbidities any-or-none composite endpoint Proportion of patients who experienced AT LEAST ONE of the five morbidity endpoints All-Or-None / Any-Or-None Advantages: No need to determine weights Reflects important values Emphasizes systems of care Emphasizes high benchmark Simple to analyze statistically

Using methods for binary (yes/no) endpoints Disadvantages: Choice to treat all items equally may be criticized Step 2. A Summary Measure Is Defined for Each Domain Perioperative Medical Care Bundle Operative Technique Risk-Adjusted Mortality Measure Risk-Adjusted Morbidity Bundle

Proportion of patients who received all 4 medications Proportion of patients who received an IMA Proportion of patients who experienced operative mortality Proportion of patients who experienced at least one major morbidity Step 3. Use Hierarchical Models to Separate True Quality Differences from Random Noise proportion of successful outcomes

= numerator / denominator = true probability + random error Hierarchical models estimate the true probabilities Variation in performance = measures Variation in true probabilities + Variation caused by random error Example of Hierarchical Models Figure. Mortality Rates in a Sample of STS Hospitals

0.0 0.01 0.0 0.01 Observed Estimates 0.02 0.03 0.02 0.03 Hierarchical Estimates 0.04 0.04 Step 3. Use Hierarchical Models to Separate True Quality Differences from Case Mix Variation in performance =

measures Variation in true probabilities Variation in true = probabilities Variation caused by the hospital + Variation caused by random error + Variation

caused by case mix risk-adjusted mortality/morbidity Advantages of Hierarchical Model Estimates Less variable than a simple proportion Shrinkage Borrows information across hospitals Our version also borrows information across measures Adjusts for case mix differences Estimated Distribution of True Probabilities (Hierarchical Estimates)

Mortality Morbidity Median = 2.2% IQR: 1.8% to 2.8% 2 4 6 Hospital-Specific Rate (%) 0 10 20 30 40 Hospital-Specific Rate (%) IMA Usage Medication Usage

5 10 20 Hospital-Specific Rate (%) 20 40 60 80 Hospital-Specific Rate (%) Median = 5.6% IQR: 4.2% to 7.3% 0 Median = 13.0% IQR: 10.0% to 16.5% Median = 52.6% IQR: 41.8% to 63.6% Step 4. The Domain Scores Are Standardized to a Common Scale

Step 4a. Consistent Directionality Directionality Needs to be consistent in order to sum the measures Solution Measure success instead of failure Worse Better IMA usage rate All-or-none medication adherence Better Worse Risk-Adjusted Mortality Rate Risk-Adjusted Any-Morbidity Rate Probability of NO mortality 1 Probability of mortality

Probability of NO morbidity 1 Probability of morbidity Step 4a. Consistent Directionality Mortality Avoidance Morbidity Avoidance 94 96 98 Hospital-Specific Rate (%) 60 70 80 90 Hospital-Specific Rate (%) IMA Usage Medication Usage 5 10

20 Hospital-Specific Rate (%) 20 40 60 80 Hospital-Specific Rate (%) Median = 97.8% IQR: 97.2% to 98.2% Median = 5.6% IQR: 4.2% to 7.3% 0 Median = 87.0% IQR: 83.5% to 90.0% Median = 52.6% IQR: 41.8% to 63.6% Step 4b. Standardization

Each measure is re-scaled by dividing by its standard deviation (sd) Notation meds Probability of receiving all medications IMA Probability of receiving an IMA mort Probability of NO operative mortality morb Probability of NO major morbidity Step 4b. Standardization Each measure is re-scaled by dividing by its standard deviation (sd) standardized meds measure meds / sdmeds standardized IMA measure IMA / sdIMA standardized mort measure mort / sdmort standardized morb measure morb / sdmorb Step 5. The Standardized Domain Scores Are Combined By Adding Them mort Composite sdmort

morb IMA sdmorb sdIMA meds sdmeds where denotes the hierarchical estimate of Step 5. The Standardized Domain Scores Are Combined By Adding Them then rescaled again (for presentation purposes) 1 mort Composite c sdmort morb IMA

sdmorb sdIMA meds sdmeds 1 1 1 1 where c sd sd sd sd morb IMA meds mort

(This guarantees that final score will be between 0 and 100.) Distribution of Composite Scores Composite Scores Median = 95.0% IQR: 94.0% to 95.6% 86 88 90 92 94 Estimated Composite Score 96 98 (Fall 2007 harvest data. Rescaled to lie between 0 and 100.) Goals for STS Composite Measure

Heavily weight outcomes Use statistical methods to account for small sample sizes & rare outcomes Make the implications of the weights as transparent as possible Assess whether inferences about hospital performance are sensitive to the choice of statistical / weighting methods Exploring the Implications of Standardization If items were NOT standardized Items with a large scale would disproportionately

influence the score example: medications would dominate mortality A 1% improvement in mortality would have the same impact as 1% improvement in any other domain All Or None Medication Usage Range 0 20 40 60 80 Provider-Specific Usage Rates (%) Range 0

100 Mortality 20 40 60 80 Provider-Specific Risk-Standardized Rates (%) 100 Exploring the Implications of Standardization mort Composite 0.5 morb IMA 4.2 5.8 meds

14.3 After standardizing A 1-point difference in mortality has same impact as: 8% improvement in morbidity rate 11% improvement in use of IMA 28% improvement in use of all medications Item-Total Correlation Composite is weighted toward outcomes 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0 0.78 0.65 0.56 0.48 Mortality Morbidity Meds IMA Sensitivity Analyses Key Question Are inferences about hospital quality sensitive to the choice of methods? If not, then stakes are not so high Analysis

Calculate composite scores using a variety of different methods and compare results Sensitivity Analysis: Within-Domain Aggregation Opportunity Model vs. All-Or-None Composite Spearman rank correlation = 0.98 0.8 Agree w/in 20 %-tile pts = 99% Agree on top quartile = 93% Pairwise concordance = 94%

1 hospitals rank changed by 23 percentile points places Spearman r =0.98 0.2 All-Or-None 0.4 0.6 Agreement between methods 0.0 0.5 0.7 Simple Average

No hospital was ranked in the top quartile by one method and bottom half by the other 0.9 Sensitivity Analysis: Method of Standardization Divide by the range instead of the standard deviation where range denotes the maximum minus the minimum (across hospitals) Method 2: Re-Scaled to (0,1) 0.2 0.4 0.6 mort morb meds IMA Composite

rangemort rangemorb rangeIMA rangemeds Spearman r = 0.99 40 41 42 43 44 Method 1: Normalized by Standard Deviation Sensitivity Analysis: Method of Standardization Dont standardize Method 3: Average - No Rescaling 70 75 80 85 90 Composite mort morb IMA meds Spearman r = 0.84

56 57 58 59 60 Method 1: Normalized by Standard Deviation Sensitivity Analysis: Summary Inferences about hospital quality are generally robust to minor variations in the methodology However, standardizing vs. not standardizing has a large impact on hospital rankings Performance of Hospital Classifications Based on the STS Composite Score

Bottom Tier 99% Bayesian probability that providers true score is lower than STS average Top Tier 99% Bayesian probability that providers true score is higher than STS average Middle Tier < 99% certain whether providers true score is lower or higher than STS average. Results of Hypothetical Tier System in 2004 Data 407 70 53

Below Average (N = 70) Indistinguishable from Average (N = 407) Above Average (N = 53) Ability of Composite Score to Discriminate Performance on Individual Domains Risk-Adjusted Mortality (%) IMA Usage (%) 3.0% 3.0% 100.0% 2.4% 88.1% 93.7% 95.7%

80.0% 1.7% 2.0% 60.0% 40.0% 1.0% 20.0% 0.0% Low Tier Middle Tier High Tier 0.0% Any-Or-None Morbidity (%) 20.0%

18.1% 5.0% High Tier All-Or-None Medications (%) 66.4% 60.0% 9.8% 10.0% 0.0% Middle T ier 80.0% 13.5% 15.0% Low Tier

40.0% 48.1% 35.7% 20.0% Low Tier Middle Tier High T ier 0.0% Low Tier Middle Tier High T ier Summary of STS Composite Method

Use of all-or-none composite for combining items within domains Combining items was based on rescaling and adding Estimation via Bayesian hierarchical models Hospital classifications based on Bayesian probabilities Advantages Rescaling and averaging is relatively simple Even if estimation method is not

Hierarchical models help separate true quality differences from random noise Bayesian probabilities provide a rigorous approach to accounting for uncertainty when classifying hospitals Control false-positives, etc. Limitations Validity depends on the collection of individual measures Choice of measures was limited by practical considerations (e.g. available in STS) Measures were endorsed by NQF Weak correlation between measures

Reporting a single composite score entails some loss of information Results will depend on choice of methodology We made these features transparent - Examined implications of our choices - Performed sensitivity analyses Summary Composite scores have inherent limitations The implications of the weighting method is not always obvious Empirical testing & sensitivity analyses can help elucidate the behavior and limitations of a composite score

The validity of a composite score depends on its fitness for a particular purpose Possibly different considerations for P4P vs. public reporting Extra Slides Comparison of Tier Assignments Based on Composite Score Vs. Mortality Alone Mortality Only Composite Score 407 524 6 0 Worse Than Average (N = 6) Indistinguishable from Average (N = 524) Better Than Average (N = 0)

70 53 Worse Than Average (N = 70) Indistinguishable from Average (N = 407) Better Than Average (N = 53) EXTRA SLIDES STAR RATINGS VS VOLUME Frequency of Star Categories By Volume EXTRA SLIDES HQID METHOD APPLIED TO STS MEASURES Finding #1. Composite Is Primarily Determined by Outcome Component (A) HQID Composite Score 90 95

105 r =0.93 HQID Composite Score 90 95 105 2 (B) 60 70 80 Process Component 90 100 2

94 96 r =0.07 98 100 102 Outcome Component 104 Finding #2. Individual Measures Do Not Contribute Equally to Composite 50% 49% Percent of Explained Variation 50% 41%

40% 30% 24% 20% 15% 10% 7% D /C An 6% 5% 1%

0% e el t la tip 6% 0% n e nt ke rs rs lity io ds on ilur i i

t e e o e a t c t r p k V a ck or St tili d Fa Inf e er oc o l

l n e l M p b b g A na ta eo ta e C on e e l / R R B

B D ro p P /C eo D Pr ts IM A Number of Sites 0 100 Number of Sites 0 20

Explanation: Process & Survival Components Have Measurement Unequal Scales Range 60 70 80 90 100 Process Adherence Rate (%) 110 Range 60 70 80 90 Survival Index

100 110 EXTRA SLIDES CHOOSING MEASURES Process or Outcomes? Processes that impact patient outcomes Processes that are currently measured Process or Outcomes? Processes that impact patient outcomes Randomness Outcomes

Structural Measures? Structure Processes that impact patient outcomes Randomness Outcomes EXTRA SLIDES ALTERNATE PERSPECTIVES FOR DEVELOPING COMPOSITE SCORES Perspectives for Developing Composites Normative Perspective Concept being measured is defined by the choice of measures and their weighting

Not vice versa Weighting different aspects of quality is inherently normative - Weights reflect a set of values - Whose values? Perspectives for Developing Composites Behavioral Perspective Primary goal is to provide an incentive Optimal weights are ones that will cause the desired behavior among providers Issues: Reward outcomes or processes? Rewarding X while hoping for Y SCRAP Score + confidence interval

Overall composite score Domainspecific scores 3-star rating categories Graphical display of STS distribution