Why Nobody Believes the Numbers:
The Outcomes Measurement Guide for Grown-Ups

Sign up to receive Al's upcoming book at a discount, and get the introduction free today.

Find Out More

Measuring Disease Management Results in Small Accounts: It Doesn't Look Easy, and It's Not As Easy As It Looks

The techniques which have evolved to measure financial results in health plans and sizable employers turn out to be at best time-consuming inconclusive exercises for small accounts. Beyond that, they are often actually misleading. The DMAA Guidelines themselves acknowledge as much. They recommend a population size well into the five figures (though they are silent on an exact number) as the minimum necessary for reasonably valid population-based measurement. DMPC generally agrees with 50,000-employees as the lowest level at which pre-post becomes too problematic to apply, though a Medicaid disabled population, Medicare population or even a municipal or union population might generate valid-enough results in lower five figures of total lives owing to the higher proportion of disease-eligible members.

So what to do for the rest of the employer community, which is mostly comprised of employers with fewer than 50,000 employees? Seven observations and recommendations follow. DMPC certification in small-group measurement is per item. Slides from reports or proposals containing any item below may be stamped as DMPC-Certified for small-group, using our seal. The entire report or proposal isn’t certified, just the pieces which conform to one of the items below. (See rules for details.)

#1: The classic disease-specific plausibility indicators need to be expanded.

The heart attack and angina attack rate are each about 1 per 1000 in the <65 population. A year when a 1000-member group has one more or one fewer cardiac events is likely to be due to luck. That’s why you’d need a much larger set of datapoints than just disease-specific events. This larger set of datapoints -- your "small group plausibility indicator" – should start with total hospital days, admissions and ER visits. Note that a change in ER copay will confound these results and make the ER portion invalid except that if the copay increases and the number of visits does not decrease, that is a datapoint indicating that the DM/wellness program is failing to reduce ER visits even with a financial disincentive.

Next, subtract maternal/neonatal, cancer, trauma, and surgeries except heart, amputations, and bariatric surgeries (These ICD9 groupings will shortly be available from DMPC.) That calculation yields something approaching total possibly impactible medical/surgical admissions and ER visits. Of course that subset will include some which truly aren’t impactible, but if your DM/wellness program is working, over a period of a few years there should be a positive trend of declining admissions here even with the "white noise" of non-avoidable medical admissions. (Don’t even attempt to do this on a cost basis with inflation adjustments – stick to utilization measurement only.) Note that in this case it is "DM/wellness," rather than one or the other. You can’t separate out the two effects, one objective reason for contracting for both from the same source. (Some prefer to purchase "best of breed" but other things equal, measurement is easier on a consolidated program.)

Those adjustments are more likely to introduce bias and mistakes in calculation than they are to adjust for actual trend. Just do days and apply a standard per diem cost to avoided days. Also, cap your long stays at perhaps 10 days to prevent wild distortions.

If you have 1000 people and you are paying $3 PMPM you need a reduction of about 20 days just to break even. 1000 people should generate about 150 days relating to medical admissions, so 20 days is a noticeable change.1 For those who are familiar with Ariel Linden's published findings, that level of reduction is also roughly what he would cite as the number of reduced days needed to break even. Hospital days can bounce year to year, so look for at least two years of this sustained reduction from your previous trend. After that, look for it to be sustained but don't penalize the program if you don't see further improvement.

This should work down to the group level of about 1000 total lives. But what about groups well below that? What "plausibility indicator" can be applied which has enough datapoints to create statistical significance even in a group with 100 people? That is the subject of the next recommendation.

#2: Use the classic plausibility indicators only to disprove, not to prove

A financial finding of savings based on pre-post analysis, confirmed by a reduction in the event rates for most of the key plausibility indicators is persuasive evidence in a large group of the success of a program. However, in a small group the plausibility indicators are much more likely to blip up and down on their own, because of the small number of data points. Therefore a financial finding with reductions in event rates could be due to a blip. However, a financial finding of savings with no change or an increase in the classic plausibility indicators is invalid. It would be impossible to have truly achieved savings in the five classic conditions if there is no change in the event rates for the five classic conditions.

#3: Measure unscheduled paid time off (PTO), which correlates fairly closely (but overstates) absences.

Quite simply, if you did anything at all in wellness and disease management, employees should be spending less unscheduled time away from work. This indicator should "bounce" randomly much less than hospitalizations, because the number of datapoints is much greater. A company with 100 employees might have 200 unscheduled PTO days, but might have only 4 medical admissions. A year in which you have three admissions would then represent a 25% reduction, but no one would automatically attribute that to a program. Conversely, the same 25% reduction in unscheduled PTO would be a reduction of fifty days of lost work time – clearly a dramatic improvement which doesn’t happen randomly.

Of course, the majority of groups still don’t track unscheduled PTO or even total PTO, but spending a large sum on a wellness/DM program should provide an excuse to start. Note that some supervisors are looser than others in docking their direct reports for absences, but that type of white noise should wash out, and the reduction goal is great enough that even with white noise there should be a noticeable decline. Note that it is quite possible that the absence number will be wrong…but it should be consistently wrong year after year, which would not affect the trend in absence measurement.

Make sure to include short-term disability in your tally of PTO. For long-term disability, the same test would be applied as described in Point #3 to cases with high medical cost: Should this episode of LTD have been prevented through disease management?

Why "validated" absence measurements are not acceptable

An instrument is "validated" if the subjectively collected information reasonably mirrors what really happened. For instance, if someone says their absences went from four days to two days, and a review of the data shows that to be the case.

We have our doubts about whether anyone would complete such a survey honestly. However, our main objection is that these surveys are generally not sent to everyone, only to people who had been identified as being at risk or having had absences in the previous period. As anyone who has attended DMPC seminars on regression to the mean in disease management knows, that type of sampling creates a significant bias towards showing improvement even where none existed, for the simple reason that people who had no absences during the baseline are not surveyed. Their absences could easily increase (at best they will stay the same) but that increase would not be captured.

For instance, suppose the program were offered only to people with four or more absences. If someone goes from zero to four days to zero in three succeeding period, the program would show only the improvement in the third period, not the deterioration in the second period. This person would not have received a survey or been invited to join a program in the first period. Only after they had recorded four absences would this person be flagged.

#4 Have an unbiased HIPAA business associate third party review all your high-cost hospitalizations, over a threshold to determine which should have been preventable.

You can do this using claims data, though accessing case management notes is much more helpful. Ideally you'd like a chart review such as an External Review Organization would provide. It's preferred but not conclusive to compare the number of these cases to the number in the baseline; however, that’s not the main reason to do it.

The main reason to do it is to determine how many of these cases should have been prevented by disease management. Are these cases in the disease management "wheelhouse" like a heart attack or amputation? If so, inquire of the DM vendor (or internal entity) what exactly happened in this case. Was the person flagged? If not, should s/he have been? What contacts or contact attempts were made? What was the content of the phone calls? What other follow-up was there?

If you are in or contemplating a DM contract with a vendor, this is the kind of thing which should be decided on in advance, to the level of specificity of even agreeing upon who the reviewer should be.

Note: Though it is way too early to know what the benchmark should be, do not expect that anything close to 100% of the cases which become outliers to be cases which don't fall within the purview of "should have been caught" by disease management. Also, don't expect that any two unbiased experts will come up with the same result but it should be close.

#5: Give your satisfaction survey some "teeth."

Too often these surveys are desultory affairs, conducted by the vendor itself. They ask if someone is satisfied, would they recommend the program to a friend, has it been helpful etc. Obviously most will say yes -- The program doesn't cost any money so of course most people will think it's a good deal. And those who don't, because they didn't waste any money on it, won't be dissatisfied enough to bother to complete the survey.

What is needed to measure small accounts is to survey the perceived value of the program, value being defined as helpfulness in relation to cost. You probably don't want to tell the respondents exactly how much these programs cost, so the satisfaction survey tries to determine what the respondents would trade them for in terms of vouchers redeemable for other health-related activities, and/or how much they themselves would pay for it. (DMPC has such a survey – a simple ten questions -- for member use.) Note: Expect the average willingness-to-pay to be less than the price, because only a portion of the benefits accrue to the member. (The rest accrue to the employer.). The question is just, how much less? If the average person would only pay $5/year, then the program should be further scrutinized. Though it is too early to tell, one would hope that the average person would think that his/her employer is paying at least $50 a year.

Certification for #5 can be achieved using the DMAA model satisfaction survey with a small but important change. Before the very first question is a statement about "if you feel that this survey has been sent to you in error and you are not participating in a disease management program, please disregard this survey."

Whether someone believes they are participating in such a program is a critical piece of data. This statement must be turned into a question: "Are you participating in a disease management program now?" (Y/N). Since only people who are known to be in the program get a survey, the percentage that doesn’t is, by itself, data. Note that this can only be done on an automated or live phone survey – in a mail survey someone would simply not respond and you won't know whether they aren't responding for this reason or a myriad of others.

#6 Get right to the biometric findings.

While research correlates reduced expenses to reduced risk factors, Chief Financial Officers (CFOs) are usually skeptical of being asked to pay for this correlation. But if you measure this risk objectively, with companywide mandatory health screens on an annual basis, it becomes more believable. While a CFO may not believe the exact numbers generated by the research, most CFOs would agree that a companywide improvement in weight, blood pressure, lipid levels, and other biometrics is worth paying for through disease management and wellness. But these measurements have to be fairly close to mandatory and more than a subjective assessment.

Ideally one would like to do a paired analysis, comparing the same people’s findings year-over-year. Assume two situations in which there is no change on average but in one situation a number of people who had been in the high-risk range get better while others who had tested normal get worse, whereas in the other situation most people stay about the same as in the initial test. Note that a paired analysis is only valid enough for certification where the biometric testing is close to mandatory if not mandatory. Otherwise, the people who agree to the tested multiple times are the ones most likely to have shown improvement. (This is exactly the fallacy in most wellness reports involving results from repeat tests – only a subset repeats…and that subset is the most motivated subset.)

The first situation implies that the wellness and/or DM programs are working and also that there needs to be a change in corporate culture so that people don’t deteriorate absent the program. For instance, maybe the cafeteria food needs to be made healthier or maybe the focus of wellness needs to shift from the high-risk to everyone...although the wellness program appears to be working.

The second situation – remember, in this example, the average is the same in both groups – implies exactly the opposite: That neither wellness not DM is having a noticeable impact.

At some point benchmarks will be available from DMPC to determine whether a group's performance puts it more closely in the first category or the second.

#7) Use the DMPC-approved measurement on your book of business, backed with plausibility indicators, and be very clear that this is an average for all your groups

To do this, you must be DMPC-certified for savings measurement. And you must use the plausibility indicators.

#8) Split the group in two and compare the halves in total trend (less outliers)

If you have 10,000 or more employees, you can do the intervention on roughly half and not on the other half. (If you have more than 10,000, one group needs to have at least 5000 people in it.) The groups should, going in, be comparable in all the obvious ways – benefits design, age, claims history. Then compare the trends, removing obvious outliers from both, such as transplant, neonates, multiple trauma, and cases > $100,000. (Some would argue the neonates point, saying those are preventable – that can be done either way.)

Watch-outs

In each case you need to watch out for confounders. There are actually only a few which would skew your results more than a little and which happen with some frequency. Examples: Did your company recently expand by hiring a lot of entry-level employees? Did it make an acquisition? Did it offer an early retirement buyout? Did it change your PTO policies? Did it make a major change in medical benefits design?

Finally, each of these is fairly low-cost, as compared to a complete financial ROI calculation. The only one which requires significant time is #3, and even that shouldn’t be more than $150/case for no more than a handful of cases. (A full review might take longer but in this case you only want to answer one question – no need to get into a full quality-of-case discussion.)

How to pay for these measurements

For health plans and vendors who are selling these programs and measurements, many of them represent revenue and customer retention opportunities. The ability to help an HR department track absences, for instance, has a monetary value, as does the collection of biometrics.

For employers who are buying these programs, there is no real option other than using some of these measurements. Without them, there is no way of knowing whether a program is working or not. The cost of several of these is low and could be done by the data warehouse vendor. The others are higher in cost but are trivial compared to the cost of productivity lost to absenteeism and medical expense. A company would never undertake a quality improvement program in its manufacturing operations without having a way to measure the results; surely the same should be true in human resources.

Why we are not counting risk scores towards small group measurement certification.

Some people have asked us about using risk scores Of course you are welcome to use whatever measurements you want, but we find that this one is not reflective of a group's performance for this particular purpose and therefore are not certifying it. For instance, suppose someone has a heart attack. Heart attacks in the <65 population are among the most "voluntary" of major medical events, if not the most "voluntary" one. A screen would identify many if not most people who are close enough to infarcting that they are going to have a heart attack within the nest twelve months. Someone who fails the screen should have been given an intervention before infarcting. Therefore the program failed this person. That should be a mark against the program.

However, following a heart attack, the person's risk score would also increase year-over-year, thus increasing the risk score for the population as a whole. This would be a mark in favor of the program, increasing the "degree of difficulty" adjustment, as though the risk score is totally an independent variable. We feel that risk scores mix independent and dependent variables and therefore are not a good indicator of success measurement.

Having said that, we believe that a cross-sectional comparison of risk scores and risk score trends is a good idea and we hope to collect that data in the future.

1. Days should correlate loosely with admissions. If your days went down but your admissions didn't, it is unlikely that the program can be credited as the change took place inpatient. Days are preferred to admissions for estimating cost-effectiveness because most payors pay generally on that basis


Disease Management Purchasing Consortium International, Inc. .

890 Winter Street, Suite 208
Waltham, MA 02451
Phone: 781 856 3962
Fax: 781 884 4150
Email: alewis@dismgmt.com