Monday, 10 August 2009

Validity, reliability & generalisability of project results; the science of improvement?



I am aware that the use (and sometimes invention) of management gobbledygook words to describe actions and intentions regarding quality improvement may sometimes be more of a hindrance than a help. After an inspirational week spent with the Veterans Affairs Quality Scholars in Vermont I got thinking more about this.

"Spread" and "Sustainability" are my two pet problems as words. Spread is difficult to describe and many people use it in different ways and for different dynamics, thus creating more confusion. I have worried for a long time that there is no such issue as sustainability in QI if we are doing proper continuous improvement.

So I wonder what reframing and mindset shifts we get when we use words from the science discipline - after all, many call what we do the "science of improvement".

Validity
Generally in science validity measure the extent to which the test. experiment or method actually does what it has been designed to do.
I wonder how many improvement projects get "good results" yet the overall aim is not identifiably or actually reached. For example, a project designed to length of stay in hospital may use average LOS to measure. Over 6 months this may show an increase, despite much work. As the average includes the denominator of beddays, what may have happened is other work on prevention and lowering readmissions rates has decreased the total number of beddays. Apart from average not being a good improvement measure, if is possible the actions taken for LOS were not focused on the overall aim - which could have been to reduce cost, change the experience for the patient etc.

In statistical terms this questions whether the sample used exhibits the characteristics of the population.
This may be one of the reasons why spread does not happen. We choose populations outside the norm (people willing to change, where the context is prepared, give them help etc) and then when they get good results we require the "norm" to copy them. In many cases the results, the change process, the toolkit produced is designed for a very small population and has poor validity across the wider intended adopting group.

Reliability
Statistically this is the amount of credence placed in a result; the precision of the measurement as repeated over a specific period of time.
For improvement projects that use control charts reliability will show as the extent to which the process is controlled. On a more macro level, reliability of a measure is weakened when the measurement method changes over time or when the measurement is open to "gaming".
The most popular measures for gaming (and I think lacking in reliability) are ones like 95% of patients to wait no longer than 4 hours.

Reliability can also mean the probability that a measurement (or intervention) will perform its intended function over time and within a given set of conditions
This definition reminds me of the talk about "sustainability". If results drop off or the way of measuring becomes "unsustainable" - usually due to other changes in the system, then the problem may be more one of design than a loss of momentum (or whatever way you conceive of "sustainability").

Generalisability
To draw specific inferences; to make generally or universally acceptable
This is about demonstrating that the improvement work carried out in Ward 10 is applicable throughout the hospital. To what extent can the other 15 wards copy what has been done and get the same result? In my experience we often end up with say half the other wards adopting something and of those half not all of them get the same results as the originating ward (they may in fact do better). This is about spread, and to effect generalisability the originating project needs to be able to describe their contextual factors and anything that may be contributing to their results. Without this, adopting teams, and management who would like the work adopted, would have little knowledge as to the generalisability of the work.
I also see a lot of "the results from hospital A was a £10,000 saving. This means all ten of our hospitals can achieve a total of £100,000 if they did the same." I suggest calculation is meaningless without a demonstration of the probability of generalisability.

The SQUIRE Guidelines have been developed to help overcome some of the lack of rigour in publishing improvement work. In particular they address the contextual / generalisability issue.

No comments: