The Red Herring of Reliability
The issue of reliability regularly features in the DSM debate. Advocates of the DSM system report favorable reliability statistics in one or other of the various revisions (e.g., Brown, Di Nardo, Lehman, & Campbell, 2001), while critics paint a different picture (e.g., Timimi, 2011). Kirk and Kutchins (e.g., 1992, 1997) have written extensively on the topic of DSM reliability. They claim that although it is widely accepted that the DSM system greatly improved diagnostic reliability, even 20 years after publication of the DSM-III there was still not “a single major study showing that DSM (any version) is routinely used with high reliability by regular mental health clinicians” (1997, p. 53).
The concerted efforts to clarify and improve the reliability of the DSM have led us down a blind alley. Even if the DSM was perfectly reliable, it would still be a flawed system. Perfect reliability will not solve DSM’s problems. Perfect reliability would mean that two people would always come to the same diagnostic decision about the information presented to them. Different clinicians, for example, would identify “disorganized schizophrenia” or “obsessive compulsive disorder” from the same case material. Are they correct? If two clinicians agree that someone meets the diagnostic criteria for oppositional defiant disorder, does that mean that someone really has oppositional defiant disorder? Given that the DSM is neutral with respect to etiology, what has someone “got” when they’ve “got” a DSM disorder?
The system of Zodiac signs may well be a system approaching perfect reliability. With very little training, and no more information than a person’s date of birth, two raters could easily agree on whether someone belonged to the Sagittarius or the Pisces (for example) group. Is this person “really” a Sagittarian or a Piscean? Could we say that the person was born with Sagittariusness?
With minimal instruction people could also be taught to reliably identify the constellations in the night sky. What would that tell us? If we could achieve perfect agreement between raters at selecting Orion from a range of constellations we still would not have progressed any further in our understanding of the universe.
The fundamental problem with reliability is that it will not, and cannot, tell us what we want to know. Reliability tells us the extent to which different raters arrive at the same conclusion. That is, it tells us about the raters, not the person being rated. It is silent on matters about that which is being rated.
All we can conclude when two people reach the same diagnostic decision after reviewing the same case material is that these people are organized to have similar pattern recognition systems; at least for recognizing the particular disorder under question. The important point is that the agreement the two people reach tells us nothing about the person upon which the case material was based.
When we investigate reliability we find which diagnoses raters can agree on and which they have difficulty agreeing about; nothing more. We learn nothing about the legitimacy of the diagnostic categories and we move no closer to understanding the nature of psychological distress or how it might best be treated.
Understanding what information the reliability statistic provides us with is crucial, but often not fully appreciated. In a recent letter to the editor of the American Journal of Psychiatry, Spitzer and his colleagues (Spitzer, Williams, & Endicott, 2012) cited an earlier article (Spitzer, Forman, & Nee, 1979) about the reliability obtained in the DSM-III field trials. In the 2012 paper, Spitzer et al. (2012) reflected on a conclusion drawn in the 1979 paper. They reported that kappas (the reliability statistic) greater than 0.7 were considered to be “good agreement as to whether or not the patient has a disorder within that diagnostic class” (Spitzer, Forman, & Nee, 1979, as cited in Spitzer, Williams, & Endicott, 2012). Agreeing about what it is considered a patient might or might not have, however, provides no certainty on what might actually be troubling the patient.
Developing exercises to help clinicians agree more readily with each other might be an important undertaking in its own right, but we need to be very clear about what this agreement does and does not tell us. It certainly doesn’t tell us anything about the person we have reached agreement about.
The DSM system of classification via symptom pattern groupings is akin to organizing categories of automobile drivers based on the way they hold the steering wheel. Those drivers who hold the steering wheel with two hands at the top could be allocated to one category, those who hold it with one hand on the side allocated to another, and so on. With this system we could study the different driver categories and, with sufficient numbers in our studies, we could discern differences between the groups. We could even look for differences in brain circuits and brain chemistry to attempt to understand the basis of their differences. We could develop observation methods to achieve high levels of agreement between independent raters regarding which categories drivers should be assigned to and we could develop intervention programs to improve driver behavior. Regardless of the effort involved in developing this system it would achieve successes only serendipitously because it is not a valid way of understanding driving behavior.
Although the terms “reliability” and “validity” are commonly used together, they are very different constructs. Reliability for the DSM is irrelevant because it is not a valid way of understanding people or their problems. Progress in understanding and treating disturbances to mental health will never be achieved by improving the reliability of a fatally flawed system. We would be well justified in abandoning efforts to improve reliability and starting at the beginning of defining mental health. From a sensible, accurate, and precise understanding of mental health we may then move to an appreciation of mental health problems and a more sophisticated and nuanced approach to their resolution.
Brown, T. A., Di Nardo, P. A., Lehman, C. L., & Campbell, L. A. (2001). Reliability of DSM-IV anxiety and mood disorders: implications for the classification of emotional disorders. Journal of Abnormal Psychology, 110, 49-58.
Kirk, S. A., & Kutchins, H. (1992). The selling of DSM: The rhetoric of science in psychiatry. New Brunswick, NJ: AldineTransaction.
Kutchins, H., & Kirk, S. A. (1997). Making us crazy. DSM: The psychiatric bible and the creation of mental disorders. New York: Free Press.
Spitzer, R. Forman, J., & Nee, J. (1979). DSM-III field trials, I: Initial interrater diagnostic reliability. American Journal of Psychiatry, 136, 815-817.
Spitzer, R. L., Williams, J. B. W., & Endicott, J. (2012). Standards for DSM-5 reliability. American Journal of Psychiatry, 169, 537.
Timimi, S. (2011). Campaign to abolish psychiatric diagnostic systems such as ICD and DSM (CAPSID). Retrieved on 25 May 2013 from http://www.criticalpsychiatry.net/wp-content/uploads/2011/05/CAPSID11.pdf