The issue of reliability regularly features in the DSM debate. Advocates of the DSM system report favorable reliability statistics in one or other of the various revisions (e.g., Brown, Di Nardo, Lehman, & Campbell, 2001), while critics paint a different picture (e.g., Timimi, 2011). Kirk and Kutchins (e.g., 1992, 1997) have written extensively on the topic of DSM reliability. They claim that although it is widely accepted that the DSM system greatly improved diagnostic reliability, even 20 years after publication of the DSM-III there was still not “a single major study showing that DSM (any version) is routinely used with high reliability by regular mental health clinicians” (1997, p. 53).

The concerted efforts to clarify and improve the reliability of the DSM have led us down a blind alley. Even if the DSM was perfectly reliable, it would still be a flawed system. Perfect reliability will not solve DSM’s problems. Perfect reliability would mean that two people would always come to the same diagnostic decision about the information presented to them. Different clinicians, for example, would identify “disorganized schizophrenia” or “obsessive compulsive disorder” from the same case material. Are they correct? If two clinicians agree that someone meets the diagnostic criteria for oppositional defiant disorder, does that mean that someone really has oppositional defiant disorder? Given that the DSM is neutral with respect to etiology, what has someone “got” when they’ve “got” a DSM disorder?

The system of Zodiac signs may well be a system approaching perfect reliability. With very little training, and no more information than a person’s date of birth, two raters could easily agree on whether someone belonged to the Sagittarius or the Pisces (for example) group. Is this person “really” a Sagittarian or a Piscean? Could we say that the person was born with Sagittariusness?

With minimal instruction people could also be taught to reliably identify the constellations in the night sky. What would that tell us? If we could achieve perfect agreement between raters at selecting Orion from a range of constellations we still would not have progressed any further in our understanding of the universe.

The fundamental problem with reliability is that it will not, and cannot, tell us what we want to know. Reliability tells us the extent to which different raters arrive at the same conclusion. That is, it tells us about the raters, not the person being rated. It is silent on matters about that which is being rated.

All we can conclude when two people reach the same diagnostic decision after reviewing the same case material is that these people are organized to have similar pattern recognition systems; at least for recognizing the particular disorder under question. The important point is that the agreement the two people reach tells us nothing about the person upon which the case material was based.

When we investigate reliability we find which diagnoses raters can agree on and which they have difficulty agreeing about; nothing more. We learn nothing about the legitimacy of the diagnostic categories and we move no closer to understanding the nature of psychological distress or how it might best be treated.

Understanding what information the reliability statistic provides us with is crucial, but often not fully appreciated. In a recent letter to the editor of the American Journal of Psychiatry, Spitzer and his colleagues (Spitzer, Williams, & Endicott, 2012) cited an earlier article (Spitzer, Forman, & Nee, 1979) about the reliability obtained in the DSM-III field trials. In the 2012 paper, Spitzer et al. (2012) reflected on a conclusion drawn in the 1979 paper. They reported that kappas (the reliability statistic) greater than 0.7 were considered to be “good agreement as to whether or not the patient has a disorder within that diagnostic class” (Spitzer, Forman, & Nee, 1979, as cited in Spitzer, Williams, & Endicott, 2012). Agreeing about what it is considered a patient might or might not have, however, provides no certainty on what might actually be troubling the patient.

Developing exercises to help clinicians agree more readily with each other might be an important undertaking in its own right, but we need to be very clear about what this agreement does and does not tell us. It certainly doesn’t tell us anything about the person we have reached agreement about.

The DSM system of classification via symptom pattern groupings is akin to organizing categories of automobile drivers based on the way they hold the steering wheel. Those drivers who hold the steering wheel with two hands at the top could be allocated to one category, those who hold it with one hand on the side allocated to another, and so on. With this system we could study the different driver categories and, with sufficient numbers in our studies, we could discern differences between the groups. We could even look for differences in brain circuits and brain chemistry to attempt to understand the basis of their differences. We could develop observation methods to achieve high levels of agreement between independent raters regarding which categories drivers should be assigned to and we could develop intervention programs to improve driver behavior. Regardless of the effort involved in developing this system it would achieve successes only serendipitously because it is not a valid way of understanding driving behavior.

Although the terms “reliability” and “validity” are commonly used together, they are very different constructs. Reliability for the DSM is irrelevant because it is not a valid way of understanding people or their problems. Progress in understanding and treating disturbances to mental health will never be achieved by improving the reliability of a fatally flawed system. We would be well justified in abandoning efforts to improve reliability and starting at the beginning of defining mental health. From a sensible, accurate, and precise understanding of mental health we may then move to an appreciation of mental health problems and a more sophisticated and nuanced approach to their resolution.


Tim is a Professor in Mental Health at the Centre for Remote Health in Alice Springs, Australia where he conducts mental health research and provides a clinical psychology service within the public mental health service. He has a PhD in Clinical Psychology from the University of QLD (Australia) and an MSc in Statistics from the University of St Andrews (Scotland). He has over 100 publications including books, book chapters, and peer-reviewed publications in scientific journals and has presented his work at national and international conferences. Tim has developed a transdiagnostic cognitive therapy called the Method of Levels (MOL) which adopts a patient-centred view of mental health disorders and seeks to help patients resolve the distress underlying particular symptom patterns rather than focussing on the symptoms themselves. He has also pioneered a patient-led system of service delivery in which patients determine the frequency and duration of treatment sessions. His interests in mental health centre around the importance of control to psychological wellbeing and service provision and he prioritises the perspective of the individual in understanding psychological distress and helping in its amelioration.