Let’s fact-check The New Yorker on the opioid crisis and learn how to get answers from data. It’s easier than most people think!
They recently posted a tweet in which they make a pretty shocking claim:
— The New Yorker (@NewYorker) March 14, 2018
The link goes to a visually striking, very dramatic presentation of a very politicized issue. The piece, entitled Faces of an Epidemic, is dated October 30, 2017, and is largely about the photographs. There is some text, and the piece makes this claim a few pages in:
Opioids now kill more than fifty thousand Americans a year, ten thousand more than AIDS did at the peak of that epidemic—more, too, than gun homicides and motor-vehicle accidents. Opioid overdoses are now the leading cause of death for Americans under the age of fifty.
Those are a couple of very interesting claims! It’s pretty believable that opioid overdoses claim more lives than homicides of any sort, but motor-vehicle accidents? The leading cause of death for Americans under fifty years old? More than fifty thousand a year?
We have some incredible claims, the issue is unfortunately political, and no citations, all of which are red flags. If it seems a bit difficult to believe then how do you find out whether it’s true? It seemed dubious to me, so I had a look. Let me show you how to figure this sort of thing out from the comfort of your own terminal.
First We Need the Data
This is light data science, but don’t let that put you off it. Luckily, the data should be readily available. There are vast troves of open data, and the US government is often pretty good about releasing data. The Census Bureau, the EPA, and even the Social Security Administration have a few big chunks of open data. You can download the source code for the Apollo Guidance Computer.
In our case, the CDC is the relevant Three-Letter Agency. Among other things, they’re charged with containing epidemics, so hospitals in the US feed them records. Every time you get sick, most especially if you die, they hear about it one way or another. A quick search gets you to their data, and the piece that’s relevant here is a dataset called the Mortality Multiple Cause Files. You can get data back to 1968!
We’ll get the 2016 report, as it’s the most recent available. (The data goes through an unbelievably arcane pipeline, so there’s quite a bit of lag.) It’s about a hundred megs, but uncompressing it gets you a 1.3GB file.
Here’s the downside of all this open data: it’s still the government producing it. I was worried it would be in some proprietary format, but it’s just plain text. It’s not at all readable, though: it’s all fixed-width fields and encoded. I’ve written more than my share of ETL code, and you might be surprised (or horrified) at how much data still resides in fixed-width files.
The CDC helpfully provides a guide to decoding the data. If you want to play with the data, you’ll want to keep that around.
How to Get Answers
First, we have to decide the questions we want to ask. To verify the claims, we want to know how many Americans under 50 die of opioid overdoses, and whether that’s the leading cause.
There are a lot of approaches. R is an excellent tool for this. I often load and parse things in irb and then play with the data in memory. PostgreSQL is designed to interactively query large datasets, and it has facilities for this sort of thing. For one-off programs, as long as it gets you an accurate answer in a timely manner, there aren’t any objectively bad approaches. I’ll sometimes change approaches when one tool seems preferable to another. This time, I started and ended with AWK. It’s been a standard tool in Unix for decades, and it’s very easy to write quick awk scripts to process big chunks of text.
Getting Our Hands Dirty
For serious matters like “How bad is the opioid crisis?”, I think it’s important to show your work, even if you’re a robot. If it’s not your own work, give a source. This keeps you honest, and it helps people learn about the process. Since the point here is to show a process by example, we’ll go into detail, and the source code is embedded below.
The data we have is a fixed-width file, each line is a fixed size (490 characters for this one), and each field is a fixed number of characters from the beginning of the line. As the CDC’s guide indicates, age is a two-part field. We’ll start by parsing that: character 70 represents the units, and 71-74 the number. “Year” is the largest unit provided, so as long as the unit is years and the age is less than 50, or if it’s in a unit besides years, we have a record of a death of an American under 50 in 2016.
AWK is great at handling delimited text, but for fixed-width files, we have to use `substr()`. Since the field is zero-padded and this upset comparisons, I incremented and then decremented the age by 1.
We can already get some useful information, namely how many of the 2.7 million records apply to people under 50: 267,647.
The cause of death is encoded as an ICD-10 abbreviation in columns 146-150, and contributing factors are listed in columns 344-444. Accidental overdoses are marked X40-X44, intentional overdoses X60-X64, and unknown intent Y10-Y14. The various substances are listed in the second field; opioids are T40.0 through T40.4, and to be generous we’ll add T40.6, which signifies “unknown narcotics”. So to select records where the cause of death matches an overdose and there were any opioids involved, we’ll have to look through both fields.
The script loops through each record, and increments a total. (This isn’t strictly necessary for AWK, we could use the NF variable.) Then it parses out the age, and the ICD-10 classification for the cause of death. When it finds someone under 50, it increments another counter for that total. Next, we check if it matches the causes we’re looking for, and increments a counter for that, as well as one of two more counters to report whether it was deemed accidental or intentional. Finally, we pull off the first character of the ICD-10 classification to track major classifications, which allows us to see how other causes rank.
The Verdict: The New Yorker Exaggerated
When we run it, we get a different result than the New Yorker: opioid overdoses only account for 29,995 deaths, not quite “over fifty thousand”, the number the New Yorker claimed. The leading causes of death for Americans under fifty years old are heart disease (35,888) and cancer (31,289). They were correct that it causes more deaths than transportation accidents, which totaled 24,887.
I don’t know where the New Yorker got the statistics they used or what the basis for their claims was. Maybe they used a projection, maybe there’s a flaw in their data or methodology, or maybe there’s a flaw in mine. (It could be all three!) If you’re cynical enough, you could probably come up with several other theories, but we’re left to speculate because they didn’t show their work.
The Source Code
If you’re curious or you want to try to spot a bug (I spotted one while writing this), you can see the code here:
If you find any bugs in the code or flaws in the methodology, please do let us know. Don’t believe everything you read.
I’d also like, for the sake of clarity, to point out that this article is intended to show how anyone with enough interest can find answers, using publicly available data and simple tools. If you’re willing to roll up your sleeves, you can fact-check the New Yorker, or satisfy some other curiosity you might have about the world. I strongly encourage you to do so! (But do feel free to hire us if it warrants calling in the professionals.) It took about half an hour because my work involves this sort of thing, but it’s possible for nearly anyone to do.
To further clarify, I don’t intend to trivialize the problem of opioid abuse, but I do think accuracy is important, especially in matters of public policy, and it’s prudent to be suspicious of any numbers cited in the vicinity of a political issue.