EPM Channel | Categorizing Numeric Variables

Eight years ago, while attending UseR! 2007, the international R user group meeting held that year in Ames, Iowa, I had the opportunity to participate in an intense tutorial on regression modeling strategies conducted by Vanderbilt professor Frank Harrell. It was a terrific class, for me combining a well-timed review with a wealth of never-scene material.

Three takeaways from Frank’s Harrell’s lectures I’ve always kept close are to be attentive to both non-linearity and interaction effects among independent variables, and to be wary of categorizing continuous, numeric attributes.

The danger in categorizing numeric variables makes a lot of intuitive sense. It certainly seems reasonable that recoding a continuous attribute into a small number of categories might throw away potentially-telling information.

In addition, as Harrell notes: “Categorization assumes that the relationship between the predictor and the response is flat within intervals; this assumption is far less reasonable than a linearity assumption in most cases. (Also), Researchers seldom agree on the choice of cutpoint, thus there is a severe interpretation problem. One study may provide an odds ratio for comparing BMI > 30 with BMI <= 30, another for comparing BMI > 28 with BMI <= 28. Neither of these has a good definition and they have different meanings.”

I was reminded of Harrell’s admonitions as I read a blog the other day by Columbia statistician Andrew Gelman. Gelman was providing statistical commentary to an article by Anne Case and her newly-minted Nobel laureate husband, Angus Deaton, that posits “a marked increase in the all-cause mortality of middle-aged white non-Hispanic men and women in the United States between 1999 and 2013. This change reversed decades of progress in mortality and was unique to the United States; no other rich country saw a similar turnaround.”

Gelman, however, offered a different interpretation of the finding than the authors, posing the question “…could this pattern be an artifact of the coarseness of the age category?” Might the findings have more to do with changes over time in the composition of the heavily categorized age category of 45-54 than they do with increasing mortality per se? In other words, how does the age category itself change between 1999 and 2013 – and how might that influence the findings?

Because of demographic trends, it’s almost certain that the 45-54 cage category will weigh more heavily to higher ages in 2013 than it did in 1999. So if mortality increases with age in the 45-54 range, it’ll be higher in 2013 – even if annual death rates are unchanged from 1999.

Demographers and epidemiologists compare morbidity and mortality across structurally different populations all the time, having developed “standardization” methods to “equate” the disparate distributions. The ultimate question is what would mortality look like if the populations were identical.

Gelman computes some back of the napkin estimates to conclude that “…the Case and Deaton estimates are biased because they don’t account for the increase in average age of the 45-54 bin during the period they study. After we correct for this bias, we no longer find an increase in mortality among whites in this category. Instead, the curve is flat.“

The stakes are high in this “discussion”, so there’s no shortage of reaction in blogspace. Already, commentary has made it to the The New York Times. Responding to Gelman, Case-Deaton, note that “If we want to be more precise about the age range involved, we could say that for all single years of age from 47 to 52, mortality rates are increasing,”… “So the overall increase in mortality is not due to failure to age adjust.”

To the more granular numbers adduced by the authors, Gelman concedes that mortality is not flat as first argued, instead “mortality rates among non-Hispanic whites aged 45-54 increased by an average of about 4% after controlling for age.” The increase was 12 percent without the age adjustment, suggesting that age bias accounted for about two-thirds of the increase — but did not entirely explain the increase.”

In the end, no matter how the questions are ultimately resolved, not much good will emerge from lumping ten census years into a single age category variable, disposing of a wealth of potential relationship information in the process.

My take – and I suspect Frank Harrell’s as well – is that the use of detailed, annual information would have saved much analytic angst. Be it resolved to avoid coarse, categorized recodes and work with continuous, interval-level data directly when available.

By Steve Miller, from: http://www.information-management.com/blogs/big-data-analytics/categorizing-numeric-variables-a-cautionary-tale-10027721-1.html?utm_medium=email&ET=informationmgmt:e5513265:2047253a:&utm_source=newsletter&utm_campaign=daily-nov%2011%202015&st=email