James Garrett

Explorations of probabilistic thinking

Some thoughts on statistical modeling and data science James Garrett

Some years ago, in a large pharmaceutical company, an academic paper circulated among some of the members of the statistical staff (of which I was one). It addressed the question of how logistic regression compared to machine learning (ML) methods in predictive accuracy with clinical trial data. The authors of the paper applied logistic regression models and found equal or better predictive performance than had been reported earlier with ML methods.

The paper was greeted by the statisticians with an uncomfortably tribal quality—I think someone may have literally written, “Hooray for our side!” If that wasn't the literal statement, the discourse within the group certainly ran along that line, such was the feeling of siege that prevailed.

(Unfortunately, I cannot find that paper now. A search on this topic uncovers many papers on the relative predictive accuracy of logistic regression on clinical data. It's been a topic of some concern, apparently.)

I had been studying statistical modeling methods advocated by Dr. Frank Harrell, Jr., whom I like to call “The most respected ignored statistician in America.” He's an elected fellow of the American Statistical Association, which is as close to a Nobel Prize as the statistics world comes. He specializes in exploratory modeling of clinical data. He mined decades' worth of statistical thinking to synthesize an exploratory modeling workflow that is purported to be thorough, efficient, and likely to yield replicable results.

The funny thing about this methodology is that at any single given point in its process, it is utterly familiar to statisticians. Logistic regression here; sure, feeling right at home. Model selection criteria there; got it. Assessment of correlations among predictor variables; of course. However, when you put all the pieces together and take it from A to Z, the whole is not familiar at all. On two occasions, with two large corporations, sat with large statistics groups and heard Dr. Harrell walk through his process. I've left the room watching audience members nodding their heads and saying the good Doctor had made a lot of convincing points. But would they incorporate these ideas in their own work? “No, my clients would never let me.” That's why I say Harrell is highly respected yet mostly ignored.

One aspect that figures prominently in Harrell's workflow is inclusion of spline expansions to enable simple non-linearity for continuous predictors. In over ten years of work within the large statistical group, I had seen lots of logistic regression models, but not one included a spline expansion.

So it was interesting indeed when I looked more closely at the paper which “scored one for our side.” It didn't give a lot of detail about how the logistic regression models were fitted, but they contained spline expansions. It seemed extremely likely that the creators of the models were aware of Harrell's process. At any rate, by using splines they were out of step with typical practice.

In fact, this paper wasn't a win for “our side,” if “our side” refers to statisticians who fit models in the typical manner. It was a win for something completely different, neither ML nor typical statistical practice. It was a win for what statistical modeling could be, but rarely is. I believe my colleagues were a little quick to accept this paper as representing their practice. It suggested a middle way between traditional statistical modeling and ML, a way which can predict as well as ML, can be as flexible in many respects, and can be more informative to boot. If the data is appropriate.

I intend to write a follow-up essay soon suggesting why I think Harrell's approach works well for most clinical data sets, so stay tuned. I'll also offer some thoughts on a way to categorize statistical and ML modeling methods according to their behavioral properties, to aid in picking the right method for the data set at hand. We really shouldn't be organized in tribes at all; there is no best modeling method, there is only a method that is best at exploiting the features of a particular data set. What are those features? I'll offer my suggestions soon.