Explorations of probabilistic thinking

James Garrett

This blog goes out to people who are not professional statisticians or data analysts, but who need to work with one. Or hire one. If this describes you, here’s something very important: you need to know about archetypes of statistical practice.

In particular, if you need support for a problem that does not directly involve clinical trials, you should be discerning of what skills you need, and you can’t assume that just any statistician will do.

So, what in the world is an archetype of statistical practice?

In my nearly-30-year career as a statistician working in industry, it’s struck me that there are three main patterns of statistical work: – The clinical statistician – The industrial statistician – The machine-learning or algorithmic statistician

These are work areas, which call for certain work practices or mental orientations. Any specific statistician naturally adopts the practice pattern for their area of work, or they naturally migrate to the work that fits their orientation pattern. I call these “archetypes” of statistical orientation.

A person can have more than one archetype, and some may be able to adapt their work pattern to the different work at hand. But some strongly live in one area predominately, and would perform poorly in a different context, if they don’t receive specific training.

You may be asking, “Where do data scientists fit into this system?” I first came up with this system long ago, before “data scientist” was a thing, so humor me and let me parse the statistical world first, and then I’ll make an attempt at placing data scientists into it.

A caveat: my career experience has been in the health field (diagnostic devices and, a bit, pharmaceuticals). If you’re coming from a different background, perhaps things appear a bit different to you? I’d be interested to hear your thoughts.

With that, the three archetypes follow.

The clinical statistician

The clinical statistician supports clinical studies interpreted in an inherently adversarial context, whether they’re supporting a research manuscript for publication or a drug application to health authorities. They make arguments concerning the appropriate degree of evidence contained in the data for or against a hypothesis, sufficient to convince a skeptical audience. Therefore they must make sure everything is in order and all critical assumptions are met.

“Biostatistics” is the subfield of statistics that focuses on clinical trials. There are challenges that arise uniquely in clinical trials, and general statisticians may not be that familiar with some of them.

To illustrate a situation in which argument is key, and it’s critical to not make mistakes: in graduate school I took a math course called “Real Analysis”, which is the theoretical development of calculus. For one exam we were given one hour to prove 10 statements. In that hour I finished proving three of them; for the others I wrote down some thoughts but did not fully prove the statements. I was careful to say what I was confident in and why.

For confidently proving only 30% of the statements, I earned an “A” on the exam. I demonstrated that I was not extremely creative in math, but I knew what followed from what, and the latter is a valuable skill. In fact, mathematics is an edifice in which one true statement is based on another, so it strongly depends on there being no wrong things in the framework. This carries over to arguing for a skeptical audience: say what you can, don’t say what you can’t, and above all, don’t say a wrong thing. A wrong thing could cause your whole case to crumble. And if they find you saying one wrong thing, and you’re unaware that it’s wrong, can they believe other things you’re saying?

I’ve seen clinical statisticians look at a careful post-hoc exploratory analysis (which can be critical to the business) and make a blanket statement that statistical inference in exploratory analyses can’t be relied upon. This is true, but quite unhelpful–more helpful would be: how compromised is the particular exploratory analysis? This is one example where an orientation is not wrong but inappropriate to the context.

The industrial statistician

The industrial statistician focuses on experiments with many controllable factors, such as in a lab experiment. Their goal is to optimize a product or process, or to develop a predictive model of a system. Here it is critical to determine the most important contributors to the process and to understand different sources of variability. Once the most important factors are identified and characterized, less-important factors are of little interest. Assessing evidence for relationships sufficient to convince a skeptical audience is not important; what is important instead is reaching findings that will move the project forward usefully, even if they’re imperfect approximations of reality.

There is a technology for handling a large number of predictor variables under experimental control, especially where every single observation is expensive (e.g., it requires an entire run of a pilot manufacturing line).

For instance, the field of “fractional factorial designs” can develop a useful local model for 5 controlled factors by collecting only 9 runs (8 runs that perturb all 5 factors and one run in the nominal “center” in order to detect whether a linear model is adequate); an additional 10th run at the center point again would be valuable in order to estimate pure error. This assumes that interactions involving 3 factors or more are trivial, which is realistic in most cases. 9 or 10 runs stands in contrast to the 25 = 32 runs one might naïvely expect as a minimal perturbation of 5 factors (not counting center points or replication). 32 runs will support estimation of all possible interactions up to 5-way, but it is highly, highly unlikely that all such interactions will be active. Finding a balanced subset of runs in 5-dimensional space is a nontrivial exercise, and a geometrically appealing one. This is a useful skill, and one that a clinical statistician could very well have no awareness of.

There is an analogy with numerical optimization routines. Note that such algorithms work iteratively, and at each iteration they make a local linear or quadratic approximation to the function of interest. It isn’t critical that the function be truly linear or quadratic; the approximation only needs to be good enough to move the search process forward. The same is true for modeling of experiment data: the model needs to be a good enough local approximation that it moves the project forward; it need not be correct.

In fact, industrial work is usually iterative, in its best form. In fact, a rule of thumb is to spend no more than 20% of your budget on your first experiment, because you fully expect subsequent experiments. Subsequent experiments can incorporate previous findings, and it is efficient not to waste resources on factors or estimates that, given prior data, can be neglected. In a sequential context, correctness is less important than in the clinical context, provided results are correct enough; after all, if the conclusions of one experiment are a little off, the next experiment will refine them.

There is an art to working with a team to elicit a list of all potential factors, develop a strategy to handle them, and to bring the team on board with the experimental design. Thus there is a bit that is ineffable and social in the practice of industrial statistics.

The machine-learning or algorithmic statistician

Statistics is about learning from data, and standard statistical methods develop this learning by making assumptions that might be more useful than true. Do errors follow a Gaussian (normal) distribution? Are relationships linear? And we begin to realize that these assumptions, while convenient, are not actually required or even the main idea. We can do about as well if we assume that a relationship is smooth, rather than linear, for instance. This direction of interest leads to semiparametric modeling, multivariate clustering, semiparametric density estimation, and predictive models that allow for arbitrary complexity of interaction (neural nets, random forest, support vector machines). This flavor of statistician also takes responsibility for assessing a model’s generalizability to future data.

However, the big dichotomy in statistics is between clinical and industrial; the machine-learning orientation is usually added to one of the others, when it is expressed at all.

Data scientists and statisticians

Where is the boundary between data science and statistics? Is there one? Much ink has been spilled on this so it is probably foolish to pursue it here…but what the heck, completeness demands it. My own experience (with clinical biomarker exploration) suggests the following: – Data scientists are intentional about the craft of programming and managing data. While a few statisticians have intentionally nurtured their craft, the community doesn’t see it as a universal need. A randomly-selected data scientist is likely to be a better programmer than a randomly-selected statistician. – Data scientists take ownership of data pipelines and data handling, more so than statisticians. – Statisticians own the question of inference. If you’re making statistical inference, you’re doing statistics. Whether or not you consider yourself a statistician. – Data scientists tend to take more responsibility than statisticians for understanding the scientific background. – I see a bifurcation in the data science community: there are those whose analysis process is to use loops and a small set of hypothesis tests and plots, then interpret the pile of results that result. Others adopt machine-learning and clustering methods eagerly.

What does this all mean?

In the health-related industry, most statisticians work in the clinical area, and support the clinical archetype. This is very important work with its own idiosyncracies, and it’s good that these practitioners generally adopt the archetype appropriate to the work. Some of these practitioners may also be able to adopt one or more of the other archetypes when placed into a different context, while others may not.

The proportion of clinical statisticians among statisticians in health-related areas is so high that many in the industry don’t realize that there is anything else. They refer to all statisticians as “biostatisticians” and expect that clinical statisticians can address any statistical need. This is a big error and can lead to business and project issues.

It would be just as big of an error to place one of the other archetypes into a clinical role if they cannot adopt the clinical orientation. However, given that the clinical role is so prevalent and highly developed, this direction of error rarely happens, or at least it is caught right away. It is the clinical-to-other direction that is more likely to be undetected, and to lead to problems. All it takes is a blasé pointy-haired boss to put the wrong person into place and the stage for mayhem is set.

In essence, managers in health-related organizations need to know that not all statisticians are alike.

Jim Garrett

  1. Introduction
  2. Why binning is so seductive
  3. Statistical efficiency
  4. Representing Nature

Introduction

Over and over I’ve found my data analysis strategies to be contrary to those of many of my peers. One area of differing inclination is that, when given continuous data, I try very hard to analyze it as continuous, rather than binning it. My preference is, according to standard principles, the theoretically preferred approach: binning is discarding information and loss of information must lose statistical efficiency. Nevertheless, this textbook advice is almost universally ignored.

I’m going to argue here that binning is a bad practice, but not for statistical efficiency reasons. Instead, I argue for trying hard to analyze continuous data on the grounds of clarity. The more we meet Nature where she is, the more clearly we can understand her and make reasonable decisions.

Note: I’ve realized that this blogging platform doesn’t allow one to include figures, unless one has access to the server, which I don’t. I’m in the process of setting up my own web site, at which point I’ll move my blog there. In the meantime, in lieu of figures, I’ll include code to produce the figures. Perhaps this will offer a little tutorial benefit. At least, I’m trying to make lemonade out of a lemon.

Here’s some setup code:

library(rms)
library(mgcv)

## A crude hand-sketched example
template <-
    data.frame(x = c(0, 4, 5, 6, 7, 8,    9,  10),
	       y = c(0, 0, 1, 3, 4, 4.25, 4.5, 4.75))

## A function to linearly interpolate
tempFun <- 
    approxfun(x = template$x,
	      y = template$y)

## Generate 
set.seed(123)
datbin <- data.frame(x = runif(300, 0, 10))

## Transform template to probability scale, ranging from 0.1 to 0.8
invLogit <- function(r) 1 - 1 / (1 + exp(r))

LogitLow <- log(0.1 / 0.9) # -2.197225
LogitHigh <- log(0.8 / 0.2) # 1.386294

datbin$p <-
    invLogit((LogitHigh - LogitLow) / tempFun(10) * tempFun(datbin$x) + LogitLow)

MedianBin <- median(datbin$x) # 4.778207

datbin$XGp <-
    factor(ifelse(datbin$x >= MedianBin, "High", "Low"),
	   levels = c("Low", "High"))

set.seed(456)
datbin$YBin <- rbinom(nrow(datbin), size = 1, prob = datbin$p)

Why binning is so seductive

As mentioned, virtually any academic statistician will argue against binning categorical data on efficiency grounds, and yet the practice is pervasive. Why? I don’t know for sure, but my guesses are as follows.

First, our customers ask for it. Very often a statistician’s customer is a medical clinical researcher, and many clinical researchers are accustomed to seeing a comparison of discrete groups. In fact, many clinical researchers aren’t even aware that any analysis other than comparing groups is even possible, if the response is not continuous. For instance, if the response variable is binomial with a hypothesized probability p that depends on independent study factors, a non-statistician researcher may imagine that a probability is a property of a group, i.e., a proportion of the whole. To evaluate a probability of an outcome given a study factor, we must define a subgroup based on the factor and then count successful outcomes per subgroup. Modelers know that logistic regression can describe the probability p as changing continuously as a function of other factors: it’s possible for every single study subject to have their own distinct probability p.

Therefore a clinical researcher asks for a subgroup definition because they prefer to think that way or they’re unaware of alternatives. Next, the pliable statistical analyst may want to please his customer by fulfilling the request as quickly and directly as possible. This raises questions about what a statistical analyst’s ideal role in collaboration is, but that’s for another day….

Second, binning can simplify analysis. Continuous data cannot be depended upon to be nicely normally-distributed as in textbook cases. There can be non-normal distributions, skewness, outliers, etc. When relating to other variables, we similarly cannot rely on relationships being linear. All of these problems simply go away when we bin. Deciding not to bin is not a simple choice, but is rather a commitment to cultivate an entirely new toolbox of analysis strategies suitable for messy cases.

Given the analyst caught between customers demanding binning on the one hand, and maybe not fully confident about their continuous-data toolbox on the other hand, perhaps it’s no surprise that most analysts simply provide the customer with what the ask for.

Still, I challenge statistical analysts to start nurturing that toolbox. If you want to know how, watch this space.

Statistical efficiency

Let’s consider the question of statistical efficiency and power with a simple hypothetical example. Suppose we are assessing a possible biomarker in a clinical trial, and the biomarker is assayed on 300 subjects. Biomarker values are uniformly distributed over a range. Suppose the clinical outcome is binary. An increased biomarker value contributes to an increased probability of clinical response, and the true relationship is per the piecewise-linear curve generated as follows:

plot(0:10,
     invLogit((LogitHigh - LogitLow) / tempFun(10) * tempFun(0:10) + LogitLow),
     type = "l",
     ylim = c(0, 1),
     main = "True probability of response",
     xlab = "Biomarker value",
     ylab = "Probability of response")

(This produces a piecewise-linear curve that is somewhat sigmoidal, starting from a probability of 0.1 (where it remains for some time), and then it increases rapidly to roughly 0.7. Then it continues increasing, but at a slower rate. Over the range of the curve (from 0 to 10), it reaches a maximum probability of 0.8.)

Even though the curve is crudely piecewise linear, it has some features that complicate real-world analysis:

  • It is not linear, not even on logistic scale.
  • The probability of clinical response does not reach zero at the low end of the biomarker range, nor does it reach 1.0 at the high end. The biomarker is informative but there are clinical exceptions.
  • At the high end, increasing biomarker value contributes to increased probability, but at a slower rate.

We want to assess whether the biomarker is worth investigating further, and if so, the nature of the relationship. I’ll carry out three alternative strategies:

  1. Calculate the median value of the biomarker and split cases into biomarker “High” and “Low” groups accordingly. Apply Fisher’s Exact Test to test for association between binary biomarker group and binary clinical outcome.
  2. Fit a generalized linear model relating binary outcome to a smoothing spline estimate of the biomarker’s contribution. Use an approximate test against the null model of no relationship. Optionally, we can assess evidence for non-linearity. Obtain an estimate of the relating curve, with pointwise confidence intervals.
  3. Similarly, fit a logistic regression model, but use a natural spline expansion for continuous biomarker values. Obtain a likelihood-ratio test against the null of no relationship, and as with the GAM, assess the evidence for non-linearity and obtain an estimate of the curve mediating the relationship.

Should we include fitting a continuous model that assumes linearity? I think not, because we don’t know if that’s the case, and a non-linear relationship is quite plausible.

With alternative (1), Fisher’s Exact Test gives a p-value “< 2.2e-16”. Statistical efficiency is not an issue here.

(The following code evaluates Fisher’s Exact Test with the X grouping against the binary outcome.)

fisher.test(datbin$XGp, datbin$YBin)

With the GAM–alternative (2)–we also find <2e-16 as a p-value against the null hypothesis of no relationship. The following code generates an estimate of this relationship:

GamMod.init <- gam(YBin ~ s(x), family = binomial, data = datbin)

summary(GamMod.init)

plot(GamMod.init,
     trans = invLogit,
     ylim = c(0, 1),
     main = "Logit outcome vs biomarker",
     xlab = "Biomarker",
     ylab = "Logit of outcome probability")

(The figure shows a figure that tracks the true curve, but it is implausibly “wiggly”.)

This tracks the true curve but is too wiggly to be plausible. Manually forcing the smoothing parameter to be large enough so that the curve is almost monotonic, we find:

GamMod <- gam(YBin ~ s(x, sp = 0.05), family = binomial, data = datbin)

plot(GamMod,
     trans = invLogit,
     main = "Logit outcome vs biomarker",
     xlab = "Biomarker",
     ylab = "Logit of outcome probability")

(The code generates a less-wiggly curve that still tracks pretty well.)

The actual predicted probabilities are not very different between these estimates, actually, and they both give p-values of <2e-16 against the null model.

## initial model
summary(GamMod.init)

## Add a linear component to the initial model so that linear and non-linear 
## components can be assessed separately.
GamMod.init.lin <- gam(YBin ~ x + s(x), family = binomial, data = datbin)

summary(GamMod.init.lin)

## smoother model
summary(GamMod)

(The initial and the smoother GAM models both show a p-value for the relationship betweeen X and outcome of <2e-16. Additionally, a linear term is added to the initial model in order to assess the non-linear contribution separately. This shows that p-values for the linear and the non-linear contributions are both very small.)

Applying alternative (3), unpenalized spline regression, we obtain the following curve:

## Use rms package to enable nice ANOVA
RegModBin <- lrm(YBin ~ rcs(x, parms = 5), data = datbin)


## Use base or "stock" glm to support likelihood ratio test via base 
## anova function

RegModBin.s <- glm(YBin ~ rcs(x, parms = 5), data = datbin)

## Plot
## Set of points on X axis for plotting
TmpSeq <- seq(0, 10, length = 200)

## Get predictions on logistic scale, then calculate confidence limits
## on that scale, then transform to probability scale
Preds <-
    predict(RegModBin,
	    newdata = data.frame(x = TmpSeq),
	    type = "lp", se.fit = T)
## has components "linear.predictors" "se.fit"

CIReg <- 
    data.frame(Est = invLogit(Preds$linear.predictors),
	       Low = invLogit(qnorm(0.025,
				    mean = Preds$linear.predictors,
				    sd = Preds$se.fit)),
	       High = invLogit(qnorm(0.975,
				     mean = Preds$linear.predictors,
				     sd = Preds$se.fit)))

plot(range(TmpSeq), c(0, 1), type = "n",
     main = "Outcome probability vs. biomarker",
     xlab = "Biomarker",
     ylab = "Outcome probability")
polygon(x = c(TmpSeq, rev(TmpSeq), TmpSeq[1]),
	y = c(CIReg$Low, rev(CIReg$High), CIReg$Low[1]),
	col = "thistle", border = NA)
lines(TmpSeq, CIReg$Est)
rug(datbin$x[datbin$YBin == 0])
rug(datbin$x[datbin$YBin == 1], side = 3)

In this plot the Y-axis is on the probability scale rather than the logit scale. This is substantially equivalent to either GAM model. Here we can give an informative ANOVA breakdown:

## Generate the ANOVA table
anova(RegModBin)

Factor     Chi-Square d.f. P     
x          69.85      4    <.0001
 Nonlinear  9.89      3    0.0195
TOTAL      69.85      4    <.0001

## Likelihood ratio test using base R
anova(RegModBin.s, glm(YBin ~ 1, data = datbin))

This indicates that there is overwhelming evidence that the biomarker influences the outcome, and furthermore there is strong evidence of a nonlinear component, i.e., a departure from linearity on the logit scale. While this representation doesn’t show exactly how small the p-value is, a standard likelihood-ratio test yields <2e-16, just as the other methods.

In summary, all three approaches indicate strong evidence that the biomarker influences the clinical response. This does not support the idea that the continuous approach is more powerful. The median split represents the biomarker with 1 degree of freedom, while the continuous approaches use roughly 4 degrees of freedom. They yield more information, but “cost” more. It’s a fair trade, but it’s not clear that one always has more power than the other.

Rather, the reason that I recommend a continuous approach is that, in one step, we (1) assess evidence for a non-null relationship and (2) gain a reasonable estimate of that relationship. Further, we do the latter without carrying out substantial optimization or multiple looks at the response, which compromises statistical reliability.

Now let’s think a little more about real life.

Representing Nature

Here’s a true story illustrating how failure to look at continuous data can lead to self-imposed confusion and obfuscation, also cost real money, and delay important projects.

As the resident expert in clinical diagnostic assays in a large pharmaceutical company (that’s not saying a great deal when the company didn’t nurture such expertise), I was pulled into an apparent assay issue impinging on an oncology clinical trial. The assay measured gene copy number (GCN) for a specific gene; a GCN value above a specified cutoff was a study enrollment criterion. That is, it was a companion diagnostic assay (CDx) for the therapy under study. Two labs carried out testing for the study, each serving a different geographic region.

Recently there had been some operational issues with the assay which had required troubleshooting; the assay vendor had confirmed the issue, put a fix in place, and, for good measure, both labs repeated operator proficiency validations. Then patient screening for the study resumed. After some time, however, the trial team noticed that the “prevalence” (incidence of GCN over cutoff) was higher at one lab than the other. It was decided to pause study enrollment once again. zNote that for a pharmaceutical company, completing trials quickly is the coin of the realm; pausing a trial was a Very Serious Matter. Meetings were held, numbers were compiled, and still bigger meetings were held. Finally this expanding process grew to include a bystander previously unaware of the entire study, i.e., me.

When I joined my first meeting, it was chaired by the head of Oncology, which, for the company organization at the time, reported to the CEO. Lots of highly-paid senior people were there; this was one of those meetings where the cash clock ticked quickly.

I was given the information that had been compiled up to then. This included assay positivity counts at each site. I asked, “Where are the GCN numbers for each of the sites? What does the GCN distribution look like at each site?”

Such information had not been compiled!

Consider for a moment what this indicates about priorities and corporate culture. The company was expending significant resources, pausing a trial and using a top executive’s time. It would have been the simplest thing to organize GCN values–they were available in the clinical database, waiting to be visited. There’s no question that the values actually measured would give a more complete representation than the processed values. Yet this value was not widely shared. If this describes your organization too, you have work to do!

As an aside, there’s another aspect to this: when you’re troubleshooting a data-related issue, investigate the data process from first information acquisition to final result before you invest time in a lot of other approaches. At what point does the data begin to look anomalous? Or, if you prefer, start at the end and work towards the beginning. The point is to be systematic and to “scan” the whole process to come to an understanding of the state of things. Looking at continuous data often means looking upstream; I also have an expensive war story about this. The error is committed again and again.

But back to our story: in not much time the team pointed me to GCN numbers and I was able to determine that the clinical cutoff for GCN was very near the median GCN value. This is not necessarily an error, but it is definitely problematic: a small shift in the distribution of measured values, such as can easily happen with many assays, will induce a substantial change in positivity rate. I fitted density estimates; they had similar shapes but one was shifted slightly higher than the other. If I shifted the higher density down by the difference in medians, the densities lined up quite well, and furthermore the apparent positivity rates closely agreed.

This difference in median was smaller than the noise in the assay, so an assay scientist wouldn’t worry too much about it, and would certainly recommend against placing the assay in a context where a trivial change (relative to assay variability) would be interpreted gravely. Red flags and warnings should have gone up when the clinical cutoff was suggested.

The team decided to carry out a paired sample study. Regulatory requirements prevented either lab from sending clinical samples to the other lab, but the assay vendor could split samples, test them, and send them to each lab. Then we could compare each lab to the vendor. Long story short, it turned out there was a small difference between the vendor and both labs, and in the same direction, but this difference was not meaningful. The trial resumed. Frankly, while there was an observed difference, when interpreted with quantitative data and an understanding of variability, there was no real issue. By looking at processed data and not turning quickly to the underlying quantities when questions arose, the team had gotten themselves in very costly tizzy.

Here’s what I’ve taken away from these experiences:

  • Model continuous data when possible. It’s probably closer to the data that Nature gave you than binned or otherwise processed data. – Transforming such data towards a Gaussian distribution is fine, in fact it’s often beneficial.
  • Analyzing data close to what Nature gave you will give you a more complete assessment. It may have some analysis complications, but it will give the decision-making team more confidence.

For more on this topic, check out Frank Harrell Jr.’s essays How to Do Bad Biomarker Research and a chapter on Information Loss.

#continuousdata #dataanalysis #statistics

Some thoughts on statistical modeling and data science James Garrett

Some years ago, in a large pharmaceutical company, an academic paper circulated among some of the members of the statistical staff (of which I was one). It addressed the question of how logistic regression compared to machine learning (ML) methods in predictive accuracy with clinical trial data. The authors of the paper applied logistic regression models and found equal or better predictive performance than had been reported earlier with ML methods.

The paper was greeted by the statisticians with an uncomfortably tribal quality—I think someone may have literally written, “Hooray for our side!” If that wasn't the literal statement, the discourse within the group certainly ran along that line, such was the feeling of siege that prevailed.

(Unfortunately, I cannot find that paper now. A search on this topic uncovers many papers on the relative predictive accuracy of logistic regression on clinical data. It's been a topic of some concern, apparently.)

I had been studying statistical modeling methods advocated by Dr. Frank Harrell, Jr., whom I like to call “The most respected ignored statistician in America.” He's an elected fellow of the American Statistical Association, which is as close to a Nobel Prize as the statistics world comes. He specializes in exploratory modeling of clinical data. He mined decades' worth of statistical thinking to synthesize an exploratory modeling workflow that is purported to be thorough, efficient, and likely to yield replicable results.

The funny thing about this methodology is that at any single given point in its process, it is utterly familiar to statisticians. Logistic regression here; sure, feeling right at home. Model selection criteria there; got it. Assessment of correlations among predictor variables; of course. However, when you put all the pieces together and take it from A to Z, the whole is not familiar at all. On two occasions, with two large corporations, sat with large statistics groups and heard Dr. Harrell walk through his process. I've left the room watching audience members nodding their heads and saying the good Doctor had made a lot of convincing points. But would they incorporate these ideas in their own work? “No, my clients would never let me.” That's why I say Harrell is highly respected yet mostly ignored.

One aspect that figures prominently in Harrell's workflow is inclusion of spline expansions to enable simple non-linearity for continuous predictors. In over ten years of work within the large statistical group, I had seen lots of logistic regression models, but not one included a spline expansion.

So it was interesting indeed when I looked more closely at the paper which “scored one for our side.” It didn't give a lot of detail about how the logistic regression models were fitted, but they contained spline expansions. It seemed extremely likely that the creators of the models were aware of Harrell's process. At any rate, by using splines they were out of step with typical practice.

In fact, this paper wasn't a win for “our side,” if “our side” refers to statisticians who fit models in the typical manner. It was a win for something completely different, neither ML nor typical statistical practice. It was a win for what statistical modeling could be, but rarely is. I believe my colleagues were a little quick to accept this paper as representing their practice. It suggested a middle way between traditional statistical modeling and ML, a way which can predict as well as ML, can be as flexible in many respects, and can be more informative to boot. If the data is appropriate.

I intend to write a follow-up essay soon suggesting why I think Harrell's approach works well for most clinical data sets, so stay tuned. I'll also offer some thoughts on a way to categorize statistical and ML modeling methods according to their behavioral properties, to aid in picking the right method for the data set at hand. We really shouldn't be organized in tribes at all; there is no best modeling method, there is only a method that is best at exploiting the features of a particular data set. What are those features? I'll offer my suggestions soon.