As many of you know, I’m organizing the 2013 CADD Gordon conference (July 21st to 25th, Mount Snow, Vermont). A Gordon conference is rare opportunity to actually “confer”, and there’s a lot to talk about! The subject will be the use of statistics in molecular modeling. Not molecular modeling. The statistics of molecular modeling. Not statistics. The statistics of molecular modeling. I repeat myself because in the course of organizing the conference I’ve needed to. Many times!
“So you mean a conference on modeling with some statistics.” Er, no.
“You do realize this is the CADD Gordon meeting, not the Statistics Gordon conference, right?”
Yes, I do realize this. That’s why it’s a conference on statistics in molecular modeling. I’ve come to realize that this is a strange concept to most in our field. And that, actually, is the point: it shouldn’t be a strange concept. This summer, I hope to make a dent in that perception—not by talking about boring statistics, but about how to be better at what we do.
Hal Varian, the chief economist at Google, said that the “sexy” job of the next decade will be that of statistician. When people laugh, he points out that few would have predicted that computer engineering would have risen to preeminence in the 1990s. His rationale is straightforward: from finance to sports to politics, statistics have suddenly become the “it” thing. And what do we have in molecular modeling? Not much. Take one very simple aspect of statistics: the concept of confidence limits, a.k.a. error bars. Simple, but fairly central to two aspects of modeling: (A) what the expected range in performance of a technique might be, and (B) whether method X is better than method Y with some confidence. It’s difficult to see why this would be a controversial or technically demanding requirement of publications in our area, and yet it is astounding how rarely it is seen or correctly interpreted. A classic example of the importance of (A) in the wider world is the 1997 Red River flood of Grand Forks, North Dakota. The predicted flood level from the National Weather Service (NWS) was 49 feet, leading the citizens of Grand Forks to prepare barriers of about 51 feet. What the NWS did not include was the error bar on that prediction, which was plus or minus nine feet! In fact, the river crested at 54 feet, causing an estimated $3.5 billion in damages. One of the explanations for the lack of information on the part of the NWS was that they did not want to look uncertain! Yet they effectively left out the most important part of the prediction.
Physics uses statistics too!
Of course, error bars in molecular modeling are not quite such a matter of life and death (we hope), but the desire to look more precise than we truly are does come into play. I think this is especially true in areas of modeling that claim to be physics-based. In fact, I recently heard one young modeler declare that he had no need to learn statistics because he did structure-based modeling! I can only assume this perception arises because real physics can have startling accuracy: precision tests of quantum electrodynamics, for instance, agree with theory to a few parts per billion. But this is irrelevant. The physics-based approaches are just as imprecise in their predictions of our “wet” world as empirical methods according to any and all blind challenges. We have seen this repeatedly in SAMPL events. And it’s not like physics doesn’t require good statistics. Yet how else can one explain the ubiquitous pronouncements of the superiority of one method over another in our journals, as opposed to the more nuanced and infinitely more useful, but perhaps less fundable, probability of superiority?
Error bars, or the lack of them, are just the tip of the iceberg. Take, for example, the quite common case of measurements that lie within some range, with a substantial subset of results outside of that range (e.g., affinity measurement). How are the out-of-range results treated? (A) Set to the limit of the assay, e.g. 10 millimolar or whatever, (B) Ignored. Clearly neither (A) or (B) is quite ideal. The first introduces unknown error into any regression model and the latter leaves out information. Did you know that there are variants of R-squared, called pseudo-R-squareds, that allows you to use both? Shouldn’t this be something our field investigates?
Speaking of the limitations of experiments, when was the last time you saw a modeling paper that actually acknowledged such a thing? There was a nice paper by the Abbott crew (Hajduk, Muchmore & Brown, DDT, 14, April 2009) that tried to quantify the effect of experimental noise and assay range on R-squared, which was unfortunately widely ignored as far as I can tell. Crystal structures permit an estimation of coordinate precision, but this is rarely used in assaying the RMSD of pose prediction. When is electron density used, as it should be, to quantify this? Even the simple procedure of sampling from the potential experimental uncertainty is lacking. How much of the widely discussed concept of “activity cliffs” is due to experimental error? In the world according to Reverend Bayes there are ways to interpret experimental results that include our expectations, yet this is never discussed, nor is the concept of experimental design (in any formal sense).
Suppose you have multiple measurements, each with a different reported error. How do you average these values? The simple answer is to weight by the inverse square of the error (more uncertain measurements are weighted less). But suppose we don’t know the error for one of the measurements and that measurement is askew from the others—is there not a risk that including it in the average actually worsens the prediction? “Any measurement is better than none” is the falsehood that has become the mantra driving the industry to cheaper but less accurate assays for years. No measurement does not mean no information; for example, molecular similarity often gives you a respectable null model estimate.
And what about null models? Untold numbers of papers proclaim the greatness of their particular method but fail to compare it to anything simpler, such as 2D fingerprint searches as a null model in virtual screening. This illustrates the problem that even a very basic aspect of the scientific method—the concept of a control experiment—is typically ignored by our journals. While it’s great that a method works (ignoring the probable cherry-picking that goes on, especially in academia), it should matter if a (statistically) equivalent result arises from a (reliable) method such as 2D (or ROCS, if you’ll forgive me some bias!). There are many flavors of null models—molecular weight, for instance, acts as a useful one for scoring functions for affinity prediction—but many papers don’t seem to include any.
A trickier concept—but one we ought to appreciate in an empirical field—is the future risk of adjustable parameters. Future risk just means predictions about data we have yet to see, typically incorrectly interpreted in terms of a training and test set. For instance, the training set includes examples from all time periods, as opposed to a separation into a true “past” and “future” dataset. While I think the field is slowly learning this one, there is still a reliance on cross-validation or y-scrambling to “prove” you haven’t over-trained. It’s not that these techniques are bad; they are just incomplete. For example, cross-validation might suggest that a model is bad—e.g., over-trained—but how often will it be wrong? Bad models can appear good and good models as bad. How do you know the chance of this for your model? And how does this classification error change with the size and composition of the dataset? There are some quite lovely approaches from information theory that give bounds on expected performance, given a certain number of parameters, but they are mostly unknown in CADD.
And this is only explicit parameterization. What about implicit parameters, such as the choice of parameters from a large set, or the choice of method in the first place, or the choice of system? These can all be worked within the context of a larger (Bayesian) framework. But that’s the keyword: work. And who, other than the more puritanical, is going to do more work than necessary to get published, especially if journals continue to have such low standards? There are societal forces at work here, not just ones of rigor. And even with standards, how useful are they if we can’t reproduce the work of others? One of Barry Honig’s aphorisms that I try to pass along is “It’s not science until someone else does it.” It’s when a group of us start to use an approach we start to learn and reproducibility is a good starting point.
Hype, glorious hype
Another issue is the ability to see through hype. I’ve commented in a previous entry about how management in pharma finds hype totally irresistible, whether it be about experimental approaches, classes of targets, management techniques (how’s that lean six sigma working out for you?) or simply the latest in computer innovation. There’s some real progress in the latter: Google Translate or IBM’s Watson come to mind. However, there is a lot of nonsense out there too. How do you distinguish between the two? Well, a firm grounding in statistics doesn’t hurt. I’ll try to give an example from the hype surrounding “Deep Learning” in the near future.
I could go on. But let’s get back to my plans for the Gordon conference. There are clearly a lot of simple methods that ought to be standard to modeling: how to calculate error bars and interpret them, how to plot straight lines, how to deal with outliers, etc. To address some of these, I plan morning sessions where members of our community can present methods they found useful, along with the science such approaches enabled. It’s my hope to capture these and other approaches in written form and also in a web-accessible interface for the community. There are also the bigger issues, such as how non-ideal our data is for standard statistical tests, how to deal with parameter risk, experimental error and NULL models. These each will have their own sessions, with two speakers and plenty of time for discussion. I don’t know if we can “solve” these issues at such a short meeting, but we can perhaps make a start. Then there are the societal issues, in which journals feature prominently. A skeptic might claim that no matter how happy-clappy a meeting we have amongst the faithful, it is all for naught if the journals don’t get on board. Well, I have a backup plan. I won’t elaborate just yet, but don’t worry, I’m not starting my own journal. Thought of that. Came to my senses long ago.
Finally, as I alluded to earlier, there are other fields out there, either having success with a more statistical approach or perhaps suffering because of a lack of one. I’ve invited a panel of five external speakers: Steve Ziliak, an economist from Roosevelt University, coauthor of the excellent The Cult of Statistical Significance; Cosma Shalizi from CMU and the field of machine learning, who has written popular articles on bootstrapping and Bayesian reasoning; George Wolford II, from the department of psychology at Dartmouth, an igNoble prize winner for a priceless paper on fMRI studies; Carson Chow from the NIH, who has applied Bayesian analysis to a variety of topics, including obesity in the United States; and, finally, Elizabeth Iorns, a medical researcher who has set up a company, ScienceExchange, to help reproduce experimental results for companies that can no longer trust published findings. These five will give evening talks, each bringing a different and, we hope, illuminating perspective.
I think statistics is an important part of what we ought to do, as modelers, but usually don’t. The cost of not using statistics is less thorough work and shakier progress. So come to Mount Snow in July (incongruous as that sounds) and we’ll try to make a difference. Knowing some statistics will not only make us better scientists, but is also a part of the rich intellectual heritage of the twentieth century. It’s time to catch up!
Visit the GRC CADD 2013 website.