What Matters for Driving Distance?

Math addendum: Is this really valid?

Here's a mathematical fine point that you may or may not be interested in. If you're not into math, you don't need to understand -- nor even read -- this note. Skip it if it holds no interest for you.

While I was working on answering Reinout's question, I started to wonder whether it is valid to compare a deterministic computer/physics model with the type of statistical model Reinout gleans from the PGAtour.com statistics. It seems to give reasonable answers, but it is mathematically suspect. Here's the problem; the statistical model and the computer model are not the same, so it may be partly or perhaps even mostly luck that they give the same answers.

The computer model takes a set of launch conditions, and computes a carry distance that physics says those launch conditions will produce.
The statistical model takes a set of data points, each of which is an average of a season's worth of swings for one single professional golfer. We then look at the distribution of those points. That is, the carry distance for Bubba Watson's row on the spreadsheet is the average of all his drives, the ball speed is the average of all his drives, etc. All those drives are repesented as a single point in the scatter plots -- each point is the statistical summary for one player.

It may seem frivolous to question testing a deterministic mathematical model with a statistical model. In science, theories are tested that way all the time. When there is any randomness or outside influences in the experimental results, scientists turn to the sort of graphs Reinout presented -- at least superficially. But this is substantially different. If we were using statistics to test the mathematical model behind TrajectoWare Drive, the statistical base would have one point per measured drive -- not one point representing a season's worth of measured drives.

What does this do to the statistics we observe? At the very least, it is probably making R-squared much smaller than it should be. If each data point were a single drive (rather than the average of many drives), I would expect the trend line slopes to be pretty similar to what we see. But I would expect the points to line up much better along that line, not scatter all over the page. And that would result in a much better correlation of the random effects in the experiment, an R-squared closer to 1.0.

Why do we see so much spread in the data? Because even a single player's driving statistics are not uniform. For instance, data will be taken on holes where the player used a driver (certainly the intent of Reinout's study), but also on holes where he used a 3-wood, or perhaps even an iron. Uphill and downhill. Into the wind, with the wind, crosswind, and combinations thereof. What do we get when we average all those drives into a single point? I honestly don't know. And that is exactly the problem!

Think about this: When Reinout does the statistical curve fitting, he is tacitly assuming that the statistical fit will reflect repeated use of the single-instance computer model. But that assumption is mathematically valid only if the computer model is linear. If the model is nonlinear, then the distributions are warped by the nonlinearity, and the average of the computed carries will not necessarily be the measured average of carries. But we know (looking at the launch space surface) that the function of carry distance for ball speed, launch angle, and spin is not linear. Perhaps the restriction of launch space implied by properly fitted drivers keeps us in a region where the function is close enough to linear that the linearity assumption does not do any damage.

So the fact that the computer model continue to give the same information as the statistics might be coincidence. More likely, it is a rough approximation to what we would get if we gathered the statistics properly -- one drive per data point. Either way, we got lucky.

What Matters for Driving Distance?

A Five Percent Solution

A Standard Deviation

A Fitting Observation

Conclusion

Math addendum: Is this really valid?

Notes: