Tuesday, July 30, 2013
Correlations
As you guys know, I think that correlations are horribly misused. The first rule to remember: correlation-not-causation.? That means all we are doing is showing relationships, regardless if those relationships are in fact dependent on each other. While we know that triples are a causative agent to runs scored, simply showing a relationship between triples and runs does NOT imply that you have determined the causative relationship.
I ran a correlation of all team-offenses from 2003-2012 (300 team seasons), of all the offensive events against runs scored. The correlation was a healthy r=.965. While we get reasonable numbers in many cases (.50 for singles, .78 for doubles, 1.45 for HR, .32 for UBB, -.11 for regular outs, -.13 for strikeouts), we also get weird results, like a triple being 1.28 runs.
The gap between a triple and double according to the regression is .50 runs, while it's only .17 between the triple and HR. There's a half-dozen ways to show that this is ridiculous, in terms of the DIRECT cause-effect between triples and runs. Now, triples can be capturing other things, like speed and taking the extra base. But by itself, the triple is not generating 1.28 runs. Or it can be that teams that happened to hit alot of triples might happen to hit better with men on base, leading to an overfit (that is, you are taking the actual variation of performance of men on base, and trying to find SOMETHING that somehow links to it, just by pure luck).
You get similar silly things like the HB being .39 runs while the unintentional walk being .32 runs. While the UBB does occur with first base open more often (and thereby depressing its value relative to the hit batter), the difference is nowhere close to .07 runs, but rather closer to .02 runs. But, perhaps the HB contains extra information, like teams who are being hit by pitches might also good at other things, like hitting HR with men on base.
Then you have the CS being worth only -.02 runs. Naturally, the CS has to be larger in magnitude than a regular batting out, since you have to account for the erasing of the runner, in addition to the out.
Finally, the intercept is 42 runs. That is, even after you account for every batting event, there is still another 42 runs unaccounted for (0.25 runs per game) that are neither directly, or indirectly, related to those batting events. If we force the intercept to zero, all the run values change slightly to compensate, and in the process, changes the run value of the CS down to zero (actually, .003, which is essentially zero).
Maximizing correlation is NOT the goal, and it's not even a valid test. It's a useful first-sniff tool.
And most importantly, to aggregate 162 games into one line is a terrible decision. There's no reason to do it. In fact, it introduces a systematic bias, such like a team that has great baserunners will end up scoring more runs than the regular batting stats would suggest. And in so doing, possibly overfitting.
If you want to do the work, at the very least, run the correlation at the game-level. And if you do that, do it at the inning-level. And if you do that, you may as well use the 24-base-out matrix. And once you do that, you realize that regression ends up being useless to you.
Recent comments
Older comments
Page 1 of 151 pages 1 2 3 > Last ›Complete Archive – By Category
Complete Archive – By Date
FORUM TOPICS
Jul 12 15:22 MarcelsApr 16 14:31 Pitch Count Estimators
Mar 12 16:30 Appendix to THE BOOK - THE GORY DETAILS
Jan 29 09:41 NFL Overtime Idea
Jan 22 14:48 Weighting Years for NFL Player Projections
Jan 21 09:18 positional runs in pythagenpat
Oct 20 15:57 DRS: FG vs. BB-Ref
Apr 12 09:43 What if baseball was like survivor? You are eliminated ...
Nov 24 09:57 Win Attribution to offense, pitching, and fielding at the game level (prototype method)
Jul 13 10:20 How to watch great past games without spoilers