Tangotiger Blog

Tuesday, July 30, 2013

Correlations

By Tangotiger

As you guys know, I think that correlations are horribly misused. The first rule to remember: correlation-not-causation.? That means all we are doing is showing relationships, regardless if those relationships are in fact dependent on each other. While we know that triples are a causative agent to runs scored, simply showing a relationship between triples and runs does NOT imply that you have determined the causative relationship.

I ran a correlation of all team-offenses from 2003-2012 (300 team seasons), of all the offensive events against runs scored. The correlation was a healthy r=.965. While we get reasonable numbers in many cases (.50 for singles, .78 for doubles, 1.45 for HR, .32 for UBB, -.11 for regular outs, -.13 for strikeouts), we also get weird results, like a triple being 1.28 runs.

The gap between a triple and double according to the regression is .50 runs, while it's only .17 between the triple and HR. There's a half-dozen ways to show that this is ridiculous, in terms of the DIRECT cause-effect between triples and runs. Now, triples can be capturing other things, like speed and taking the extra base. But by itself, the triple is not generating 1.28 runs. Or it can be that teams that happened to hit alot of triples might happen to hit better with men on base, leading to an overfit (that is, you are taking the actual variation of performance of men on base, and trying to find SOMETHING that somehow links to it, just by pure luck).

You get similar silly things like the HB being .39 runs while the unintentional walk being .32 runs. While the UBB does occur with first base open more often (and thereby depressing its value relative to the hit batter), the difference is nowhere close to .07 runs, but rather closer to .02 runs. But, perhaps the HB contains extra information, like teams who are being hit by pitches might also good at other things, like hitting HR with men on base.

Then you have the CS being worth only -.02 runs. Naturally, the CS has to be larger in magnitude than a regular batting out, since you have to account for the erasing of the runner, in addition to the out.

Finally, the intercept is 42 runs. That is, even after you account for every batting event, there is still another 42 runs unaccounted for (0.25 runs per game) that are neither directly, or indirectly, related to those batting events. If we force the intercept to zero, all the run values change slightly to compensate, and in the process, changes the run value of the CS down to zero (actually, .003, which is essentially zero).

Maximizing correlation is NOT the goal, and it's not even a valid test. It's a useful first-sniff tool.

And most importantly, to aggregate 162 games into one line is a terrible decision. There's no reason to do it. In fact, it introduces a systematic bias, such like a team that has great baserunners will end up scoring more runs than the regular batting stats would suggest. And in so doing, possibly overfitting.

If you want to do the work, at the very least, run the correlation at the game-level. And if you do that, do it at the inning-level. And if you do that, you may as well use the 24-base-out matrix. And once you do that, you realize that regression ends up being useless to you.

(8) Comments • 2013/08/01 • Statistical_Theory

Recent comments

Nov 23 14:15		Layered wOBAcon
Nov 22 22:15		Cy Young Predictor 2024
Oct 28 17:25		Layered Hit Probability breakdown
Oct 15 13:42		Binomial fun: Best-of-3-all-home is equivalent to traditional Best-of-X where X is
Oct 14 14:31		NaiveWAR and VictoryShares
Oct 02 21:23		Component Run Values: TTO and BIP
Oct 02 11:06		FRV v DRS
Sep 28 22:34		Runs Above Average
Sep 16 16:46		Skenes v Webb: Illustrating Replacement Level in WAR
Sep 16 16:43		Sacrifice Steal Attempt
Sep 09 14:47		Can Wheeler win the Cy Young in 2024?
Sep 08 13:39		Small choices, big implications, in WAR
Sep 07 09:00		Why does Baseball Reference love Erick Fedde?
Sep 03 19:42		Re-Leveraging Aaron Judge
Aug 24 14:10		Science of baseball in 1957
Aug 20 12:31		How to evaluate HR-saving plays, part 3 of 4: Speed
Aug 17 19:39		Leadoff Walk v Single?
Aug 12 10:22		Walking Aaron Judge with bases empty?
Jul 15 10:56		King Willie is dead. Long Live King Reid.
Jun 14 10:40		Bias in the x-stats? Yes!
Jun 13 17:05		Bat Swing Checklist
Jun 07 12:10		Spray Angle is not needed, part 32
Jun 02 17:37		Stanton Swing Speed and Acceleration Curves
Jun 01 14:44		Statcast Lab: Pre-introducting Bat Acceleration
Jun 01 12:14		Bill James and Tango talk WAR
Older comments Page 1 of 151 pages 1 2 3 > Last ›
Complete Archive – By Category Complete Archive – By Date 2024 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov 2023 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2022 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2021 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2020 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2019 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2018 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2017 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2016 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2015 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2014 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2013 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec FORUM TOPICS Jul 12 15:22 Marcels Apr 16 14:31 Pitch Count Estimators Mar 12 16:30 Appendix to THE BOOK - THE GORY DETAILS Jan 29 09:41 NFL Overtime Idea Jan 22 14:48 Weighting Years for NFL Player Projections Jan 21 09:18 positional runs in pythagenpat Oct 20 15:57 DRS: FG vs. BB-Ref Apr 12 09:43 What if baseball was like survivor? You are eliminated ... Nov 24 09:57 Win Attribution to offense, pitching, and fielding at the game level (prototype method) Jul 13 10:20 How to watch great past games without spoilers

Tangotiger Blog

Tuesday, July 30, 2013

Correlations

Recent comments

Older comments

Complete Archive – By Category

Complete Archive – By Date

FORUM TOPICS

Latest...