[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
THE BOOK cover
The Unwritten Book
is Finally Written!

Read Excerpts & Reviews
E-Book available
as Amazon Kindle or
at iTunes for $9.99.

Hardcopy available at Amazon
SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
Shop Amazon & Support This Blog
RECENT FORUM TOPICS
Jul 12 15:22 Marcels
Apr 16 14:31 Pitch Count Estimators
Mar 12 16:30 Appendix to THE BOOK - THE GORY DETAILS
Jan 29 09:41 NFL Overtime Idea
Jan 22 14:48 Weighting Years for NFL Player Projections
Jan 21 09:18 positional runs in pythagenpat
Oct 20 15:57 DRS: FG vs. BB-Ref

Advanced

Tangotiger Blog

A blog about baseball, hockey, life, and whatever else there is.

Tuesday, July 30, 2013

Correlations

As you guys know, I think that correlations are horribly misused.  The first rule to remember: correlation-not-causation.?  That means all we are doing is showing relationships, regardless if those relationships are in fact dependent on each other.  While we know that triples are a causative agent to runs scored, simply showing a relationship between triples and runs does NOT imply that you have determined the causative relationship.

I ran a correlation of all team-offenses from 2003-2012 (300 team seasons), of all the offensive events against runs scored.  The correlation was a healthy r=.965.  While we get reasonable numbers in many cases (.50 for singles, .78 for doubles, 1.45 for HR, .32 for UBB, -.11 for regular outs, -.13 for strikeouts), we also get weird results, like a triple being 1.28 runs. 

The gap between a triple and double according to the regression is .50 runs, while it's only .17 between the triple and HR.  There's a half-dozen ways to show that this is ridiculous, in terms of the DIRECT cause-effect between triples and runs.  Now, triples can be capturing other things, like speed and taking the extra base.  But by itself, the triple is not generating 1.28 runs.  Or it can be that teams that happened to hit alot of triples might happen to hit better with men on base, leading to an overfit (that is, you are taking the actual variation of performance of men on base, and trying to find SOMETHING that somehow links to it, just by pure luck).

You get similar silly things like the HB being .39 runs while the unintentional walk being .32 runs.  While the UBB does occur with first base open more often (and thereby depressing its value relative to the hit batter), the difference is nowhere close to .07 runs, but rather closer to .02 runs.  But, perhaps the HB contains extra information, like teams who are being hit by pitches might also good at other things, like hitting HR with men on base.

Then you have the CS being worth only -.02 runs.  Naturally, the CS has to be larger in magnitude than a regular batting out, since you have to account for the erasing of the runner, in addition to the out.

Finally, the intercept is 42 runs.  That is, even after you account for every batting event, there is still another 42 runs unaccounted for (0.25 runs per game) that are neither directly, or indirectly, related to those batting events.  If we force the intercept to zero, all the run values change slightly to compensate, and in the process, changes the run value of the CS down to zero (actually, .003, which is essentially zero).

Maximizing correlation is NOT the goal, and it's not even a valid test.  It's a useful first-sniff tool.

And most importantly, to aggregate 162 games into one line is a terrible decision.  There's no reason to do it.  In fact, it introduces a systematic bias, such like a team that has great baserunners will end up scoring more runs than the regular batting stats would suggest.  And in so doing, possibly overfitting.

If you want to do the work, at the very least, run the correlation at the game-level.  And if you do that, do it at the inning-level.  And if you do that, you may as well use the 24-base-out matrix.  And once you do that, you realize that regression ends up being useless to you.

 

 

(8) Comments • 2013/08/01 • Statistical_Theory

Latest...

COMMENTS

Nov 23 14:15
Layered wOBAcon

Nov 22 22:15
Cy Young Predictor 2024

Oct 28 17:25
Layered Hit Probability breakdown

Oct 15 13:42
Binomial fun: Best-of-3-all-home is equivalent to traditional Best-of-X where X is

Oct 14 14:31
NaiveWAR and VictoryShares

Oct 02 21:23
Component Run Values: TTO and BIP

Oct 02 11:06
FRV v DRS

Sep 28 22:34
Runs Above Average

Sep 16 16:46
Skenes v Webb: Illustrating Replacement Level in WAR

Sep 16 16:43
Sacrifice Steal Attempt

Sep 09 14:47
Can Wheeler win the Cy Young in 2024?

Sep 08 13:39
Small choices, big implications, in WAR

Sep 07 09:00
Why does Baseball Reference love Erick Fedde?

Sep 03 19:42
Re-Leveraging Aaron Judge

Aug 24 14:10
Science of baseball in 1957

THREADS

July 30, 2013
Correlations