[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
THE BOOK cover
The Unwritten Book
is Finally Written!

Read Excerpts & Reviews
E-Book available
as Amazon Kindle or
at iTunes for $9.99.

Hardcopy available at Amazon
SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
Shop Amazon & Support This Blog
RECENT FORUM TOPICS
Jul 12 15:22 Marcels
Apr 16 14:31 Pitch Count Estimators
Mar 12 16:30 Appendix to THE BOOK - THE GORY DETAILS
Jan 29 09:41 NFL Overtime Idea
Jan 22 14:48 Weighting Years for NFL Player Projections
Jan 21 09:18 positional runs in pythagenpat
Oct 20 15:57 DRS: FG vs. BB-Ref

Advanced

Tangotiger Blog

A blog about baseball, hockey, life, and whatever else there is.

Saturday, January 03, 2015

Overfitting

?This fellow tries to simplify the algorithm to estimate who will make the HOF.  However, because he only works with backwards data, he offers NO opportunity to test with out of sample data.  That leads to massive overfitting.  How massive?  His first rule for hitters is that he will make the HOF if he scores more than 1197 runs (at an 88% success rate).   Steve Finley is at 1443.  He'll never make it.  Neither will Luis Gonzalez, nor Bernie Williams, Brett Butler, Darrell Evans, Tony Phillips, Julio Franco, Dave Parker, Ray Durham, Chili Davis, Don Baylor, and Edgar Renteria.  And a host of others.  Could a first-pass have excluded this guys before using Runs as the first line of defense?  Sure, you start with WAR, which the author intentionally ignored.  What you REALLY want to do in these tests is identify the most important variables first, those that lead to automatics, and then worry about the nuances later.

You can see this with his first two tests for pitchers: Wins over 229 and years LESS than 25.  That was an obvious overfit, when the more natural fit would have been to start with wins over 299.  And I doubt it's number of years, but maybe number of losses or W/L differential or win%, something more real and natural.

Finally: start with the Bill James Hall of Fame monitor.  That would have been PERFECT for the researcher to have used, as Bill provides a series of well thought-out and reasonable tests.  The researcher could then try to reduce it as much as he could, and still maintain a more reasonable output than Julio Franco predicted for the Hall of Fame, and Nolan Ryan not.

(31) Comments • 2015/01/23 • Statistical_Theory

Latest...

COMMENTS

Feb 19 11:05
Bat-Tracking: Timing Early/Late

Feb 07 15:38
Aging Curve - Swing Speed

Feb 06 11:55
Batting Average as a proxy for fun!  Batting Average as a proxy for fun?

Feb 03 20:21
Valuation implication of straying from the .300 win% replacement level

Jan 31 13:35
Breaking into the Sports Industry WITHOUT learning to code

Jan 26 16:27
Statcast: Update to Catcher Framing

Jan 19 15:02
Young players don’t like the MLB pay scale, while veteran stars love it

Jan 14 23:32
Statcast Lab: Distance/Time Model to Catcher Throwing Out Runners

Jan 07 13:54
How can you measure pitch speed by counting frames?

Jan 02 17:43
Run Value with runners on base v bases empty

Dec 28 13:56
Run Values of Pitches: Final v Intermediate

Dec 27 13:56
Hall of Fame voting structure problem

Dec 23 19:24
What does Andre Pallante know about the platoon disadvantage that everyone else does not?

Dec 21 14:02
Run Values by Movement and Arm Angles

Dec 18 20:45
Should a batter have a steeper or flatter swing (part 2)?

THREADS

January 03, 2015
Overfitting