[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
THE BOOK cover
The Unwritten Book
is Finally Written!

Read Excerpts & Reviews
E-Book available
as Amazon Kindle or
at iTunes for $9.99.

Hardcopy available at Amazon
SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
Shop Amazon & Support This Blog
RECENT FORUM TOPICS
Jul 12 15:22 Marcels
Apr 16 14:31 Pitch Count Estimators
Mar 12 16:30 Appendix to THE BOOK - THE GORY DETAILS
Jan 29 09:41 NFL Overtime Idea
Jan 22 14:48 Weighting Years for NFL Player Projections
Jan 21 09:18 positional runs in pythagenpat
Oct 20 15:57 DRS: FG vs. BB-Ref

Advanced

Tangotiger Blog

<< Back to main

Monday, April 08, 2013

DIPS and BABIP

By Tangotiger 10:16 AM

Pizza gives it a once-over.  For those new to the whole debate, you may find this piece interesting (solvingdips.pdf).

The issue with focusing on BABIP is that it's half of a bigger equation.  If you do BABIP, then you MUST do SLGBIP (or, basically, just do wBABIP, which is the wOBA version of BABIP).  On top of which, you must include DP (and reach on error... anyone who ignores reach on error is doing it wrong).

A GB pitcher has a higher BABIP than a FB pitcher.  However, for each BIP, the FB pitcher will give up more extrabase hits.  And for each BIP, the GB pitcher will get more DP.  In the end, what we REALLY care about is the run value.  And the run value is identical for GB and FB (excluding HR).

In order to get a balanced view, you need to look at run value per BIP.  And the significance there will shrink.  This is why something as simple as K minus BB+HB-IB per PA works.  It's because whatever truth you do find in BABIP, there's a counter-truth that reduces its significance.

?


#1    Tangotiger 2013/04/09 (Tue) @ 11:32

MGL asks a question in the comments, and Pizza responds:

We’re not talking here about how far to regress here. That’s a different set of analyses. We have two variables fighting it out to see who is better at predicting the outcome of the next ball in play, which is the true measure of how good a predictor is. These analyses tell you that recent history does a better job modeling the outcome of the next BIP than does league average. That right there suggests that the standard DIPS assumption that everyone is league average deep down should be treated with suspicion.

So at that moment, he is better described as a .240 BABIP rather than a .300 BABIP.

As far as I can tell, Pizza IS talking about regression, just as MGL is positing.

Furthermore, his last paragraph I find incredibly hard to believe.  Pizza is right out saying that given a pitcher with a .240 BABIP in his last 100 BIP, that given an either/or choice, it’s better to describe him as a .240 pitcher than a .300 pitcher.

In my best guess, I’d call such a pitcher a true .297 or .298 pitcher.  Maybe I’m wrong, and we can say he’s a .295 pitcher.  I don’t believe any answer below a .290.  This would set the regression point of r=.50 at BIP=500.

And to set it so that Pizza would claim he’s more likely a true .240 than a true .300, that would set r=.50 at BIP=99 (or lower).

No way.

Either that, or I have no idea what Pizza did in his article to suggest his claim.

 


#2    KJOK 2013/04/09 (Tue) @ 13:41

I almost hate to suggest this about someone as smart at Pizza but it almost looks like he’s misinterpreting his own results.  Seems like he’s saying he’s found the last 100 BIP are in general more predictive than just using the league average, BUT most of those would be much less extreme than .240 BABIP.  Any extreme result over the last 100 BIP like .240 or .360 is not going to be more predictive than the .300 I’m almost certain.


#3    Tangotiger 2013/04/09 (Tue) @ 13:58

He seems to be quite clear:

At 10 BIP, the league BABIP had a 4-to-1 edge in predictive power, consistent with what we’ve been taught about BABIP all these years. But as the sampling frame crept up, the pitcher’s recent results on balls in play started to become a relatively stronger predictor. By the 100-BIP sampling frame, a pitcher’s recent performance was the stronger of the two predictors. Around 150 BIP, it was about a 60/40 split in favor of the pitcher’s recent results, and it stayed around that ratio up to 250 BIP.

Sorry, but I just don’t get it.


#4    Pizza Cutter 2013/04/09 (Tue) @ 14:13

KJOK, .240 was the example used by MGL in the comment that he left, which is why I used that number.  .240 might be a rather large deviation, but it could theoretically happen.


#5    Guy 2013/04/09 (Tue) @ 14:15

My guess is that Pizza means that, at 100 BIP, a model using actual performance is a more accurate predictor than simply assuming league average for all players.  However, that model will almost certainly heavily regress (e.g. it might be BABIP = .297 + .01*Performance).  And while the model may be more accurate on average, it doesn’t follow that it will be more accurate for extreme values.  So it certainly won’t be the case that “he is better described as a .240 BABIP rather than a .300 BABIP.”  What you could say, I think, is that it’s likely this pitcher will be below .300 on his next BIP.


#6    Tangotiger 2013/04/09 (Tue) @ 14:31

Guy: no one would dispute that claim.  Naturally, ANY metric is better if it uses
a) some past observation PLUS league average,
b) rather than ONLY league average.

But, Pizza isn’t saying that.  He said that of the two, past observation is a stronger predictor than league average.  That means he’s weighting the past observation more.

And he confirms that with his statement that .240 is a better predictor than league average.

***

Maybe a specific example for Pizza: if we find all starting pitchers who gave up 25 or fewer non-HR hits in their first 100 BIP to start their 2012 season (such that the average will be about… I dunno .210), what is your estimate as to the BABIP we’ll observe in their 101st BIP?  And what is your estimate for what we’ll observe for their 101st through 200th BIP?


#7    Pizza Cutter 2013/04/09 (Tue) @ 14:38

It looks like a previous comment I left got swallowed.

My argument is that the concept of “regression to the mean” is based on the assumption that a player has a _static_ true talent level.  DIPS has elevated this to the idea that true talent for BABIP is something of a mathematical constant.  But basically, we take a number and assume that the player holds this value for an extended period of time (a year?)

I question that underlying assumption.  These results suggest that a model where our approximation of a pitcher’s true talent is left to vary within the year is better than assuming that static number.  My assumption is that the last 100 BIP are the best indicator for what a pitcher’s talent level is right at this moment, and it outperformed the static model in my head-to-head test.

In that case, we have nothing stable to which we can regress back.  And given what we know about the reliability of BABIP, trying to regress the last 100 BIP back to the league mean will basically result in regressing back to the league mean.


#8    Steve C 2013/04/09 (Tue) @ 14:40

Is this all driving to a weighted predictor that would like like xBABIP = a*(100BABIP)+b*(carearBABIP)+c*(lgBABIP)

We know that if a=0 and there are 3000+/- BIP then b>c

Correct me if i’m wrong but pizza is saying that the true talent in the now (next 10 BIP) is volatile enough that the previous 100 tells us more than we think.

All this really tells me is that pitchers are streaky and the BIP outcomes are piecewise smooth because of this.


#9    Pizza Cutter 2013/04/09 (Tue) @ 14:40

To put that in layman’s terms, I’m making an argument for streakiness.


#10    Steve C 2013/04/09 (Tue) @ 14:43

Proof#2 sounds like it is quite a bit like PZR.  Am I understanding that correctly?


#11    Tangotiger 2013/04/09 (Tue) @ 14:46

A player’s true talent level varies year-to-year, day-to-day, and second-to-second.  This is because he’s human, and this applies not only to ballplayers, but to any of us.

And since we know pitchers perform worse each time through the order, we even have empirical data that shows us that his talent level changed that day.

***

Now, if you want to suggest that you would use:
a. league average
b. his career totals and/or, a certain number of BIP, but not the last 100 BIP
c. the last 100 BIP

And such that c. is weighted (per BIP) more than b. is, then that’s perfectly fine.

Indeed, that’s the precept of Marcel.  Indeed, this is made clear with Day-to-Day Marcels, that we weight more recent performances higher.

***

So far, we are all in agreement (presuming I captured your feelings above).  Where I have a problem is with your response to MGL:

“So at that moment, he is better described as a .240 BABIP rather than a .300 BABIP.”

I still don’t get that.


#12    Tangotiger 2013/04/09 (Tue) @ 14:48

We know pitchers are “streaky” within game, because of the times-thru-order effect.

But, are you making the argument for streakiness, game-to-game?  If so, then you should show the streaks at the game level.

By clumping at the BIP level, you may instead be capturing the times-thru-order effect.


#13    Guy 2013/04/09 (Tue) @ 14:51

Pizza:
Two questions. 
1) Are you saying that if you are asked to estimate the hit probability for a pitcher’s next BIP, and you have to choose only between A) Lg BABIP, and B) that pitcher’s BABIP for his previous 100 BIP, you would select B?
2) Same question, but for the pitcher with a .240 BABIP over his past 100 BIP.

I’d be shocked if B is the correct answer to Q1.  And it can’t possibly be correct for Q2.


#14    Pizza Cutter 2013/04/09 (Tue) @ 15:05

Because 100 BIP will cover more than one game (and probably more than a month), I can’t make a claim to modeling in-game streaks.  But you could call that some sort of lunar-progression-level streak.

The .240 vs. .300 remark is a simple statistical statement.  These analyses say that the past 100 BIP (.240) beat the league average (.300) assumption.  If I had to choose between either .240 or .300, my results say I should pick .240.  Perhaps there’s a different formula that describes things better, but given the choices, the numbers speak very clearly.

And that’s significant, because it means that there’s something better than just assuming 300 for everyone.


#15    Tangotiger 2013/04/09 (Tue) @ 15:12

Maybe someone wants to flex their Retro-muscles, and perform the test at the bottom of Tango/6.

Unfortunately, I can’t do any baseball research for the public, so I’m hoping someone steps up to the plate (er, mound?) here.

***

And Pizza, the 101st BIP will be part of the game with the 100th BIP about 90% of the time.  Naturally, it won’t be part of the 70th and earlier BIP.  But, there is some bias here.

I think the test would be better if you looked at five starts of BIP, and then look at the 6th start, and see how it relates to the prior 5.

***

And also run it for K per PA.


#16    MGL 2013/04/09 (Tue) @ 15:13

What bothered me about the article (which was excellent, BTW, even if I did not understand the mathematics), was that he continually mentioned that the sabermetric understanding of DIPS was that pitchers had NO control over their BABIP and that any deviation was completely random.

Even in his answer to my question, he said:

That right there suggests that the standard DIPS assumption that everyone is league average deep down should be treated with suspicion.

That is completely wrong. There is no assumption that “everyone is league average.” Voros never said that and none of the other DIPS researchers said that.

Pizza seems, at times, to be arguing against something that does not exist.

Pitcher BABIP, like everything else, is a combination of true talent and luck (random variance). It just so happens that that there is a lot more random variance in a pitcher’s BABIP for a given sample size, than other things which we are typically interested in. There is no “magic” with respect to BABIP and DIPS.

And because of that, if you want to be somewhat lazy, you can simply substitute a “mean” for a pitcher’s actual BABIP when dealing with relatively small samples of BIP rather than go through the regression process, because even after going through the regression process you are going to get an answer that is very close to the mean, again for relatively small samples of BIP. However, for pretty large samples, like 5 or 10 years, using a mean would be a pretty big mistake. Probably not as big as using actual BIP, but a mistake nonetheless.

Another thing that Pizza (and others) argue against (albeit more implicitly), which also simply isn’t true, is the notion that it is proper to regress a pitcher’s actual BABIP toward the “league average.” That is true if we know nothing about the pitcher other than his BABIP. However, if we have access to all of that pitcher’s stats, we would use his GB and K rates, pitch velocities, and other things (e.g., whether he is a knuckler or not, whether he is a “pounder” or a “nibbler”) to establish the population mean, which would not necessarily be the same as the “league mean.”


#17    Tangotiger 2013/04/09 (Tue) @ 15:20

And to piggback on MGL/16 (which I ditto), the most important characteristic is simply if he’s a starter or reliever.

The Rule of 17 suggests that pitching in relief will drop your BABIP by a whopping 17 points.  That’s going to be more relevant than any single thing you can find.  Indeed, it might be more relevant than EVERY other thing combined you can find.

(Though, pitching in relief implicitly brings with it that you are throwing the ball faster.)


#18    Pizza Cutter 2013/04/09 (Tue) @ 15:35

In the interests of investigating whether this theory works at the extremes, I ran some new analyses.

Using the same basic framework as I did in the original, I took the league average and the past 100 BIP and let them fight it out in the same logistic regression.

I only took cases where the last 100 BIP yielded a prediction of .280 or lower, then .275 or lower, then .270 or lower, etc.)  There does come a point where league BABIP is a better predictor, and it seems to happen somewhere between .270 and .265.  However, it should be noted that the past 100 BIP still holds some significant sway, even as you descend even further.

Perhaps .240 is too lucky to believe, but .270 is not.


#19    Guy 2013/04/09 (Tue) @ 15:37

Yes, I was just thinking that the starter/reliever distinction is very important.  If Mariano shows up with a .270 BABIP on his last 100 BIP, that will be a better estimate for him than league average.  But it won’t be a better estimate than the league closer average. 

I’m not so worried about the other biases Tango mentions, as 100 BIP is about 5 starts. 


#20    Pizza Cutter 2013/04/09 (Tue) @ 15:38

Tango/17 - Yeah, whether reliever or starter probably does make a difference.  I just haven’t gotten there yet.

MGL/16 - Maybe no one actually argues that out loud, but it seems that for all we know and have known (see all the links I put in there to other DIPS doubters!), we still use ERA estimators that assume a “normal” BABIP.  It’s still “in the air”


#21    Tangotiger 2013/04/09 (Tue) @ 15:43

Instead of the “or lower”, can you show ranges?  That is:
.010-.200
.210-.250
.260
.270
.280
.290

If I’m following you, you are saying that a pitcher who gives up say 28 hits per 100 BIP, that that is more predictive than league average.

But, if a pitcher gives up say 23 hits per 100 BIP, that the league average is more predictive?

That the closer to league average the performance, the more predictive the PITCHER’s performance.  But, the further away from league average, then the more predictive the LEAGUE average?

Sounds to me more like park or fielding factors or something.  That a pitcher who gives up 28 hits on 100 BIP is more predictive than the league average, because of the park and fielding variables wrapped in there.

But if a pitcher gives up 21 hits on 100 BIP, then all the park and fielding factors linked to that is too small to counter the random variation, and so, we just bet on league average.

I think this is what’s happening here.


#22    Tangotiger 2013/04/09 (Tue) @ 15:51

In other words:

Let’s say you take all the hitters at Coors.  And then, you split them up into three groups:
1. observed a wOBA of .320 or less
2. observed a wOBA of .320-.360
3. observed a wOBA of .360 or higher

Assume leauge average is .320.

So, if you look at all the Coors hitters who we observe in the .320-.360, we have to figure that Coors helped them.  And so, if we wanted to know how they’d hit at Coors, chances are, we’re better off using their production, than league average.

But, for the guys who hit .360+ (say they average .420 as a group), well, it’s hard to accept that they are going to repeat that .420.  So, we end up forecasting them to be LOWER than group #2!

It’s a paradox of sorts.


#23    Pizza Cutter 2013/04/09 (Tue) @ 15:53

There could be park factors in there (I didn’t control for that).  Although in my original article, I addressed fielding.  Pitcher skill appears to weight more heavily than fielding.

Small amendment, if a pitcher gives up 23 hits on his 100 *immediately previous* BIP, you believe the league average more.


#24    Tangotiger 2013/04/09 (Tue) @ 15:53

In TAngo/22, I’m imposing the either/or: either you choose the recent performance, or you choose league average.


#25    Tangotiger 2013/04/09 (Tue) @ 16:04

Here’s a more obvious example:

Say the league average is 19% K rate (.19 K per PA).

You have two pitchers who, after facing 10 batters got:
PitcherA: 2 strikeouts (i.e., .20 K per PA)
PitcherB: 8 strikeouts (i.e., .80 K per PA)

Now, my questions:
1. What predicts PitcherA’s next 10 batters: his observed 20% K rate, or the league average 19% K rate?

2. What preducts PitcherB’s next 10 batters: his observed 80% K rate, or the league average 19% K rate?

For Q2, it’s obvious: it’s the league average.

But, for Q1, maybe he’s facing bad hitters, maybe the pitcher is a bit above average, maybe park conditions help K, etc.  So, you might go with his observed performance.

Paradox.

So, I think you need to go back to the utensil guy’s research.  He was right all along.


#26    Peter Jensen 2013/04/09 (Tue) @ 16:47

I am not a Bpro subscriber so I couldn’t read the article to see what Pizza’s methodology actually was.  Here is what I did.  I took every starter from 2005 through 2011 that had more than 200 BIP in a year.  I computed a rolling average of his BABIP for his previous 100 BIP.  I compared his BABIP in his most recent 100 BIP with his BABIP in his 100 BIP before that (in the same year).  So BABIP for BIP 1-100 compared to BIP 101- 200 as one trial, BABIP in BIP 201-300 compared to BIP 301-400 as trial 2, etc.  If the difference of the two BABIPs in each trial was less than the the difference between the BABIP of the most recent 100 and .300 I gave a win to recentcy.  If they tied I gave a tie, otherwise I gave a win to league average.

The results over 3798 trials were 1232 wins for recentcy, 405 ties, and 2161 wins for league average.  That is with no adjustment for the quality of the pitcher’s team defense.  No adjustment should mean more wins for recentcy since the quality of the defense for affecting the BABIP rate is being given entirely to the pitcher.


#27    Guy 2013/04/09 (Tue) @ 16:53

Peter:  Did you check to see if the average BABIP for starting pitchers in these years was in fact .300?  Looks to me like the mean is more like .296. Not sure how that impacts your results…..


#28    Peter Jensen 2013/04/09 (Tue) @ 16:58

Guy - Since every group of 100 is only going to have a precision of .010 it won’t impact my results at all.  Even if you throw all the ties to recentcy it is a long way from supporting Pizza.


#29    Pizza Cutter 2013/04/09 (Tue) @ 17:07

My methodology was based on logit regression.  I took the rolling average for the last 100 BIP immediately before this particular BIP.  So, for BIP #101, I’m looking at the BABIP from #1-100.  For BIP #102, I’m looking at #2-101, etc.


#30    Peter Jensen 2013/04/09 (Tue) @ 17:08

Guy - I double checked. The average BABIP for this group of starters was .292 because of the selection bias of having to pitch enough to allow over 200 BIP in a season.  Changes the results to 1409 for recentcy and 2389 for league average.


#31    Tangotiger 2013/04/09 (Tue) @ 17:11

Peter: great stuff!

Now, if you are up for it, what if you change it to BIP=200?

That is, at what point does recency wins = league average wins?


#32    Tangotiger 2013/04/09 (Tue) @ 17:14

And since Peter did not have a tie, he’s effectively using a league average of .295.  That’s because with 100 BIP, either you gave up 29 or 30 hits.  Comparing to .291 or .299 is the same thing.

Peter: can you also break it down to how I was saying in Tango/21?

That is, do we in fact see the league average “win” more, the more extreme the pitcher performance?


#33    Peter Jensen 2013/04/09 (Tue) @ 17:17

Pizza - I would think that your methodology would be heavily biased by the quality of the opposing team, home team bias, and umpire bias.  How did you adjust for the pitcher’s team defense?


#34    Pizza Cutter 2013/04/09 (Tue) @ 18:39

Peter/33 - the article presented three different lines of argument.  We’ve mostly been discussing #1 on recency vs. league average on what predicts BABIP on this BIP better.

The analysis concerning defense was slightly different.  I used the 93-99 Retrosheet data, which has batted ball location data (flawed as it may be), and created an xBABIP based on where the pitcher’s balls in play went.  For defense, I took the BABIP of what happened when the pitcher on the mound wasn’t pitching.  I also incorporated batter xBABIP (same basic idea as the pitcher) and league average BABIP.  Batter xBABIP reigned supreme here, but pitcher was second, and generally out-did defense by a 2/1 margin.


#35    MGL 2013/04/09 (Tue) @ 19:15

“we still use ERA estimators that assume a “normal” BABIP.”

Yes we do. As I said, when faced with the (false) dichotomy of using actual BABIP or some mean BABIP, it is correct to use the latter for almost any but the largest of sample sizes. That is NOT the same thing as saying or even implying that variation in BABIP is random that pitchers have no control over their BABIP, etc.

Pizza, you are perpetuating a myth, and I don’t know why you don’t correct that, rather than defend yourself by stating that sabermetric insiders are doing the same thing. They are not. There is no saberist that I know of that has ever said that a pitcher’s BABIP has no skill component.

This article (again, as much as I like it, and it is great work), is a PERFECT example at how complex statistical wrangling can mislead the reader and even the researcher.

Tango is 100% correct, BTW. Of course an actual BABIP that is close to the league mean will be a better predictor of future BABIP than one that is far away from the mean, IF the mean is wrong!

Here is another example: Say the mean you are using is .300, but this is a reliever, and the mean SHOULD be .290.

If the reliever throws a .240 BABIP, his future BABIP will be around .290 much closer to the .300 incorrect mean you are using.

But if his BABIP is .290 in 10 or 100 BIP, then his future BABIP will still be .290, which of course is closer to his actual than the incorrect mean you are using.

So this whole “recent small-sample BABIP is a better predictor than using league average” thing, while true, is simply an “illusion” caused by using the wrong mean!

What if we used the mean BA for all baseball players in the world, rather than just major leaguers, and that mean BA was .350. Well, if an MLB player hits .250 in 100 AB that is likely a better predictor of his future BA than .350.


#36    TomC 2013/04/09 (Tue) @ 20:25

That doesn’t really work though.  If the “true mean” is .290, and he’s comparing to .300 instead, then performance will only do as well or better than .300 when performance is between .280-.300.  Given that the binomial variance for n=100, p=.3 is about .045,  that’s not going to be anywhere near half the time.


#37    Pizza Cutter 2013/04/09 (Tue) @ 22:44

MGL/35 - Per your suggestion, I ran analyses in which I corrected for the starter vs. reliever BABIP issue, and when I did so, league BABIP (adjusted for role) out performed the last 100 BIP, although the victory was narrow, and there was still a significant contribution for recent history. 

However, when I moved the sampling for the recency effect to 150 (I believe I only briefly mentioned this in the article, but 150-160 was the point where the recency effect was strongest… 100 was the point where it just barely poked its head over league average), recency emerged again as the better predictor.

So even controlling for starter/reliever, we can still find that recency effect performs better.  The bigger problem is that DIPS-based estimators don’t do us the service of controlling for starter/reliever.

I’m worried that people are missing the broader point.  Sure, there have been plenty of arguments made that DIPS rests on a flawed assumption (I’ve made some of them).  But there are also a lot of other articles that follow the template that “Smith had a low BABIP last year, prepare for him to regress to the mean.”  Maybe it’s not the same people, but it’s something that’s baked into the community more generally.  People still use FIP as the basis for all sorts of statements.

I’m not even so invested in the thought that the league average BABIP is winning or losing slightly as a predictor (at best, it wins by about a 60/40 ratio in terms of variance explained).  The fact that it is contributing and in a non-trivial amount is good enough for me.  Beyond the myth-busting element, it suggests a way that we might be able to begin looking at guys who have wide deviations in BABIP and picking apart which ones are talent-based vs. those that really were just random noise.


#38    MGL 2013/04/10 (Wed) @ 01:57

I’m worried that people are missing the broader point.  Sure, there have been plenty of arguments made that DIPS rests on a flawed assumption (I’ve made some of them).

I don’t know what you mean by “DIPS” (the formula, the idea, etc.) but it certainly does not rest on a flawed assumption. It rests on an excellent assumption. One, using some mean BABIP (league average is usually fine) is MUCH better than using actual BABIP (if actual BABIP is close to the mean, then who the heck cares which one you use?) for any sample but a really, really large one. 

Two, that BABIP necessarily includes fielder proficiency, whereas HR, K, and BB do not.

But there are also a lot of other articles that follow the template that “Smith had a low BABIP last year, prepare for him to regress to the mean.”

Are you suggesting that that is not true? It is 100% true, assuming that the mean is that of the population that the pitcher comes from, with respect to BABIP talent. And if the BABIP that Smith had last year was really far away from even a league mean, then it is also true even if we use the league mean as the number to regress toward.

I still have no idea of the difference between the the recent BABIP “winning or losing” and the amount to regress toward some mean. How does recency over 150 BIP “win” over league BABIP, when clearly if a pitcher’s BABIP is .250 (or .350) over his last 150 BIP, his BABIP over the next 150 (or 10 or 1000) is going to be very close to .290-.295. How does the .250 (or .350) “win” over .296 (a league average BABIP)?

Is that some kind of semantical trick related to the statistical tests you have done? To me, the “winning” number means being closer to the future BABIP than the losing number.

Clearly if a pitcher is .240 or .250 or .370 or .210 or .330, etc., that number is not going to “win”, i.e., be closer to his future BABIP, then whatever mean you want to use (.296, .292, .302, etc.)...


#39    Tangotiger 2013/04/10 (Wed) @ 07:22

The correct term is regression TOWARD the mean, and not TO the mean.  TOWARD means going in the direction of, while TO means going AT that point.

EVERYTHING regresses TOWARD the mean.  We should be past discussing this point.

The ONLY question on the table is the DEGREE of that regression, how much, the amount of regression.

People may use FIP to suggest that regression is 100%, but we’ve already established that EVERYTHING regresses TOWARD, not TO, so, anyone who uses 100% is wrong.

And we already know that the year-to-year correlation for full-time starters for BABIP is around r=.20.  So, again, we KNOW there’s some sort of dependency there.  We can tweak out the fielders and park by looking at pitch-switchers, and the r is still healthy, at .15 or something.  There IS a skill.

Furthermore, we can look at career numbers, and again there, we see enough spread in observations that it’s not related to luck.  Again, we know there’s a skill.

***

Anyway, all you have to do Pizza (or Peter), is run the results in Tango/21, so that we can see the paradox of this “winning”.

That the winning-thing doesn’t tell us what Pizza is implying it does.  That really, it’s telling us more about the pitcher’s environment than the pitcher.


#40    Kincaid 2013/04/10 (Wed) @ 23:56

Pizza, can you give the output of your logit regression here (if that information is subscription only, I understand)?  I’m trying to reproduce the general result by running a logit regression using the past 100 BIP to predict the next 1 BIP, but the results I am getting are showing the predicted value staying very tightly around the league average.

I took all balls in play from 2003-2012 and ran a logit regression of each pitcher’s BIP #101 on his BIP #1-100, #102 on #2-101, etc up to #5000-5100.  I got

Coefficients:
            Estimate Std. Error z value   
(Intercept) -0.98872    0.01281 -77.202  
data$BABIP   0.38968    0.04308   9.046

Converting that to an expected BABIP for an observed .240 BABIP over the previous 100 BIP:

x = .39*.240 - .99
xBABIP = e^x/(1 + e^x)
= .290

compared to a league average of .293 over all the observations.  The results I’m getting are showing the observed BABIP barely moving the predicted value.


#41    Pizza Cutter 2013/04/11 (Thu) @ 14:13

I had slightly different inputs (1993-2012, I converted the “tracking” BABIP to logged odds ratio before entering it, I only allowed in-season predictions, so no cross-feeding from one year into another).

But to answer the question:

.240 has a logged odds ratio of -1.153

Coefficients were

tracking BABIP (100 BIP sampling frame) was .077
constant was -.801

x = .077 * -1.153 - .801
x = -.890, xBABIP in the model is .286, so our models are roughly saying the same thing.  Baseline BABIP in my sample (among cases considered) was .295.

Looks like my model assumes that the tracking average has a little more pull.

 


#42    Tangotiger 2013/04/11 (Thu) @ 14:28

Ok, so what are Kincaid and Pizza saying?  That given 100 BIP at a .240 level, that we’d expect the 101st BIP to be somewhere between 3 points (Kincaid, or regressing 96%) or 9 points (Pizza ore regressing 84%) below league average?

I would have guessed 97% regression for talent level.  But considering we’re talking about the same team, possibly the same park, partly the same game, I would have expected somewhere around 93-95%.

I don’t understand what all the arguing is about then.  I mean, the 84% is still a healthy amount of regression.  It’s actually much less regression than I expected.  So, that’s very interesting.

But, framing it like Pizza originally did, which is basically what we learned is nothing more than a paradox, doesn’t help us.

The focus should be on the amount of regression, and not on the “head to head winning”.


#43    Guy 2013/04/11 (Thu) @ 14:35

I think part of the confusion in this discussion is that when Pizza says recency “wins” over league average with >150 BIP, he means that his model’s prediction—not the prior BABIP—will beat league average as a predictor.  So in his example, a .240 prior performance yields a prediction of .286, and roughly 60% the time the actual BABIP will be closer to .286 than to .295.  Pizza:  am I interpreting that correctly?  So that would still be consistent with a very substantial regression TOWARD the mean.


#44    Tangotiger 2013/04/11 (Thu) @ 15:01

Guy: then we can’t reconcile this statement:

“So at that moment, he is better described as a .240 BABIP rather than a .300 BABIP.”


#45    Guy 2013/04/11 (Thu) @ 15:09

Tango:  Right, unless Pizza meant a PROJECTED .240 pitcher.  However, I’m not sure such a beast can be found in the wild.  Indeed, it would probably require a pitcher to post a negative BABIP….


#46    MGL 2013/04/11 (Thu) @ 16:51

Guy wouldn’t that be obvious? I would expect that a proper projection, for ANY number of BIP, would beat a league average more than half the time. The only reason it wouldn’t, in small sample sizes, would be because of “slop” in the projection model.

I mean, if my model were merely to regress actual BABIP toward the mean by an amount equal to 4000/(4000+BIP), shouldn’t that beat using the league mean more than half the time for ANY number of BIP? Obviously for all intents and purposes the projected BABIP will be the same as the league mean for small sample sizes.


#47    Pizza Cutter 2013/04/11 (Thu) @ 16:54

Through this entire conversation, I’ve been working from a variance explained framework.  When I say that .240 is the better descriptor than .300, I’m saying that the variable based on the recency effect (which could have a value of .240, theoretically) is a better predictor (statistically) when you look at contributions to -2 log fit statistics, which is the logit analogue of R-squared.  In fairness, when I started to look at some of those more extreme cases (see #18), league average came back into the lead.  Throughout, the two kinda danced in that 60/40 range relative to one another.

When the recency effect goes head to head with the “everyone is league average” assumption, then the recency effect has a roughly 60/40 edge over league average in general, again in terms of variance explained.  Even if you reverse those numbers and consistently find that league average is the better predictor, it still means that a significant chunk of what’s going on can be found in the recency variable.

Perhaps the more accurate statement for me to make would be “If I could only have one piece of information to work with to predict the outcome of the next ball in play, I’d rather have a projection based on the pitcher’s recent performance than a projection based on the league average.”

Of course, we could have both if we wanted them, and that’s where it becomes a false dichotomy.  Then again, for my own purposes, it’s also not the part that interests me about these findings.  The fact that recency goes toe-to-toe with the “everyone is league average” assumption and either bests it or holds its own has some pretty big implications for how we understand DIPS and balls in play.

Recency implies that there is some contribution either of the surrounding environmental factors or of the pitcher himself that affects results on balls in play.  That’s one of those things that people have generally nodded their heads to, but no one seems to have done anything about, principally because year-to-year reliability and my own work with split-half reliability suggested that these contributions were not robust.  If the effect of talent is not very strong and the league mean is so overwhelming (and a constant), then a “let’s regress everyone to the mean” approach makes sense.  But these analyses, with what I think is a more sophisticated statistical method, show that the effect is robust. 

There’s another fascinating part.  If everyone is basically league average give or take some random noise, then in the long run, everyone will regress to .300 and there’s nothing we can do about it except hope for dumb luck to be on our side.  But, if there are either environmental or talent-based factors that in play here, then it opens up the possibility that we can identify those factors, and which pitchers/situations contain them, and maybe some of those factors could be manipulated.  In any case, it opens up the idea that some players are more immune to regression to the mean than others, which is why I talked about questioning the underlying assumption of regression to the mean as a framework (#7).  In the aggregate, “regress everyone back to .300” might perform decently well, but maybe we can have a more nuanced and statistically better understanding and that we shouldn’t regress everyone at the same rate.

And because the analytic framework (logit) allows our estimation of true talent to vary even from BIP to BIP, rather than just looking at year-level data (or more), it allows us a much more fine-grained look at how some of these effects might play out over time.

If all that anyone gets out of my piece is “DIPS is not random, there are major contributions of pitcher talent and/or environmental factors around the pitcher, and this presents a new framework in which we might understand those factors” then I am a happy man.


#48    Guy 2013/04/11 (Thu) @ 17:14

“If I could only have one piece of information to work with to predict the outcome of the next ball in play, I’d rather have a projection based on the pitcher’s recent performance than a projection based on the league average.”

Pizza:  I think you are losing people with this notion of a split in variance explained by recency vs. the average. (Or at least, you’re losing me).  I’m not sure what it means to say the average “explains” 40%—or any %—of the variance.  The average is a constant—it doesn’t “explain” any variance.  And what would a “projection based on the average” be, if not the league average itself?

The other point I would make is that what you call “recency” is indistinguishable from any player/team/field ability that differs from league average.  To the extent you find that the recent data adds information, it may simply be evidence that this pitcher/team has a mean that isn’t league average, not that “recent” data has special value because players are “streaky.”  And while you seem to find either possibility to be a revelation, what you’re hearing here is that all of us already assume pitchers/teams vary in their hit prevention ability.  This just isn’t a new or surprising idea.


#49    Tangotiger 2013/04/11 (Thu) @ 17:42

I’d rather have a projection based on the pitcher’s recent performance than a projection based on the league average

I don’t know what it means to say “based”.

I mean, if you do:
101stBABIP
= 95% lgBABIP
+ 5% recencyBABIP

That’s “based” on both.  How do you make it based on ONLY recencyBABIP?  Well, you could do this:
101stBABIP
= .280
+ 5% recencyBABIP

Except, that’s just a trick.  That .280 means something: it’s 95% of the lgBABIP.

***

Now, what is “based on league average”?  That’s EXACTLY league average.  What else could it be?

***

I’m as confused as ever.

 

 


#50    MGL 2013/04/11 (Thu) @ 17:49

“If all that anyone gets out of my piece is “DIPS is not random, there are major contributions of pitcher talent and/or environmental factors around the pitcher, and this presents a new framework in which we might understand those factors” then I am a happy man.”

With all due respect, and that is different from what we already know about DIPS and what has been written in 300 articles since Voros, how?


#51    Pizza Cutter 2013/04/12 (Fri) @ 11:31

A projection based on the recency variable would simply be the equation that the computer spits out when I ask it to do a best fit on my training data set using only the recency variable as a predictor.  See #41.

The prediction based on league average would be slightly off from the actual league average in this specific case, because the training data set only includes data from the 101st BIP onward for each season (and only pitchers who were good enough to get to 100 BIP in a season).  You could use just plain old league average if you want.

If you wanted to include both the league average and the recency variable, that’s fine.  The program will spit out some coefficients for both inputs.  FWIW, that equation, including both variables, will be a better fit than the equation produced by either variable alone.

Guy/48 - As to your questions on variance explained, if we were doing linear regression, I would quote R-squared as an index of how well the model fits the observed data.  In logit, which is what I’m using here, it’s -2 log.  In linear regression (or correlation) you want to find the variable that’s most closely correlated with the outcome and the same basic principle applies here.  The program says that the recency variable is more closely related to the outcome in question.  We know because it picks up a greater amount of the variance.

In linear regression, you can take a variable out of a regression and see how it changes the R-squared.  In binary logit, you can take a variable out and see how it changes the -2 log.  That’s what I mean when I talk about variance contributions.  If you remove recency, the loss in -2 log has about a 60:40 ratio to the loss that happens if you remove league average.

You are correct that the recency variable could simply be a proxy for pitcher/team ability at preventing hits on balls in play.  To test that, I inserted the pitcher’s seasonal BABIP into the equation.  Even after this, the recency variable continued to be a significant predictor of what was coming next.

And that time dependency element is the new information that I think my work adds.  We can appreciate (and begin to measure) how not only does BABIP skill vary between pitchers, but also over time _within_ a pitcher.  Again, if we allow for the fact that those fluctuations might be more than just random noise, as these results suggest, then perhaps we can map out what factors affect those fluctuations.


#52    Tangotiger 2013/04/12 (Fri) @ 12:43

“using only the recency variable as a predictor”

In light of what I said in Tango/49, I will reiterate that you are implicitly using the league average as the second variable.  You are NOT using “only” recency.


#53    Tangotiger 2013/04/12 (Fri) @ 13:10

In other words: whether Pizza uses “only” recency, or explicitly uses both recency and league average, he’ll end up with virtually the same amount of weight on the recency variable.


#54    TomC 2013/04/12 (Fri) @ 13:15

I think I get it now.  There’s a league average each year (for the qualified pas he looks at) that’s different each year. You can average those yearlies to get an overall “league average” for the sample period.  Then he’s looking at PAs and Finding that the correlation to the drift of the yearly LA babip from the overall average is about the same as the correlation to the recent performance.  Not that it’s as important as league average, but that it’s as important as the drift in league average (which is obviously a few percent, max, of league average for babip).


#55    Guy 2013/04/12 (Fri) @ 13:35

To test that, I inserted the pitcher’s seasonal BABIP into the equation.  Even after this, the recency variable continued to be a significant predictor of what was coming next. And that time dependency element is the new information that I think my work adds.

OK, I agree this could be a new insight.  But it really depends on how powerful the recency data is vs. seasonal/career measures.  Even if you posit a constant pitcher talent (at least at the seasonal level), you’d expect some small extra power from recency simply because it will capture some of whether the next pitch is home/away, quality of opposing hitters, seasonal weather factors, pitcher role (relief vs starter), and probably other factors I haven’t thought of. So the question is, beyond these exogenous factors, can you demonstrate that pitchers’ own talent varies?  And by how much?  Does recency allow us to predict a pitcher’s BABIP with more than an extra .001 or .002 accuracy? 

The one true pitcher factor you are almost certainly measuring is the impact of injuries, which I imagine do impact a pitcher’s BABIP ability for a period of time. But not sure how useful that info is, unless we can use a rising BABIP as an injury flag.


#56    Tangotiger 2013/04/12 (Fri) @ 13:57

I would suggest that this work has more applicability if you did K/PA.  While we ALWAYS weight more recent data heavier, when it comes to K/PA, you would weight the recent data disproportionately more, and strikingly so.

If you are looking for signs of injury, it’ll be much easier to find in a changing K/PA than in a changing BABIP.

***

And I don’t see the time-dependency as something new.  We’ve done day-to-day Marcels in the past, and any forecasting system weights more recent data more. 


#57    Pizza Cutter 2013/04/12 (Fri) @ 14:26

Guy/55, again using the variance explanation ratio, it was about 80:20 with season BABIP being a better predictor than recency.  The recency data may not be tremendously powerful, but we at least know that it’s statistically separate from the pitcher’s overall talent level.

I would agree that there might be some other factors that are driving those fluctuations that aren’t related to the pitcher himself, although I would point out that with a sampling frame of 100-150 BIP, we’re talking about at least a big enough sample to where some of those external factors would wash out.  Or at least it’s not a silly thing to say, prior to really digging into the issue (which I haven’t yet).

For me, I’m not really interested in BABIP per se.  Predicting to the third decimal place is nice, but I’d rather spend my time understanding how the variance is constructed.  Which factors play the biggest role, and knowing that, how could a team manipulate those factors to their advantage?


#58    Pizza Cutter 2013/04/12 (Fri) @ 14:36

Tango/52-53, In the sense that when I ask the program to give me a best fit formula with only the recency variable in there, it will give me something in the form of y = ax + b.  I’ll probably get b that’s pretty close to the league average, because that’s a good place to start from.  You’re correct that I’ll end up with roughly the same amount of weight on the recency variable, and that amount will be significant and, from what it seems like, fairly considerable.


#59    Tangotiger 2013/04/12 (Fri) @ 14:49

Right, so at least we agreed there’s no such thing as “only” the recency variable.

And, we agree that the pitcher (and/or his circumstances) will contribute towards the future BABIP.

And we agree that more recent data is more relevant than older data.

Here, do me a favor (Pizza or Kincaid), and run this:
TEST#1
x1: BIP 1 through 50
x2: BIP 51 through 100

y: BIP 151 through 200

(I put in a gap, so as to at least ensure we are not correlating same-game information.)

Now, run your process that uses x1 and x2 as the independent variables, and y as the dependent.

What do I expect to happen?  The coefficient for x2 will be BARELY larger than x1.

You can try something more stark:
TEST#2
x1: BIP 1 through 50
x2: BIP 101 through 150

y: BIP 201 through 250

Now, we put even more space between x1 and x2, which should make the recency of x2 have more weight.

My guess is that we’ll be barely able to see anything again.

Finally, really stretch it out:
TEST#3
x1: BIP 1 through 50
x2: BIP 301 through 350

y: BIP 401 through 450

I think we might see something there.  Maybe as much as 10% more weight, but that’s pretty much it.

(All data should be from same-year.)


#60    Guy 2013/04/12 (Fri) @ 14:49

I would point out that with a sampling frame of 100-150 BIP, we’re talking about at least a big enough sample to where some of those external factors would wash out

Yes and no.  For a starting pitcher, about 10% of a 100-BIP sample will come from the current game on average. So that captures a lot of potential influences: catcher, other fielders, umpire, park, home/away, opposing hitters, day/night, temp.  For the other 90% of the sample, they will tend to have similar temperature (seasonal), pitcher health/injury, and role (for mixed-use pitchers). Could those factors account for what you see?  I obviously don’t know, but seems like it could account for a fair amount of what is a pretty small pie to start with.


#61    Pizza Cutter 2013/04/12 (Fri) @ 15:27

Guy/60 - It’s probably better (I go over this in the article) to use either a window of 150 or 160 BIP.  That helps to water down how much conflation we would have.  The good news is that a lot of this info can be found on Retrosheet, so we can compensate for a messy model by having a ton of statistical power.

And I’d argue that some of those are features, rather than bugs.  If, for example, we see that temperature is affecting a pitcher, we may not be able to change the temp, but we might ask whether it’s an issue of the temperature affecting the spin on his pitches or whether it’s a body-comfort issue.


#62    Kincaid 2013/04/12 (Fri) @ 23:04

League average is not a very useful variable because it doesn’t vary much.  If you take it out, its function gets mostly absorbed by the intercept in the regression equation, like Tango has been saying, so the Likelihood Ratio Test won’t pick it up as an important variable.  But, like Tom C said, it’s just that the drift in league average year-to-year isn’t that important, mostly because it’s so small.  If you take the intercept out of the regression equation (which effectively uses the league average variable as the intercept), then the league average variable becomes much more important than the observed data variable.  It’s the intercept that is driving most of the results of the model, which represents the regression toward league average.

*****

Tango/#59

Data is 2003-2012, 1-50 includes all samples from 1-50 up to 50-99.  For example, the first regression includes regressing BIP #200-249 on BIP #50-99 and BIP #100-149.

This is regular linear regression, not logit regression:

Coefficients:
            Estimate Std. Error t value   
(Intercept) 0.274282   0.001312  209.01
1-50        0.032466   0.003188   10.19 
51-100      0.046258   0.003212   14.40

Multiple R-squared: 0.003231, Adjusted R-squared: 0.003211 

------

Coefficients:
            Estimate Std. Error t value
(Intercept) 0.285384   0.001472 193.814
1-50        0.019839   0.003592   5.523
101-150     0.007544   0.003630   2.078

Multiple R-squared: 0.0004508, Adjusted R-squared: 0.0004254 

------

Coefficients:
            Estimate Std. Error t value
(Intercept) 0.275950   0.001912 144.306
1-50        0.027979   0.004603   6.079
301-350     0.031377   0.004506   6.964

Multiple R-squared: 0.001775, Adjusted R-squared: 0.001732

 


#63    Tangotiger 2013/04/12 (Fri) @ 23:20

Kincaid: excellent, thanks!

The one where we expected recency to have the most effect is the last one.  The coefficient is 12% higher for the most recent group.  That’s right where I expected it.

Now, the next one we’d expect is the second test, but that one is totally off.  That one suggests the OLDER data is more relevant than the more recent one.

Finally, the first test, we expected very little variation, but we got a tremendous amount.

Overall, inconsistent, and I don’t think it points to much.

In any case, we’re talking about taking 3% to 8% of any of the three samples, and regressing 95%, which is what we’ve always expected to happen.


#64    Tangotiger 2013/04/12 (Fri) @ 23:24

And by the way, those standard errors is why regression drives me bananas.  The first and second test, for the 1-50 BIP sample, the coefficients are 4.3 standard deviations apart, and that’s because we changed the second parameter (x2).  That’s how sensitive these things are, and it’s why we get hugely different values, especially for test2.

Basically, run a regression often enough or in different ways, and you’re bound to find something.

Thanks, Kincaid, for running all that.


Click MY ACCOUNT in top right corner to comment

<< Back to main


Latest...

COMMENTS

Feb 26 01:19
Iterations of ABS (Automated Ball-Strike)

Feb 19 11:05
Bat-Tracking: Timing Early/Late

Feb 07 15:38
Aging Curve - Swing Speed

Feb 06 11:55
Batting Average as a proxy for fun!  Batting Average as a proxy for fun?

Feb 03 20:21
Valuation implication of straying from the .300 win% replacement level

Jan 31 13:35
Breaking into the Sports Industry WITHOUT learning to code

Jan 26 16:27
Statcast: Update to Catcher Framing

Jan 19 15:02
Young players don’t like the MLB pay scale, while veteran stars love it

Jan 14 23:32
Statcast Lab: Distance/Time Model to Catcher Throwing Out Runners

Jan 07 13:54
How can you measure pitch speed by counting frames?

Jan 02 17:43
Run Value with runners on base v bases empty

Dec 28 13:56
Run Values of Pitches: Final v Intermediate

Dec 27 13:56
Hall of Fame voting structure problem

Dec 23 19:24
What does Andre Pallante know about the platoon disadvantage that everyone else does not?

Dec 21 14:02
Run Values by Movement and Arm Angles

Dec 18 20:45
Should a batter have a steeper or flatter swing (part 2)?

Dec 18 16:19
Art and Science of WAR: Deriving the zero-baseline, historically

Dec 14 23:50
Art and Science of WAR: Positional Adjustments

Dec 10 12:49
Fine and Notso-Fine Starts

Dec 06 21:59
To login to this site, and register an account (part 2)