Will get to the headline question in a second. First the setup.
Suppose you have two teams, the Expos who are favoured to beat the Spiders, in 52% of their matchups, on neutral sites.
When playing at the homesite of the Expos, they win 56% of the time. When they play as visitors against the Spiders, they win only 48% of the time. This is a typical home site advantage in baseball.
In a winner-take-all game, at the Expos home site, the Expos have a 56% win expectancy.
In a best-of-7 game series, where the Expos are home for 4 of the 7 games, the Expos have a win expectancy of 55.6%
That's right: you may think that there is more randomness in a single game, but that is not true. The Expos have a better chance of winning a one-game series than a best-of-7 series.
ANSWER TO THE QUESTION
Indeed, in a best-of-3, where all the games are at the home site of the Expos, the Expos have a win expectancy of 59.0%.
Do you know what kind of best-of-X series you need, with an even split of homesite games (except for the last game naturally)? Would you believe... 27? That's right, a best-of-27 series, with 14 games at the Expos home site and 13 games at the Spiders home site has the same win expectation as a best-of-3 all-Expos-home games.
Next time someone complains that a 3-game series is too short, or too random, remind them of the above fact.
Having thoroughly refuted several times, both by myself and other independent researchers, that the spray direction is the missing ingredient in the x-stats, the question remains: what are missing ingredients?
Someone brought up the case of Isaac Paredes, who is a heavy pull batter. However, there is another attribute of Paredes: he does not hit the ball hard. Now, you may think that the x-stats ALREADY account for the exit velocity. After all, the two main ingredients is launch angle and speed. We account for the launch speed. Don't we? Well, once again, I must again talk about the difference between modeling a PLAY and modeling a PLAYER. The x-stats, traditionally, evaluate PLAYS. But, since we are interested in PLAYERS, we limit the variables so that we focus on the PLAYERS. In other words, yes, we evaluate each play, one at a time. But instead of considering AS MANY variables as we can that went into that play we consider AS FEW variables as we can that went into that play that the player themselves have a strong influence.
Launch speed is an easy one to include on an event by event level. Launch angle as well (the easiest one that separates groundballs from home runs). The Spray Direction is one that is needed on the play, but is not needed for the player (as we've learned many times). So, we ignore that one. We include the Seasonal Sprint Speed of the runner, as that's important on groundballs.
Which gets us back to Launch Speed. Remember last night, I created a profile of each batter, to establish their Spray Tendency? Well, what if we do the same thing, but with Launch Speed? That is, let's create a profile of a batter based on how hard they hit the ball.
Now, you may think: we ALREADY account for this on a play level right? Yes, we do. But, what if a 100mph batted ball by Isaac Paredes is different from a 100mph battedball by Giancarlo Stanton, even when both are hit at 20 degrees of launch? In other words, we want Launch Speed to pull double-duty: we want to know the launch speed on that play, but we also want to know the batter's seasonal launch speed.
So, do we see a bias based on a batter's seasonal launch speed? Yes. Yes, we do.
Here's what I did, so you can feel free to replicate. I'm focused on 2016-2019 years as one seaons and the 2020-present (thru June 5, 2024) years as a second season. I do this on the idea that a player has a general speed tendency that spans multiple years. This lets me increase my sample size for each season. I also make sure that a batter that hits on both sides is considered two distinct players.
The speed tendency follows the Escape Velocity method for Adjusted speed: greatest(88, h_launch_speed). For every batted ball, I take the greater of the launch speed and 88. And I average that.
Anyway, I use the same Pascal method of binning I did last night, the 10/20/40/20/10 split.
So, on to the fascinating results. For the weakest batters, the Paredes and Arraez and so on, their xwOBA was .306, while their actual wOBA was .318. That is an enormous bias of 12 wOBA points. The next weakest batters had .339 xwOBA and .345 actual wOBA for a bias of 6 points.
The strongest batters had an xwOBA of .452 and a wOBA of .442, for a 10 point shortfall. The next set of strongest batters had an xwOBA of .411 and a wOBA of .402 for a 9 point shortfall. The middle group were pretty much even.
Now, before we get TOO excited, what else could cause this? I have a few thoughts, but let me just leave this here for now.
Suppose a team starts the season 3-0, but you had them entering the season to end 81-81. What is your new forecast? If that 3-0 record means nothing at all, then you'd assume they play .500 the rest of the way (in 159 games), and you add 3 wins to that, and you get 82.5 wins. In other words, they gain +1.5 wins in the final season forecast, based strictly that 3-0 is +1.5 ahead of 1.5-1.5.
But, the pre-season forecast can't carry a weight of an infinite number of prior games, such that adding 3-0 to that means nothing at all. Suppose that the pre-season forecast has the weight of three full seasons. Here's what happens to the final season forecast after streaks of 1 to 20 games, and how much information those extra wins gives us.
This three-season weight is an ILLUSTRATION. We need to figure out that weight. My expectation is that that weight is going to be somewhere between 1 and 3 seasons of weight. Aspiring Saberists, Assemble.
From 2016 to 2023, the pitcher among the lowest BABIP in MLB is Justin Verlander. With 3196 balls in play, his hits allowed rate is 142 fewer hits than league average. That is a substantial number, by far the highest number in that time frame. In second place is Kershaw, at 95 hits better than average. In third place is Scherzer, at 83 better than average.
This seems like a perfect refutation of Voros and DIPS, which we discussed in Part 1. If I asked you to name the three best pitchers since 2016, Verlander, Kershaw, and Scherzer could very well make up that top 3. So that potentially the three best pitchers in MLB also happens to have the best hits on balls in play is not noteworthy in the least. The next names on the list however are Julio Teheran, Cristian Javier, John Means, Tony Gonsolin, Yusmeiro Petit, and on and on it goes. deGrom is 114th out of 654 pitchers. Gerrit Cole is 89th. Aaron Nola is 466th. Wheeler is 296th. The second WORST pitcher on hits on batted balls also happens to be the 8th best in FIP-based WAR: Kevin Gausman. These 8 pitchers by the way lead in WAR on Fangraphs, a metric that ignores balls in play. Looked at it holistically, this better describes the original issue Voros found: how much attribution can we possibly give the pitcher on balls in play?
Part of the problem we have is how I even introduced it in the first paragraph. I said Verlander is among league-lows in BABIP. But more accurately, we should say that Verlander AND HIS FIELDERS are among the league-leaders. We can't just bypass his fielders. And we still have the issue that so much of what happens on batted balls goes beyond the pitcher and his fielders and their park. Random Variation weighs heavily, in ways that you don't see in other stats.
Ok, let's get into it. We can use Statcast data to directly determine the contributions of the pitcher. We can look at their launch angle and speed to determine how well effective they are. When we do that, we can see that Verlander gives up alot of soft batted balls, to the point that he ALSO happens to be the best pitcher in baseball since 2016 on launch-based hits on balls in play. Well, take that Voros! Except, well, the magnitude is not there. When we look at Verlander and his fielders, their BABIP suggests 142 fewer hits than league average. But when we look at Verlander and his allowed launch angle and speed, that suggests 76 fewer hits than league average. The breakdown for Verlander looks like this:
+22 fielders with Verlander
+76 Verlander using launch angle+speed
+44 everything else, including Random Variation
====
+142 Verlander's team, when Verlander is on mound
Here is Kevin Gausman:
-15 fielders with Gausman
-68 Gausman using launch angle+speed
-2 everything else, including Random Variation
====
-85 Gausman's team, when Gausman is on mound
Gausman is interesting in that we can explain the ENTIRETY of the poor BABIP with himself and his fielders. He's had the bad luck of having poor fielding behind him, 15 hits worse than average. But the rest of the outcomes is because of Gausman himself. Relying on FIP-only for Gausman would not be a good idea.
We can do this for every pitcher. I will show you a chart (click to embiggen), that shows, on the x-axis, how well each pitcher, and their team, do, compared to league average. You can see Verlander on the far-right and Gausman on the far-left.
On the y-axis is the direct contribution of each pitcher. While in some cases, the two correspond, like with Verlander and Scherzer and Kershaw and Gausman. In other cases, there is little overlap. Take for example Adam Wainwright:
+10 fielders with Wainwright
-75 Wainwright using launch angle+speed
+26 everything else, including Random Variation
====
-39 Wainwright's team, when Wainwright is on mound
This is a mess to resolve. Wainwright has been hit very very hard. Indeed, he's been league-worst, using launch angle and speed, at 75 more hits allowed than league average that we can directly attribute to his launch characteristics. He's had the good fortune of playing with good fielders. When he was on the mound, they made 10 extra outs than the average fielder. There was another 26 extra outs that we can't attribute to the pitcher or his fielders. Whether this is Random Variation, or it's the fielding alignment mandated by his coaches, or Wainwright somehow managing to gets more balls hit closer to his fielders, we can't really tell. All in all, Wainwright's team, with Wainwright on the mound, only gave up 39 more hits than league average.
Sometimes you get into issues like Zach Eflin, who is better than league average on how hard he is hit, and yet is worse than league average when he is on the mound with hits on balls in play. Do we really want to attribute to Eflin things that he has no control over, simply because he happens to be on the mound when those bad things happen? Why not attribute some of that to his fielders, who are equally not-complicit, but are equally present? Or maybe, stop attributing things that we don't know who to attribute to, simply because we've identified they are present? Sins of the Father and all that.
This is what it looks like if you compare the direct contribution of the pitcher, using their launch angle and speed, to the "everything else" I've been talking about. As you can see, virtually no correlation. In other words, after having identified the contribution of the pitcher directly by how hard they are being hit, whatever is left over has no association to that. Whatever is left, which is going to be mostly Random Variation, has likely very little to do with that pitcher.
When you look closely at first chart, we can come up with the general point: about half of the results we can attribute to the pitcher. In some cases more, in some cases less. In some cases, there's a reverse effect (like Eflin). But, if we simply use as our starting point that we'll count half of the outcome and give it to the pitcher, then we've taken a big step forward in better attributing outcomes to the underlying contribution.
Should we completely ignore hits on balls in play? No. The pitcher is not a pitching machine. There is some influence there.
Should we completely accepts hits on balls in play? Also no. The pitcher is not in total control here. There's alot happening that has no bearing on the pitcher.
Should we split the difference, give them half, and move on? For the pre-Statcast years: yes. Without any additional information as to their direct impact, then we have to infer their impact. And it's about half of what you see. Basically, BABIP is somewhere between Pitching Machine 4587 and that pitcher. And that's how much attribution we should give the pitcher. In Statcast years, we have more information, and so we can better attribute the impact of the pitcher to the outcome when they are present.
We can of course be a little fancier, and figure out fielder influence as well, but that's a story for another thread.
Twenty years ago, Voros shook the saber community with one of the most important saber discovery to that point, and still a top ten discovery of all saber-time. He called it DIPS, or Defense-Independent Pitching Statistics. My tiny contribution to that was FIP, which is merely a shortcut to the full-fledged DIPS. Had I not invented FIP, Voros would have eventually created it anyway.
The illustrations that Voros provided was extremely compelling. In 1999 and 2000, Pedro Martinez had perhaps the greatest stretch of two pitching seasons ever, in the history of baseball. It's difficult to even decide which of the two seasons was the better one. His ERAs were 2.07 and 1.74, and this is in the middle of the high scoring era. He had 313 strikeouts in one of the seasons and 284 in the other. And this is while pitching only 213 and 217 innings each season. In the season where he gave up 32 more hits, he also gave up 8 fewer HR. All in all, it's hard to decide which of the two seasons were better, and in any case, the two stood together as perhaps the best pitching seasons back to back.
What did Voros point out? If you remove the strikeouts and homeruns, and compared the non-HR hits to all remaining batted balls, what he called BABIP (batting average on balls in play), Pedro had among the league-low of .236 one season and among the league-high of .323 in the other season. This seemed ridiculous on its face. How could perhaps the greatest pitcher ever, having one of his two best pitching seasons ever, allowed hits on balls-in-play at a close to league-high rate? And how did he pair that up with a league-low rate in the other season?
This would suggest that allowing non-HR hits on balls-in-play might be pretty random. After all, Pedro would not pair a league-leading strikeout one season with a league-low strikeout another season and STILL be one of the best pitchers ever. You couldn't do that with walks either, or homeruns. It just doesn't work like that. But, non-HR hits on balls-in-play? Well, it happened. And it wasn't just Pedro either. While pitchers had a fairly stable SO, BB, HR year to year, their BABIP fluctuated greatly.
In retrospect, we should have known. Because Random Variation would have told us. But, no one ever looked, not until Voros. The key point of his discovery is that Voros created the denominator: balls in play. That was the key. Once that was done, then you could apply basic statistical principles to determine how much Random Variation could have impacted BABIP. Assuming 500 balls in play, then one standard deviation was roughly 0.46 divided by root-500 or 20 points. Two standard deviations is 40 points. So, going from 2 standard deviations worse than average to 2 standard deviations better than average is not that noteworthy from a performance standpoint. Look hard enough, and someone will do that year after year. In 1999-2000, that just happened to be Pedro. Even Pedro was subject to Random Variation.
Still, what do you do with this information, that Pedro had a .323 and .236 BABIP in back to back seasons? This is where you get into ATTRIBUTION and IDENTIFICATION. Suppose that pitching was done via pitching machines. And through Random Variation, you will end up with some games with 3 hits and other games with 13 hits. Nothing changes. It's the same machine, the same opposing batters, the same fielding alignment. Nothing changes. Except, because of Random Variation, you will get a random result of hits. We've identified the entity on the mound (Pitching Machine 4587). But do we attribute the results to that machine? Or, is the machine simply inconsequential?
Now, humans are different: they are humans. And when it comes to human behaviour and human talent, they can influence results. Now, just because they can influence SOME of the results, doesn't mean they can influence ALL the results. We can identify who the pitcher is on the mound, but do we attribute everything that happens to the pitcher? After all, we have human fielders involved, and we have the vagaries of the park and weather that day. The batters change, and heck, every ball is like a snowflake: no two balls are alike.
Just because we've identified Pedro, and we've calculated a BABIP of .323 one season and .236 another season doesn't mean we attribute all of that to Pedro. There's other entities involved here. Pedro cannot possibly absorb all those outcomes, given that he's one influence.
At the time twenty years ago, I was involved in a discussion and research called Solving DIPS, which basically determined, through basic statistical principles, that Random Variation was the large agent, while the pitcher and fielders were also significant agents, as was the park.
Next up: we'll set aside all that theory and look at things more factually.
We can look at the win expectancy to make the call, if you want to be more objective. That game was in the top of the 10th, with one out, and a runner on 2B. Getting a run-scoring single adds .26 wins, while a first-and-third single adds .10 runs (over and above the first-and-second default scenario). Let's say therefore a single would add +.18 wins. An out drops it by .09 wins. So, the breakeven is getting a basehit 33% of the time (which is a wOBA of .300).
To think of it more simply: an IBB is typically win-neutral. Whether the batter gets an IBB or is allowed to be pitched normally, the end result is typically the same. So, all that Cabrera has to be thinking is if he can match his true talent wOBA. Assuming he's .400, that's what he needs to be hitting. And getting a hit 45% of the time is a .400 wOBA. So, if Cabrera thought in the moment: I can get a hit here at least 50/50, I am swinging: then tip of the cap for being faced with a situation that you likely never prepared for, and your instincts took over.
Poisson is a great distribution to use, when you have low probability events. Hockey scoring for example follows a (mostly) Poisson distribution. In baseball, a low probability event would be a Pitch Timer infraction: there's about one infraction per 1000 pitches, which certainly qualifies as a low probability event.
Poisson has a useful property: the mean of the population is ALSO the variance of the population. And since the variance is the square of the standard deviation, this means we also know the standard deviation. The average number of infractions is about 20, so that sets the standard deviation as root20 or about 4.5 infractions. Two standard deviations covers about 95% of the data (which in this case is about 28-29 clubs), and so we expect almost all the clubs to be +/- 9 from the mean. Or a range of 18 infractions, top to bottom, after removing the 1 or 2 extreme clubs.
And, what do we find? After removing the Rays (31) and Mariners (8), the remaining 28 clubs range from 29 to 11 infractions, a range of... 18 infractions. That's the exact same number we'd expect from Random Variation.
In other words, the distribution of values we are seeing is exactly consistent with what you'd find from a Poisson Distribution, whereby there is no time-attentive skill (at the club level).
Remember: the job of a saberist is to look at the data and remove Random Variation. If you don't do that, then you CANNOT draw any conclusions from the data. Data is just data. Evidence is data with Random Variation removed. And the above data provides no evidence of a time-attentive skill at the club level.
Back in 2006, I developed a rather simple method to establish the true talent level of sports leagues (ANY sports league). For baseball specifically, and I'll use the A's as an example: you add 35 wins and 35 losses (a 35-35 record) to their current record, which you then convert to a win%. That's it.
So, for the A's, their 10-38 record (aka their OBSERVED record) gets a 35-35 record added (aka their PRIOR). Together that gives us 45-73, which as a percentage is .381 win%, their true talent record, their rest-of-season record (aka their POSTERIOR). With 114 games remaining, at a .381 win%, that gives us 43.5 rest-of-season wins. And so, our expectation is they will end the season with 53-54 wins.
Now, historically, how well has this method held up with such extreme teams? Well, thanks to Dan at Fangraphs, we can test our theory. I'm going to use only teams since 1950, which gives me 19 teams. Those teams averaged 11.8 wins in 48 games. Our expectation for their rest-of-season win% is .397, for a final Won-Loss record of 54-101. And what was their actual final Won-Loss record? Would you believe.... 54-101.
In other words, this method, developed 17 years ago to handle a broad set of teams perfectly works with the most extreme of extreme teams.
Last week, I saw a Twitter poll asking if a player who hit a homerun was a baserunner. From there, I generated my own polls, all starting with the same premise: you sent 26 batters to the plate, all 26 were retired, then the 27th batter did something unusual. And so the question at the end was: how many baserunners did this team have for this game?
If you have 27 ground outs, I think we can all agree that we have no baserunners. Even though the batter-runner is running, he's in the runner's lane, and the defense got the batter-runner out. But if he was safe, he's now a baserunner for the next batter. If he got to second base on a double, he is no longer a batter-runner, but a runner going from 1B to 2B (no different than any other runner starting at 1B). If he got thrown out trying to stretch it into a double, he was a runner thrown out at 2B.(*)
(*) Though you can also argue that the runner has to still be on the bases for the next batter. So your typical homerun has no baserunning. Being thrown out at second base on your own turn at bat has no baserunning.
So, it would seem that once the batter SAFELY reaches first base, that establishes that we now have a baserunner.
However.
Let me provide two examples, and you tell me how you see it. You have a fence-clearing hit, where the excited batter skips over first base. The defense will appeal the play at 1B, and the batter is out. The batter gets no HR, no single, no basehit of any kind. There's no walk or hit batter or error or defensive interference. There is nothing positive that happens here in the record book. We have a 27-up, 27-down perfect game. And so, no baserunners. The batter never safely reached first base, never claimed it.
Now, how about a 4-ball walk, with an errant pitch. The batter is awarded first base by the umpire, and in his excitement to try to get to 2nd base on the errant pitch (the ball is live after all), he skips over first base. The defense appeals the play at first base, and the player is out. The exact same thing as happened with the fence-clearing out.
Except.
Well, this is officially a walk, in the scoring rules. In both cases, the walk and the fence-clearing hit, the batter has the right to go to first base without chance of being put out. The batter however has the obligation to touch first base. Once he skips over that obligation, his right to first base no longer exists.
This now becomes a discussion of scoring rules. There's nothing to credit the batter for the fence-clearing hit. You can't give him a single, because he never touched first base. We can give him a walk for the 4-balls, because the umpire awards those on the spot, regardless if the batter does anything with it.
And so, when it comes to deciding what is a baserunner, are we necessarily tied to the idea that it must be based on the scoring rules on the batter? Or, can we say that in either of these extreme cases, the team did not have a baserunner at all. They both did the exact same thing (skipped over first base, out on appeal to first base). They both had the right to go to first base without being putout. The only difference is we have a scoring rule category for one, and not the other.
I see this question often, especially when you get one of those scenarios where a team was 99% chance of winning, and then they end up losing the game. And it seems to happen more often than you would think. Except, well, it doesn't.
I have a win probability model for baseball, which is available on a few pages on my site as well as in The Book. It's the same model that Fangraphs and Reference uses on their site, and that we use on Savant. So, I decided to check to see how often a team was predicted to win 99.5%+ of the time and how often they did actually win.
Since 1998, we've had 4.6 million plate appearances. Of those, over 140,000 of the time the expectation was that a home team would win 99.5%+ of the time (for an average of 99.84%). And how often did they actually win? 99.90%. Based on this single data point, we were a tiny bit conservative. I repeated this at every step, from 0% to 100%, a total of 101 bins. The average bin had 45,000 plate appearances.
And as the chart below shows: the model is close to perfect. This is the first time I've tested it like this, and I'm not surprised by the results. The model itself is a straightforward recursive probability model. Given the limited number of states in baseball (24 base-out states), there's really not much that could do wrong.
I can't really speak to football or hockey, especially for those very very end of game scenarios, where the clock is huge, and the playing approach changes drastically. But hockey follows Poisson, which really means it would be straightforward to create a model to handle various scenarios. I don't know anything about football, but given the talented folks who have worked on this, I presume their models are quite good.
Last night, I compared past ERA to past FIP, to see at which point in a player's career that past ERA overtakes past FIP in being more instructive of a pitcher's future career ERA. That point never took place, much to my surprise. I had reasoned that since hits on balls-in-park (BABIP) will provide some signal after 5-8 years, that by 10+ years of a pitcher's career that it would be better to fully consider BABIP than to fully ignore it. And that by 10+ years, all the Random Variation in ERA due to sequencing would have finally subsided. I wasn't sure then why I was wrong, but I was wrong.
So, instead of relying on past ERA, which had so much things going into it, this time I will focus just on past wOBA. The difference between past FIP and past wOBA is that FIP ignores BABIP while wOBA embraces it. There is exactly one difference between the two: BABIP. Therefore, in comparing past FIP and past wOBA (converting wOBA into an ERA-scale naturally), we can see exactly when BABIP will matter.
Just to give you the answer so there's no suspense: it doesn't. wOBA never overtakes FIP, much to my even greater surprise. To the extent that we take an all-or-nothing position on BABIP: we should take nothing, always. We'll try next time a weighted approach. Not here though.
How about comparing wOBA to ERA? Surely we expect wOBA to be better than ERA in the short-run, and that there's a point at which ERA takes over wOBA? Yes! Finally, we get a payoff here. wOBA is preferred to ERA up until 90 preceding games, so about 810 IP. Afterwards, ERA take a very slight lead. Not enough that it pulls away from wOBA, but enough that wOBA is the laggard between FIP, ERA, and wOBA, when we consider 5+ seasons.
The difference between wOBA and ERA is that wOBA ignores sequencing of events, plus those other aspects of baserunning (SB, CS, WP, etc), and a few other things that go into ERA (DP, pitcher fielding, etc). So, up until 4 seasons of data, we much prefer wOBA over ERA. And then after that, ERA starts to take over albeit only slightly.
Here's the chart (click to embiggen). I also zoomed in on the scale on the left so we can better see all three lines.
Update
The weighted approach is noted in comment #1 below. I am also included that weighted chart here:
This is going to end with good research, so if you want to jump straight to that, go ahead. Otherwise, bear with me as I lay down the whole premise.
Thesis
As many of you know, Fangraphs uses FIP as its core metric for their implementation of WAR (fWAR), while Reference (rWAR) uses ERA (or more precisely RA/9). I've been proceeding on the basis that FIP is a more accurate representation for pitchers for a single season, while ERA or RA/9 is more accurate for their careers. And, I've been guessing that the point at which they both are equally relevant is about 5-8 seasons. If we think of a season as 200 IP, that's about 22 9-inning games. And so, 5-8 seasons is 110 to 176 games.
So, with 1 to 3 seasons, I've been suggesting weighting it as 75% fWAR and 25% rWAR, while for 10+ seasons, weighting it as 75% rWAR and 25% fWAR. Something along those lines. And just as an educated guess.
Reasons
Why do we think ERA would take over FIP? Since FIP completely ignores everything outside of SO, BB, HBP, HR, it stands to reason that eventually all those missing things would act as a signal to break free from the small sample size noise. What would those things be? Well, we have those 70% of plate appearances that are balls-in-park. That would seem to be a big deal. Eventually, a pitcher's BABIP is going to start to matter, more than totally ignoring it as FIP does. There's a bunch of other little things involving baserunners: SB, CS, PK, BK, WP, PB. There's also the sequencing of events, whether a pitcher scatters hits across the game or gets them clumped into one inning.
There's even more. In fact, I was reading an old article from Bill James last night on an unrelated issue (Mussina and Hoffman for the Hall of Fame), and he was discussing this very topic. He actually points out everything that ERA captures that FIP ignores. Let me quote the relevant part:
======== Start Bill James ======================
... some people who figure WAR may be (again, not certain who is doing what) but some people may be calculating value based not on actual runs allowed, but on formula estimates of the number of runs the pitcher person could have been expected to allow based on his peripheral numbers. Trevor Hoffman has a 2.87 career ERA, but 3.08 FIP (Fielder Independent Pitching) based on his strikeouts, walks, and home runs allowed, whereas Mike the Moose has a 3.68 career ERA but a 3.57 FIP based on his peripherals.
Again, questionable adjustment. Pitchers do lots of things that have SOME impact on their ERA, other than getting strikeouts and allowing walks, home runs and hit batsmen. They also:
a) Hold baserunners well or poorly,
b) Pick runners off,
c) Field their position,
d) Throw Wild Piches,
e) Induce ground balls, and
f) Pitch to the situation at a certain level.
When figuring data for a SEASON, the discrepancies between ERA and FIP are probably mostly random factors which the pitcher does not control, and we should probably prefer FIP to ERA. But when dealing with CAREER records of several hundred innings—and we are dealing with career records here—when dealing with career records, it is much more likely that the discrepancies between ERA and FIP are created by factors (a) through (f) above, and then the actual ERA is almost certainly the more instructive number. In a more sophisticated sabermetric analysis, we would rely more on FIP when dealing with small data groups, but we would rely more on actual ERA when the number of innings for an individual pitcher is larger.
======== End Bill James ======================
So, this exactly matches my view. Well, until today. Bill and I, and probably hundreds of you out there, probably have the same instinct on this matter. At some point, real baseball things start to matter. Our job as saberists is to figure that point out. Bill didn't take a guess as to when the swap would happen, when ERA is preferred to FIP, other than that oblique reference to several hundred innings. But, conceptually, he and I are aligned. Well, as I said until today.
The Research
Let me now discuss my research. For each pitcher, starting in 1998, I asked this question: What is their career ERA and career FIP to-date, and how many outs did they record to-date? I asked this question after every single one of their games, for every single pitcher, from 1998 to 2021.
For example, as of May 30, 2008, Justin Verlander pitched in 76 games. At that point in his career, he recorded 1422 outs, allowed 209 ER, for a to-date ERA of 3.97. AFTER THAT, he's recorded 8067 outs and allowed 930 ER for an ERA of 3.11. Verlander has 454 games through 2021, so that's how many such answers I have for Verlander. I repeated this for every pitcher since 1998.
For this particular May 30 game, with a career to-date of 1422 outs, I convert that to equivalent number of games by dividing by 27, for outs per game. That number is 52, which means the equivalent of 52 9-inning games. So, in his preceding 52 games, he has an ERA of 3.97. There are 346 pitchers in this pool of 52 preceding-games. Running a correlation of their to-date ERA to their rest-of-career ERA, we get r=0.35. Repeating this for their to-date FIP to their rest-of-career ERA, and we get r=0.47. Conclusion? Given 52 9-inning games, to-date FIP is more instructive to determine the rest of career ERA than to-date ERA. This is not a surprise, since this is the equivalent of two or three season.
So, I repeated this for 53 and 54 and 55... and, well, you name it. Up until game 137, the to-date FIP always has a higher correlation than to-date ERA for future ERA. Since 137 games is right in my wheelhouse of expectation, this was quite comforting. Ok, so the swap happens there, right, that's the point when ERA takes over? No, not exactly. It actually bounced back and forth from game 138 to game 173. So in-between these points, ERA and FIP are equally instructive for the future. It is only when we look at the preceding 174 games (roughly 8-9 seasons) that ERA takes over. Well, takes over until 185 games. After that point, FIP takes over again.
The Conclusion
Trying to be conservative, I'll call it a draw starting at game 138. So, once you have 138 preceding games of 9 innings, or 1242 IP, that's the point that ERA and FIP are equally instructive, and it stays that way no matter how much more information you get. So, I was wrong that a swap would eventually happen. It doesn't. FIP is preferred for one season, and for a few seasons. And it's only when you have 7+ seasons that FIP and ERA are equally preferred.
The points Bill James brought up, as reasonable as they are, as much as I've believed in something like them for the last twenty years, well, this particular evidence doesn't support it. I'm looking forward to the Aspiring Saberists seeing what they can do here.
Here's the complete chart, from 1 preceding game to 217 preceding games. (click to embiggen) FIP takes a huge early lead, and holds it strongly for many years, until FIP and ERA are basically even, and they stay there.
Addendum
I need to reiterate that FIP is a DESCRIPTIVE stat, not predictive. It uses actual data and simply weights it and scales it. That's all it's doing. It is no different than SLG that weights various events (singles, doubles, triples, homers) and ignores the rest (walks, hit batters, steals, wild pitches, etc). SLG isn't saying those things aren't important. It simply says that it doesn't need it for what it's trying to do. OBP is not making a claim that walks and HR are equal in impact, just that for what it's measuring, getting on base, they are equal. So, that's what FIP is: it is DESCRIBING one facet of pitching.
And given the above research, ERA just contains too much random variation, likely from sequencing. If we want to include hits on balls-in-park, then let's include it in some other metric, and not rely on ERA to capture that information. Same with SB and WP and DP and whatnot. Let's find out what is keeping ERA from really having the impact it should have. My guess is the sequencing of events predominantly. That's the likely source of the Random Variation. And hits on balls-in-park is probably a close second.
I'm always interested in saber research, constantly seeking it out. The latest one I found is on reddit called 3D wOBA. First off: what a great name! I wish I would have thought of it.
Anyway, the researcher notes that xwOBA is principally driven by speed and angle, and notes the lack of use of the Spray Angle. Which is true. And intentional.
As I've said many times in the past: it's a question of describing the PLAY or the PLAYER. Why is FIP popular? Because it describes the PLAYER. Doesn't giving up 0 hits in a game mean it's a great game? Yes. But, does it mean it's a great PITCHER? No, not necessarily. That's because hits that stay in the park are subject to a great deal of random variation having nothing to do with the pitcher himself. You have the fielders, the fielding alignment, and the park. Not to mention most pitchers are similar in BABIP talent that it requires a GREAT deal of batted balls to find the signal amongst all that noise.
Anyway, back to the matter at hand. The research very (very) helpfully provided the data. And so, it took just a couple of minutes for me to do the test I needed to do: compare wOBAcon, xwOBAcon, and 3DwOBAcon to NEXT SEASON'S wOBAcon. Why do we want to do that? Because that data is unbiased. It's describing the PLAYER. And that is what I care about. And really, when you think about it, most of the time, that's what you care about too.
Anyway, here are the correlations. A straight wOBAcon to wOBAcon correlation is an r=0.55 (the sample had an average of 359 batted balls). This gives us a ballast value (the regression amount) of 293 batted balls.
How about the xwOBAcon from Savant (as shown in their spreadsheet anyway)? That's a correlation of r=0.56. We learn a little bit more, but not much more, than just using their actual performance. But at least, directionally, it's where we want it.
Now, finally the superbly named 3DwOBAcon, how did it do? Correlation of r=0.52. Wait, it's LOWER? Yes, it is. And this is very typical when you overfit your data. You are so focused on trying to explain THAT PLAY that you ignore what really matters here: the players themselves.
When you run a correlation to same-plays, the Savant xwOBAcon has a correlation of r=0.83 while 3DwOBAcon is slightly higher at r=0.85. The good thing here is that their model, as they've wanted to tune it, works fine. The extra dimension, the spray angle, does in fact better help describe the play in question.
But, from the PLAYER perspective, the spray angle is mostly noise. And so when you use that information as a critical component to describe the player TALENT, and so, be able to predict next season, you are introduce noise for that prediction. It's like trying to use ERA to explain ERA next season instead of FIP. Or use win% to predict next season's win% instead of using ERA to predict next season's win%.
And this is why, by and large, we don't use BABIP to evaluate pitchers. And this is why, by and large, we don't use spray angles to evaluate batters.
This is an article I wrote back in 2008 (first published at The Hardball Times, then subsumed into Fangraphs). It deals with aging curves, as applied to fielders (SS specifically).
It uses an approach I favor, the Matched Pairs, to control for as many of the variables as I could (except for the one variable we are interested in, namely age).
It uses the method that I like the most, the Delta Method.
In addition, to get the full curve, I use the Chained Delta Method.
Finally, because of selection bias, I to use regression (at least at a simplistic way to show the effect). So, that would be Regressed Chained Delta Method using Matched Pairs.
Anyway, I'm posting it here because I had a heckavu time finding it myself, and it's a good primer on doing an Aging Curves study.
When a true .550 team faces a true .520 team, the expected one-game winning percentage for the better team is about .530. This can be easily shown using the Odds Ratio Method. But for ease of explanation, it's enough to say that the .550 team is 30 points ahead of the .520 team. And so, they will win 53% of the time. That's for a neutral site game.
The historical winning percentage for the home team (aka the home team advantage) is about .540. This means that if you expect to win 53% of the time in a neutral-site game, then you would win 57% of the time at home. That's for a single home game.
What happens with a three-game series at home? How often would such a team win that series? The chance of winning back to back games is .57 x .57. The chance of losing the first, then winning the next two is .43 times .57 x .57. The chance of winning the first and third and losing the second is the same thing.
Add them up:
.3249 = .57 x .57
.1397 = .43 x .57 x .57
.1397 = .57 x .43 x .57
And we get .604. So, the chance of a true .550 team beating a true .520 team, in a three-game series, with all three games at home, is 60.4%. Remember that number, .604, as this is our target.
Now, what if it was two games at home, and one on the road? For ease of description, we'll assume the first game is on the road, and the next two are at home. But the ordering really doesn't matter. Our better team is expected to win 57% of the time at home and 49% on the road.
Add them up:
.2793 = .49 x .57
.1657 = .51 x .57 x .57
.1201 = .49 x .43 x .57
And we get .565. So, the chance of a true .550 team beating a true .520 team, in a three-game series, with two games at home and one on the road is, is 56.5%.
We can create a simple shortcut by taking two-thirds of 57% and one-third 49% to give us 54.3% as the constant single-game winning percentage. And plugging in .543 single-game winning % gives us a 3-game series winning percentage of 56.5%.
We can repeat this for a 5-game series (3 at home, 2 on the road), 7-game series and so on. This is what we get as the chance that this better team will win the playoff series:
.570 1-game
.565 3-game
.571 5-game
.578 7-game
.584 9-game
.590 11-game
.596 13-game
.602 15-game
Remember I said to remember .604? Well, there you go. When you have one team that is 30 points better than the other team (in this illustration, we use .550 v .520), the better team will win 60.4% of a three-game all-home series. And the better team will win 60.2% of a 15-game 8/7 home/away split series.
In other words, the key here is that the length of the series is not what really matters. What really matters is the home site advantage. You can especially see this in comparing the 1-game to the 7-game series.
Now, if you want to argue that the home-site advantage isn't what it traditionally is. That maybe there's reasons that the home site advantage shouldn't be what it is in a playoff series, that's fine. That's a separate argument (and one you'd have to prove, not just assume).
But the fact of the matter is that the extent that the home-site advantage exists, it compounds itself in an all-home series.
Also note that all this is very sensitive to the gap between teams. If instead of a 30 point gap in the two teams, suppose we had a 50 point gap. In that case, an all-home 3-game series would lead to a .634 series win%. And this is how it looks with the traditional playoff setup:
.590 1-game
.594 3-game
.608 5-game
.620 7-game
.632 9-game
So you can see that in this case, a 9-game 5/4 split series would give you the same expectation as a 3-game all-home series.
Since 1999, there have been 23 batters(*) with at least 3000 times batting in an 0-1 count. By the time their plate appearance ended, their wOBA was 33 to 59 points below their career wOBA. In other words, none of these players ended up performed better when passing through an 0-1 count than 0-0.
(*) For purposes of this study, I broke up the players based on their bat-side and the opposing pitch-hand.
The league average drop is 47 points. What should we have expected in terms of range, based purely on Random Variation given 3000+ plate appearances? One standard deviation in wOBA is about 0.5/sqrt(PA), meaning 9 points. Two standard deviations is therefore 18 points. And so, we'd expect to find 95% of batters to have a difference of 47+/-18 or 29 to 65 points in wOBA difference. We ended up with 33 to 59 points of drop. In other words, that Nick Markakis had a wOBA drop of only 33 points in an 0-1 count and that Alex Rodriguez had a drop of 59 points doesn't actually mean anything about them specifically: we expected to see those kinds of drops, just by Random Variation.
Ok, how about 2000+ plate appearances? Chris Davis has a 93 point drop (or 46 more than the average) and Marco Scutaro has only a 9 point drop (or 38 less than the average). All 147 batters still show a drop overall, as we'd expect. But what about the range? Well, with 2000 PA, one standard deviation is 11 points. Here we see changes of 3 to 4 standard deviations. Does that mean that what Davis and Scutaro did is real? No! What it DOES mean is that we can't ascribe the ENTIRE difference to Random Variation. We still have a very healthy amount of that deviation that is nothing but Random Variation. But not all of it.
How about 1000+ plate appearances? Now we have Pedro Alvarez (against RHP anyway) who drops 100 points (or 53 more than average) on an 0-1 count, while Khalil Greene (against RHP) who drops only 2 points (or 45 less than average). One standard deviation for 1000 PA is 16 points. We are again around 3 standard deviations. With 589 batters, we start to expect to see things happening beyond 3 SD just by Random Variation to ~two batters.
What we can do is calculate the z-scores for each batter. A z-score is simply the number of standard deviations from the mean that the player's performance is. These extreme batters that we've noted have a z-score around 3. What matters is the distribution of all these z-scores. If the standard deviation of these z-scores is exactly 1, that's proof that what we observe is purely Random Variation. If there is an "0-1 skill", then we'd expect a spread wider than normal, and so, the standard deviation of the z-scores would be more than 1. And the more the skill, the much higher the standard deviation of z-scores.
And what do we see? The standard deviation of the z-scores is... 1.0. In other words, no 0-1 skill.
I'll go through the other 10 counts to see what kind of results we get. My expectation is that maybe at 3-2 we will see something, because that is such an extreme count. Of course, even if there is a 3-2 skill, no batter will intentionally go from a 3-1 to a 3-2 count because he can "handle" a 3-2 count better than the typical batter. Whatever skill he may have at 3-2 can't possibly overcome the inherent advantage he'd have just by being in a 3-1 count.
The blog post below and the attached document was written entirely by my son, Philip.
The purpose of this blog post is to explain how I came up with an equation that models the seams of a baseball in 3D space. This equation is subject to a scaling variable that allows for the distance of separation between the seams to be controlled. I then also give a preview of how we then can go from a 3D spherical map to a 2D one.
Thank you especially to Desmos and Math3D for their terrific online software
This is going to make a few small assumptions, that will only affect the precision, not the accuracy. And, we'll have to remember the acceleration part of the kinematic equation you would have learned from school: "one half, a-t-squared".
a is acceleration, meaning gravity. You may know it as 9.81 m/s2, where s2 is seconds squared. Or as 32.2 feet/s2. For our purposes, we want inches, which is 386 in/s2.
t is time, or how much time the ball is in the air. Here's where we make two assumptions.
One is that the ball will travel 53 feet from release to plate crossing.
The other is that the final speed of the ball is 92% of the initial speed, meaning the average speed of the ball is 96% of the initial speed.
So, if we have a pitch thrown at 90mph, which is 132 feet per second, the average speed is 96% of that or 126.7 feet per second. Since the ball travels 53 feet, then 53/126.7 will give us 0.418 seconds
Therefore, "one half, a-t-squared" is 0.5 x 386 x 0.418^2 = 33.8 inches. And that's how much gravity affects a ball thrown at 90 mph.
Here's a shortcut formula you can use if you don't want to go through all that:
(523 / Speed) ^2
So, in the above case, 523/90, squared is 33.8 inches.
WOWY (With Or Without You) is a crude, but very effective, concept to describe the impact of players. In 1989-1990, Roy had a 2.53 GAA, with a league leading .912 save percentage, on his way to winning the Vezina (Top Goalie) and finishing fifth in the Hart (MVP). His backup, Brian Heyward, had a tough season in comparison, but was actually just a bit below league average, with a 3.37 GAA and .878 save percentage. (Yes, kids, save percentage at .880 was at one time considered average.) Redlight Racicot gave up 3 goals on 6 shots as the third member of the goalie team.
So, in games without-Roy (which in this case is virtually all Heyward), the Habs gave up 3.45 GAA, which pro-rated to Roy's time on ice is 183 goals allowed. Habs with Roy gave up 134 goals. So, with-Roy is 49 fewer goals ( -49 ) than without-Roy. We can repeat this process for every season of Roy with the Habs ( - 215 goals ) and with the Avs ( -92 ) to give us a WOWY of -307 goals for Roy.
There is one major assumption here, and a minor one. The major is the "without" Roy goalies are "average". Of course, their very presence as a backup of Roy likely presumes that they are bench-level at best, otherwise, they wouldn't be his backup. Fortunately, this bias would exist for all top goalies (Hasek, Luongo, Brodeur, et al). And we're also fortunate that top goalies have long careers so alot of the uncertainty of their backups will get reduced. In a long career, top goalies will have a wide array and number of backups.
The minor is that the teams play "similarly" with Roy as without-Roy. Ideally, we'd like to want this to be true. But, it's not necessarily the case. This is something that we'd have to prove to be true, or at least provide the uncertainty level to the extent that it's true.
One technical note: when I do WOWY, I actually use the harmonic mean. So in the above case, the pro-rating is not to the 53 equivalent 60 minute games Roy played, but 37 games. The end result is that Roy, with-Habs, ends up at -149 goals in 360 harmonic-games. I then pro-rate that total to -215 goals in his 532 actual (60 minute equivalent) games. For someone like Roy, it works out well enough.
Where it breaks is with the Cal Ripken seasons. When Jacques Plante plays a full season (every single minute), there is no Habs-without-Plante. So that entire season is discarded. But even with other seasons of limited backup time, those seasons have their impact severely reduced. The less Plante plays, the more we will end up counting those games. Of course, the less he plays, the less good he is. So, we get into a tough situation here.
To the extent that what I did works (to whatever degree you can accept), the top Habs goalies were
-294 Plante
-215 Roy
-210 Dryden
-142 Price
-43 Huet
If you were to create a Mount Rushmore of Habs goalies, those top 4 would invariably be it. At the bottom are Gerry McNeil and Bunny Larocque. But when your without-McNeil is principally Plante and without-Bunny is principally Dryden, this points to the limitation we have, that you don't get the wide array you'd like to have.
There are corrections we can apply, by going through an iterative process, and looking at performances outside the seasons in question. We'll get to that next time.
The NHL season gave us an unprecedented experiment on head-to-head and common-opponent theories. What I come to call WOWY (with or without you).
Let's take the Habs/Leafs. In head-to-head games in the regular season, the Leafs scored 34 goals to the 25 from the Habs. So the Leafs scored 58% of the goals. Against non-Hab opponents (which is every other Canadian team), the Leafs scored 55% of the goals. The Habs against those same common opponents, scored 50% of the goals. In other words, against common opponents, the Leafs are a bit better than the Habs. Against each other, the Leafs are far better than the Habs. Which is closer to reality, their 10 head to head games (aka With), or their 46 common-opponent (aka Without) games? In this case, the With is closer.
The Oilers scored 63% of the goals in their games against the Jets. Against common-opponent, Oilers are 53% and Jets 55%. As we know, the Jets demolished the Oilers, scoring 64% of the playoff goals. In this case, the Without is closer.
How about the other teams? Let's remove the two matchups that are too close to call.
Bruins/Caps are 51% in favor of Boston in the With, while the without is a slight favorite to Boston 56% to 55%. Either way, we'd assume an even split. Boston scored 62% of the playoff goals, but as I said, we won't learn anything here.
Avs/Blues: Avs scored 55% of their head to head goals (and naturally Blues are 45%). While against common-opponent, it was 60% Avs, 51% Blues. In other words, whether you go with H2H or common-opponent, you get the same conclusion. We won't learn anything here. For sake of posterity, Avs scored 74% of the goals in the playoffs.
So let's get on to the last 4 matchups and see how they stack up:
Knight/Wild: head-to-head, they each scored 24 goals. Against common-opponent, Knights scored 63% of the goals and Wild scored 54%. So Knights should have been heavy favorites using Without and even-odds using With. So far, Knights have scored 56% of the goals. So, the Without is closer.
Pens/Islanders: Pens scored 58% of the With goals, while the Isles are ahead in the Without: 57% to 55%. In the playoffs, we know the Isles are way ahead, 57% of the goals. So, with Without is again closer. Not only that, but the With is wildly deceiving.
Bolts/Panthers: Florida scored 56% of their head-to-head goals. Against common-opponent, Bolts were a bit better at 58% to 55%. In playoffs, Bolts were far better, scoring 59% of the goals. The Without is much better.
Canes/Predators: Canes scored 59% of the head-to-head goals. Against common-opponent, Canes were a bit better, 57% to 52%. In the playoffs, Canes scored 58% of the goals. In this case, the With is closer.
Adding it all up, in the six matchups that gave us a conclusion: 4 were better with the Without and 2 were better with the With. Given that the number of games played With/Without were something like 8/48 for most teams, the volume of the Without certainly gave those games a leg up here.
The true question therefore is not an either/or. Rather, the question is if the head-to-head games should be given more weight than the common-opponent games. That is, what weighting of the head-to-head games relative to common-opponent gives us the best predictor?
We'll look at that next time, once the second round is over, and we've got more data to work with.
Recent comments
Older comments
Page 1 of 150 pages 1 2 3 > Last ›Complete Archive – By Category
Complete Archive – By Date
FORUM TOPICS
Jul 12 15:22 MarcelsApr 16 14:31 Pitch Count Estimators
Mar 12 16:30 Appendix to THE BOOK - THE GORY DETAILS
Jan 29 09:41 NFL Overtime Idea
Jan 22 14:48 Weighting Years for NFL Player Projections
Jan 21 09:18 positional runs in pythagenpat
Oct 20 15:57 DRS: FG vs. BB-Ref
Apr 12 09:43 What if baseball was like survivor? You are eliminated ...
Nov 24 09:57 Win Attribution to offense, pitching, and fielding at the game level (prototype method)
Jul 13 10:20 How to watch great past games without spoilers