Baseball fans who have no use for advanced metrics can realize the flaws in evaluating pitchers by their won-lost records, but may struggle to understand the inherent flaws in the more commonly used earned run average. Henry Chadwick invented ERA in the 19th century to measure the effect of defense on pitching performance, but not until Voros McCracken explained the concept of Defense Independent Pitching Statistics (DIPS) did our understanding of the relationship between pitching and defense take a big step forward. McCracken explained that pitchers controlled the rates of whiffing, walking, and getting walloped with home runs, showing that the correlation between these statistics in consecutive years was strong. Though he inferred an ability for hurlers to control these numbers, another finding suggested little persistence in their Batting Average on Balls in Play (BABIP), leading to the conclusion that ERAs were dependent on defense (or luck), and therefore very volatile.
Armed with this information, sabermetricians began to develop methods of estimating ERA by controlling for the factors that can muddy the proverbial waters. These estimators enable the evaluation of pitching performance based on what pitchers actually control, rendering more accurate the tracking of their abilities. Watching trends in actual skills that pitchers control can help us better grasp whether shifts in ERA are the result of changes from the individual or from external factors. Since then, many competing estimators have emerged with their accompanying strengths and weaknesses. Perhaps the most popular ERA estimator is Fielding Independent Pitching (FIP), which uses the following straightforward formula: FIP = 3.20 + (3*BB – 2*K + 13*HR)/9, where the 3.20 is a constant dependent on the league and year, used to place the outputted number on the ERA scale.
Researchers have noted that, among the defense-independent statistics, home runs are by far the least predictable. Although home-run rate has shown itself to be more repeatable than BABIP, the lack of persistence makes such a comparison similar to justifying a D grade by mentioning that other classmates failed the test. Further research revealed that the percentage of fly balls that left the yard (HR/FB) sported about as little persistence as BABIP, and second-generation estimators attempted to eliminate HR/FB luck from estimation. One of the more obvious adjustments is to simply approximate the number of home runs that would have been hit if the pitcher had neutral luck in the fly ball department, and re-computing FIP with this estimate. This metric, known as Expected Fielding Independent Pitching (xFIP), uses the regular FIP formula but it replaces HR with xHR, the metric described above. This estimator marked an upgrade over FIP given the accepted notion that HR/FB has much more of a foundation in luck than actual skill, but there was still ample room for improvement.
Nate Silver invented QERA back in 2006 for Baseball Prospectus to adjust for a few issues with FIP and xFIP, and while he referred to the stat as a toy, it represented a big step upward in the methodology of estimators. The formula—QERA = (2.69 – .66*GB% + 3.88*BB% – 3.4*K%)^2—derives one of its main benefits from the fact that it accounts for non-linear run scoring; the more baserunners allowed, the higher the percentage that will score. It also removes the bias that innings pitched totals are subject to batted-ball luck and a pitcher with a higher BABIP will have a lower K/IP even if he strikes out the same percentage of hitters. QERA has another problem of its own, in that GB% is really GB/Ball in Play (or, GB/BIP), while BB% and K% are measured per batters faced (SO/PA and BB/PA).
In other words, for pitchers who strike out and walk large numbers of hitters, changes in ground balls per ball in play affect their QERA as much as they do for pitchers who barely strike out or walk any hitters, even though the latter group’s ground-ball rate actually represents a higher tally. Further, while QERA picks up some of the interaction between walk, strikeout, and ground-ball rates, it does not necessarily weight them correctly.
With that in mind, we have invented a new statistic, Skill-Interactive Earned Run Average (SIERA), which corrects the problems with old estimators while adding a few more realistic assumptions. This was done first by un-foiling all of the individual components in QERA while making an adjustment for the issue with the ground-ball denominator issue, and testing to see which interactions and squared terms were relevant by using multiple linear regression analysis. Essentially, we changed the GB/BIP to (GB-FB-PU)/PA and evaluated all of the terms in the exponential regression, removing those with insignificant p-values; while the QERA formula only shows three variables, un-foiling the formula reveals several more. We identified two terms that were not useful: the squared term of walks, and the interaction between walk and strikeout rate. The squared terms on strikeout and ground-ball rates were both significant, and we also found important interactions between walks and grounders and between whiffs and grounders that have strong effects on run scoring.
As a result, SIERA accomplishes the following:
- Allows for the fact that a high ground-ball rate is more useful to pitchers who walk more batters, due to the potential that double plays wipe away runners.
- Allows for the fact that a low fly-ball rate (and therefore, a low HR rate) is less useful to pitchers who strike out a lot of batters (e.g. Johan Santana's FIP tends to be higher than his ERA because the former treats all HR the same, even though Santana’s skill set portends this bombs allowed will usually be solo shots).
- Allows for the fact that adding strikeouts is more useful when you don't strike out many guys to begin with, since more runners get stranded.
- Allows for the fact that adding ground balls is more useful when you already allow a lot of ground balls because there are frequently runners on first.
- Corrects for the fact that QERA used GB/BIP instead of GB/PA (e.g. Joel Pineiro is all contact, so increasing his ground-ball rate means more ground balls than if Oliver Perez had done it, given he's not a high contact guy).
- Corrects for the fact that FIP and xFIP use IP as a denominator which means that luck on balls in play changes one's FIP.
The new ground-ball statistic used is: (GB-(FB+PU))/PA. Now walks, strikeouts, and grounders use the same denominator, avoiding any type of weighting issues. GB/PA could have been used instead of GB/BIP, but our findings suggested that line drives per ball in play exhibited virtually no persistence, and did not represent a pitcher skill. When his line-drive rate is low, the pitcher is probably just lucky, but ground-ball, fly-ball, and pop-up rates will increase to make up the difference. Since ground-ball rate for the league as a whole is similar to the sum fly-ball and pop-up rates, using the difference between the two eliminates some of the luck that would make this estimator look bigger than its britches. For the same reason, pop-up rate was allowed to negatively affect SIERA because it is a symptom of the pitcher throwing the ball that generates an upward trajectory, which could lead to an increase in home runs. A pitcher’s skills are throwing strikes, making hitters miss, and throwing with angles and spins such that the trajectory of the ball is downward when it hits the bat. A popup almost always represents an out, but it also represents a potential problem for the pitcher in the future.
Simply running a regression analysis to predict park-adjusted ERA and developing a statistic that introduces these improvements to Defense Independent Pitching Statistics would be useless if it did not predict ERA better than other statistics. Not only did SIERA emerge as the leader in ERA estimators, we discovered more importantly that using the same regression analysis on different datasets shows that the coefficients developed continue to predict ERA better than other estimators, proving that our analysis was not biased by retroactively predicting the mark. Specifically, using 2003-08 data to generate a formula and then testing it on 2009 pitchers, SIERA emerged as the best estimator of park-adjusted ERA in the following year and the best at predicting same-year ERA amongst the estimators that treat HR/FB as luck; FIP and tRA consider it to be more skill-laden.
In other words, it is impossible to best FIP in terms of same-year mirroring unless HR/FB is treated as a skill, but tests have shown that HR/FB itself is unstable and not indicative of something within the control of the pitcher. FIP and tRA lead other estimators that do not credit the pitcher for this luck in predicting same year Earned Run Average, but SIERA overtakes both in predicting future performance, which is arguably much more important. After all, the primary goal of ERA estimators is to approximate a skill set that can successfully generate low ERAs while being as accurate as possible in the modeling and assumptions deriving the formula.
In the coming days, we will explain in more detail the derivation of SIERA, provide some tests to check its performance, and offer examples of pitchers for whom the metric performs vastly better than other estimators. The last part is very important, as a small change in ERA estimation is not necessarily a big deal unless there are pitchers who are perpetually underrated or overrated by similar statistics. This is certainly true in the case of SIERA and FIP for a player like Santana, whose solo home run tendencies are inaccurately punished by FIP in a way that underestimates his skill by a significant amount. The introduction of a metric that properly accounts for all that was mentioned above helps to evaluate pitchers in a more precise and useful way than ever before.
For now, we leave you with the formula for the statistic that will be kept here moving forward and will soon be found on the revamped reports:
SIERA = 6.145 – 16.986*(SO/PA) + 11.434*(BB/PA) – 1.858*((GB-FB-PU)/PA) + 7.653*((SO/PA)^2) +/– 6.664*(((GB-FB-PU)/PA)^2) + 10.130*(SO/PA)*((GB-FB-PU)/PA) – 5.195*(BB/PA)*((GB-FB-PU)/PA)
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
For instance, in Nate's formula: QERA = (a + b*BB% + c*K% + d*GB%)^2, it's essential that b is positive and that c & d are negative, but that requires that b*c is negative when it is actually positive in nearly every regression we ran, and it also requires that d^2 is positive, when it actually should be negative.
Thanks for highlighting this point, though. I think it makes the metric more transparent.
Nice work.
In Wednesday's article, this will be flushed out a little more, but the methodology behind a lot of the quadratic and interactions terms entails asking "which pitchers need this most?" That's why GB rate is more important for pitchers who allow more base-runners, and why marginal improvements in K-rate is less important the higher the K-rate goes.
Matt and Eric, was there any thought to adding a second order term in BBrate? was plotting the numbers for an average pitcher, of Krate vs Siera, and found it to look to linear (by eye) in BBrate.
We did test the second order term for BB-rate, which we'll explain in more detail on Wednesday, but it kept coming up as insignificant so we left it out of the equation.
There is speculation among Braves fans that his circle-change (for which he was famous) behaved somewhat like a knuckleball, which is a pitch that is known to outperform many ERA predictors (I'm assuming that Wakefield probably outperforms SIERA as well). But that's all anecdotal evidence, obviously.
But seriously .... very nice work ... glad to see BP's new blood picking up where the older writer left off, and improving upon the work.
Also a correlation matrix of the same estimators over time, i.e., interseasonal correlations for each indicator with itself and with the other estimators, e.g., 2008 and 2009 will do.
Finally, some information about the correlations (intra- and interseasonal) for pitchers with different numbers of innings pitched (let's just say high (150+), medium (75-149), and low (<75) (or some such breakdown).
If you can't put all of this into the article, perhaps you could offer it as a downloadable spreadsheet or table?
Thanks for considering this.
Over the next four days Matt and I have articles going really in-depth into a number of topics, one of which is testing SIERA against FIP, xFIP, QERA, tRA, ERA-Park, ERA, you get the picture. So just hold tight. We're going to have the data in the articles themselves too.
Isn't that backwards? If you have a higher BABIP, you will face more hitters per batted ball, as more will get on base. If you K the same percentage of batters faced, you should get MORE K's per inning. In the extreme case, Two pitchers both strike out every third hitter, and one has a BABIP of .000 and the other has a BABIP of 1.000, the first will have one K/IP, and the latter will have 3 K/IP.
My other question is about weather, umpire and pitching to ballparks. I know that all three affect scoring, and when added together can do so significantly. But never do I see any mention of them in really, anything that anyone does. It really bothers me that we can see that over 30+ starts the same offense may produce dramatically different Runs/Game for a given starter on the same team, but it is somehow assumed that the umpires and the game conditions for each of those starters "even out". These things count, but they're not counted.
*I've lost count of the number of ways this can be demonstrated. Most obviously, there is a significant correlation between team BABIP allowed and the three true outcomes; staffs that are better according to the latter also allow a lower BABIP just as you'd expect. But here's perhaps the best bit of evidence yet:
Take all the pitchers with 200+ BFP in consecutive seasons for the same club in the same park since we've had UZR data (2002-2009). The change in BABIP is of course correlated to the change in team UZR Range + Error, with the correlation surprisingly weak (r = .19) because BABIP is just so damn noisy. Now, adjust the yearly change in K/BFP for age and change in role (starter vs. relief) and toss that into the regression. Surprise! The change in BABIP is also significantly (p = .02) correlated to the change in K rate. That is very unlikely to be caused by changes in FB/GB ratio that correlate to K rate changes and very likely to be what it looks like: you strike out more batters, they also hit the ball less hard (and the opposite).
It's also interesting that the change in BABIP with age (again, with change in UZR included in the regression) precisely mirrors the change in HR / Contact with age. In both cases, there is no change until age 28 or 29, then a worsening. (Contrast to K rate, which significantly improves to age 27, then declines at the same rate). The worsening of BABIP after age 29 is not significant (p = .25), but in this data set neither is the improvement in BB rate by young pitchers (p = .26), and we're pretty sure that's real.
(Some of this is in a thread at SoSH looking at the impact of team defense on pitching.)
I remain somewhat skeptical about whether HR/FB really should be regressed essentially to the mean, but admittedly I've barely studied it, let alone studied it as much as I've studied BABIP.
Can you tell me where I can readily obtain Pop-Ups as a statistic in the typical MLB boxscore?