Recently, there has been a lot of digital ink spilled about ERA estimators—statistics that take a variety of inputs and come up with a pitcher’s expected ERA given those inputs. Swing a cat around a room, and you’ll find yourself with a dozen of the things, as well as a very agitated cat. Among those is SIERA, which has lately migrated from here to Fangraphs.com in a new form, one more complex but not necessarily more accurate. We have offered SIERA for roughly 18 months, but have had a difficult time convincing anyone, be they our readers, other practitioners of sabermetrics, or our own authors, that SIERA was a significant improvement on other ERA estimators.
The logical question was whether or not we were failing to do the job of explaining why SIERA was more useful than other stats, or if we were simply being stubborn in continuing to offer it instead of simpler, more widely adopted stats. The answer depends on knowing what the purpose of an ERA estimator is. When evaluating a pitcher’s performance, there are three questions we can ask that can be addressed by statistics: How well he has pitched, how he accomplished what he’s done, and how he will do in the future. The first can be answered by Fair RA (FRA), the third by rest-of-season PECOTA. The second can be addressed by an ERA estimator like SIERA, but not necessarily SIERA itself, which boasts greater complexity than more established ERA estimators such as FIP but can only claim incremental gains in accuracy.
Some fights are worth fighting. The fight to replace batting average with better measures of offense was worth fighting. The fight to replace FIP with more complicated formulas that add little in the way of quality simply isn’t. FIP is easy to understand and it does the job it’s supposed to as well as anything else proposed. It isn’t perfect, but it does everything a measure like SIERA does without the extra baggage, so FIP is what you will see around here going forward.
Why ERA Estimators?
Why do we want to know a player’s expected ERA? We can measure actual ERA just fine, after all, and we have good reason to believe that ERA is not the best reflection of a pitcher’s value because of the unearned run. Component ERA estimators are frequently used to separate a pitcher’s efforts from those of his defense, but you end up throwing out a lot of other things (like how he pitched with men on base) along with the defensive contributions. Here at BP, we use Fair RA to apportion responsibility between pitching and defense while still giving a pitcher credit for his performance with men on base. Sean Smith, who developed the Wins Above Replacement measure used at Baseball Reference, has his own method of splitting credit that doesn’t rely on component ERA estimators. They simply aren’t necessary to handle the split credit issue.
What about ERA estimators’ utility in projecting future performance? Let’s take a look at how the most popular component ERA works in terms of prediction. Fielding Independent Pitching was developed independently by Tom Tango and Clay Dreslough, and is simply:
(13*HR + 3*BB – 2*SO)/IP +3.20
I computed FIP for the first 50 games of the season, from 1993 to 2010, and looked at how it predicted rest-of-season ERA, weighted by innings pitched. As expected, the Root Mean Square Deviation is very large, at 5.32. If you repeat the exercise but use previous year’s ERA instead, RMSE drops to 2.78. After 100 games, the difference still persists; ERA has an RMSE of 4.52, compared to 9.64 for FIP. (The reason RMSE goes up the further you are in season is that you are predicting ERA for substantially fewer games.) Large sample sizes are our friend when it comes to projection.
All component ERA estimators share the same substantial weakness in terms of predicting future ERA: they work a lot better on smaller sample sizes. I tried predicting 2010 pitching stats using the aggregate of pitching stats from 2003 through 2009, so I could test how several ERA estimators fared as sample sizes increased dramatically. In addition to FIP, I used xFIP, a FIP derivative developed by Dave Studeman which uses outfield flies to estimate home runs allowed, and Skill-Interactive ERA, a metric originally developed here at BP by Matt Swartz and Eric Seidman.
Here is the RMSE for 2003-2010 at increments of 100 innings:
xFIP |
||||
100 |
1.52 |
1.45 |
1.40 |
1.33 |
200 |
1.20 |
1.19 |
1.18 |
1.14 |
300 |
1.02 |
1.07 |
1.05 |
1.03 |
400 |
.98 |
1.02 |
1.02 |
1.02 |
500 |
.77 |
.88 |
.90 |
.91 |
600 |
.75 |
.88 |
.93 |
.96 |
700 |
.81 |
.91 |
.93 |
.99 |
800 |
.77 |
.88 |
.91 |
.96 |
900 |
.77 |
.90 |
.91 |
.98 |
1000 |
.81 |
.94 |
.97 |
1.04 |
Looking at all pitchers with at least 100 innings, SIERA outperforms the other estimators (but not by much, as I’ll explain momentarily) as well as ERA. By about 400 innings, though, the gap between the ERA estimators (as well as the gap between them all and ERA) has essentially disappeared. As you add more and more innings, plain old ERA outperforms the estimators, with basic FIP ranking second-best.
If you want a projection, the key thing you want is as large a sample as possible. If you need both a large sample as well as the most recent data, weighted according to their predictive value, you want something like rest-of-season PECOTA. So what is the function of these estimators, if not prediction? The answer, I think, is explanation. These ERA estimators allow us to see how a pitcher came about the results he got, if they came from the three true outcomes, defense, or performance with runners on.
Enter SIERA—Twice
SIERA was doomed to marginalization at the outset. It was difficult to mount a compelling case that its limited gains in predictive power were worth the added complexity. As a result, FIP, easier to compute and understand, has retained popularity. FIP is the teacher who can explain the subject matter and keep you engaged, while SIERA drones on in front of an overhead projector. When FIP provides you an explanation, you don’t need to sit around asking your classmates if they have any idea what he just said. If ERA estimators are about explanation, then the question we need to ask is, do these more complicated ERA estimators explain more of pitching than FIP? The simple answer is they don’t. All of them teach fundamentally the same syllabus with little practical difference.
Fangraphs recently rolled out a revised SIERA; a natural question is whether or not New Coke addresses the concerns I’ve listed above with old Coke. The simplest answer I can come up with is, they’re still both mostly sugar water with a bit of citrus and a few other flavorings, and in blind taste tests you wouldn’t be able to tell the difference.
I took the values for 2010 published at Fangraphs and here, and looked at the root mean square error and mean absolute error, weighted by innings pitched. I got values of .12 and .19 respectively. In other words, roughly 50 percent of the time a player’s new SIERA was within .12 runs per nine of the old SIERA, and 68 percent of the time it was under .20 runs per nine. There is a very modest disagreement between the two formulas.
In order to arrive at these modest differences, SIERA substantially bulks up, adding three new coefficients and an adjustment to baseline it to the league average for each season. I ran a set of weighted ordinary least squares regressions to, in essence, duplicate both versions of SIERA on the same data set (new SIERA uses proprietary BIS data, as opposed to the Gameday-sourced batted ball data we use here at Baseball Prospectus). Comparing the two regression fits, there is no significant difference in measures of goodness of fit such as adjusted r-squared. Using the new SIERA variables gives us .33, while omitting the three new variables gives us .32.
We can go one step further and eliminate the squared terms and the interactive effects altogether, and look at just three per-PA rates (strikeouts, walks and “net ground balls,” or grounders minus fly balls) and the percentage of a pitcher’s innings as a starter. The adjusted r-squared for this simpler version of SIERA? Still .32.
So what does the extra complexity of the six interactive and exponential variables add to SIERA? Mostly, multicollinearity. An ordinary least squares regression takes a set of one or more input variables, called the independent variables, and uses them to best predict another variable, the dependent variable. It does this by looking at how the independent variables are related to the dependent variable. When the independent variables are also related to each other, it’s possible to throw off the coefficients estimated by the regression.
When multicollinearity is present, it doesn’t affect how well the regression predicts the dependent variable, but it does mean that the individual coefficients can incorrectly reflect the true relationship between the variables. This is why you can have one term in old SIERA with a negative coefficient now have a positive coefficient in new SIERA. The fundamental relationship between that variable and a pitcher’s ERA hasn’t changed, players haven’t started trying to score outs without making runs or anything like that.
Adding additional complexity doesn’t help SIERA explain pitching performance in a statistical sense, and it makes it harder for SIERA to explain pitching performance to an actual human being. If you want to know how a pitcher’s strikeouts influence his performance in SIERA, you have no less than four variables which explicitly consider strikeout rate, and several others which are at least somewhat correlated with strikeout rate as well. Interpreting what SIERA says about any one pitcher is essentially like reading a deck of tarot cards—you end up with a bunch of vague symbols to which you can fit a bunch of rationalizations to after the fact.
The world doesn’t need even one SIERA; recent events have conspired to give us two. That’s wholly regrettable, but we’re doing our part to help by eliminating it from our menu of statistical offerings and replacing it with FIP. Fangraphs, which already offers FIP, is welcome to this demonstrably redundant measure.
FIP or xFIP?
The next question is why not use a measure like xFIP instead of FIP? The former is much simpler than SIERA and performs better on standard testing than FIP, while being indistinguishable from SIERA in terms of predictive power. While xFIP is simpler than SIERA, it still drags additional levels of complexity along with it, including stringer-provided batted ball data. The question is whether or not that data adds to our fundamental understanding of how a pitcher allows and prevents runs.
One of the fundamental concepts to understand is the notion of true score theory (otherwise known as classical test theory). In essence, what it says is this: in a sample, every measurement is a product of two things—the “true” score being measured, and measurement error. (We can break measurement error down into random and systemic error, and we can introduce other complexities; this is the simplest version of the idea, but still quite powerful on its own.)
You’re probably familiar with the principle of regression to the mean, the idea that extreme observations tend to become less extreme over time. If you select the top ten players from a given time period and look at their batting average, on-base percentage, earned run average, whatever, you see that the average rate after that time period for all those players will be closer to the mean (or average) than their rates during that time period. That’s simply true score theory in action.
Misunderstanding regression to the mean can lead to a lot of silly ideas in sports analysis. It’s why people still believe in a Madden curse. It’s why people believe in the Verducci Effect. It is why we ended up with a lot of batted-ball data in our supposed skill-based metrics. I’m regularly known as a batted ball data skeptic. So I’m sure it comes as no surprise that I regularly lose sleep over statistics like home runs per fly ball, a component of xFIP.
xFIP uses the league average HR/FB rate times a pitcher’s fly balls allowed in place of a pitcher’s actual home runs allowed. While it’s true that a pitcher’s fly ball rate is predictive of his future home run rate on contact, it’s actually less predictive than his home runs on contact (HR/CON). Looking at split halves from 1989 through 1999[1], here’s how well a pitcher’s stats in one half predict his home run rate on contact in the other half:
R^2 |
|
FB/CON |
.023 |
HR/CON |
.035 |
Both |
.045 |
For those who care about the technical method, I used the average number of contact in both halves to do a weighted ordinary least squares regression. The coefficients on the regression including both terms are:
HR_CON_ODD = 0.019 + 0.158*HR_CON_EVEN + 0.032 * FB_RT_EVEN
This result seems to fly in the face of previously published research. Most analysts seem to feel that a pitcher’s fly ball rate is more predictive of future home runs allowed than their home run rates. How to reconcile these seemingly contradictory thoughts? The answer is simply this: estimating home run rates using a pitcher’s fly ball rates leads to a much smaller spread than using observed home run rates. Taking data from the 2010 season, and fitting a normal distribution to actual HR/CON, as well as expected HR/CON given FB/CON, gives us:
When you create an expected home run rate based on batted ball data, what you get is something that's well correlated with HR/CON but has a smaller standard deviation, so in tests where the standard deviation affects the results, like root mean square error, it produces a better-looking result, without adding any predictive value.
Aside from questions of batted ball bias, however, there is a reason that this sort of analysis can be unsatisfactory: it assumes that a fly ball that wasn’t a home run is equally predictive of future home runs as a fly ball that is a home run. This is absolutely incorrect. You can break a pitcher’s fly ball rate down into two components: home runs on contact, and fly balls on balls in play. (This ignores the very small number of home runs that are scored as line drives, which is typically rare enough to not impact this sort of analysis.) Fly balls on balls in play are a much poorer predictor of future home runs than home runs on contact, with an r-squared of only .014.
While regressing the spread of HR/CON is a good idea, using fly ball rates to come up with expected HR/CON does this essentially by accident. And in doing so, they throw out a lot of valuable information that is preserved in HR/CON but is washed out when non-HR fly balls are considered in equal proportion. By being in a rush to discard the bath water, someone’s throwing away the baby.
No Estimating Zone
Proponents of skill-based metrics such as SIERA and xFIP (as distinct from FIP) may claim that results trump theory, that their additional predictive value alone proves their utility. But just as the greater predictive value of fly ball rates is a mirage, so is the greater predictive value of skill-based estimators. What happens when we fit a normal distribution to our ERA estimators, along with ERA? Again looking at 2010 data:
The estimators have a smaller standard deviation than observed ERA. SIERA has a lower spread than FIP and xFIP has a lower spread than both (although SIERA and xFIP are much closer to each other than SIERA is to FIP). Let’s look at how well each of these predicts next-season ERA from 2003 through 2010 (using a weighted root mean square error), and well as the weighted SD of each among the pitchers who pitched in back to back seasons:
xFIP |
||||
RMSE |
2.88 |
2.07 |
1.86 |
1.86 |
SD |
1.71 |
.83 |
.49 |
.53 |
The ERA estimators are all closer to the next-season ERA than ERA is, but they have a much smaller spread as well. So what happens if we give all the ERA estimators (and ERA) the same standard deviation? We can do this simply, using z-scores. I gave each metric the same SD as xFIP, which was (just barely; you can’t even tell the difference when you round the values) the leading estimator in the testing above, and voila:
xFIP |
||||
RMSE |
1.94 |
1.87 |
1.86 |
1.84 |
In essence, what we’ve done here is very crudely regress the other stats (except SIERA, which we actually expanded slightly) and re-center them on the league average ERA of that season. Now SIERA slightly beats xFIP, but in a practical sense both of them are in a dead heat with FIP, and ERA isn’t that far behind. As I said, this is an extremely crude way of regressing to the mean. If we wanted to do this for a real analysis, we’d want to look at how much playing time a player has had. In this example we regress a guy with 1 IP as much as a guy with 200 IP, which is sheer lunacy.
In a real sense, that’s what we do whenever we use a skill-based metric like xFIP or SIERA. We are using a proxy for regression to the mean that doesn’t explicitly account for the amount of playing time a pitcher has had. We are, in essence, trusting in the formula to do the right amount of regression for us. And like using fly balls to predict home runs, the regression to the mean we see is a side effect, not anything intentional.
Simply producing a lower standard deviation doesn't make a measure better at predicting future performance in any real sense; it simply makes it less able to measure the distance between good pitching and bad pitching. And having a lower RMSE based upon that lower standard deviation doesn't provide evidence that skill is being measured. In short, the gains claimed for SIERA are about as imaginary as they can get, and we feel quite comfortable in moving on.
[1]The years in question are because those are the years for which we have Project Scoresheet batted ball data, and thus represents the largest period of publicly-available batted ball data on record.
Thank you for reading
This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.
Subscribe now
I think the important thing is to evaluate the work on its own merits, regardless of the source - let the evidence be the guide. I started working on this before I had any inkling that Fangraphs would be picking up SIERA - unfortunately I was away for a while to attend SABR and so the article you see here got delayed. Once SIERA was released at Fangraphs, we delayed the article for a few days so I could review the changes they made and see if they addressed any of the concerns I outlined. As I note above, they didn't, so I incorporated some new material to mention that and we went with it.
I'm a little saddened that it worked out this way, actually - I don't think Fangraphs is adding anything that helps them communicate important baseball info, given that they were already running FIP to begin with. And I think that the way the timing of this happened, it puts the focus more on the personalities involved than on the quality of research.
Now we get an incredibly time-consuming rebuttal to someone's else's work on a different site. This off the heels of an even more off-target critique of Fangraphs' rest of season weighting algorithm. Exactly what weighting are you using again for PECOTA RoS, Colin? Oh wait: you never answered that repeated question, instead spending a whole article doing forensic analysis to uncover and slam someone else's algorithm.
I have no allegiance to Fangraphs or BP or any other site for that matter. But the trend here at BP is undeniably disturbing. More effort is being put toward saying someone else is wrong than it is toward creating and optimizing your own models and tools -- at least based on what is being published, which is all that really matters. The PECOTA pitcher win-loss projections were unbelievably screwed up, beyond any shred of credibility; the 10-year projections never came. But instead of reading articles about what happened and how it will improved for next year, we're reading about how some ex-employee should stop trying to improve a once-prized algorithm for estimating true pitcher skill over small samples? Color me saddened as well.
There are other problems I discussed in the last thread that I won't harp on further.
Unlike evo34, I have a dog in this fight; I've published twice on BP, I've been acknowledged in the book credits, and I've had a number of pleasant conversations with BP authors both live and via e-mail. I want BP to succeed.
But an article saying, "Math is hard; incremental improvement isn't useful," and a somewhat opaque statistical attack is either anti-stathead or appearing to stem from personal animosity.
Steven Goldman deserves some heat for this, too. There's a time when an editor needs to slow down the horses.
At some point since then - after the first week they were up, as we were giving it constant scrutiny during that stretch, they were overwritten somehow with the old values. This is quite embarrassing, and probably means that someone ran a piece of pre-season prep code which needed to be deactivated. I take full responsibility for it being removed. I'm not clear on why the customers who were most interested in seeing it fixed (such as jrmayne) were not appropriately informed when it was published in April, but obviously it wasn't handled professionally on our part at any stage.
All customer suggestions are read and discussed. Some are "works as designed". Some are not possible to be fixed / implemented in a timely manner. Pre-season PECOTA modifications would generally come under one of these two categories. Some suggestions lead to re-evaluation of statistical methods, such as this FIP/SIERA discussion evinces. But ones such as UPSIDE - which are simply repairing/restoring stats that have been on the online before - should be fixed in a timely manner. We'll make sure that any such future fixes are addressed in a timely manner, and that these solutions are communicated.
Also, it was mentioned several times in the pre-season, but the order of tables in the PECOTA cards seems odd. The injury and contract info. tables, which are not of primary interest to most users and take up a ton of room, are listed above projected playing time, forecast percentiles, and now the 10-year forecast.
http://www.baseballprospectus.com/article.php?type=2&articleid=14171#88302
I'll be happy to try and expand upon that, if you can please tell me what you find lacking in that explanation.
But this is from the July 19 thread:
Rest-of-season PECOTA: We've made changes to the weighting of recent results that reduced the impact of 2009-2010 performance on players' rest-of-season projections, making them less susceptible to being swayed by small-sample stats. Those changes are now reflected in the "2011 Projections" table at the top of each player card, as well as in our Playoff Odds.
[End quote]
And several people noted that you'd previously said that Fangraphs was wrong for its too-2011 heavy stats. While the paragraph was parsed different ways (and we never found out which way was right - the logically coherent and grammatically coherent readings appear at odds), it appears to say that prior years are having their weighting reduced, to get closer to Fangraphs' weightings. (Yes, Fangraphs is doing something else that appears mistaken, but you vigorously stressed the goal of not overselling recent performance.)
So:
Step 1. Fangraphs is wrong! They're weighting 2011 way, way too much! Dummies.
Step 2. The approximate weighting Fangraphs is using for 2011 (for players with similar PT rates in each year) looks right to some of the unwashed masses.
Step 3. We've changed our system to weight prior years less (and, by necessity, 2011 more.)
Either Step 1 or Step 3 has a flaw. What I want to know is if I've misread Step 3.
--JRM
Then why on earth is this being published now, at the very same time the former BP author is writing about this stat on a different website? Why not two weeks ago? Or in August? Or in the off-season? I hardly believe switching from SIERA to FIP is an urgent change that must be made and justified this very second unless there's some sort of legal/IP issue involved, in which case just say that. Doing this now seems intentionally confrontational and needlessly petty.
Completely irrelevant thought. I just don't like that axiom.
I'm also not quite sure I buy into the "overly complex" argument either since the "overly complex/hard to describe in layman's terms" argument has been used when discussing why it was so hard to update PECOTA post-Nate Silver or discuss why PECOTA projected a player a certain way or gave a comparable player that didn't seem to match.
The mistake people make (and I don't exclude myself from this, at least when resorting to the shorthand lingo of conversational sabermetrics) is to conflate those ERA estimators with a full-on projection system that takes into account multiple years of data, park and league adjustments, physical comps (in the case of PECOTA), and other situations. Saying a pitcher's ERA will go in a given direction because his FIP or SIERA says so is a shortcut, but you'd rather have the larger sample and the more complex beast of a projection system, at least for some purposes.
(assuming that B is, in fact, better, and that the perceived improvement isn't illusory)
Also, isn't SIERA just "an ERA estimator that accounts for a pitcher's ability to control BABIP"? Unless I'm missing something huge, despite the length introduction it really wasn't any more difficult to "understand" than, say, VORP.
Second, and more important, if it can't be explained in conversation it'll never gain acceptance beyond the sabremetric stove. No fair complaining about the obstinacy of a thick-headed baseball mainstream if the stats you're pushing on them are beyond explanation.
Regarding SIERA, I kind of simplistically used it in my mind as calculating a pitcher's performance if BABIP/luck wasn't a factor.
Still, for one of my overall points, I could've used FRAA or or any other single metric that has some complexity beneath it that no one has really explained sufficiently to me in the many years I've had a BP subscription. I understand metrics are complex. I may not even understand how or why they work. But, I can generally tell if they are useful or not without knowing the nitty gritty beneath the hood. So, I don't really buy the complexity argument since SIERA seems to "make sense" more than FRAA.
The statement and mindset I get curious about is: "It was difficult to mount a compelling case that its limited gains in predictive power were worth the added complexity." This indicates that there are some benefits to SIERA that aren't in FIP. Should I, as a customer, care about how simple or complex a metric's mechanics? I can definitely appreciate the effort to produce metrics, but I honestly think that my subscription fee pays, in part, for research into extra precision and usefulness. If I was in a restauraunt, I really don't know or think about how difficult it is to cook a meal, but I do think about its taste.
So, finding that BP is reverting to another metric that does not have those "limited gains", it does make me wonder if there are other BP metrics that are not as precise. Kinda like finding out that I'd been served a watered down wine, I'd start sniffing suspiciously at the rest of the meal.
One is that it obscures the logic of the underlying processes. Matt claims that the extra terms have real baseball meaning, but he hasn't tested that, much less proven it. I'm skeptical that any of them have much real baseball meaning, but even if some of them do, I'm quite certain that not all of them do. So, when you look at SIERA, how are you to know which terms have real baseball meaning and what they mean? If one of them is wrong, what effect does that have on the whole formula? If something in FIP is wrong, say the coefficient for the BB should be 2.5 instead of 3, it's pretty easy to see how that would affect the result.
Secondly, complexity limits applicability. SIERA is limited to the batted ball data era because of its requirement for groundball data. Who knows if it would work in the minors, or Japanese players, or elsewhere, for example? It's much easier to get the needed data for FIP and to test its applicability in other leagues and levels.
Third, the more complicated a multivariate regression, the easier it is for it to break--in other words, for the conditions that applied in sample to change in ways that cause your result to vary in ways you didn't anticipate--and the harder it tends to be to realize that this has happened.
To my way of thinking, your criticism here is well-explained, and doesn't feel personal. The article did not succeed on those terms, at least to me.
SIERA: 1.844
QERA: 1.842
So, it's difficult to paint this as a partisan dispute.
From the glossary:
QERA =(2.69+K%*(-3.4)+BB%*3.88+GB%*(-0.66))^2
I know from reports that I design that, even with the documentation, sometimes I get too close to the data and can't see the trees from the proverbial forest.
So, to rephrase:
I'm curious if, over the last 18 months, someone besides Matt didn't outline what each part of SIERA did or check if that part was needed.
Or is the person who did that no longer with BP?
As to who exactly did what when behind the scenes, it's certainly not my place to share that, if it's even appropriate to be shared at all.
I'm not aware of what Matt or anyone else at BP did with SIERA prior to my joining. I can say that in the first few months I was here, the stats group had an extensive discussion about ERA estimators. A lot of the things that Colin and I have said here and that Matt has said in his series at Fangraphs were part of that discussion, if in more nascent forms.
I'm probably a skeptic of more baseball stats than I am an advocate, and unnecessary and/or unexamined complexity is one big reason why.
But home runs as a percentage of line drives are low: only about 2 percent of line drives become home runs (as compared to about 11 percent of outfield fly balls).
I've always wondered how the ball-tracking systems define the threshold for a "line drive," and how consistent those definitions are throughout the industry. And that's before getting into the subjective experience of judging a line drive...
Obviously you can consider launch speed, also, and that complicates the picture a bit. Here's a link to a graph I made of that a while back:
http://fastballs.files.wordpress.com/2010/06/vla_vs_sob_zoom.png
Also, regarding the RSME tables in the final section, are those differences significant at all, or are they within the margin of error of the estimate? (my statistical knowledge is pretty rudimentary, so I'm not quite sure what the proper word is) Especially the final one with the normalized SDs, is 1.84 really better than 1.87 or is it like the RSME is 1.84 +/- 0.05 so you can't really say it's "better"? I'm also surprised ERA is that close to the estimators. Does this mean that the "true" ERA ability of pitchers has much less spread than measured due to fielding and batted-ball luck?
A bit of a technical note - as I stated in the article, I attempted to duplicate the regressions used to build new SIERA off my own dataset (using different batted ball data than what is used at Fangraphs, so I wouldn't expect a complete match), and I got a p-value for that coefficient of .46. That's very, very large (typically if a p-value is > .1 or .05 it will be omitted from the regression), and it indicates to me that the inclusion of that variable isn't statistically significant (which would be why the sign could flip like that without affecting the goodness of fit of the regression.)
If I omit the two variables with p-values above .4 (the other being the squared netGB term), I actually see the adjusted r-squared rise (although not by a statistically significant amount) and several other model selector tests (such as the Akaike information criterion and the Bayesian information criterion) improve; those are tests that control for the tendency of adding new variables to a regression to increase the r-squared by "overfitting" rather than better predicting the relationship between variables.
We did independent testing of SIERA at "The Book" blog and it did well. Yes, it's complex, but the intent was to pick up on some of the nuances of BABIP.
I haven't finished reading Matt's recent series at FanGraphs, but there's some thought provoking issues raised. I'm doing some of my own research (off and on) of the relationship between a pitcher's true talent ground ball rate and his BABIP allowed (it's not linear) of his HR per LD+FB (also not linear). Another thing that's not linear is the relationship between runs created per PA (linear weights, woba) and runs per game (ERA).
I was a math major at one time 30 years ago, but today Matt has better math skills than I do to run the regression analyses that show how multiple terms, not just ground ball rate, interact to predict base hit and homerun rates. Matt's SIERA research published in these recent articles has discovered this non-linearity I spoke of, as well as realizing that one players's fly balls or line drives are not the same as another's.
Regardless of whether xFIP is a better estimator than SIERA, I believe these concepts will be important issues in the next few years as we try to improve our modeling of batted balls and the projections we build from them.
However, as someone who mixed the ingredients for a house-brand soda manufacturer back in the '90s, I can tell you that in fact there are multiple kinds of citrus flavoring in cola, as well as a variety of other odd things.
I drink way too much Pepsi, but can't stand Coke. I've also got a bottle of Mello Yello sitting here, one of my favorite flavors, much more than Mountain Dew. But I have a hard time distinguishing between 7-Up and Sprite, they are interchangeable to me.
It's just that I (and apparently several others) found the slap at SIERA to be distasteful.
Now I read BP for more subjective articles, like those byJason Parks, Kevin Goldstein, and John Perrotto, which dont rely on BP's flawed metrics as much.
It's pretty disheartening to see so much go into slamming a former colleague and competitor when so much has languished broken on this site for so long this season. Constructive criticism is well and good, but get your own house in order first and show you can keep it that way before you curiously time the release if an article deriding a competitor.
Come on guys, get your heads together and figure it out. If BP and FanGraphs agreed on this, then Baseball-Reference and ESPN might buy into it and we could all talk the same language.
- hsr/Scoresheetwiz
The added complexity of SIERA does not give us better info.
There are disagreements in the sabermetric community. I'm not sure how "getting our heads together" will solve that. What do you think we've been doing? This discussion is all part of getting to the truth, but it happens over the period of years. It's not the matter of one person suddenly dispensing some enlightment and everyone else bowing at the feet of their insight. Even something like Voros and DIPS took years to hash out. It's not for lack of effort and dialogue, certainly.
I mean, was SIERA was flawed (as some commentary suggests), weren't people at BP wasn't checking on it/understanding it? Isn't there some kind of peer review process? If it was complex to the scope that its application was limited as Mike suggests, why was it adopted? What new evidence or insight came to light that wasn't in all the previous articles about SIERA? Or is it just impossible to use/understand SIERA now that Matt isn't with BP?
Was SIERA flawed (as some commentary suggests)?
Weren't people at BP checking on it/understanding it?
But for me what it boils down to is mindset. I believe that Colin has been leading a change in that regard. I use Colin's name because he's in charge of the stats here, but really all of it is a group effort and there are many people to whom the positive changes should be credited. That is also not to say that this mindset was not present at BP previously either to a greater or lesser degree. I certainly hold a number of the former BP writers and founders in high regard.
What I saw that brought me back around to BP enough that I was willing to put my own name and effort to it was an emphasis on good, logical sabermetric thinking. That everything should be tested to see if it was true. I find Colin to be the clearest sabermetric thinker in the field. If there's something dodgy with the maths or the approach, he can spot it like no one else can. I need to be around someone like that to clarify my own thinking and work. I believe I have unique insights into how the game works and how to handle data, but I can also get off track in my thinking. Colin asks good questions of me that focus my ideas in a profitable direction. Dan Turkenkopf is also particularly good at that. (You guys may not see Dan's name much on bylines, but he's a big reason that everything keeps going around here stat-wise.)
For example, Colin asked a number of questions of me as I was researching my latest piece on batter plate approaches that helped to shape what the final article looked like, and he reminded me of some of Russell Carleton's previous work on the topic that is pushing me in good directions for future research.
It's that sort of thing that I believe is driving us here at BP toward a better understanding of the game. That will ultimately filter into our stats, too, and it has to some extent already, but I see the stats and their specifics as the branches of the tree, with the trunk being a solid sabermetric mindset and dialogue. The fruit, hopefully, is more enjoyment of the game for all of us, reader and writer alike.
Yet, my "beef" isn't so much with retiring SIERA. My issues are threefold:
1. I didn't buy the complexity explanation since it made me doubt the rigor BP puts into its other statistics.
2. I didn't understand why it was a BP staple if there were doubts about why it was needed.
3. The timing (which Colin addressed) and "suddenness" made it look like the retiring of SIERA had to do with Matt's departure, which could have been resolved if BP was better at announcing who was joining/leaving.
I've never been the type to threaten about cancelling my subscription and I'm definitely not talking about that right now. BP's added a lot of writers over the last few years. Still, I do get curious why people are leaving. I see a BP article debunking the usefulness of SIERA after using it as a staple for the last 18 months, yet I see FanGraphs pick it up. How am I supposed to interpret that?
I appreciate you sharing your concerns. I definitely can tell that you have the best interests of BP at heart here as a long-time subscriber and reader who wants to be able to "trust the process" (as a long-suffering Royals fan, I think I'm entitled to use that phrase) and not have to worry whether you're just watching a pretty facade with rotting framework behind it. Your thoughts are always welcome here, and you are more than welcome to email me personally if you don't feel that you get a good enough answer in these comments.
There are, for lack of a better word, political considerations around personnel decisions, both on the part of BP management and the contributor who chooses to or is asked to leave. I know some of the details of the departure of Matt and other former BP writers, but some of them I do not know. I am glad that I am not in charge of figuring out how to manage all that kind of stuff. Steve Goldman wrote a good piece, I thought, a couple months back on that topic, and anything I would have to add to that would be worse.
I can explain further about what I mean about complexity being a negative if that is helpful. I'm not sure that it is.
Regarding ERA estimators in general, the big unresolved question for them is: What is an ERA estimator trying to do? And from that, how do you know if it is successful? If ERA estimators are not trying to predict next year's ERA, then why do we use that as the test for which ERA estimator is the best? If they are trying to tell us the pitcher's true performance in a small sample, what is the true standard for that against which we can measure? There is none, because that is an unresolved question of sabermetrics, and it will be if and until we figure that out.
We don't know how to disentangle the pitcher's performance from the performance of the batters and the fielders and the umpires in small samples. We know much better how to do that in large samples, but for that, we don't need ERA estimators.
I don't worship at the altar of FIP or kwERA any more than I do at the altar of SIERA. They are all tools that tell us something, but we don't really know what that something is, other than that it is, for FIP and kwERA at least, fielding independent, and for SIERA mostly so. So that's definitely worth more than nothing, to know what part of a pitcher's performance was independent of his fielders (other than his catcher). But it is most definitely not telling us exactly what part of his performance was due to his own effort and exactly what part was due to factors out of his control (call it luck, or what have you). We, as a sabermetric field, don't know how to separate that yet at the game level. One of my quests is to figure that out.
I guess, through no particular fault of Colin's, that this article just touched me wrong and got me a bit confused and saddened in a few different ways. Shows I care, right?
*pulls the annual off the shelf and thumbs through Colin's statistical introduction*
I stand corrected and apologize. I'm not sure why I thought that.
Was SIERA incorporated into the PECOTA changes?
Loved this article. Loved the one you wrote explaining the issues with defensive metrics. I read the introductory article on SIERRA and thought it was a bunch of screwed up data mining with unsupported variable interactions. Thank you for dumping it.
-Tom A.
So no one would use a generic HR/FB rate or raw xFIP to predict second-half ERA. You'd obviously under-project HR rate and ERA of every Rockie pitcher and over-project every Padre by doing so. A stat like xFIP is only useful when compared to itself (across pitchers on same team, or across years of same pitcher); it's not well-designed to directly predict actual ERA and I don't think this is a particularly new revelation.
http://www.fangraphs.com/blogs/index.php/new-siera-part-five-of-five-what-didnt-work/
"Park effects themselves come with a bit of noise, though, and the effect of park-adjusting the batted-ball data didn’t help the predictions. The gain simply wasn’t large enough: The standard deviations of the difference between actual and park-adjusted fly ball and ground-ball rates was only 1.4% and 1.5%, respectively, and the park factors themselves were always between 95.2 and 103.9 (the average park factor is set to 100).
...
There were only 16 other pitchers with ERA changes more than .05 in either direction.
As a result, very few people saw major changes to their SIERAs, and the combined effect was a weaker prediction."
"Wait, BP doesn't like stats to be complex? Since when? It took me three years to understand some of their stats. I've read numerous over-my-head articles explaining how the complexities led to a slightly better understanding of the data interpretation and prediction models. In light of this history, 'this algorithm is bad because it is complex' sounds disingenuous."
The timing is also extremely troubling. Discretion being the better part of valor, waiting to publish a criticism of a recently departed coworker would have been the better choice. While peer review is great, this was not just peer review, it was a critical review of a stat that up until the moment this article was published I thought that BP was 110% behind. And it comes right on the heels of a very harsh criticism of fangraphs' prediction system, which was quickly followed by "oops, our prediction system had some big errors in it." I do applaud, however, the civility of this particular comment thread.
I wholeheartedly agree that BP needs to fix its internal process. PECOTA cards have had numerous errors for far too long, both before and after the newer BP staffers have been on the job. But BP also needs to fix a communications issue: what I took away from this article was "we want our stats to be 'close enough for government work.'"
Really? Because I know of a completely free web site where the stats are close enough for government work . . .
No, absolutely not. I'm sorry if that's what you got from this.
We want our whole statistical approach to be the best in helping us and you understand and appreciate the game of baseball. There are times when that is best served by complexity, and there are times when that is best served by simplicity.
The reason we are best served by simplicity in this case is because of how little we as a sabermetric field know about how to assign credit/blame for an individual batted ball among the various participants on the field. We know fairly well how to do that once we get thousands of such events to evaluate, but not at the granular level. If we can establish some falsifiable theories about how that happens, and prove them out, then we can proceed to building more complex explanations of how to assign the credit for what a pitcher's true performance was.
Even FIP is a bit of a lie if you take it to be a true and complete record of the pitcher's performance. If you only take it to be a record of the fielding-independent portion of his performance, then you're on more solid footing. (I don't see many people take it that way in its common usage.) It's a bit harder to remember that you're looking at a lie and exactly what sort of a lie it is when you're looking at SIERA or xFIP or something other more elaborate construction that someone is using to purport to tell you a pitcher's true performance over some time period. That's why complexity is a negative here. If ERA estimators were all about telling the truth, the whole truth, and nothing but the truth, we wouldn't have this problem.
I'm as eager for a better solution to it as everyone else.
So if someone decides on his own to use them as projection systems (as Colin does), he needs to make the adjustments to make it a fair test. Obviously, not every pitcher remains on the same team year-to-year; but most of them do. Same thing with fielding: there is going to be a reasonably high correlation of team defense year-to-year, which will help ERA as a predictor and hurt the fielding-independent estimators.
The short version is that if something is designed to neutralize park and defense (as FIP, xFIP and SIERA are to various degrees), it's not right to use it as a raw predictor of something (actual ERA) that is affected heavily by park and defense.. Either make a new metric that is designed to be a predictor of raw stats, or create a new test.
The goal of estimators like xFIP try to show what HR rate a pitcher would likely yield IF he was magically placed in a random/neutral park against a random/neutral schedule. Wyers tests what happens when the player is NOT in this random/neutral environment, but rather in one that is highly correlated to his previous (known) environment. There is no reason one should try to use raw xFIP or SIERA to try to predict a season in a partially-known environment. If the goal of these metrics is to to predict next season accurately (which Wyers seems to be assuming), the correct method is to estimate the biases of each pitcher's next-season environment and make the necessary adjustments to xFIP/SIERA formulas or to the stats being measured.
Without really understanding 75% of what you wrote in this article (I'm no statistician), it seems to me that the problem could have been solved with a little added accessibility and usability for commoners like myself.
SIERA's formula has been in the glossary since it was introduced. You can get the glossary entry by clicking on SIERA in any stats report where it's a column. The link goes here: http://www.baseballprospectus.com/glossary/index.php?search=SIERA
Maybe it's because I'm married to a graphic designer and have a picky view of presentation now, but the stats and glossary pages are the least usable parts of this site.
No I'm not cancelling my subscription over it. Kevin and Jason make it money well spent.
2) That is essentially what FIP vs. SIERA boils down to, isn't it?
3) Even if the answer to question no. 1 is yes, at a small enough sample size, GB/FB has to be more reliable. Has anyone determined how large of a sample you need for home run rates to be a better indicator than GB/FB if it ever is?
4) That GB/FB is better data than HR/9 at a small enough sample, leads me to conclude that SIERA has it use. Where GB/FB data is incomplete, FIP has its use. Why can't they co-exist? Just change FIP to HR-SIERA.
5) Aren't GB/FB/LD and pop-up data getting more and more reliable, accurate, and available?
6) If so, aren't we going in the wrong direction by retreating from SIERA back to FIP? Shouldn't we just be coninuing to refine SIERA and keep it state-of-the-art just as Matt is doing (or, at least, attempting to do)? Really, SIERA was originally just a refinement of FIP or xERA, no?
First, I really like the analysis. This could easily be a week's worth of articles for me to work my way through. Hopefully sleeping on it for one night was enough for it all to sink in.
There are two and a half takeaways that I have from the initial table, neither of which is new information for me:
1) SIERA slightly outperforms xFIP and FIP in the prediction of long-term future ERA when given a small sample size to work from (definitely at 100 IPs, and very slightly at 200 IPs).
1.5) After about two seasons of data for a young starter(300 IPs), the estimators are about as useful a predictor as ERA itself.
2) After 300 IPs, ERA actually becomes a slightly better predictor of future ERA than the estimators.
Why does SIERA outperform the other estimators with the small sample of data? The components may be needlessly complicated once you get to 300IPs, but for small data samples, those extra variables seem to have some added predictive power.
I would LOVE to have seen QERA side-by-side in that first table.
And, as evo34 notes above, if prediction was the real goal of the estimators, then park factors would almost certainly need to be included. Would park factors offer any added value?
For everything else you presented, I'm going to have to re-read and re-sleep-on-it. Could be an interesting and entertaining weekend. Thank you.
1. Thanks to Richard Bergstrom for initiating a conversation that was informative and managed to be both passionate and respectful. He asked some questions that I think a lot of us had on our minds.
2. I agree with some of the other commenters here that the timing for the release of this article was poor. The timing served to link two events together (the decision for BP to discontinue their relationship with Matt Swartz and SIERA, and the decision for Fangraphs to pick them up) that didn’t need to be linked together.
3. Having stated number two, I found that Colin's analysis in the article was compelling, and my sabermetric man-crush on Colin continues unabated.
4. I always thought that a strength of SIERA was that it did not use a variable constant (I call it a "fudge factor") to adjust for run environment. Ideally, any run estimator does not need to be told what the run environment is; the run estimator should tell you. Now that SIERA v2 has a fudge factor, I find it less compelling. (Yes, I know that FIP uses one too, and it bothers me.)
5. As Colin mentions, if two of the coefficients in SIERA have switched their signs from v1 to v2, this calls into question whether the coefficients tell us anything real, or if they just force the hoped-for result to match the data. This also makes me find SIERA less compelling.
6. For me, FIP is most useful as a backward-looking barometer of what happened than a forward-looking estimator of future ERA. I know that Player X will never sustain that HR/FB percentage of 1.8 over the long term, but he did it this year, and that added real value to his team this year.
7. I look at ERA, FIP, xFIP, and SIERA myself trying to figure out who should replace Brett Anderson on my fantasy team, but I recognize that all of them need to come with a complementary shaker of salt. For predictive run estimators, it looks like the quest for the holy grail continues. And the quest is enjoyable to read about and discuss.
But you've certainly given me something to think about, particularly the contention that HR/Con is better than HR/OF (or HR/FB). That is counterintuitive to me, and I'm going to have to read your explanation several more times to understand it.
To me, xFIP is a useful stat that tells you something about a pitcher, but (IIRC) I resisted putting it on the THT stats page because I didn't see it as a "reference" stat. That was silly of me, I guess. Many readers asked to see it, so we eventually added it.
Similarly, we used to run "ERA-FIP" at THT, which was also something requested by readers. I was kind of uncomfortable with that too and when we got a request to run ERA-xFIP too, I refused. I thought it put too much emphasis onto a single number and calculation. Time has shown that I'm in the minority in that position.
I guess I am someone who is uncomfortable with the quest for a complex stat that explains everything. I am leery of issues like multicollinearity and other things I can't pronounce or understand. I intuitively won't trust a stat I can't understand. I make only two exceptions: projection systems and total-win based systems.
So I wish you guys success on your own statistical quests but please: don't try to do too much. Keep it simple.
I'm completely with you on your second-to-last paragraph, and my skepticism extends to projection systems and total-win based systems, too, though there's sometimes a need to use them. For example, when it comes to my fantasy league, I love that there is a projection system I can use. In a theoretical world with infinite time, I might like to roll my own so that I understood all the pieces better, but in the real world, I'm glad that Colin and others are doing it for me.
One of my favorite sabermetric articles ever was the one that Bill James wrote in the 1987 Abstract about how to evaluate a statistic. That has been very influential on future development as a sabermetician. I want to be able to understand clearly what a statistic is telling me about a player or about the game. The mathematical pieces should simply be ways of representing other concepts with numbers or equations. That may be my physics training coming into play, too. I learned most of my higher math through physics classes. Anyhow, that's simply a different way of saying that I hear your perspective loud and clear.
I agree 100% that sometimes there is a need to have projection systems, even if we don't totally understand them. There are a few things you're willing to take on faith. Not many, just a few.
I've said this before, but I continue to be puzzled that component approaches to ERA leave out pitcher SB/CS and baserunner pickoff analysis. Derek Lowe is almost absurdly easy to steal on, while some lefties have virtually no stolen base attempts against them. It seems likely that this difference has a statistically significant impact on actual ERA.
I've been trying to analyze this stuff a bit, but the data I can find aren't completely accurate. For example, a successful pickoff attempt that is negated by a bad throw by the firstbaseman, allowing the runner to get to second, is scored as a stolen base and charged to the pitcher who picked the baserunner off in the first place!
Or is this the kind of complexity that everyone is railing against?
I tried again using the queries Colin set up for this study a couple days ago, running the queries against some FIP_PLUS metrics with various parameters, and again failed to come up with anything useful.
Re: studes / Fast: I actually agree with Dave and Mike's overall preference for stats that mean more specific things - or at least fully understand it when we see it. Generally, the pitching stats I mainly look at are K/9, BB/9, and GB/FB - with due consideration to park, defense, work load, durability, etc. Knowing those components: K, BB, & GB% informs how to apply that pitcher's context for whatever time frame you want to look at. But there needs to be a stat that also sums up each of those three main components in one stat just for quick reference. It is not a question of which kind of stat is better, it is preferable to have both.
Are you implying that the distribution of expected home runs based on contact rate is wider than that for fly ball rate?
"Fly balls on balls in play are a much poorer predictor of future home runs than home runs on contact, with an r-squared of only .014."
But your table shows an R-squared of .023 for FB/CON. What's the diff between .023 and .014?
And aren't those abysmally low R-squared figures? I'm used to getting R Squareds in the .20 to .30 range.
(In-play fly balls + Home Runs on Fly Balls)/(Balls in play + Home Runs)
If you remove home runs from both the numerator and denominator, you get FB/BIP.
What I was trying to illustrate here is that you can divide fly balls into two types - ones that are home runs, and ones that aren't. FB/CON treats them the same for predicting future home runs, but home runs have greater predictive value than FB that aren't home runs.
A good test might be to ask whether HR/FB is more predictive of second half home runs than HR/CON?
"...there is a reason that this sort of analysis can be unsatisfactory: it assumes that a fly ball that wasn’t a home run is equally predictive of future home runs as a fly ball that is a home run."
But isn't this just as true as HR/CON? In fact, couldn't you turn this argument around and say that this is why HR/FB is preferable to HR/CON? Because fly balls are more predictive of future home runs than, say, ground balls?
I love the topic and the conversation.
But it's pretty hypocritical (of all of us) to stand on our high horses and decry that the "world is flat" mentality of those like Murray Chass is ridiculous, yet we are completely unable to present a unified alternative.
It's the same problem I see at work. More people like to yell and holler about the problems, but nobody really talks about a solution.
P.S., this wasn't directed at Colin or BP, it was just a general bitch session.
I see no hypocrisy at all.
Sabremetrics tries to follow some of the rules of academia with peer reviews, case studies and all that... and they also have many competing ideas at play. Yet I'm not sure an article with this kind of tone and timing would appear in an academic journal. It'd be like a professor leaving one University A for University B, followed by a bunch of articles from professors at University B immediately debunking their former colleague's work.
Damned if you do, damned if you don't. In a few weeks, this article will drop off the home page and we'll all move on with life.
BTW, Colin is definitely saying that SIERA is "bad" in some ways. Look again at his statements about multicollinearity or why using HR/FB is bad (which I still don't get). Mike is saying that an approach that relies too much on regression is, indeed, "bad."
The staff who are still at BP are most definitely committed to learning about how to measure and understanding pitching. I would count myself among the foremost in that desire.
SIERA is being retired because we believe that approach is dangerous in terms of thinking it's teaching you things about pitching that will end up being false and because we believe there are better ways to learn about those things.
We are not retiring from the discussion of learning about quantifying pitching skill and performance. Far from it! Matt Swartz and I are going to have very different approaches to the subject going forward. Obviously, I believe my line of inquiry will be more fruitful, or I wouldn't be pursuing it. That's not to say you shouldn't listen to both of us and decide for yourself what to think.
Above, I mentioned axiom's I don't like; another one is, "Good enough is the enemy of great." That one's crap because, look, it's nice if we can gain a hundredth of a decimal point of "accuracy" in projecting a stat for a game of sports, but not if it takes utilizing differential equations and 10th-order polynomials. Especially when compared to (for example) FIP, which is a plain ol' 7th-grade pre-algebra formula that gets us 90% of the way there. I don't see complexity for minimal gain as progress in the war against outdated statistics like Wins, Saves, and "scrappiness". In this particular case, unless there's a compelling argument why we need to add complexity, good enough truly is good enough.
I think that's the layman's terms of what you guys are trying to say, right Mike? It's not that you DON'T want to improve, it's just that you want to use a little bit of common sense in how you go about it.
Those of us who don't sleep with a copy of "Applied Regression Analysis and Generalized Linear Models" thank you.
From a can of worms standpoint, this could be interpreted ominously if there are other major differences between the people who are here now as opposed to those who were here 18 months ago.
Also, "dangerous" is a very strong word and, at worst, is not the kind of word that encourages further discussion. Ideally, both your line of inquiry and Matt's should both furthering our understanding of baseball... I am definitely not in the mood to choose sides or consider this whole discussion an either/or proposition.
And I'm not sure what's ominous about that. It's not the case that people who disagree are let go from BP. We have all sorts of internal disagreements. As I've stated elsewhere, some of the former writers are people who I hold in great esteem. Matt has done good work, too. But it should be obvious that there has been a change in philosophy at BP over the last couple years, and some of that is related to a change in personnel (with causation running in both directions). One could obviously read too much into that if one assumes that every departure is due solely to that reason. In some cases it probably is and others it has nothing to do with it.
Let me be clearer on the ominous part. I've been here for five or so years now. I've seen many people come and go and I also understand that further research can lead to better metrics or modifying older metrics. I don't think I've seen Davenport translations or PECOTA or MORP or other BP stats or projection systems debunked and thrown completely and immediately out the window when the creator left BP.
Let's say, in a hypothetical, that Jay Jaffe left BP tomorrow. He's been using JAWS to evaluate Hall of Famers for quite a few years. JAWS isn't my favorite metric in the world, but I understand how/why he uses it to evaluate Hall of Famers and I find in terms of how he uses JAWS, that what he does pretty entertaining and insightful. So, if the month he left, JAWS was debunked in a BP article because the new guards thought it was a horrible metric, I would wonder why no one said/did anything sooner? I would wonder why I was told for a month or a year that "Hey, this has the BP Seal of Approval", then when the hypothetical Jaffe left, "Hey, this was never any good." It'd make me question the vetting process of everything BP produces and wonder whether metrics are used because they are a) good/proven/vetted or b) popular with whoever is in charge at BP.
Am I being extreme or overblowing things? Maybe. I like to think I'm not the kind of person here to exaggerate. But before I saw this article, I didn't have these kinds of doubts/questions before.
As to metrics thrown out completely when the creator left--Nate Silver's Secret Sauce got that treatment:
http://www.baseballprospectus.com/article.php?articleid=12085
Nate's obviously a sharp guy, so take that for what you will, but I see/saw it as a good thing.
Re:
"It'd make me question the vetting process of everything BP produces and wonder whether metrics are used because they are a) good/proven/vetted or b) popular with whoever is in charge at BP."
It's honestly probably some of both, but I'd also point out that those two are not mutually exclusive options. In addition, it's a good thing when our less stat-heavy writers at BP tell us, "We can't use this stat in our writing, it doesn't make sense and we can't explain it or justify it to the readers." (Hypothetical example--I'm not talking about SIERA.) We are going to be human in our understanding and our choices here, for all that entails, good and bad.
Vetting is not simply a matter of Colin looking at a stat and calculating some standard errors and p-values and blessing it or consigning it to the dustbin. It's a process of filtering through various people on the team with perspectives that include statistical knowledge, philosophy of what we're trying to do at BP, knowledge of how the game works, connection with what the readers want and need, and hopefully a good dose of common sense. It is also, at times, going to be a matter of trial and error.
Re: Secret sauce was given a little wiggle room on possibly a run of bad luck and the secret sauce wasn't retired as soon as he had left either. The interesting thing is that the secret sauce was used for longer than SIERA but the Unfiltered blurb and rationalpastime article combined were about half the size, word-count wise of the retiring SIERA article. Also "What I'd hope you take away from this is not that Secret Sauce is worthless" and the article in general is a bit different in tone from "In short, the gains claimed for SIERA are about as imaginary as they can get, and we feel quite comfortable in moving on." and the SIERA article.
Re: Both
My impression with the article and comments is that people at BP either a) thought it took the wrong approach or b) weren't really sure how or why SIERA worked since coefficients seemed to be plucked out of a hat. I would think the people at BP vetting/validating it would have an understanding of it and might have addressed some of the issues Colin had raised in the last 18 months.
To be a bit more explicit, before this article, I was quite sure that the stats people at BP checked each other's work and had an understanding of the metrics being used. The impression I got from this article was that SIERA was either broken or not optimal for the last 18 months and people either didn't understand its complexity or didn't think it was worth the effort to fix. So then, I began asking "Were these concerns raised in the past? If so, why bring them up now? If not, was someone validating SIERA?"
In any event, Dave's response satisfied me in regards to the timing. The tone? As I said below, I thought it should've been more of a "State of the Prospectus" thing wishing Matt luck and then an article a few days or weeks later in this kind of vein later comparing SIERA and FIP. I just think mingling the business/politics of a person leaving and the research/critique/criticism/debunking at the same time of that person's idea leads one to question the critique and wonder why the critique did not occur sooner.
As I've also indicated, I'm not some die-hard SIERA fan. But these circumstances _did_ make me wonder/question some of the processes at BP. And I'd like to thank you and the rest of the staff for trying to address my concerns.
I guess I just view the question of "learning about how to measure and understanding pitching" as very different from finding the best method to forecast a statistic. Now, I haven't looked into the discussion of the merit of the metric with too much, but change of signs in a coefficient isn't necessarily a big problem when the terms in a regression change (due to correlation between regressors and the regressand, changes in the coefficients are expected and the region around 0 isn't special).
It is precisely the fact that is being advanced by Matt and others as a tool for explaining the causal effects of pitching results that bothers me, and all my objections to it are in that regard.
For what it's worth, I think the Angrist-Imbens-Rubin methods of causal econometrics are under-utilized by those interested in the causal relationships in baseball. E.g. the answer to the question "what would happen if player X were to throw their fastball more often?" is best answered with their techniques, not assuming that the run value of their fastball is fixed at a given number.
I have no problem with people openly disagreeing with each other. In fact, I think it's healthy. But I agree with you that it's to no one's credit if people start getting personal and slinging mud.
I also agree that there is a little bit of mud slinging here. Colin can be harsh in his assessment at times. But it's to his credit that he doesn't hold back in his assessments and, as far as I know, doesn't let things get personal.
As I said in the article - the decision to retire SIERA was not a decision to go to war with the sabermetric community; the goal was actually to surrender this fight entirely. After I had written the first draft of this article, Fangraphs started running the revised SIERA on their site. After they published the revised formula, we made the decision to evaluate the revised formula and incorporate that material into the article.
Once Fangraphs started publishing SIERA, I don't know that we were faced with a "good" option as to how to proceed. Acknowlege the revision, we look like we're attacking. Ignore it, we look like we're not open to outside criticism or that we don't think our critique stands up against the revision. Publish it now, again, look like we're attacking. Wait longer, and we perpetuate a whole lot of confusion by leaving two versions of SIERA in the wild without addressing concerns from our readers over whether or not we would incorporate the new revisions.
But at the end of the day - you can now go to Fangraphs or BP and find the same ERA estimator. (We calculate the annual constants slightly differently - they use one constant for both leagues, while we use a separate constant for each league - but by and large they're exactly the same measure.) I think in the long run, the decision to retire SIERA in favor of FIP ends up accentuating commonality, not difference, in the sabermetric community.
I agree, and that's what I'm applauding. At the same time, I see Richard's point about the tone and timing. As I said, damned if you do, damned if you don't. I have no beef with this decision.
Patriot's (developer/co-developer of BaseRuns and Pythagenpat) blog:
http://walksaber.blogspot.com/2011/07/saying-nothing-about-era-estimators.html
Tango and MGL's Book blog:
http://www.insidethebook.com/ee/index.php/site/comments/siera_updated/
"cwyers Colin Wyers
I developed a new ERA estimator: 1*ERA+0. The r-squared is f@cking fantastic."
Stay classy.
By the way, re-reading Colin's point about SIERA's smaller standard deviation makes me wonder if the heavily regressed PECOTA projections this year are going to benefit in terms of RMSE relative to previous projections that had a higher SD. So perhaps the same standardization will benefit the comparison between projection systems after this year.
To the second - yes, I think this is something that affects projection systems as well, and it's something I hope to address further after the season. A quick note on that: I did a quick comparison between this season's PECOTA pitching projections and the Marcels (I chose the Marcels primarily because they come with standardized player IDs, thus making this sort of comparison easy).
Looking at the standard deviation of players in common between the two systems (weighted by the harmonic mean of IP projected by each), in terms of ERA, I get:
PECOTA: .53
Marcels: .24
Now bear in mind that I ommitted players without an explicit Marcels projection; if, as Tango suggests, I forecast players who debut in MLB this season to perform with the league average ERA, that will further drop the SD of the Marcels relative to PECOTA.
So I don't think, in a comparison with other projection systems, PECOTA is necessarily benefiting from "heavy regression" - PECOTA is regressing far less than at least one popular projection system.
Your analysis has made me question whether i care about using the projection system with the lowest RSME -- I may be willing to sacrifice a certain amount of error if the SD is more representative of reality, taking into consideration the difference between observed variance and true talent variance. Other tests (like heads-up) can be used, of course, to meet those goals.
I think those are both interesting questions, and I don't have an answer for you at this point in time, but it's something I'm going to research.
"Your analysis has made me question whether i care about using the projection system with the lowest RSME -- I may be willing to sacrifice a certain amount of error if the SD is more representative of reality, taking into consideration the difference between observed variance and true talent variance. Other tests (like heads-up) can be used, of course, to meet those goals."
I would agree - I think that, to the extent that you can maintain accuracy of your system otherwise, a larger SD of your forecasts is better. Given that, I'm starting to think about tests like heads-up testing and other alternatives to straight RMSE testing. I'm really hoping that some outside analysts do the same - it'd be nice if some people who don't have "skin in the game," so to speak, are coming up with similar answers.
This is, of course, untrue. The article was published on July 25, a full week after SIERA was introduced at fangraphs.
Earlier this year, Matt Swartz sent me and email saying that he wanted to take SIERA somewhere else. I told him that we didn't have a problem with that. I also told Matt that we planned on running SIERA for the rest of 2011, and that we would be retiring the stat after that.
When Matt recently announced that he was re-publishing SIERA at Fangraphs, we were fine with that, and it didn't change our plans at all. What we weren't expecting was for SIERA to be re-formulated and published under the same name. The Internet now has two different sets of numbers attached to the same stat name on two different websites. This is absolutely certain to generate unnecessary and frustrating confusion on the part of people who aren't closely following the topic. Now we can expect questions: why doesn't your SIERA match their SIERA? Theirs is newer, and they say it is better--why are you still running old numbers? Which SIERA am I expected to use? This is not a position we want to be in.
At this point you can argue that SIERA itself suffers when there are multiple versions floating around. We weren't sure what we were going do about an ERA estimator when we decided to retire SIERA, but faced with the prospect of playing a contributing role in a confusing system, we decided that changing our timetable and replacing SIERA immediately was the right thing to do. Our options at that point were to quietly replace SIERA with something else or to announce what we were doing. We decided that Colin should fast-track his SIERA retirement article and we would use that to announce the change. When the article was ready on Friday, we posted it.
I do wish something to this effect had been posted at the beginning of the article or on Unfiltered, to avoid the appearance of ripping on Matt/Fangraphs.
What I am most disappointed with though, is the lack of THIS eminently reasonable explanation up front. This entire article (and the yanking of SIERA itself) should have been prefaced with this statement, and how it wasn't is completely beyond me. Then Colin's tone, while still disappointing, would at least have been cast in a different light.
The ironic thing is, I never used SIERA or TAV in my own writing or analysis, but with all this stirred up, I have much more interest in SIERA than I ever had before. Perhaps, this is all a joke and that was the plot all along.
The issue here is what would we use SIERA or FIP for. If it is for seeing how well someone is pitching over a segment of a season - and I suggest that is its most practical purpose, then Colin's chart comparing RMSEs at 100 inning incrementals is misleading. Let's see how these metrics compare at 20 innings, 40, 60, 80, 100, 125, 150, 200, 300, and 400?
It comes down to the question of, what do we mean by how well somebody is pitching? I mean, I can identify certain deficiencies in how ERA evaluates a pitcher's performance. It ignores "unearned" runs, based on an inconsistently applied and essentially capricious definition of which runs are unearned. It gives the pitcher credit (or blame) for the actions of his team's defense. It gives a pitcher credit (or blame) for how well his follow-on pitchers do at handling inherited runners.
SIERA addresses one of those issues - it (largely) doesn't hold a pitcher accountable for the actions of his defense (although if there is a hit/no-hit bias in the batted ball data, there is a possibility that it is still giving the pitcher some responsibility for the actions of the defense). But by regressing against ERA, and including terms such as ground ball rate and starter IP percentage, SIERA perpetuates the other two failings of ERA that I note above.
The trouble is that in terms of assessing a pitcher's in-sample performance, there's nothing to test SIERA against. If there was a consensus on a stat that was better at measuring a pitcher's performance in sample, we wouldn't have to test SIERA against it, everybody would just use it. So SIERA tests against out-sample ERA as a proxy for "skill." As discussed above, in real (as opposed to fantasy) baseball, ERA is deficient as a measure of actual performance. But more importantly - there are things that can occur over a small sample that are not repeatable but are certainly part of how well he is pitching for that period of time. Testing against out-sample ERA doesn't tell you which metric does the best at measuring these aspects of performance, it tells you which is best at ignoring those parts of performance. To some extent it comes down to how much you care about those aspects of performance, compared to aspects that are more readily measured.
Here's where I could say something about how an overly reductive approach treats players as Strat-O-Matic cards, but let's be blunt - a whole lot of the interest in these metrics is because people almost literally want to know what a player's Strat card will look like for an out-sample period, such as the rest of the season. The trouble here is not so much SIERA itself, but misuse of the metrics. To the extent that sabermetric education has occured outside of a few scattered enclaves on the Internet, what's important is not that we've taught people to use OPS instead of RBI, or ERA instead of wins, or our new trendier metrics instead of the simpler metrics that the last generation of sabermetricians fought to get established in the first place. It's the fact that we have now conditioned large swaths of people to have a Pavlovian response to predictions based on a limited number of observations - "small sample size." I mean, we've gotten to the point where reporters who cover sports for a living are more well-versed in things like true-score theory and measurement error and other bedrock principles of statistics than reporters who cover health, the economy or any number of other topics.
Misuing SIERA as a projection system utterly ignores that; more to the point, chasing after minimal improvements in predictive accuracy from samples with selective endpoints utterly ignores that. And I say "misuse" advisedly - Matt himself has clearly stated, repeatedly, that SIERA is not a projection system and is not intended to be one. So when it comes to questions like, "what's the best stat to use to predict a pitcher's future performance based on 40 IP," to me that sounds like trying to decide what caliber bullet to use to shoot yourself in the foot with. If we really want to predict a pitcher's future performance, we need to use more than 40 IP; the most valuable data that comes with a pitcher's stat line over that small a sample is his name.
"...we have now conditioned large swaths of people to have a Pavlovian response to predictions based on a limited number of observations - 'small sample size.'"
1. The starter IP percentage acts as a proxy for the "Rule of 17" that I noted on my blog a year or two ago. That is, the more you pitch as a starter, the higher your BABIP, and the higher your HR/PA. Since SIERA purposefully ignores both BABIP and HR, then including starter IP percentage is an excellent parameter to use. That's not to say it's used in the best way, but that it's used at all at least shows that Matt has a great insight here. That SIERA is the only one to thought to even have it is a huge feather in its cap.
As an example, it would be ridiculously easy for me to change the "3.2" constant in FIP to something else, like "3.3" for starters, "2.9" for relievers", and a sliding scale for the rest.
2. I agree that the true test must be against RA9. Though, in reality, any test against RA9 will deliver virtually the same result as ERA (if you select a random group of pitchers... but, if the only pitchers you select are all GB pitchers, then ERA will not be a good test).
The distinction between ERA and RA9, with regards to what we are talking about here, is being very nitpicky. That's not a bad thing, but it's not a good thing either.
http://www.insidethebook.com/ee/index.php/site/comments/my_issue_with_regression_equations/
You say pretty clearly that you wish that SIERA adopted a fundamental model of baseball team scoring, like BaseRuns, so that we wouldn't have to disentangle these sorts of effects.
I understand that a lot of what I'm talking about when I critique SIERA relative to something like FIP is trivial or, as you say, "nitpicky." But that's because the only difference between SIERA and these other stats are trivial, "nitpicky" differences. That's kind of the point here, isn't it? The correlation between SIERA and xFIP, as reported by Matt, is .94. The difference between the two in predicting out-sample ERA, again according to Matt, is .01 runs per nine innings. If your position is "differences that trivial don't matter," then doesn't it follow that SIERA doesn't matter?
I agree that the presentation of SIERA obfuscates more than it enlightens.
I also know that there are kernels of truth in there that would be much more powerful if it followed the model that Patriot is espousing.
So, SIERA would benefit from further exploration and better presentation. It still deserves a place, and whether it's at Fangraphs or here, it doesn't really matter.
I mean FIP deserves a place too, and it took until you came here to give it that place, which is a ridiculously long time to present such a simple stat.
But the appellation SIERA belongs to a particular metric with a particular framework (if you wish to be pedantic, you could say two metrics, but the two versions of SIERA are far more similar than they are different). Not carrying that specific stat doesn't prevent BP from doing research into some of the concepts behind it. It's not like BP is incapable of contributing to the discussion of fielding metrics just because we call ours something other than UZR - calling it UZR would do nothing but confuse people who have (rightly) come to expect UZR to mean a very specific thing.
I agree with you that the implementation of SIERA "obfuscates more than it enlightens." I think us keeping the name for a significantly different metric, while the original SIERA still exists in the wild, would do much the same. Neither point means that we're leaving the conversation.
If you base a prediction just on the 40 innings, you are assuming the latter, which is missing the point because the whole question is what do these 40 IP mean ?
If you base a prediction on past performance plus incorporate the new 40 IP, then your prediction essentially assumes the former that the guy is the same pitcher.
In other words, any prediction is close to useless because it assumes an answer to the question your asking, but it doesn't really answer the question.
I think what people really want to know is what was the probability of this 40 IP sample occurring given the past performance of this pitcher. If you can look at that 40 IP sample and say there was less than a 10% chance of that happening randomly given this guys previous level of performance, then maybe you conclude that the chance of him having improved is large enough to warrant picking him up in your fantasy league.
1) Why were PECOTA pitcher win-loss record projections so screwed up in the pre-season?
2) What was changed in the PECOTA rest-of-season weighting scheme a few days ago, and why was it done?
3) Why is it assumed that a good test of ERA estimators is using the past 1000 IP to predict the next 200 IP?
4) Why was the park environment ignored in assessing HR rates vs. flyball rates as predictive indicators?
5) Why was schedule ignored in assessing HR rates vs. flyball rates as predictive indicators?
6) "Aren't you graphing actual HR rates vs. predicted HR rates based on flyball rates? Of course the actual HR rate is going to be wider--that's the nature of real life vs. projection, right?
Are you implying that the distribution of expected home runs based on contact rate is wider than that for fly ball rate?"
7) "A good test might be to ask whether HR/FB is more predictive of second half home runs than HR/CON?"
8) "...there is a reason that this sort of analysis can be unsatisfactory: it assumes that a fly ball that wasn’t a home run is equally predictive of future home runs as a fly ball that is a home run."
But isn't this just as true as HR/CON? In fact, couldn't you turn this argument around and say that this is why HR/FB is preferable to HR/CON? Because fly balls are more predictive of future home runs than, say, ground balls?"
If you look at first and second half HR/CON vs FB/CON rates for 2002 - 2010 (cutoff 200 TBF in each half, using 180 days into the season as the 1st half / 2nd half split) you get this:
R^2
FB/CON
.084
HR/CON
.046
This is more in line with previous research and shows the batted ball data to be useful when predicting second half home runs.
You know, pissing and moaning about tone is unbecoming when you're desperately trying to do a hatchet job, and ignoring facts and context to get the job done.
Thanks. Whatever the value is of 40 innings worth of data - it does have some value. There are situations either in predicting success or describing hotness where you might want to have the best stat for analyzing those 40 innings. You might be comparing two players who are very close and that stat could be the best arbitrator . . . and whatever the value of GB data has regarding HR yields with only that amount of data, I still believe it has more value than HR/9. I do not claim to have made that notion up. Others I respect think that as well. I just don't have a study handy that shows it. . . and whatever the benefits or detriments are of all of SIERA's many complicated nuances vs. FIP's simplicity, the most significant one is that it uses GB data instead of HR/9 data.
If Colin and Mike want to concentrate their efforts on PITCH/fx and HIT/fx, because this is samll potatoes; go for it. I look forward to their findings. Meanwhile, FanGraphs has a working SIERA, if I feel the need to check it out.
TagoTiger makes a case that HR per Fly Ball is to some small degree a skill: http://www.insidethebook.com/ee/index.php/site/comments/pre_introducing_batted_ball_fip_part_2/
Paapfly shows just how great Matt Cain is at generating a low HR/FB: http://www.paapfly.com/2011/02/matt-cain-ignores-xfip-again-and-again.html
So, yet a newer improved SIERA would include appropriately regressed HR/FB data.
SIERA purposefully limits itself to:
1. not using HR
2. not using prior years
Those are constraints it imposes upon itself.
Marcel purposefully uses HR and uses prior years, as does PECOTA.
FIP purposefully limits itself to BB, HB, SO, HR, and current year.
All these decisions are made because the metrics are trying to answer a specific question.
So, don't expect any of them to change to be "better", because they have already decided that they want to be limited to some extent.
What's going on here? You poach us like a minor league team, so we'll poach back to gain major league acceptance?
Personally, I always enjoyed David's interviews. However, they were a guilty pleasure. I didn't make time to read them often enough.
And yeah, I liked David interviews too.