Manufactured Runs: Lost in the SIERA Madre

July 25, 2011

Recently, there has been a lot of digital ink spilled about ERA estimators—statistics that take a variety of inputs and come up with a pitcher’s expected ERA given those inputs. Swing a cat around a room, and you’ll find yourself with a dozen of the things, as well as a very agitated cat. Among those is SIERA, which has lately migrated from here to Fangraphs.com in a new form, one more complex but not necessarily more accurate. We have offered SIERA for roughly 18 months, but have had a difficult time convincing anyone, be they our readers, other practitioners of sabermetrics, or our own authors, that SIERA was a significant improvement on other ERA estimators.

The logical question was whether or not we were failing to do the job of explaining why SIERA was more useful than other stats, or if we were simply being stubborn in continuing to offer it instead of simpler, more widely adopted stats. The answer depends on knowing what the purpose of an ERA estimator is. When evaluating a pitcher’s performance, there are three questions we can ask that can be addressed by statistics: How well he has pitched, how he accomplished what he’s done, and how he will do in the future. The first can be answered by Fair RA (FRA), the third by rest-of-season PECOTA. The second can be addressed by an ERA estimator like SIERA, but not necessarily SIERA itself, which boasts greater complexity than more established ERA estimators such as FIP but can only claim incremental gains in accuracy.

Some fights are worth fighting. The fight to replace batting average with better measures of offense was worth fighting. The fight to replace FIP with more complicated formulas that add little in the way of quality simply isn’t. FIP is easy to understand and it does the job it’s supposed to as well as anything else proposed. It isn’t perfect, but it does everything a measure like SIERA does without the extra baggage, so FIP is what you will see around here going forward.

Why ERA Estimators?

Why do we want to know a player’s expected ERA? We can measure actual ERA just fine, after all, and we have good reason to believe that ERA is not the best reflection of a pitcher’s value because of the unearned run. Component ERA estimators are frequently used to separate a pitcher’s efforts from those of his defense, but you end up throwing out a lot of other things (like how he pitched with men on base) along with the defensive contributions. Here at BP, we use Fair RA to apportion responsibility between pitching and defense while still giving a pitcher credit for his performance with men on base. Sean Smith, who developed the Wins Above Replacement measure used at Baseball Reference, has his own method of splitting credit that doesn’t rely on component ERA estimators. They simply aren’t necessary to handle the split credit issue.

What about ERA estimators’ utility in projecting future performance? Let’s take a look at how the most popular component ERA works in terms of prediction. Fielding Independent Pitching was developed independently by Tom Tango and Clay Dreslough, and is simply:

(13*HR + 3*BB – 2*SO)/IP +3.20

I computed FIP for the first 50 games of the season, from 1993 to 2010, and looked at how it predicted rest-of-season ERA, weighted by innings pitched. As expected, the Root Mean Square Deviation is very large, at 5.32. If you repeat the exercise but use previous year’s ERA instead, RMSE drops to 2.78. After 100 games, the difference still persists; ERA has an RMSE of 4.52, compared to 9.64 for FIP. (The reason RMSE goes up the further you are in season is that you are predicting ERA for substantially fewer games.) Large sample sizes are our friend when it comes to projection.

All component ERA estimators share the same substantial weakness in terms of predicting future ERA: they work a lot better on smaller sample sizes. I tried predicting 2010 pitching stats using the aggregate of pitching stats from 2003 through 2009, so I could test how several ERA estimators fared as sample sizes increased dramatically. In addition to FIP, I used xFIP, a FIP derivative developed by Dave Studeman which uses outfield flies to estimate home runs allowed, and Skill-Interactive ERA, a metric originally developed here at BP by Matt Swartz and Eric Seidman.

Here is the RMSE for 2003-2010 at increments of 100 innings:

IP	ERA	FIP	xFIP	SIERA
100	1.52	1.45	1.40	1.33
200	1.20	1.19	1.18	1.14
300	1.02	1.07	1.05	1.03
400	.98	1.02	1.02	1.02
500	.77	.88	.90	.91
600	.75	.88	.93	.96
700	.81	.91	.93	.99
800	.77	.88	.91	.96
900	.77	.90	.91	.98
1000	.81	.94	.97	1.04

Looking at all pitchers with at least 100 innings, SIERA outperforms the other estimators (but not by much, as I’ll explain momentarily) as well as ERA. By about 400 innings, though, the gap between the ERA estimators (as well as the gap between them all and ERA) has essentially disappeared. As you add more and more innings, plain old ERA outperforms the estimators, with basic FIP ranking second-best.

If you want a projection, the key thing you want is as large a sample as possible. If you need both a large sample as well as the most recent data, weighted according to their predictive value, you want something like rest-of-season PECOTA. So what is the function of these estimators, if not prediction? The answer, I think, is explanation. These ERA estimators allow us to see how a pitcher came about the results he got, if they came from the three true outcomes, defense, or performance with runners on.

Enter SIERA—Twice

SIERA was doomed to marginalization at the outset. It was difficult to mount a compelling case that its limited gains in predictive power were worth the added complexity. As a result, FIP, easier to compute and understand, has retained popularity. FIP is the teacher who can explain the subject matter and keep you engaged, while SIERA drones on in front of an overhead projector. When FIP provides you an explanation, you don’t need to sit around asking your classmates if they have any idea what he just said. If ERA estimators are about explanation, then the question we need to ask is, do these more complicated ERA estimators explain more of pitching than FIP? The simple answer is they don’t. All of them teach fundamentally the same syllabus with little practical difference.

Fangraphs recently rolled out a revised SIERA; a natural question is whether or not New Coke addresses the concerns I’ve listed above with old Coke. The simplest answer I can come up with is, they’re still both mostly sugar water with a bit of citrus and a few other flavorings, and in blind taste tests you wouldn’t be able to tell the difference.

I took the values for 2010 published at Fangraphs and here, and looked at the root mean square error and mean absolute error, weighted by innings pitched. I got values of .12 and .19 respectively. In other words, roughly 50 percent of the time a player’s new SIERA was within .12 runs per nine of the old SIERA, and 68 percent of the time it was under .20 runs per nine. There is a very modest disagreement between the two formulas.

In order to arrive at these modest differences, SIERA substantially bulks up, adding three new coefficients and an adjustment to baseline it to the league average for each season. I ran a set of weighted ordinary least squares regressions to, in essence, duplicate both versions of SIERA on the same data set (new SIERA uses proprietary BIS data, as opposed to the Gameday-sourced batted ball data we use here at Baseball Prospectus). Comparing the two regression fits, there is no significant difference in measures of goodness of fit such as adjusted r-squared. Using the new SIERA variables gives us .33, while omitting the three new variables gives us .32.

We can go one step further and eliminate the squared terms and the interactive effects altogether, and look at just three per-PA rates (strikeouts, walks and “net ground balls,” or grounders minus fly balls) and the percentage of a pitcher’s innings as a starter. The adjusted r-squared for this simpler version of SIERA? Still .32.

So what does the extra complexity of the six interactive and exponential variables add to SIERA? Mostly, multicollinearity. An ordinary least squares regression takes a set of one or more input variables, called the independent variables, and uses them to best predict another variable, the dependent variable. It does this by looking at how the independent variables are related to the dependent variable. When the independent variables are also related to each other, it’s possible to throw off the coefficients estimated by the regression.

When multicollinearity is present, it doesn’t affect how well the regression predicts the dependent variable, but it does mean that the individual coefficients can incorrectly reflect the true relationship between the variables. This is why you can have one term in old SIERA with a negative coefficient now have a positive coefficient in new SIERA. The fundamental relationship between that variable and a pitcher’s ERA hasn’t changed, players haven’t started trying to score outs without making runs or anything like that.

Adding additional complexity doesn’t help SIERA explain pitching performance in a statistical sense, and it makes it harder for SIERA to explain pitching performance to an actual human being. If you want to know how a pitcher’s strikeouts influence his performance in SIERA, you have no less than four variables which explicitly consider strikeout rate, and several others which are at least somewhat correlated with strikeout rate as well. Interpreting what SIERA says about any one pitcher is essentially like reading a deck of tarot cards—you end up with a bunch of vague symbols to which you can fit a bunch of rationalizations to after the fact.

The world doesn’t need even one SIERA; recent events have conspired to give us two. That’s wholly regrettable, but we’re doing our part to help by eliminating it from our menu of statistical offerings and replacing it with FIP. Fangraphs, which already offers FIP, is welcome to this demonstrably redundant measure.

FIP or xFIP?

The next question is why not use a measure like xFIP instead of FIP? The former is much simpler than SIERA and performs better on standard testing than FIP, while being indistinguishable from SIERA in terms of predictive power. While xFIP is simpler than SIERA, it still drags additional levels of complexity along with it, including stringer-provided batted ball data. The question is whether or not that data adds to our fundamental understanding of how a pitcher allows and prevents runs.

One of the fundamental concepts to understand is the notion of true score theory (otherwise known as classical test theory). In essence, what it says is this: in a sample, every measurement is a product of two things—the “true” score being measured, and measurement error. (We can break measurement error down into random and systemic error, and we can introduce other complexities; this is the simplest version of the idea, but still quite powerful on its own.)

You’re probably familiar with the principle of regression to the mean, the idea that extreme observations tend to become less extreme over time. If you select the top ten players from a given time period and look at their batting average, on-base percentage, earned run average, whatever, you see that the average rate after that time period for all those players will be closer to the mean (or average) than their rates during that time period. That’s simply true score theory in action.

Misunderstanding regression to the mean can lead to a lot of silly ideas in sports analysis. It’s why people still believe in a Madden curse. It’s why people believe in the Verducci Effect. It is why we ended up with a lot of batted-ball data in our supposed skill-based metrics. I’m regularly known as a batted ball data skeptic. So I’m sure it comes as no surprise that I regularly lose sleep over statistics like home runs per fly ball, a component of xFIP.

xFIP uses the league average HR/FB rate times a pitcher’s fly balls allowed in place of a pitcher’s actual home runs allowed. While it’s true that a pitcher’s fly ball rate is predictive of his future home run rate on contact, it’s actually less predictive than his home runs on contact (HR/CON). Looking at split halves from 1989 through 1999[1], here’s how well a pitcher’s stats in one half predict his home run rate on contact in the other half:

	R^2
FB/CON	.023
HR/CON	.035
Both	.045

For those who care about the technical method, I used the average number of contact in both halves to do a weighted ordinary least squares regression. The coefficients on the regression including both terms are:

HR_CON_ODD = 0.019 + 0.158*HR_CON_EVEN + 0.032 * FB_RT_EVEN

This result seems to fly in the face of previously published research. Most analysts seem to feel that a pitcher’s fly ball rate is more predictive of future home runs allowed than their home run rates. How to reconcile these seemingly contradictory thoughts? The answer is simply this: estimating home run rates using a pitcher’s fly ball rates leads to a much smaller spread than using observed home run rates. Taking data from the 2010 season, and fitting a normal distribution to actual HR/CON, as well as expected HR/CON given FB/CON, gives us:

Normal distribution plots of HR_CON and expected HR_CON

When you create an expected home run rate based on batted ball data, what you get is something that's well correlated with HR/CON but has a smaller standard deviation, so in tests where the standard deviation affects the results, like root mean square error, it produces a better-looking result, without adding any predictive value.

Aside from questions of batted ball bias, however, there is a reason that this sort of analysis can be unsatisfactory: it assumes that a fly ball that wasn’t a home run is equally predictive of future home runs as a fly ball that is a home run. This is absolutely incorrect. You can break a pitcher’s fly ball rate down into two components: home runs on contact, and fly balls on balls in play. (This ignores the very small number of home runs that are scored as line drives, which is typically rare enough to not impact this sort of analysis.) Fly balls on balls in play are a much poorer predictor of future home runs than home runs on contact, with an r-squared of only .014.

While regressing the spread of HR/CON is a good idea, using fly ball rates to come up with expected HR/CON does this essentially by accident. And in doing so, they throw out a lot of valuable information that is preserved in HR/CON but is washed out when non-HR fly balls are considered in equal proportion. By being in a rush to discard the bath water, someone’s throwing away the baby.

No Estimating Zone

Proponents of skill-based metrics such as SIERA and xFIP (as distinct from FIP) may claim that results trump theory, that their additional predictive value alone proves their utility. But just as the greater predictive value of fly ball rates is a mirage, so is the greater predictive value of skill-based estimators. What happens when we fit a normal distribution to our ERA estimators, along with ERA? Again looking at 2010 data:

Normal distribution plots of several pitching stats.

The estimators have a smaller standard deviation than observed ERA. SIERA has a lower spread than FIP and xFIP has a lower spread than both (although SIERA and xFIP are much closer to each other than SIERA is to FIP). Let’s look at how well each of these predicts next-season ERA from 2003 through 2010 (using a weighted root mean square error), and well as the weighted SD of each among the pitchers who pitched in back to back seasons:

	ERA	FIP	xFIP	SIERA
RMSE	2.88	2.07	1.86	1.86
SD	1.71	.83	.49	.53

The ERA estimators are all closer to the next-season ERA than ERA is, but they have a much smaller spread as well. So what happens if we give all the ERA estimators (and ERA) the same standard deviation? We can do this simply, using z-scores. I gave each metric the same SD as xFIP, which was (just barely; you can’t even tell the difference when you round the values) the leading estimator in the testing above, and voila:

	ERA	FIP	xFIP	SIERA
RMSE	1.94	1.87	1.86	1.84

In essence, what we’ve done here is very crudely regress the other stats (except SIERA, which we actually expanded slightly) and re-center them on the league average ERA of that season. Now SIERA slightly beats xFIP, but in a practical sense both of them are in a dead heat with FIP, and ERA isn’t that far behind. As I said, this is an extremely crude way of regressing to the mean. If we wanted to do this for a real analysis, we’d want to look at how much playing time a player has had. In this example we regress a guy with 1 IP as much as a guy with 200 IP, which is sheer lunacy.

In a real sense, that’s what we do whenever we use a skill-based metric like xFIP or SIERA. We are using a proxy for regression to the mean that doesn’t explicitly account for the amount of playing time a pitcher has had. We are, in essence, trusting in the formula to do the right amount of regression for us. And like using fly balls to predict home runs, the regression to the mean we see is a side effect, not anything intentional.

Simply producing a lower standard deviation doesn't make a measure better at predicting future performance in any real sense; it simply makes it less able to measure the distance between good pitching and bad pitching. And having a lower RMSE based upon that lower standard deviation doesn't provide evidence that skill is being measured. In short, the gains claimed for SIERA are about as imaginary as they can get, and we feel quite comfortable in moving on.

[1]The years in question are because those are the years for which we have Project Scoresheet batted ball data, and thus represents the largest period of publicly-available batted ball data on record.

Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now

Colin Wyers

Latest Articles

You need to be logged in to comment. Login or Subscribe

iddscoper

7/22

While I enjoy some of the writing at fangraphs, this article demonstrates my problem - they just take existing stats, add multiple layers of complexity in an effort to extract another decimal point of accuracy, then declare it the be all and end all. No critical thinking about what problem they are trying to solve or what the extra accuracy would mean even if they did achieve it. They're like the Aperture Science of baseball analysis - throwing stats at the wall and seeing what sticks.

Reply to iddscoper

cwyers

7/22

In the case of SIERA, there isn't any criticism you can make of Fangraphs that you can't make of us as well. I don't want this to be an "us versus them" issue; there are plenty of good baseball analysts over there as well as here.

I think the important thing is to evaluate the work on its own merits, regardless of the source - let the evidence be the guide. I started working on this before I had any inkling that Fangraphs would be picking up SIERA - unfortunately I was away for a while to attend SABR and so the article you see here got delayed. Once SIERA was released at Fangraphs, we delayed the article for a few days so I could review the changes they made and see if they addressed any of the concerns I outlined. As I note above, they didn't, so I incorporated some new material to mention that and we went with it.

I'm a little saddened that it worked out this way, actually - I don't think Fangraphs is adding anything that helps them communicate important baseball info, given that they were already running FIP to begin with. And I think that the way the timing of this happened, it puts the focus more on the personalities involved than on the quality of research.

Reply to cwyers

markpadden

7/22

At least Swartz is 100% transparent in what he is doing, and is very responsive to reader input. Contrast that with BP, which has apparently changed (improved?) PECOTA, but has published nothing to indicate what has changed or why. We were all expecting a massive research piece on what steps were taken to improve the algorithm this past off-season. Instead we got crickets and a late product.

Now we get an incredibly time-consuming rebuttal to someone's else's work on a different site. This off the heels of an even more off-target critique of Fangraphs' rest of season weighting algorithm. Exactly what weighting are you using again for PECOTA RoS, Colin? Oh wait: you never answered that repeated question, instead spending a whole article doing forensic analysis to uncover and slam someone else's algorithm.

I have no allegiance to Fangraphs or BP or any other site for that matter. But the trend here at BP is undeniably disturbing. More effort is being put toward saying someone else is wrong than it is toward creating and optimizing your own models and tools -- at least based on what is being published, which is all that really matters. The PECOTA pitcher win-loss projections were unbelievably screwed up, beyond any shred of credibility; the 10-year projections never came. But instead of reading articles about what happened and how it will improved for next year, we're reading about how some ex-employee should stop trying to improve a once-prized algorithm for estimating true pitcher skill over small samples? Color me saddened as well.

Reply to markpadden

jrmayne

7/23

This almost exactly reflects my views. The 10-year upside projections - the only 10-year projections - were immediately identified as broken, and we got promised that it would be looked at. That got nowhere. BP isn't announcing that it's taking down the broken PECOTA's, or broken 10-year upside.

There are other problems I discussed in the last thread that I won't harp on further.

Unlike evo34, I have a dog in this fight; I've published twice on BP, I've been acknowledged in the book credits, and I've had a number of pleasant conversations with BP authors both live and via e-mail. I want BP to succeed.

But an article saying, "Math is hard; incremental improvement isn't useful," and a somewhat opaque statistical attack is either anti-stathead or appearing to stem from personal animosity.

Steven Goldman deserves some heat for this, too. There's a time when an editor needs to slow down the horses.

Reply to jrmayne

jrmayne

7/23

My comment above says or seems to say there are no 10-year projections now; that's wrong. My apologies.

Reply to jrmayne

dpease

7/23

That's right, actually--we've just released them to our Beta testers. Please have a look and let us know what you think, folks, and thanks again for your patience.

Reply to dpease

mcquown

7/24

I fixed UPSIDE in April and it was on the player cards.

At some point since then - after the first week they were up, as we were giving it constant scrutiny during that stretch, they were overwritten somehow with the old values. This is quite embarrassing, and probably means that someone ran a piece of pre-season prep code which needed to be deactivated. I take full responsibility for it being removed. I'm not clear on why the customers who were most interested in seeing it fixed (such as jrmayne) were not appropriately informed when it was published in April, but obviously it wasn't handled professionally on our part at any stage.

All customer suggestions are read and discussed. Some are "works as designed". Some are not possible to be fixed / implemented in a timely manner. Pre-season PECOTA modifications would generally come under one of these two categories. Some suggestions lead to re-evaluation of statistical methods, such as this FIP/SIERA discussion evinces. But ones such as UPSIDE - which are simply repairing/restoring stats that have been on the online before - should be fixed in a timely manner. We'll make sure that any such future fixes are addressed in a timely manner, and that these solutions are communicated.

Reply to mcquown

markpadden

7/24

As some feedback for next season, I don't think the current presentation of UPSIDE is as useful as it has been in years past. I don't think most people care what the exact upside prediction for 2017 is (down to a decimal point); they would like an approximation of the player's total upside over the next 5 years, or 10 years. Upside used to be listed this way (as a total). Now it's either listed for 2011 only (in PFM and the PECOTA spreadsheet), or as a row of individual year numbers on the PECOTA cards.

Also, it was mentioned several times in the pre-season, but the order of tables in the PECOTA cards seems odd. The injury and contract info. tables, which are not of primary interest to most users and take up a ton of room, are listed above projected playing time, forecast percentiles, and now the 10-year forecast.

Reply to markpadden

cwyers

7/23

In terms of the Rest of Season PECOTA weighting, I offered an explanation here:

http://www.baseballprospectus.com/article.php?type=2&articleid=14171#88302

I'll be happy to try and expand upon that, if you can please tell me what you find lacking in that explanation.

Reply to cwyers

jrmayne

7/23

Bayesian priors are in my wheelhouse. I am absolutely convinced that's the right way to do it. All good there.

But this is from the July 19 thread:

Rest-of-season PECOTA: We've made changes to the weighting of recent results that reduced the impact of 2009-2010 performance on players' rest-of-season projections, making them less susceptible to being swayed by small-sample stats. Those changes are now reflected in the "2011 Projections" table at the top of each player card, as well as in our Playoff Odds.

[End quote]

And several people noted that you'd previously said that Fangraphs was wrong for its too-2011 heavy stats. While the paragraph was parsed different ways (and we never found out which way was right - the logically coherent and grammatically coherent readings appear at odds), it appears to say that prior years are having their weighting reduced, to get closer to Fangraphs' weightings. (Yes, Fangraphs is doing something else that appears mistaken, but you vigorously stressed the goal of not overselling recent performance.)

So:

Step 1. Fangraphs is wrong! They're weighting 2011 way, way too much! Dummies.

Step 2. The approximate weighting Fangraphs is using for 2011 (for players with similar PT rates in each year) looks right to some of the unwashed masses.

Step 3. We've changed our system to weight prior years less (and, by necessity, 2011 more.)

Either Step 1 or Step 3 has a flaw. What I want to know is if I've misread Step 3.

--JRM

Reply to jrmayne

markpadden

7/24

I'll ask Lindbergh and Wyers again: what was changed in the PECOTA rest-of-season weighting scheme a few days ago, and why was it done?

Reply to markpadden

dethwurm

7/22

"And I think that the way the timing of this happened, it puts the focus more on the personalities involved than on the quality of research."

Then why on earth is this being published now, at the very same time the former BP author is writing about this stat on a different website? Why not two weeks ago? Or in August? Or in the off-season? I hardly believe switching from SIERA to FIP is an urgent change that must be made and justified this very second unless there's some sort of legal/IP issue involved, in which case just say that. Doing this now seems intentionally confrontational and needlessly petty.

Reply to dethwurm

noonan

7/23

Is Colin supposed to wait until some appropriate amount of time has elapsed, out of respect? We're all big boys here. We can have this debate and not worry about hurting feelings.

Reply to noonan

mattymatty2000

7/23

Well, we can have this debate but it sounds like feelings may already be hurt.

Reply to mattymatty2000

PBSteve

7/24

Feelings should not enter into it. As Sonny says in "The Godfather," it's business, not personal. While no author likes to see his work criticized, criticism comes with the territory when you start performing in public. You receive your bouquets and your brickbats and you move on. I believe it was Evelyn Waugh who said that he sometimes let a bad review spoil his breakfast but never his lunch. In any case, it is product, not people, under discussion here.

Reply to PBSteve

Oleoay

7/24

Whether it was business or personal, to me it would've made more sense to either "phase out" SIERA quietly or retire SIERA after the season.

Reply to Oleoay

crperry13

7/24

Not that it necessarily applies to the conversation, but I never liked that quote. Every person I've run into whose said "it's not personal, it's business," seems to be hiding behind that quote in order to do something that's personally tasteless.

Completely irrelevant thought. I just don't like that axiom.

Reply to crperry13

ultimatedub

7/24

I blame this whole situation on Wieters being called a bust.

Reply to ultimatedub

markpadden

7/26

http://writing.colostate.edu/guides/researchsources/includingsources/quoting/overusing.cfm

Reply to markpadden

Oleoay

7/22

I appreciate the time spent discussing this Colin, but I feel a little... miffed? I know Matt's now writing at FanGraphs, but since there was no real announcement of his leaving, it almost seems like his work is being ripped up/diminshed. I realize people come and go from BP pretty frequently, but I think Matt could've used a respectful "send-off" before beginning a discussion on why SIERA is being retired.

I'm also not quite sure I buy into the "overly complex" argument either since the "overly complex/hard to describe in layman's terms" argument has been used when discussing why it was so hard to update PECOTA post-Nate Silver or discuss why PECOTA projected a player a certain way or gave a comparable player that didn't seem to match.

Reply to Oleoay

jjaffe

7/22

As for "overly complex," it's important to distinguish between a single metric (SIERA, FIP, xFIP, whatever) which is designed for a specific purpose and takes into account a limited amount of data, and a full-on projection system, which uses tons of data and historical comparisons to spit out a very complex array of numbers we call a projection.

The mistake people make (and I don't exclude myself from this, at least when resorting to the shorthand lingo of conversational sabermetrics) is to conflate those ERA estimators with a full-on projection system that takes into account multiple years of data, park and league adjustments, physical comps (in the case of PECOTA), and other situations. Saying a pitcher's ERA will go in a given direction because his FIP or SIERA says so is a shortcut, but you'd rather have the larger sample and the more complex beast of a projection system, at least for some purposes.

Reply to jjaffe

markpadden

7/22

Why would simplicity matter very much for an estimator? Are people jotting down FIP calcs. on a napkin, or looking up the final numbers on a web site?

Reply to markpadden

dethwurm

7/22

Yeah people make this argument all the time and I have never, ever understood it. The complexity of the formula or calculation process seems to be a problem for exactly one guy: the person who has to code it into the database. Everyone else just looks up the final results, so if Statistic B is 1% better than Statistic A then, unless B takes some ungodly amount of resources to compute, *B is better* and should be used, period. An incremental improvement is still an improvement, and the complexity of the calculation means diddly-poo to the end user.

(assuming that B is, in fact, better, and that the perceived improvement isn't illusory)

Also, isn't SIERA just "an ERA estimator that accounts for a pitcher's ability to control BABIP"? Unless I'm missing something huge, despite the length introduction it really wasn't any more difficult to "understand" than, say, VORP.

Reply to dethwurm

pobothecat

7/25

Speaking for the un-mathed washes here --- simple comprehensibility does matter. I'm just not likely to rely on a stat I can't explain in conversation. (And SIERA, before this controversy ever sparked, would've been my prime example.)

Second, and more important, if it can't be explained in conversation it'll never gain acceptance beyond the sabremetric stove. No fair complaining about the obstinacy of a thick-headed baseball mainstream if the stats you're pushing on them are beyond explanation.

Reply to pobothecat

crperry13

7/25

EXACTLY my point below. Forest through the trees, and all that.

Reply to crperry13

Oleoay

7/26

OPS is a simple statistic to calculate and is an easy calculation to explain, but what does it actually mean to add SLG and OBP together, counting singles twice but walks once?

Regarding SIERA, I kind of simplistically used it in my mind as calculating a pitcher's performance if BABIP/luck wasn't a factor.

Reply to Oleoay

Oleoay

7/22

It is true that a projection system should have more complexity than a single metric. I think when I used PECOTA, I was trying to indicate that overall, Baseball Prospectus is complex. I also understand that there can be some questions about whether added complexity is worth the added usefulness/precision.

Still, for one of my overall points, I could've used FRAA or or any other single metric that has some complexity beneath it that no one has really explained sufficiently to me in the many years I've had a BP subscription. I understand metrics are complex. I may not even understand how or why they work. But, I can generally tell if they are useful or not without knowing the nitty gritty beneath the hood. So, I don't really buy the complexity argument since SIERA seems to "make sense" more than FRAA.

The statement and mindset I get curious about is: "It was difficult to mount a compelling case that its limited gains in predictive power were worth the added complexity." This indicates that there are some benefits to SIERA that aren't in FIP. Should I, as a customer, care about how simple or complex a metric's mechanics? I can definitely appreciate the effort to produce metrics, but I honestly think that my subscription fee pays, in part, for research into extra precision and usefulness. If I was in a restauraunt, I really don't know or think about how difficult it is to cook a meal, but I do think about its taste.

So, finding that BP is reverting to another metric that does not have those "limited gains", it does make me wonder if there are other BP metrics that are not as precise. Kinda like finding out that I'd been served a watered down wine, I'd start sniffing suspiciously at the rest of the meal.

Reply to Oleoay

mikefast

7/23

The argument here about complexity is not that "we didn't feel like putting in the extra time and research to make something like SIERA." It is that complexity brings with it several other problems.

One is that it obscures the logic of the underlying processes. Matt claims that the extra terms have real baseball meaning, but he hasn't tested that, much less proven it. I'm skeptical that any of them have much real baseball meaning, but even if some of them do, I'm quite certain that not all of them do. So, when you look at SIERA, how are you to know which terms have real baseball meaning and what they mean? If one of them is wrong, what effect does that have on the whole formula? If something in FIP is wrong, say the coefficient for the BB should be 2.5 instead of 3, it's pretty easy to see how that would affect the result.

Secondly, complexity limits applicability. SIERA is limited to the batted ball data era because of its requirement for groundball data. Who knows if it would work in the minors, or Japanese players, or elsewhere, for example? It's much easier to get the needed data for FIP and to test its applicability in other leagues and levels.

Third, the more complicated a multivariate regression, the easier it is for it to break--in other words, for the conditions that applied in sample to change in ways that cause your result to vary in ways you didn't anticipate--and the harder it tends to be to realize that this has happened.

Reply to mikefast

jrmayne

7/23

Mike:

To my way of thinking, your criticism here is well-explained, and doesn't feel personal. The article did not succeed on those terms, at least to me.

Reply to jrmayne

mcquown

7/23

FWIW, I ran the same tests Colin ran against these other metrics for Nate Silver's QERA stat (which has always been available at BP), and for the "next year" normalized RMSE, it comes out marginally better than SIERA (and everything else):

SIERA: 1.844
QERA: 1.842

So, it's difficult to paint this as a partisan dispute.

From the glossary:

QERA =(2.69+K%*(-3.4)+BB%*3.88+GB%*(-0.66))^2

Reply to mcquown

markpadden

7/23

Who is using ERA estimators on a full season to predict the next season? That is what projections algorithms (which use more than one year) are for. ERA estimators are useful for very small samples (50-100 IP), and should only be judged on that basis.

Reply to markpadden

Oleoay

7/23

Agreed, I think that was a better explanation about the issues with complexity limiting the applications of SIERA. Though on the other hand, I'm a little surprised the mechanics of what variables of it did what weren't documented and/or validated by others at BP.

Reply to Oleoay

mikefast

7/23

I don't understand your second sentence. Would you mind restating and or explaining further? Thanks. I appreciate your comments.

Reply to mikefast

Oleoay

7/23

Well, as a programmer, I document my code to explain what a function or a variable does. As a Business Intelligence Analyst, I also create a data dictionary to describe fields. Generally when I design a report, others look at it to see if everything adds up, makes sense and passes the "sniff test".
I know from reports that I design that, even with the documentation, sometimes I get too close to the data and can't see the trees from the proverbial forest.

So, to rephrase:

I'm curious if, over the last 18 months, someone besides Matt didn't outline what each part of SIERA did or check if that part was needed.

Or is the person who did that no longer with BP?

Reply to Oleoay

mikefast

7/23

Let me say, simply, hell yes, that sort of process goes on at BP.

As to who exactly did what when behind the scenes, it's certainly not my place to share that, if it's even appropriate to be shared at all.

Reply to mikefast

Oleoay

7/23

When I see "Matt claims that the extra terms have real baseball meaning, but he hasn't tested that, much less proven it." it suggests that sort of process didn't happen. If it had, you would have been able to ask someone at BP or read something in BP's internal documents to find out if those terms were relevant/tested/etc. Or, the process did take place and SIERA was found to be "bunk" but it took x amount of months to retire it...

mikefast

7/23

I joined the BP team about nine months ago, about which time the decision was being made not to have SIERA in the annual, which ultimately led to what you see here, as Steven said.

I'm not aware of what Matt or anyone else at BP did with SIERA prior to my joining. I can say that in the first few months I was here, the stats group had an extensive discussion about ERA estimators. A lot of the things that Colin and I have said here and that Matt has said in his series at Fangraphs were part of that discussion, if in more nascent forms.

thegeneral13

7/26

Well said, Mike. +1 for parsimony, for preserving the practical interpretation of terms, and for having a sound basis for including a term ex-ante rather than throwing everything in a blender, weeding out the high p-values, and then making up an interpretation for the remaining terms ex-post.

Reply to thegeneral13

mikefast

7/23

By the way, yes, the points I made about complexity apply to things like PECOTA and fielding metrics, too. Simpler is always better unless complexity buys you something. And when you add complexity, you better make sure you understand why and that you are testing it appropriately and thinking about whether your model is really as robust as you think it is.

I'm probably a skeptic of more baseball stats than I am an advocate, and unnecessary and/or unexamined complexity is one big reason why.

Reply to mikefast

pobothecat

7/25

(diggin' this Mike Fast guy)

Reply to pobothecat

eighteen

7/22

Wow! This type of article makes me wish I had time to pull the ol' Statistics text off the shelf and whale away at the methodology. Thanks for a great read.

Reply to eighteen

whitakk

7/22

I was a bit surprised to see line drive homers as being that rare. I'm a MiLB scorer, and I think I've classified about 20-30% of my homers as LDs.

Reply to whitakk

mikefast

7/22

About 12 percent of home runs are scored as line drives by MLBAM stringers.

But home runs as a percentage of line drives are low: only about 2 percent of line drives become home runs (as compared to about 11 percent of outfield fly balls).

Reply to mikefast

tombores99

7/22

You bring up a great point, whitakk...

I've always wondered how the ball-tracking systems define the threshold for a "line drive," and how consistent those definitions are throughout the industry. And that's before getting into the subjective experience of judging a line drive...

Reply to tombores99

mikefast

7/22

Comparing the MLBAM data to the HITf/x data that was made available for April 2009, the best and simplest division between ground balls and line drives is around a vertical launch angle of about 6 or 7 degrees, and between line drives and fly balls at a vertical launch angle of around 22 degrees, based upon how the MLBAM stringer labeled them.

Obviously you can consider launch speed, also, and that complicates the picture a bit. Here's a link to a graph I made of that a while back:
http://fastballs.files.wordpress.com/2010/06/vla_vs_sob_zoom.png

Reply to mikefast

tombores99

7/22

Wow, ask and ye shall receive! Thanks Mike!

Reply to tombores99

dethwurm

7/22

Regarding the meat of the article, Colin, you've convinced me that SIERA isn't really better at measuring a pitcher's performance than FIP or similar metrics, but I always thought what was really interesting about SIERA was the interrelatedness of the components that Matt discovered. Like, how higher SO% leads to lower BABIP, or that more total baserunners leads to a higher -percentage- of baserunners scoring. Are those effects 'real' in practical terms, or are those also consequences of excessive statistical manipulation?

Also, regarding the RSME tables in the final section, are those differences significant at all, or are they within the margin of error of the estimate? (my statistical knowledge is pretty rudimentary, so I'm not quite sure what the proper word is) Especially the final one with the normalized SDs, is 1.84 really better than 1.87 or is it like the RSME is 1.84 +/- 0.05 so you can't really say it's "better"? I'm also surprised ERA is that close to the estimators. Does this mean that the "true" ERA ability of pitchers has much less spread than measured due to fielding and batted-ball luck?

Reply to dethwurm

mikefast

7/23

I can't speak for Colin, but it bothers me that a few of the terms in SIERA's regression switch sign when it goes to SIRA, or from the BP to the FanGraphs version. That's a pretty good indication that those effects are not real and are just a result of overfitting to the data.

Reply to mikefast

cwyers

7/25

Actually, this does a pretty good job speaking for me.

A bit of a technical note - as I stated in the article, I attempted to duplicate the regressions used to build new SIERA off my own dataset (using different batted ball data than what is used at Fangraphs, so I wouldn't expect a complete match), and I got a p-value for that coefficient of .46. That's very, very large (typically if a p-value is > .1 or .05 it will be omitted from the regression), and it indicates to me that the inclusion of that variable isn't statistically significant (which would be why the sign could flip like that without affecting the goodness of fit of the regression.)

If I omit the two variables with p-values above .4 (the other being the squared netGB term), I actually see the adjusted r-squared rise (although not by a statistically significant amount) and several other model selector tests (such as the Akaike information criterion and the Bayesian information criterion) improve; those are tests that control for the tendency of adding new variables to a regression to increase the r-squared by "overfitting" rather than better predicting the relationship between variables.

Reply to cwyers

blcartwright

7/22

Dan Fox left, took "Simple Fielding Runs" with him but left "Equivalent Baserunning Runs" to BP. Matt left and took SIERA with him.

We did independent testing of SIERA at "The Book" blog and it did well. Yes, it's complex, but the intent was to pick up on some of the nuances of BABIP.

I haven't finished reading Matt's recent series at FanGraphs, but there's some thought provoking issues raised. I'm doing some of my own research (off and on) of the relationship between a pitcher's true talent ground ball rate and his BABIP allowed (it's not linear) of his HR per LD+FB (also not linear). Another thing that's not linear is the relationship between runs created per PA (linear weights, woba) and runs per game (ERA).

I was a math major at one time 30 years ago, but today Matt has better math skills than I do to run the regression analyses that show how multiple terms, not just ground ball rate, interact to predict base hit and homerun rates. Matt's SIERA research published in these recent articles has discovered this non-linearity I spoke of, as well as realizing that one players's fly balls or line drives are not the same as another's.

Regardless of whether xFIP is a better estimator than SIERA, I believe these concepts will be important issues in the next few years as we try to improve our modeling of batted balls and the projections we build from them.

Reply to blcartwright

PBSteve

7/23

One correction, Brian. I can't comment on Fox's arrangements because I was not part of the executive team at the time, but SIERA wasn't taken; we had always intended to eliminate it and made no objection to its appearing elsewhere.

Reply to PBSteve

blcartwright

7/23

Thanks Steve, I accept your explanation. As for Fox, that's how he explained it to me.

Reply to blcartwright

blcartwright

7/23

BTW, cola is flavored by cinnamon and vanilla, no citrus

Reply to blcartwright

lmarighi

7/25

If I missed that this was a reply to inside joke, you can disregard the following.
However, as someone who mixed the ingredients for a house-brand soda manufacturer back in the '90s, I can tell you that in fact there are multiple kinds of citrus flavoring in cola, as well as a variety of other odd things.

Reply to lmarighi

Oleoay

7/25

*remembers "OK Cola" which was rumored to be all the Coke products mixed together*

Reply to Oleoay

blcartwright

7/26

"a natural question is whether or not New Coke addresses the concerns Iâ€™ve listed above with old Coke. The simplest answer I can come up with is, theyâ€™re still both mostly sugar water with a bit of citrus and a few other flavorings, and in blind taste tests you wouldnâ€™t be able to tell the difference."

I drink way too much Pepsi, but can't stand Coke. I've also got a bottle of Mello Yello sitting here, one of my favorite flavors, much more than Mountain Dew. But I have a hard time distinguishing between 7-Up and Sprite, they are interchangeable to me.

Reply to blcartwright

blcartwright

7/23

Final comment, after having now read all of Colin's article, he does discuss some very important issues, especially with the prediction of home run rates allowed. Stuff to think about.

It's just that I (and apparently several others) found the slap at SIERA to be distasteful.

Reply to blcartwright

delasky

7/23

I can't speak to the math because of ignorance and drunkenness, but the SIERA madre pun? FANTASAMAGORAFFIC!!!!!

Reply to delasky

gophils

7/23

This article is ironic because it illustrates exactly why I dont use BP for statistical analysis any more. The longwinded argument against SIERA being too complex, etc is juxtaposed by BP's failings in areas like PECOTA errors and WARP, where obvious mistakes like Derek Lowe leading the league were present for weeks, if not months.

Now I read BP for more subjective articles, like those byJason Parks, Kevin Goldstein, and John Perrotto, which dont rely on BP's flawed metrics as much.

It's pretty disheartening to see so much go into slamming a former colleague and competitor when so much has languished broken on this site for so long this season. Constructive criticism is well and good, but get your own house in order first and show you can keep it that way before you curiously time the release if an article deriding a competitor.

Reply to gophils

hotstatrat

7/23

Well, I gamefully tried to follow the article, but gave up about half way through. All I can tell you is that when one of you guys mention a pitcher's SIERA, I have a decent idea what you are talking about - and what a good SIERA is. I don't have that same notion about FIPs and FRAAs. I'd have to interrupt what I'm reading and look it up - many times because I'm an aging boomer. SIERA is immediately recognizable and it seems like the appropriate counterbalance to TAv. They are both (I thought)state-of-the-art overall descriptions of how well a player is pitching or a batter is batting expressed in the old fashioned ERA and BA scales. It not only seems wrong to go to an inferior stat with a harder to grasp acronym, but I'm not buying that its added complexity is a problem. Please, don't make me finish the article. I read Mike's explanation to JRM. I think it is a cop out. Either the added complexity gives us better info or it doesn't.

Come on guys, get your heads together and figure it out. If BP and FanGraphs agreed on this, then Baseball-Reference and ESPN might buy into it and we could all talk the same language.

- hsr/Scoresheetwiz

Reply to hotstatrat

mikefast

7/23

FIP is on the ERA scale.

The added complexity of SIERA does not give us better info.

There are disagreements in the sabermetric community. I'm not sure how "getting our heads together" will solve that. What do you think we've been doing? This discussion is all part of getting to the truth, but it happens over the period of years. It's not the matter of one person suddenly dispensing some enlightment and everyone else bowing at the feet of their insight. Even something like Voros and DIPS took years to hash out. It's not for lack of effort and dialogue, certainly.

Reply to mikefast

Oleoay

7/23

Silly stupid question time. In light of this article and the commentary, why was it the focus in the last BP Annual?

I mean, was SIERA was flawed (as some commentary suggests), weren't people at BP wasn't checking on it/understanding it? Isn't there some kind of peer review process? If it was complex to the scope that its application was limited as Mike suggests, why was it adopted? What new evidence or insight came to light that wasn't in all the previous articles about SIERA? Or is it just impossible to use/understand SIERA now that Matt isn't with BP?

Reply to Oleoay

Oleoay

7/23

Ugh *shoots some holes in his grammar and rephrases*

Was SIERA flawed (as some commentary suggests)?
Weren't people at BP checking on it/understanding it?

Reply to Oleoay

mikefast

7/23

I will say for my own part that my view on SIERA has never changed. The processes at BP that led to SIERA being published and paraded with such vigor are the same processes that led me to cancel my BP subscription a couple years ago. I didn't hate SIERA then and I don't hate it now. It's not some awful stat. It's just not something I would use or believe helps our understanding of the game. I don't want to say a whole lot more than that because I don't like turning it into a clash of personalities rather than clash of approaches. But I can say that the changes that Colin has been implementing at BP are what brought me back, first as a subscriber, and now as a contributor.

Reply to mikefast

markpadden

7/23

Can you summarize what changes have been made? Not trying to be snide; I just don't have a handle on what has been added/improved from the data reporting/analysis side.

Reply to markpadden

mikefast

7/23

I'm not sure I can easily summarize what changes have been made in terms of a list of stats added and deleted or formulas changed. Not to minimize the importance of that, because Colin and Rob and others have spent a ton of hours in those efforts.

But for me what it boils down to is mindset. I believe that Colin has been leading a change in that regard. I use Colin's name because he's in charge of the stats here, but really all of it is a group effort and there are many people to whom the positive changes should be credited. That is also not to say that this mindset was not present at BP previously either to a greater or lesser degree. I certainly hold a number of the former BP writers and founders in high regard.

What I saw that brought me back around to BP enough that I was willing to put my own name and effort to it was an emphasis on good, logical sabermetric thinking. That everything should be tested to see if it was true. I find Colin to be the clearest sabermetric thinker in the field. If there's something dodgy with the maths or the approach, he can spot it like no one else can. I need to be around someone like that to clarify my own thinking and work. I believe I have unique insights into how the game works and how to handle data, but I can also get off track in my thinking. Colin asks good questions of me that focus my ideas in a profitable direction. Dan Turkenkopf is also particularly good at that. (You guys may not see Dan's name much on bylines, but he's a big reason that everything keeps going around here stat-wise.)

For example, Colin asked a number of questions of me as I was researching my latest piece on batter plate approaches that helped to shape what the final article looked like, and he reminded me of some of Russell Carleton's previous work on the topic that is pushing me in good directions for future research.

It's that sort of thing that I believe is driving us here at BP toward a better understanding of the game. That will ultimately filter into our stats, too, and it has to some extent already, but I see the stats and their specifics as the branches of the tree, with the trunk being a solid sabermetric mindset and dialogue. The fruit, hopefully, is more enjoyment of the game for all of us, reader and writer alike.

Reply to mikefast

markpadden

7/26

I sincerely hope this translates into more tangible gains for the subscriber in years to come. Thus far, to be frank, you have had a few very interesting articles, but the net data analysis gain since Wyers has joined has been pretty well hidden, if it exists at all.

Reply to markpadden

Oleoay

7/23

Personally, I thought SIERA was interesting but still had some questions/doubts about it. I liked it more than some stats, less so than others. I also have tons of respect and generally like what Colin and Matt have both done. I look at the work they put in and know I couldn't ever be a sabremetrician of that caliber.

Yet, my "beef" isn't so much with retiring SIERA. My issues are threefold:
1. I didn't buy the complexity explanation since it made me doubt the rigor BP puts into its other statistics.
2. I didn't understand why it was a BP staple if there were doubts about why it was needed.
3. The timing (which Colin addressed) and "suddenness" made it look like the retiring of SIERA had to do with Matt's departure, which could have been resolved if BP was better at announcing who was joining/leaving.

I've never been the type to threaten about cancelling my subscription and I'm definitely not talking about that right now. BP's added a lot of writers over the last few years. Still, I do get curious why people are leaving. I see a BP article debunking the usefulness of SIERA after using it as a staple for the last 18 months, yet I see FanGraphs pick it up. How am I supposed to interpret that?

Reply to Oleoay

mikefast

7/23

Richard,

I appreciate you sharing your concerns. I definitely can tell that you have the best interests of BP at heart here as a long-time subscriber and reader who wants to be able to "trust the process" (as a long-suffering Royals fan, I think I'm entitled to use that phrase) and not have to worry whether you're just watching a pretty facade with rotting framework behind it. Your thoughts are always welcome here, and you are more than welcome to email me personally if you don't feel that you get a good enough answer in these comments.

There are, for lack of a better word, political considerations around personnel decisions, both on the part of BP management and the contributor who chooses to or is asked to leave. I know some of the details of the departure of Matt and other former BP writers, but some of them I do not know. I am glad that I am not in charge of figuring out how to manage all that kind of stuff. Steve Goldman wrote a good piece, I thought, a couple months back on that topic, and anything I would have to add to that would be worse.

I can explain further about what I mean about complexity being a negative if that is helpful. I'm not sure that it is.

Regarding ERA estimators in general, the big unresolved question for them is: What is an ERA estimator trying to do? And from that, how do you know if it is successful? If ERA estimators are not trying to predict next year's ERA, then why do we use that as the test for which ERA estimator is the best? If they are trying to tell us the pitcher's true performance in a small sample, what is the true standard for that against which we can measure? There is none, because that is an unresolved question of sabermetrics, and it will be if and until we figure that out.

We don't know how to disentangle the pitcher's performance from the performance of the batters and the fielders and the umpires in small samples. We know much better how to do that in large samples, but for that, we don't need ERA estimators.

I don't worship at the altar of FIP or kwERA any more than I do at the altar of SIERA. They are all tools that tell us something, but we don't really know what that something is, other than that it is, for FIP and kwERA at least, fielding independent, and for SIERA mostly so. So that's definitely worth more than nothing, to know what part of a pitcher's performance was independent of his fielders (other than his catcher). But it is most definitely not telling us exactly what part of his performance was due to his own effort and exactly what part was due to factors out of his control (call it luck, or what have you). We, as a sabermetric field, don't know how to separate that yet at the game level. One of my quests is to figure that out.

Reply to mikefast

Oleoay

7/23

I am very happy with your commentary and the words of others in this thread, which is why I do enjoy BP. The best part of it has always been the interaction between BP Staff and the subscribers. I also read Steven's article when it was published and got a better understanding and appreciation of BP's position on the comings and goings of writers.

I guess, through no particular fault of Colin's, that this article just touched me wrong and got me a bit confused and saddened in a few different ways. Shows I care, right?

Reply to Oleoay

PBSteve

7/23

It was not a focus in the last annual. It was not in the book at all. That began a process that culminated with this discussion.

Reply to PBSteve

Oleoay

7/23

Hrm...

*pulls the annual off the shelf and thumbs through Colin's statistical introduction*

I stand corrected and apologize. I'm not sure why I thought that.

Was SIERA incorporated into the PECOTA changes?

Reply to Oleoay

Tom9418

7/23

Colin,

Loved this article. Loved the one you wrote explaining the issues with defensive metrics. I read the introductory article on SIERRA and thought it was a bunch of screwed up data mining with unsupported variable interactions. Thank you for dumping it.

-Tom A.

Reply to Tom9418

markpadden

7/23

Do any of the year-to-year tests here adjust for park? If not, some of the error observed is not actual error, but a park bias. I.e., for a given park (in cases where pitcher stays with same team for both years), a pitcher should be expected to have an actual ERA that is higher or lower than his estimator ERA. (E.g., a Padres pitcher will always be expected to have an ERA lower than his xFIP [Petco depresses HR/FB], and xFIP should not be penalized for this since it was not designed to be park-dependent). This will vary by estimator, as some assume league-avg. HR/FB rate, etc.

Reply to markpadden

markpadden

7/23

More critically, it appears you fail to control for park in your comparison of flyball rate vs. HR/contact rate in predicting future home run rate. Clearly, you would not use a league-average HR/FB rate to predict the second half of a pitcher in an extreme park. No one thinks that HR/FB rates regress to MLB-average rates. They regress to park-average rates (the net HR/FB park factor of a pitcher's schedule).

So no one would use a generic HR/FB rate or raw xFIP to predict second-half ERA. You'd obviously under-project HR rate and ERA of every Rockie pitcher and over-project every Padre by doing so. A stat like xFIP is only useful when compared to itself (across pitchers on same team, or across years of same pitcher); it's not well-designed to directly predict actual ERA and I don't think this is a particularly new revelation.

Reply to markpadden

markpadden

7/26

Anyone?

Reply to markpadden

sensij

7/26

I think you are barking up the wrong tree. Since you seem to blindly oppose anything published by a current BP employee, perhaps Matt's words (copied below) on Fangraphs will convince you that park effects are not the holy grail to improving an estimator, and in particular, do not significantly handicap SIERA relative to other estimators.

http://www.fangraphs.com/blogs/index.php/new-siera-part-five-of-five-what-didnt-work/

"Park effects themselves come with a bit of noise, though, and the effect of park-adjusting the batted-ball data didnâ€™t help the predictions. The gain simply wasnâ€™t large enough: The standard deviations of the difference between actual and park-adjusted fly ball and ground-ball rates was only 1.4% and 1.5%, respectively, and the park factors themselves were always between 95.2 and 103.9 (the average park factor is set to 100).
...
There were only 16 other pitchers with ERA changes more than .05 in either direction.

As a result, very few people saw major changes to their SIERAs, and the combined effect was a weaker prediction."

Reply to sensij

markpadden

7/26

So, just to clarify, if you were to make a fair over/under line on Padres pitchers for next season HRs allowed, you would take the current staff's flyball rate, and apply the NL average HR/FB rate to come up with the number? Cool. As a guy who has made a living betting on baseball for the past 10 years, I would be more than happy to take your action.

Reply to markpadden

markpadden

7/26

If you care to read his post, he is talking about adjusting flyball and groundball rates by park, which makes no sense and is not at all what I am talking about.

Reply to markpadden

Oleoay

7/26

For the record, evo34 doesn't blindly oppose everything posted by a current BP employee. At least, that's my opinion from watching him comment on Jaffe's, Goldstein's, Goldman's, Perroto's, threads for the last few years.

Reply to Oleoay

triebs2

7/23

I do not understand the finer details, but the populist layman's reaction I am party to seems to be:
"Wait, BP doesn't like stats to be complex? Since when? It took me three years to understand some of their stats. I've read numerous over-my-head articles explaining how the complexities led to a slightly better understanding of the data interpretation and prediction models. In light of this history, 'this algorithm is bad because it is complex' sounds disingenuous."

The timing is also extremely troubling. Discretion being the better part of valor, waiting to publish a criticism of a recently departed coworker would have been the better choice. While peer review is great, this was not just peer review, it was a critical review of a stat that up until the moment this article was published I thought that BP was 110% behind. And it comes right on the heels of a very harsh criticism of fangraphs' prediction system, which was quickly followed by "oops, our prediction system had some big errors in it." I do applaud, however, the civility of this particular comment thread.

I wholeheartedly agree that BP needs to fix its internal process. PECOTA cards have had numerous errors for far too long, both before and after the newer BP staffers have been on the job. But BP also needs to fix a communications issue: what I took away from this article was "we want our stats to be 'close enough for government work.'"
Really? Because I know of a completely free web site where the stats are close enough for government work . . .

Reply to triebs2

mikefast

7/23

"Wait, BP doesn't like stats to be complex?" "we want our stats to be 'close enough for government work.'"

No, absolutely not. I'm sorry if that's what you got from this.

We want our whole statistical approach to be the best in helping us and you understand and appreciate the game of baseball. There are times when that is best served by complexity, and there are times when that is best served by simplicity.

The reason we are best served by simplicity in this case is because of how little we as a sabermetric field know about how to assign credit/blame for an individual batted ball among the various participants on the field. We know fairly well how to do that once we get thousands of such events to evaluate, but not at the granular level. If we can establish some falsifiable theories about how that happens, and prove them out, then we can proceed to building more complex explanations of how to assign the credit for what a pitcher's true performance was.

Even FIP is a bit of a lie if you take it to be a true and complete record of the pitcher's performance. If you only take it to be a record of the fielding-independent portion of his performance, then you're on more solid footing. (I don't see many people take it that way in its common usage.) It's a bit harder to remember that you're looking at a lie and exactly what sort of a lie it is when you're looking at SIERA or xFIP or something other more elaborate construction that someone is using to purport to tell you a pitcher's true performance over some time period. That's why complexity is a negative here. If ERA estimators were all about telling the truth, the whole truth, and nothing but the truth, we wouldn't have this problem.

I'm as eager for a better solution to it as everyone else.

Reply to mikefast

PBSteve

7/23

Several commenters have seized on our discussion of complexity. Complexity is a symptom here, not the disease. We are not against complexity in and of itself; heck, build us a Rube Goldberg device if it gets us closer to understanding the truth of something. The issue here is one of utility compared to a much more basic process like FIP. Our argument is that in this case the added bits just add difficulty at the expense of accessibility and without a proportional return in improved results. Nothing wrong with throwing in the extra ingredients if they give you added insight, but in this case, they do not.

Reply to PBSteve

markpadden

7/23

Something tells me that if xFIP and SIERA were adjusted properly to reflect the pitcher's expected park environment (which is already built-in to ERA and FIP to varying degrees), the avantage would shift a little more to xFIP and SIERA. xFIP and SIERA were not designed as projection systems; if they were, they would certainly contain a simple park adjustment to reflect the actual environment (avg. HR/FB) of the player in the year being projected.

So if someone decides on his own to use them as projection systems (as Colin does), he needs to make the adjustments to make it a fair test. Obviously, not every pitcher remains on the same team year-to-year; but most of them do. Same thing with fielding: there is going to be a reasonably high correlation of team defense year-to-year, which will help ERA as a predictor and hurt the fielding-independent estimators.

The short version is that if something is designed to neutralize park and defense (as FIP, xFIP and SIERA are to various degrees), it's not right to use it as a raw predictor of something (actual ERA) that is affected heavily by park and defense.. Either make a new metric that is designed to be a predictor of raw stats, or create a new test.

Reply to markpadden

markpadden

7/24

Yet another problem is the fact that pitcher schedule strengths are correlated within season and year-to-year. So if a pitcher faces an abnormally HR-heavy set of batters in a given year, chances are the following season will also have a higher than avg. percent of power hitters. xFIP will suffer because it explicitly assumes the pitcher will be facing a league-average schedule.

The goal of estimators like xFIP try to show what HR rate a pitcher would likely yield IF he was magically placed in a random/neutral park against a random/neutral schedule. Wyers tests what happens when the player is NOT in this random/neutral environment, but rather in one that is highly correlated to his previous (known) environment. There is no reason one should try to use raw xFIP or SIERA to try to predict a season in a partially-known environment. If the goal of these metrics is to to predict next season accurately (which Wyers seems to be assuming), the correct method is to estimate the biases of each pitcher's next-season environment and make the necessary adjustments to xFIP/SIERA formulas or to the stats being measured.

Reply to markpadden

saberfury

7/23

I've gotta say, the picture on the front page is clearly in bad taste.

Reply to saberfury

eighteen

7/24

Only you can prevent whales.

Reply to eighteen

crperry13

7/23

If I may, part of the problem with SIERA was that the calculations were not easily accessible. I can find the calculation for FIP in any number of places and apply them to the posts I write. I tried on numerous occasions to use SIERA because I felt that it had strong theory behind it, but actually locating the formula to calculate it was next to impossible. As is wading through BP's not-very-user-friendly statistics page.

Without really understanding 75% of what you wrote in this article (I'm no statistician), it seems to me that the problem could have been solved with a little added accessibility and usability for commoners like myself.

Reply to crperry13

mcquown

7/23

Please feel free to email suggestions on the navigation on the stats pages to CS@baseballprospectus.com, or to me directly. We aim to please.

SIERA's formula has been in the glossary since it was introduced. You can get the glossary entry by clicking on SIERA in any stats report where it's a column. The link goes here: http://www.baseballprospectus.com/glossary/index.php?search=SIERA

Reply to mcquown

crperry13

7/24

I know it's there, I've used it before. But it's not easily find-able. I've harped on this for a while, and suggested it during the last regime change, when suggestions were asked for: The glossary is atrocious, full of circular references, partial explanations that are barely in english, and no easy-to-find formulae (usually buried in some link to some other article).

Maybe it's because I'm married to a graphic designer and have a picky view of presentation now, but the stats and glossary pages are the least usable parts of this site.

No I'm not cancelling my subscription over it. Kevin and Jason make it money well spent.

Reply to crperry13

mikefast

7/24

Chris St. John recently joined the BP staff, and one his main tasks at the moment is improving the glossary. So if you find something lacking, it would be very helpful if you would let him know about it. If for some reason you can't find his contact info, let me know, and I'll pass it on to Chris. Thanks.

Reply to mikefast

crperry13

7/24

He's my new favorite person then. I always come here first, but often throw my hands up and go to Baseball Reference or Fangraphs to finish my work. Thanks for the info!

Reply to crperry13

Ragnhild

7/25

I agree with your last sentence...with all the attrition its to a point where my two favorite contributors to BP are well-rounded "scout-side" perspectives

Reply to Ragnhild

markpadden

7/26

I would challenge anyone to find a worse-formatted, more-difficult-to-use page than: http://www.baseballprospectus.com/sortable/index.php?cid=975549. Assume creation date = > 2002.

Reply to markpadden

hotstatrat

7/23

1) Are you (BP) saying that HR rates are a better indicator of how many home runs a pitcher is likely to give up than his GB/FB?
2) That is essentially what FIP vs. SIERA boils down to, isn't it?
3) Even if the answer to question no. 1 is yes, at a small enough sample size, GB/FB has to be more reliable. Has anyone determined how large of a sample you need for home run rates to be a better indicator than GB/FB if it ever is?
4) That GB/FB is better data than HR/9 at a small enough sample, leads me to conclude that SIERA has it use. Where GB/FB data is incomplete, FIP has its use. Why can't they co-exist? Just change FIP to HR-SIERA.
5) Aren't GB/FB/LD and pop-up data getting more and more reliable, accurate, and available?
6) If so, aren't we going in the wrong direction by retreating from SIERA back to FIP? Shouldn't we just be coninuing to refine SIERA and keep it state-of-the-art just as Matt is doing (or, at least, attempting to do)? Really, SIERA was originally just a refinement of FIP or xERA, no?

Reply to hotstatrat

hotstatrat

7/23

Before you say that you can't keep changing a formula, I would suggest that one could just append the version number: SIERAvX.

Reply to hotstatrat

BurrRutledge

7/23

Colin, I took a little time to think this over before posting my comment. No real questions here, just some observations. If this spurs others to have additional questions or comments, I'm all for it.

First, I really like the analysis. This could easily be a week's worth of articles for me to work my way through. Hopefully sleeping on it for one night was enough for it all to sink in.

There are two and a half takeaways that I have from the initial table, neither of which is new information for me:

1) SIERA slightly outperforms xFIP and FIP in the prediction of long-term future ERA when given a small sample size to work from (definitely at 100 IPs, and very slightly at 200 IPs).

1.5) After about two seasons of data for a young starter(300 IPs), the estimators are about as useful a predictor as ERA itself.

2) After 300 IPs, ERA actually becomes a slightly better predictor of future ERA than the estimators.

Why does SIERA outperform the other estimators with the small sample of data? The components may be needlessly complicated once you get to 300IPs, but for small data samples, those extra variables seem to have some added predictive power.

I would LOVE to have seen QERA side-by-side in that first table.

And, as evo34 notes above, if prediction was the real goal of the estimators, then park factors would almost certainly need to be included. Would park factors offer any added value?

For everything else you presented, I'm going to have to re-read and re-sleep-on-it. Could be an interesting and entertaining weekend. Thank you.

Reply to BurrRutledge

brownsugar

7/23

Taking a page from Burr, I'd also like to post a set of my thoughts, without expectation that anyone needs to comment (or cares what I think).

1. Thanks to Richard Bergstrom for initiating a conversation that was informative and managed to be both passionate and respectful. He asked some questions that I think a lot of us had on our minds.

2. I agree with some of the other commenters here that the timing for the release of this article was poor. The timing served to link two events together (the decision for BP to discontinue their relationship with Matt Swartz and SIERA, and the decision for Fangraphs to pick them up) that didnâ€™t need to be linked together.

3. Having stated number two, I found that Colin's analysis in the article was compelling, and my sabermetric man-crush on Colin continues unabated.

4. I always thought that a strength of SIERA was that it did not use a variable constant (I call it a "fudge factor") to adjust for run environment. Ideally, any run estimator does not need to be told what the run environment is; the run estimator should tell you. Now that SIERA v2 has a fudge factor, I find it less compelling. (Yes, I know that FIP uses one too, and it bothers me.)

5. As Colin mentions, if two of the coefficients in SIERA have switched their signs from v1 to v2, this calls into question whether the coefficients tell us anything real, or if they just force the hoped-for result to match the data. This also makes me find SIERA less compelling.

6. For me, FIP is most useful as a backward-looking barometer of what happened than a forward-looking estimator of future ERA. I know that Player X will never sustain that HR/FB percentage of 1.8 over the long term, but he did it this year, and that added real value to his team this year.

7. I look at ERA, FIP, xFIP, and SIERA myself trying to figure out who should replace Brett Anderson on my fantasy team, but I recognize that all of them need to come with a complementary shaker of salt. For predictive run estimators, it looks like the quest for the holy grail continues. And the quest is enjoyable to read about and discuss.

Reply to brownsugar

hotstatrat

7/23

By the way, I apologize for sounding on July 22nd as though you (BP) and the sabrmetric community aren't trying very hard to come up with a consensus of what are the optimal stats. It is frustrating that it hasn't happened. It would be a shame if power politics were interfering with getting to the truth. My qualms over the direction in ERA estimators you have taken vs. the direction Matt and FanGraphs have taken stand until my July 23rd questions above can be explained away.

Reply to hotstatrat

mikefast

7/23

I didn't need an apology from you, but I appreciate your willingness to extend one. I think what is sometimes perceived as power politics between the various saber sites, whether BP and Fangraphs or others, is often more of a difference in philosophy about how we should analyze the game of baseball. Sometimes there are personality clashes, too, for sure, but most of the lack of consensus on stats is driven by a disagreement about how we see the game of baseball or what objectives we are trying to reach. Dialogue certainly helps us move forward over time, but the more fundamental differences in perspective about what is important is why it's not simply a matter of getting our heads together to gather 'round a single resolution.

Reply to mikefast

studes

7/23

Hey Colin, nice job (and nice job in the comments, Mike). I won't comment on the "personal" side of this issue cause I don't really understand it.

But you've certainly given me something to think about, particularly the contention that HR/Con is better than HR/OF (or HR/FB). That is counterintuitive to me, and I'm going to have to read your explanation several more times to understand it.

To me, xFIP is a useful stat that tells you something about a pitcher, but (IIRC) I resisted putting it on the THT stats page because I didn't see it as a "reference" stat. That was silly of me, I guess. Many readers asked to see it, so we eventually added it.

Similarly, we used to run "ERA-FIP" at THT, which was also something requested by readers. I was kind of uncomfortable with that too and when we got a request to run ERA-xFIP too, I refused. I thought it put too much emphasis onto a single number and calculation. Time has shown that I'm in the minority in that position.

I guess I am someone who is uncomfortable with the quest for a complex stat that explains everything. I am leery of issues like multicollinearity and other things I can't pronounce or understand. I intuitively won't trust a stat I can't understand. I make only two exceptions: projection systems and total-win based systems.

So I wish you guys success on your own statistical quests but please: don't try to do too much. Keep it simple.

Reply to studes

mikefast

7/23

Dave, thank you very much for your thoughts. You will always remain one of the pinnacles of the sabermetric community, and of course, you have given a lot to me personally, too, for which I am grateful.

I'm completely with you on your second-to-last paragraph, and my skepticism extends to projection systems and total-win based systems, too, though there's sometimes a need to use them. For example, when it comes to my fantasy league, I love that there is a projection system I can use. In a theoretical world with infinite time, I might like to roll my own so that I understood all the pieces better, but in the real world, I'm glad that Colin and others are doing it for me.

One of my favorite sabermetric articles ever was the one that Bill James wrote in the 1987 Abstract about how to evaluate a statistic. That has been very influential on future development as a sabermetician. I want to be able to understand clearly what a statistic is telling me about a player or about the game. The mathematical pieces should simply be ways of representing other concepts with numbers or equations. That may be my physics training coming into play, too. I learned most of my higher math through physics classes. Anyhow, that's simply a different way of saying that I hear your perspective loud and clear.

Reply to mikefast

mikefast

7/23

I meant to say "pillars" instead of "pinnacles", but I guess you could be a pinnacle if you wanted.

Reply to mikefast

studes

7/23

I thought you said "pinochles"

Reply to studes

studes

7/23

Mike, thanks so much. I always enjoyed working with you too.

I agree 100% that sometimes there is a need to have projection systems, even if we don't totally understand them. There are a few things you're willing to take on faith. Not many, just a few.

Reply to studes

blcartwright

7/24

studes has already told me he doesn't understand how Oliver works, but that's OK as long as keeps paying me for it.

Reply to blcartwright

WoodyS

7/23

Many thanks to Colin for an interesting article and to Dave and Mike for calming things down a bit.

I've said this before, but I continue to be puzzled that component approaches to ERA leave out pitcher SB/CS and baserunner pickoff analysis. Derek Lowe is almost absurdly easy to steal on, while some lefties have virtually no stolen base attempts against them. It seems likely that this difference has a statistically significant impact on actual ERA.

I've been trying to analyze this stuff a bit, but the data I can find aren't completely accurate. For example, a successful pickoff attempt that is negated by a bad throw by the firstbaseman, allowing the runner to get to second, is scored as a stolen base and charged to the pitcher who picked the baserunner off in the first place!

Or is this the kind of complexity that everyone is railing against?

Reply to WoodyS

mcquown

7/24

I've done a few studies on this myself over the years, sharing your intuition on the matter (it probably comes from playing too much Strat-O-Matic over the years). And I haven't been able to find any way to add any statistical significance to FIP by using SB/CS/PICKOFF stats. I honestly thought I would find some incremental boost, but didn't. There may be something there, but it's hiding pretty carefully.

I tried again using the queries Colin set up for this study a couple days ago, running the queries against some FIP_PLUS metrics with various parameters, and again failed to come up with anything useful.

Reply to mcquown

hotstatrat

7/23

Re: WoodyS - yes, I'd like a SB/CS component in the ultimate version of SIERA - with your example scored as a CS & E3.

Re: studes / Fast: I actually agree with Dave and Mike's overall preference for stats that mean more specific things - or at least fully understand it when we see it. Generally, the pitching stats I mainly look at are K/9, BB/9, and GB/FB - with due consideration to park, defense, work load, durability, etc. Knowing those components: K, BB, & GB% informs how to apply that pitcher's context for whatever time frame you want to look at. But there needs to be a stat that also sums up each of those three main components in one stat just for quick reference. It is not a question of which kind of stat is better, it is preferable to have both.

Reply to hotstatrat

blcartwright

7/24

For me the HR-CON observations were the most interesting part of this article. I've always used HR-CON for batters and pitchers in Oliver, but had been considering switching to HR/(LD+FB), intuitively thinking that would be more accurate. After all, in order to hit a homer you have to first hit a ball in the air to the outfield, so why not make that the denominator? I'll be investigating this more closely.

Reply to blcartwright

studes

7/24

Colin, I don't understand how your graph of HR rates supports what you're saying. Aren't you graphing actual HR rates vs. predicted HR rates based on flyball rates? Of course the actual HR rate is going to be wider--that's the nature of real life vs. projection, right?

Are you implying that the distribution of expected home runs based on contact rate is wider than that for fly ball rate?

Reply to studes

studes

7/24

Also, you say this:

"Fly balls on balls in play are a much poorer predictor of future home runs than home runs on contact, with an r-squared of only .014."

But your table shows an R-squared of .023 for FB/CON. What's the diff between .023 and .014?

And aren't those abysmally low R-squared figures? I'm used to getting R Squareds in the .20 to .30 range.

Reply to studes

cwyers

7/25

I could've been clearer as to what I was doing there. You can rewrite FB/CON as:

(In-play fly balls + Home Runs on Fly Balls)/(Balls in play + Home Runs)

If you remove home runs from both the numerator and denominator, you get FB/BIP.

What I was trying to illustrate here is that you can divide fly balls into two types - ones that are home runs, and ones that aren't. FB/CON treats them the same for predicting future home runs, but home runs have greater predictive value than FB that aren't home runs.

Reply to cwyers

studes

7/24

Third question: By using first half and second half results, you're typically looking at pitchers who played in the same environment. Won't that skew the results toward the HR/CON figure? Part of the reason to look at flyball rate is to take the park (mostly) out of the equation.

A good test might be to ask whether HR/FB is more predictive of second half home runs than HR/CON?

Reply to studes

studes

7/24

Sorry to keep asking questions, but this also confuses me:

"...there is a reason that this sort of analysis can be unsatisfactory: it assumes that a fly ball that wasnâ€™t a home run is equally predictive of future home runs as a fly ball that is a home run."

But isn't this just as true as HR/CON? In fact, couldn't you turn this argument around and say that this is why HR/FB is preferable to HR/CON? Because fly balls are more predictive of future home runs than, say, ground balls?

Reply to studes

crperry13

7/24

This article and all the comments are a perfect "Exhibit A" of why mainstream media and regular baseball folks can't take sabermetrics seriously. We bitch and whine about how Joe Morgan talks about clubhouse chemistry, the value of pitcher Wins, and batting average, but when two different sites (with familiar authors, I might add) disagree about what is fundamentally a few hundredths of a decimal point, the proverbial poo hits the proverbial fan.

I love the topic and the conversation.

But it's pretty hypocritical (of all of us) to stand on our high horses and decry that the "world is flat" mentality of those like Murray Chass is ridiculous, yet we are completely unable to present a unified alternative.

It's the same problem I see at work. More people like to yell and holler about the problems, but nobody really talks about a solution.

P.S., this wasn't directed at Colin or BP, it was just a general bitch session.

Reply to crperry13

studes

7/24

I have just the opposite reaction--in fact, I'm glad that people disagree about these types of issues. Being willing to discuss what's right and/or wrong about the current "baseball way of thinking" (whether it's in the mainstream media of just among us baseball nerds) is exactly what sabermetrics is supposed to do.

I see no hypocrisy at all.

Reply to studes

Oleoay

7/24

That's not really what happened though. I don't think anyone was saying SIERA was "bad" or "wrong", but it was retired for being complex for minimal gains. So, any discussion on improving SIERA is now delegated to other sites. In other words, the conversation here on BP about SIERA has basically ended once this thread scrolls off the main page.

Reply to Oleoay

studes

7/24

Okay, but why does that reflect poorly on sabermetrics? BPro is just one site--there are plenty of other sabermetric places on the web to talk about SIERA if you'd like.

Reply to studes

Oleoay

7/24

Because BP is one of the bigger sites, so the mainstream media tend to see more about sabremetrics from here than some other sites? So if there's a retirement of a metric with other components involved that may be they personal or political or just plain business, that adds extra noise to the mainstream media's perspective of sabremetrics.

Sabremetrics tries to follow some of the rules of academia with peer reviews, case studies and all that... and they also have many competing ideas at play. Yet I'm not sure an article with this kind of tone and timing would appear in an academic journal. It'd be like a professor leaving one University A for University B, followed by a bunch of articles from professors at University B immediately debunking their former colleague's work.

Reply to Oleoay

crperry13

7/25

Might have been more politic to say "We're retiring SIERA because we decided it's too complicated for any gain that it gives, please email us privately if you want more information," but I really don't think the crowd that comes around here would have stood for that.

Damned if you do, damned if you don't. In a few weeks, this article will drop off the home page and we'll all move on with life.

Reply to crperry13

markpadden

7/25

Oh, it will drop the home page and will be forgotten. I agree. But will Wyers bother to answer the many valid questions raised in the comments about his approach, or will he ignore them and wait till the article disappears? Based on past history, it will be the latter.

Reply to markpadden

studes

7/25

Well, you may not like the tone of Colin's article. It may not match that found in academic articles. But I still have no idea why this means that sabermetricians are hypocritical for criticizing the Murray Chass's of the world.

BTW, Colin is definitely saying that SIERA is "bad" in some ways. Look again at his statements about multicollinearity or why using HR/FB is bad (which I still don't get). Mike is saying that an approach that relies too much on regression is, indeed, "bad."

Reply to studes

Oleoay

7/25

I didn't mean my statements in the context of sabremetricians being hypocritical in their criticism of Murray Chass. I meant it in the context of sabremetrics wants to be taken seriously and professionally, but sometimes ends up looking petty or wishy-washy. BP is/(was?) a major site in sabremetrics and the tone of the article and some of the comments suggest SIERA was retired, in part, for reasons that imply proper research/vetting/validating wasn't done before the metric was implemented.

Reply to Oleoay

mikefast

7/24

The staff who are still at BP were never in favor of SIERA from the beginning. (There might be an exception that's not coming to mind for me right now or of which I am unaware, but at least that's mostly true.) We basically don't believe that multivariate regression techniques tend to offer the best understanding of how the game of baseball works. It's very easy to get a coefficient out of a regression and read meaning back into it that may not exist. There are better ways to learn about baseball. Regression techniques have their place, but it's usually in a supporting role, not as the main feature.

The staff who are still at BP are most definitely committed to learning about how to measure and understanding pitching. I would count myself among the foremost in that desire.

SIERA is being retired because we believe that approach is dangerous in terms of thinking it's teaching you things about pitching that will end up being false and because we believe there are better ways to learn about those things.

We are not retiring from the discussion of learning about quantifying pitching skill and performance. Far from it! Matt Swartz and I are going to have very different approaches to the subject going forward. Obviously, I believe my line of inquiry will be more fruitful, or I wouldn't be pursuing it. That's not to say you shouldn't listen to both of us and decide for yourself what to think.

Reply to mikefast

crperry13

7/25

^^ This, though my perspective (as a non-statistician who just has a weird fetish for spreadsheets) is a lot more big-picture than even what you're talking about.

Above, I mentioned axiom's I don't like; another one is, "Good enough is the enemy of great." That one's crap because, look, it's nice if we can gain a hundredth of a decimal point of "accuracy" in projecting a stat for a game of sports, but not if it takes utilizing differential equations and 10th-order polynomials. Especially when compared to (for example) FIP, which is a plain ol' 7th-grade pre-algebra formula that gets us 90% of the way there. I don't see complexity for minimal gain as progress in the war against outdated statistics like Wins, Saves, and "scrappiness". In this particular case, unless there's a compelling argument why we need to add complexity, good enough truly is good enough.

I think that's the layman's terms of what you guys are trying to say, right Mike? It's not that you DON'T want to improve, it's just that you want to use a little bit of common sense in how you go about it.

Those of us who don't sleep with a copy of "Applied Regression Analysis and Generalized Linear Models" thank you.

Reply to crperry13

crperry13

7/25

Rather than minusing me, I would have preferred whoever you are to reply and let me know what you disagree with.

Reply to crperry13

Oleoay

7/25

"The staff who are still at BP were never in favor of SIERA from the beginning."

From a can of worms standpoint, this could be interpreted ominously if there are other major differences between the people who are here now as opposed to those who were here 18 months ago.

Also, "dangerous" is a very strong word and, at worst, is not the kind of word that encourages further discussion. Ideally, both your line of inquiry and Matt's should both furthering our understanding of baseball... I am definitely not in the mood to choose sides or consider this whole discussion an either/or proposition.

Reply to Oleoay

mikefast

7/25

I don't mean "dangerous" as in someone's going to get hurt or die or anything. I mean "dangerous" in the sense of it's easy to get wrong answers and have no way to tell that they are wrong.

And I'm not sure what's ominous about that. It's not the case that people who disagree are let go from BP. We have all sorts of internal disagreements. As I've stated elsewhere, some of the former writers are people who I hold in great esteem. Matt has done good work, too. But it should be obvious that there has been a change in philosophy at BP over the last couple years, and some of that is related to a change in personnel (with causation running in both directions). One could obviously read too much into that if one assumes that every departure is due solely to that reason. In some cases it probably is and others it has nothing to do with it.

Reply to mikefast

Oleoay

7/25

I understand that was how you meant dangerous. Saying an approach is dangerous shuts doors. It does not encourage further discussion, discourse, review or feedback. You could've said "problematic", "uncertain" or any other word. At the least, a different word choice can help to defuse a situation and keep it cordial.

Let me be clearer on the ominous part. I've been here for five or so years now. I've seen many people come and go and I also understand that further research can lead to better metrics or modifying older metrics. I don't think I've seen Davenport translations or PECOTA or MORP or other BP stats or projection systems debunked and thrown completely and immediately out the window when the creator left BP.

Let's say, in a hypothetical, that Jay Jaffe left BP tomorrow. He's been using JAWS to evaluate Hall of Famers for quite a few years. JAWS isn't my favorite metric in the world, but I understand how/why he uses it to evaluate Hall of Famers and I find in terms of how he uses JAWS, that what he does pretty entertaining and insightful. So, if the month he left, JAWS was debunked in a BP article because the new guards thought it was a horrible metric, I would wonder why no one said/did anything sooner? I would wonder why I was told for a month or a year that "Hey, this has the BP Seal of Approval", then when the hypothetical Jaffe left, "Hey, this was never any good." It'd make me question the vetting process of everything BP produces and wonder whether metrics are used because they are a) good/proven/vetted or b) popular with whoever is in charge at BP.

Am I being extreme or overblowing things? Maybe. I like to think I'm not the kind of person here to exaggerate. But before I saw this article, I didn't have these kinds of doubts/questions before.

Reply to Oleoay

PBSteve

7/25

Richard, I think Dave Pease's response further down this thread will answer some of your questions with regards to the timing. As Dave says, we had intended to retire the statistic all along. That was true when Matt was still here, as suggested by my striking it from the BP annual. If we moved too slowly to do so, it was because the discussion of what to replace it with--if anything--was long and involved.

Reply to PBSteve

mikefast

7/25

As I've thought about this a bit more, I think should clarify something. There has not been a purge of SIERA supporters from the BP staff. The two co-creators: Eric Seidman and Matt Swartz, have left. They are the main supporters of SIERA that I encountered within the BP staff during the period of several months in which our tenures overlapped. There are probably a not-insignificant number of the BP staff who really don't care one way or the other.

As to metrics thrown out completely when the creator left--Nate Silver's Secret Sauce got that treatment:
http://www.baseballprospectus.com/article.php?articleid=12085
Nate's obviously a sharp guy, so take that for what you will, but I see/saw it as a good thing.

Re:
"It'd make me question the vetting process of everything BP produces and wonder whether metrics are used because they are a) good/proven/vetted or b) popular with whoever is in charge at BP."

It's honestly probably some of both, but I'd also point out that those two are not mutually exclusive options. In addition, it's a good thing when our less stat-heavy writers at BP tell us, "We can't use this stat in our writing, it doesn't make sense and we can't explain it or justify it to the readers." (Hypothetical example--I'm not talking about SIERA.) We are going to be human in our understanding and our choices here, for all that entails, good and bad.

Vetting is not simply a matter of Colin looking at a stat and calculating some standard errors and p-values and blessing it or consigning it to the dustbin. It's a process of filtering through various people on the team with perspectives that include statistical knowledge, philosophy of what we're trying to do at BP, knowledge of how the game works, connection with what the readers want and need, and hopefully a good dose of common sense. It is also, at times, going to be a matter of trial and error.

Reply to mikefast

Oleoay

7/26

I didn't mean to imply there was a purge so I apologize if that is what you thought I was portraying.

Re: Secret sauce was given a little wiggle room on possibly a run of bad luck and the secret sauce wasn't retired as soon as he had left either. The interesting thing is that the secret sauce was used for longer than SIERA but the Unfiltered blurb and rationalpastime article combined were about half the size, word-count wise of the retiring SIERA article. Also "What I'd hope you take away from this is not that Secret Sauce is worthless" and the article in general is a bit different in tone from "In short, the gains claimed for SIERA are about as imaginary as they can get, and we feel quite comfortable in moving on." and the SIERA article.

Re: Both
My impression with the article and comments is that people at BP either a) thought it took the wrong approach or b) weren't really sure how or why SIERA worked since coefficients seemed to be plucked out of a hat. I would think the people at BP vetting/validating it would have an understanding of it and might have addressed some of the issues Colin had raised in the last 18 months.

To be a bit more explicit, before this article, I was quite sure that the stats people at BP checked each other's work and had an understanding of the metrics being used. The impression I got from this article was that SIERA was either broken or not optimal for the last 18 months and people either didn't understand its complexity or didn't think it was worth the effort to fix. So then, I began asking "Were these concerns raised in the past? If so, why bring them up now? If not, was someone validating SIERA?"

In any event, Dave's response satisfied me in regards to the timing. The tone? As I said below, I thought it should've been more of a "State of the Prospectus" thing wishing Matt luck and then an article a few days or weeks later in this kind of vein later comparing SIERA and FIP. I just think mingling the business/politics of a person leaving and the research/critique/criticism/debunking at the same time of that person's idea leads one to question the critique and wonder why the critique did not occur sooner.

As I've also indicated, I'm not some die-hard SIERA fan. But these circumstances _did_ make me wonder/question some of the processes at BP. And I'd like to thank you and the rest of the staff for trying to address my concerns.

Reply to Oleoay

bsolow

7/25

Mike, I don't think even Matt Swartz would argue that multivariate regression is the best way to understand how the game of baseball works. I'm coming from a similar background as Matt (phd-level economics), so based on my reading of his work, I believe that he also makes the same distinction between forecasting the game and explaining the game. I haven't asked him this specifically, but that's my interpretation of his articles. Multivariate regression is a tool for forecasting (and in rare circumstances, can be used for evaluating the causal effects of some policies) and is usually a very good tool for that purpose.

I guess I just view the question of "learning about how to measure and understanding pitching" as very different from finding the best method to forecast a statistic. Now, I haven't looked into the discussion of the merit of the metric with too much, but change of signs in a coefficient isn't necessarily a big problem when the terms in a regression change (due to correlation between regressors and the regressand, changes in the coefficients are expected and the region around 0 isn't special).

Reply to bsolow

mikefast

7/25

If SIERA is simply a forecasting tool, I have very little disagreement with it. I wouldn't necessarily use it as my forecaster of choice, but I would certainly be content to let it lie and don't think that those who use it that way will suffer greatly from any minor troubles it may have in that regard.

It is precisely the fact that is being advanced by Matt and others as a tool for explaining the causal effects of pitching results that bothers me, and all my objections to it are in that regard.

Reply to mikefast

bsolow

7/26

Well, I'd certainly not support a claim that SIERA measured any causal effects of pitching. I think a lot of people involved in sports research (especially sports economists, some of the worst at this) play fast and loose with the issues of correlation/causation when thinking about sports statistics. My original read of the first several SIERA articles Matt wrote for Fangraphs were that he was talking about correlations like "Pitchers with more strikeouts have a lower HR/FB ratio." That's a correlative claim, not a causal claim, and is empirically verifiable.

For what it's worth, I think the Angrist-Imbens-Rubin methods of causal econometrics are under-utilized by those interested in the causal relationships in baseball. E.g. the answer to the question "what would happen if player X were to throw their fastball more often?" is best answered with their techniques, not assuming that the run value of their fastball is fixed at a given number.

Reply to bsolow

TheRedsMan

7/25

When's the last time you say a world is flat type show his work and openly engage in dialogue about why he thinks what he does? It's precisely exchanges such as this that make sabermetrics, or any scientific approach, so robust and worthy. There's accountability and an openness to the idea that anybody can contribute to the body of knowledge that you won't find in other places.

Reply to TheRedsMan

crperry13

7/25

You missed my point. If people thought the world was flat, and somebody said, "no it's round" and presented evidence, but then somebody else said, "it's egg-shaped," then presented different evidence, then they both started flinging mud to discredit the other....why should I, as someone who has thought the world was flat all my life, give either of you the time of day?

Reply to crperry13

studes

7/25

But that's a different point. Now you're saying that people are slinging mud; before you were saying that people should work together to find a "unified alternative."

I have no problem with people openly disagreeing with each other. In fact, I think it's healthy. But I agree with you that it's to no one's credit if people start getting personal and slinging mud.

I also agree that there is a little bit of mud slinging here. Colin can be harsh in his assessment at times. But it's to his credit that he doesn't hold back in his assessments and, as far as I know, doesn't let things get personal.

Reply to studes

Oleoay

7/25

For the record, one of the axioms I don't like is the "world was flat" parable. Before Columbus, most people were actually pretty sure the world was round, going back to ancient Greece and many Islamic scholars who kept a good record of the scientific studies from antiquity.

Reply to Oleoay

cwyers

7/25

We didn't retire SIERA in favor of a new, BP-only stat. We didn't retire SIERA in favor of a legacy BP stat such as QERA. We retired SIERA in favor of a stat that's already featured on Fangraphs, as well as several other websites.

As I said in the article - the decision to retire SIERA was not a decision to go to war with the sabermetric community; the goal was actually to surrender this fight entirely. After I had written the first draft of this article, Fangraphs started running the revised SIERA on their site. After they published the revised formula, we made the decision to evaluate the revised formula and incorporate that material into the article.

Once Fangraphs started publishing SIERA, I don't know that we were faced with a "good" option as to how to proceed. Acknowlege the revision, we look like we're attacking. Ignore it, we look like we're not open to outside criticism or that we don't think our critique stands up against the revision. Publish it now, again, look like we're attacking. Wait longer, and we perpetuate a whole lot of confusion by leaving two versions of SIERA in the wild without addressing concerns from our readers over whether or not we would incorporate the new revisions.

But at the end of the day - you can now go to Fangraphs or BP and find the same ERA estimator. (We calculate the annual constants slightly differently - they use one constant for both leagues, while we use a separate constant for each league - but by and large they're exactly the same measure.) I think in the long run, the decision to retire SIERA in favor of FIP ends up accentuating commonality, not difference, in the sabermetric community.

Reply to cwyers

crperry13

7/25

"I think in the long run, the decision to retire SIERA in favor of FIP ends up accentuating commonality, not difference, in the sabermetric community."

I agree, and that's what I'm applauding. At the same time, I see Richard's point about the tone and timing. As I said, damned if you do, damned if you don't. I have no beef with this decision.

Reply to crperry13

markpadden

7/26

Yep. What we need is consensus. Not innovation.

Reply to markpadden

mikefast

7/25

There is some good discussion of this topic at two additional places, for those who are interested:

Patriot's (developer/co-developer of BaseRuns and Pythagenpat) blog:
http://walksaber.blogspot.com/2011/07/saying-nothing-about-era-estimators.html

Tango and MGL's Book blog:
http://www.insidethebook.com/ee/index.php/site/comments/siera_updated/

Reply to mikefast

Patriot

7/25

Not intending to nitpick at all; I just don't want any credit shifted away from David Smyth re: Base Runs. I had nothing to do with developing Base Runs.

Reply to Patriot

mikefast

7/25

Thanks for the clarification and my apologies to David. I did learn most of what I know about BaseRuns from your posts on the topic. That must have stuck in my head.

Reply to mikefast

hotstatrat

7/25

Help me out here, somebody, please. There is a huge gulf between FIP's simple formula that includes HR/Con and the complicated formula you are refering to that includes a regression analysis of HR/FB. I have no idea if SIERA's complicated formula is right or wrong, but I do know that for a small N, at least, GB/FB is way more accurate than HR/Con for measuring how many HR a pitcher is likely to give up given the way he is pitching. Call it SIERAv5, FIPv3 or whatever, be how can BP possibly claim there is no place in their universe for such a stat?

Reply to hotstatrat

joepeta

7/25

Well, this article and comment section certainly answered my "What happened to SIERA?" e-mail question on Friday morning. Thanks to all the commenters, (Mike, Richard, et al) for adding to the understanding of this topic. While were on the topic of data that got expunged, can anyone explain to me what happened to WRXL? Not only is there no cacluation in 2011 for each team's bullpen, the prior year data disappeared as well.

Reply to joepeta

sahmed

7/25

I too have been perplexed by the loss of WXRL. I found it useful for some relief-pitching analysis because of the lineup adjustments.

Reply to sahmed

dpease

7/25

We should have a new reliever stat report ready for you fairly soon. We're still talking about the various components. Sorry for the delay.

Reply to dpease

markpadden

7/25

Nice to see Colin run away from responding to comments on his own article while talking the time to spew snide tweets all weekend. E.g.,

"cwyers Colin Wyers
I developed a new ERA estimator: 1*ERA+0. The r-squared is f@cking fantastic."

Stay classy.

Reply to markpadden

sahmed

7/25

Seriously. Where Mike says in these comments that he has learned a lot from Colin in analysis, I think Colin would benefit a lot from Mike in terms of communication. His tone is diminishing some of his own points here and drawing some negative attention to BP as it still has some work to do fixing its own metrics.

By the way, re-reading Colin's point about SIERA's smaller standard deviation makes me wonder if the heavily regressed PECOTA projections this year are going to benefit in terms of RMSE relative to previous projections that had a higher SD. So perhaps the same standardization will benefit the comparison between projection systems after this year.

Reply to sahmed

cwyers

7/25

To the first, all I can say is this - up until the point that this article was published, SIERA was our own metric.

To the second - yes, I think this is something that affects projection systems as well, and it's something I hope to address further after the season. A quick note on that: I did a quick comparison between this season's PECOTA pitching projections and the Marcels (I chose the Marcels primarily because they come with standardized player IDs, thus making this sort of comparison easy).

Looking at the standard deviation of players in common between the two systems (weighted by the harmonic mean of IP projected by each), in terms of ERA, I get:

PECOTA: .53
Marcels: .24

Now bear in mind that I ommitted players without an explicit Marcels projection; if, as Tango suggests, I forecast players who debut in MLB this season to perform with the league average ERA, that will further drop the SD of the Marcels relative to PECOTA.

So I don't think, in a comparison with other projection systems, PECOTA is necessarily benefiting from "heavy regression" - PECOTA is regressing far less than at least one popular projection system.

Reply to cwyers

sahmed

7/25

Thanks for the quick summary. I would imagine this was the case just based off Marcel not being built to project breakouts or collapses. I guess I am most interested in how the SD of PECOTA compares to previous versions as well as projection systems outside of Marcel.

Your analysis has made me question whether i care about using the projection system with the lowest RSME -- I may be willing to sacrifice a certain amount of error if the SD is more representative of reality, taking into consideration the difference between observed variance and true talent variance. Other tests (like heads-up) can be used, of course, to meet those goals.

Reply to sahmed

cwyers

7/25

"Thanks for the quick summary. I would imagine this was the case just based off Marcel not being built to project breakouts or collapses. I guess I am most interested in how the SD of PECOTA compares to previous versions as well as projection systems outside of Marcel."

I think those are both interesting questions, and I don't have an answer for you at this point in time, but it's something I'm going to research.

"Your analysis has made me question whether i care about using the projection system with the lowest RSME -- I may be willing to sacrifice a certain amount of error if the SD is more representative of reality, taking into consideration the difference between observed variance and true talent variance. Other tests (like heads-up) can be used, of course, to meet those goals."

I would agree - I think that, to the extent that you can maintain accuracy of your system otherwise, a larger SD of your forecasts is better. Given that, I'm starting to think about tests like heads-up testing and other alternatives to straight RMSE testing. I'm really hoping that some outside analysts do the same - it'd be nice if some people who don't have "skin in the game," so to speak, are coming up with similar answers.

Reply to cwyers

markpadden

7/26

"To the first, all I can say is this - up until the point that this article was published, SIERA was our own metric."

This is, of course, untrue. The article was published on July 25, a full week after SIERA was introduced at fangraphs.

Reply to markpadden

dpease

7/25

I'd like to comment on the timing of this article, and our abrupt retirement of SIERA.

Earlier this year, Matt Swartz sent me and email saying that he wanted to take SIERA somewhere else. I told him that we didn't have a problem with that. I also told Matt that we planned on running SIERA for the rest of 2011, and that we would be retiring the stat after that.

When Matt recently announced that he was re-publishing SIERA at Fangraphs, we were fine with that, and it didn't change our plans at all. What we weren't expecting was for SIERA to be re-formulated and published under the same name. The Internet now has two different sets of numbers attached to the same stat name on two different websites. This is absolutely certain to generate unnecessary and frustrating confusion on the part of people who aren't closely following the topic. Now we can expect questions: why doesn't your SIERA match their SIERA? Theirs is newer, and they say it is better--why are you still running old numbers? Which SIERA am I expected to use? This is not a position we want to be in.

At this point you can argue that SIERA itself suffers when there are multiple versions floating around. We weren't sure what we were going do about an ERA estimator when we decided to retire SIERA, but faced with the prospect of playing a contributing role in a confusing system, we decided that changing our timetable and replacing SIERA immediately was the right thing to do. Our options at that point were to quietly replace SIERA with something else or to announce what we were doing. We decided that Colin should fast-track his SIERA retirement article and we would use that to announce the change. When the article was ready on Friday, we posted it.

Reply to dpease

Oleoay

7/25

That does make a lot of sense. Thanks for the elaboration Dave.

Reply to Oleoay

dethwurm

7/25

Yes, thanks for the explanation! That all makes sense.

I do wish something to this effect had been posted at the beginning of the article or on Unfiltered, to avoid the appearance of ripping on Matt/Fangraphs.

Reply to dethwurm

herron

7/25

While there's nothing more vital to sabermetrics than peer review, and Colin certainly gave us something worth a few rereads, I can't help but be disappointed at the potshots he takes. It lacked professionalism throughout.

What I am most disappointed with though, is the lack of THIS eminently reasonable explanation up front. This entire article (and the yanking of SIERA itself) should have been prefaced with this statement, and how it wasn't is completely beyond me. Then Colin's tone, while still disappointing, would at least have been cast in a different light.

Reply to herron

Oleoay

7/25

I think Colin's tone might have been better framed if something similar to this comment of Dave's was written first and in a separate article. I kept thinking the retirement of SIERA should've been a "State of Baseball Prospectus" kind of article and Dave hit the right note.

Reply to Oleoay

Oleoay

7/25

Also, as a State of Prospectus piece in one article and the statistical critique in a different piece, it helps to separate the business/politics/public relations from the research/statistical analysis. In other words, it helps the analysis appear more professional and unbiased.

Reply to Oleoay

hotstatrat

7/25

Why don't we really simplify things? Just use K:W or kwFIP, which is more accurate in small sample sizes than FIP.

The ironic thing is, I never used SIERA or TAV in my own writing or analysis, but with all this stirred up, I have much more interest in SIERA than I ever had before. Perhaps, this is all a joke and that was the plot all along.

Reply to hotstatrat

crperry13

7/25

I've never found strikeout-to-Win ratio to be particularly useful, myself.

Reply to crperry13

hotstatrat

7/25

Sorry K:BB.

The issue here is what would we use SIERA or FIP for. If it is for seeing how well someone is pitching over a segment of a season - and I suggest that is its most practical purpose, then Colin's chart comparing RMSEs at 100 inning incrementals is misleading. Let's see how these metrics compare at 20 innings, 40, 60, 80, 100, 125, 150, 200, 300, and 400?

Reply to hotstatrat

cwyers

7/25

I don't think either SIERA or FIP necessarily tell us how well someone is pitching over a segment of a season - or rather, I don't think they tell us ALL of that. What all ERA estimators do is ignore certain non-repeatable aspects of pitching performance, most notably sequencing of events. SIERA goes a step further, and ignores whether or not a pitcher actually gave up any home runs or not.

It comes down to the question of, what do we mean by how well somebody is pitching? I mean, I can identify certain deficiencies in how ERA evaluates a pitcher's performance. It ignores "unearned" runs, based on an inconsistently applied and essentially capricious definition of which runs are unearned. It gives the pitcher credit (or blame) for the actions of his team's defense. It gives a pitcher credit (or blame) for how well his follow-on pitchers do at handling inherited runners.

SIERA addresses one of those issues - it (largely) doesn't hold a pitcher accountable for the actions of his defense (although if there is a hit/no-hit bias in the batted ball data, there is a possibility that it is still giving the pitcher some responsibility for the actions of the defense). But by regressing against ERA, and including terms such as ground ball rate and starter IP percentage, SIERA perpetuates the other two failings of ERA that I note above.

The trouble is that in terms of assessing a pitcher's in-sample performance, there's nothing to test SIERA against. If there was a consensus on a stat that was better at measuring a pitcher's performance in sample, we wouldn't have to test SIERA against it, everybody would just use it. So SIERA tests against out-sample ERA as a proxy for "skill." As discussed above, in real (as opposed to fantasy) baseball, ERA is deficient as a measure of actual performance. But more importantly - there are things that can occur over a small sample that are not repeatable but are certainly part of how well he is pitching for that period of time. Testing against out-sample ERA doesn't tell you which metric does the best at measuring these aspects of performance, it tells you which is best at ignoring those parts of performance. To some extent it comes down to how much you care about those aspects of performance, compared to aspects that are more readily measured.

Here's where I could say something about how an overly reductive approach treats players as Strat-O-Matic cards, but let's be blunt - a whole lot of the interest in these metrics is because people almost literally want to know what a player's Strat card will look like for an out-sample period, such as the rest of the season. The trouble here is not so much SIERA itself, but misuse of the metrics. To the extent that sabermetric education has occured outside of a few scattered enclaves on the Internet, what's important is not that we've taught people to use OPS instead of RBI, or ERA instead of wins, or our new trendier metrics instead of the simpler metrics that the last generation of sabermetricians fought to get established in the first place. It's the fact that we have now conditioned large swaths of people to have a Pavlovian response to predictions based on a limited number of observations - "small sample size." I mean, we've gotten to the point where reporters who cover sports for a living are more well-versed in things like true-score theory and measurement error and other bedrock principles of statistics than reporters who cover health, the economy or any number of other topics.

Misuing SIERA as a projection system utterly ignores that; more to the point, chasing after minimal improvements in predictive accuracy from samples with selective endpoints utterly ignores that. And I say "misuse" advisedly - Matt himself has clearly stated, repeatedly, that SIERA is not a projection system and is not intended to be one. So when it comes to questions like, "what's the best stat to use to predict a pitcher's future performance based on 40 IP," to me that sounds like trying to decide what caliber bullet to use to shoot yourself in the foot with. If we really want to predict a pitcher's future performance, we need to use more than 40 IP; the most valuable data that comes with a pitcher's stat line over that small a sample is his name.

Reply to cwyers

studes

7/25

Yes to this:

"...we have now conditioned large swaths of people to have a Pavlovian response to predictions based on a limited number of observations - 'small sample size.'"

Reply to studes

markpadden

7/26

The only person misusing SIERA as a long-term projection system is you. How about looking at your own article, and its odd, self-congratulatory conclusions about flyball rates? No one seems to be able to replicate your results on other data. Nor would any rational baseball person ignore park effects on home runs in such an analysis.

Reply to markpadden

eighteen

7/25

I've bookmarked this page in the event I'm asked to explain the phrase "Tempest in a teapot."

Reply to eighteen

beeker99

7/25

Yep. For someone like me who doesn't regularly read Fangraphs, I had no idea that Matt Schwarz had gone over there or anything else until I read this piece. The timing didn't seem important to me either way.

Reply to beeker99

TangoTiger1

7/25

Colin said: "But by regressing against ERA, and including terms such as ground ball rate and starter IP percentage, SIERA perpetuates the other two failings of ERA that I note above."

1. The starter IP percentage acts as a proxy for the "Rule of 17" that I noted on my blog a year or two ago. That is, the more you pitch as a starter, the higher your BABIP, and the higher your HR/PA. Since SIERA purposefully ignores both BABIP and HR, then including starter IP percentage is an excellent parameter to use. That's not to say it's used in the best way, but that it's used at all at least shows that Matt has a great insight here. That SIERA is the only one to thought to even have it is a huge feather in its cap.

As an example, it would be ridiculously easy for me to change the "3.2" constant in FIP to something else, like "3.3" for starters, "2.9" for relievers", and a sliding scale for the rest.

2. I agree that the true test must be against RA9. Though, in reality, any test against RA9 will deliver virtually the same result as ERA (if you select a random group of pitchers... but, if the only pitchers you select are all GB pitchers, then ERA will not be a good test).

The distinction between ERA and RA9, with regards to what we are talking about here, is being very nitpicky. That's not a bad thing, but it's not a good thing either.

Reply to TangoTiger1

cwyers

7/25

Tango, I understand that's the intent of including starter IP% in the regression. What I don't know and what you don't know is how much of the coefficient assigned to that variable by the OLS is explained by the intended usage and how much is explained by the inherited/bequeathed issue. I think that's at the heart of your criticism of SIERA:

http://www.insidethebook.com/ee/index.php/site/comments/my_issue_with_regression_equations/

You say pretty clearly that you wish that SIERA adopted a fundamental model of baseball team scoring, like BaseRuns, so that we wouldn't have to disentangle these sorts of effects.

I understand that a lot of what I'm talking about when I critique SIERA relative to something like FIP is trivial or, as you say, "nitpicky." But that's because the only difference between SIERA and these other stats are trivial, "nitpicky" differences. That's kind of the point here, isn't it? The correlation between SIERA and xFIP, as reported by Matt, is .94. The difference between the two in predicting out-sample ERA, again according to Matt, is .01 runs per nine innings. If your position is "differences that trivial don't matter," then doesn't it follow that SIERA doesn't matter?

Reply to cwyers

TangoTiger1

7/25

My comment regarding nitpicky relates only to the ERA / RA9 testing.

I agree that the presentation of SIERA obfuscates more than it enlightens.

I also know that there are kernels of truth in there that would be much more powerful if it followed the model that Patriot is espousing.

So, SIERA would benefit from further exploration and better presentation. It still deserves a place, and whether it's at Fangraphs or here, it doesn't really matter.

I mean FIP deserves a place too, and it took until you came here to give it that place, which is a ridiculously long time to present such a simple stat.

Reply to TangoTiger1

cwyers

7/25

Tango, I don't think anyone at BP has said we don't plan on continuing to research baseball in general or pitching in particular. There is always room for us to learn more things about baseball, in my estimation.

But the appellation SIERA belongs to a particular metric with a particular framework (if you wish to be pedantic, you could say two metrics, but the two versions of SIERA are far more similar than they are different). Not carrying that specific stat doesn't prevent BP from doing research into some of the concepts behind it. It's not like BP is incapable of contributing to the discussion of fielding metrics just because we call ours something other than UZR - calling it UZR would do nothing but confuse people who have (rightly) come to expect UZR to mean a very specific thing.

I agree with you that the implementation of SIERA "obfuscates more than it enlightens." I think us keeping the name for a significantly different metric, while the original SIERA still exists in the wild, would do much the same. Neither point means that we're leaving the conversation.

Reply to cwyers

markpadden

7/26

It sounds to me like you have thrown your hands in the air and said, "FIP is the best that anyone could possibly do." How does that benefit subscribers? Unrelatedly, what metrics have you developed since your start date at BP?

Reply to markpadden

TangoTiger1

7/26

I only said that the findings of SIERA deserve further exploration. I didn't imply anything beyond that.

Reply to TangoTiger1

TangoTiger1

7/26

I think Richard Bergstrom makes several fantastic posts.

Reply to TangoTiger1

Nickus

7/26

I wish I could plus this about 100 times more.

Reply to Nickus

hotstatrat

7/26

Thank you, Colin. That was an eloquent and enlightening reply to my last suggestion. However, I think we do want a predictive pitcherâ€™s tool that can be used on 40 innings of data. Let us shoot ourselves in the foot. Sometimes a pitcher suddenly learns a new pitch or suddenly learns how to pitch. Help us decide if that is the case or he is just on a hot streak about to cool off into reality. If GB/FB gives us a better indication of his ability to not give up home runs in 40 innings of data than the actual number of home runs he gave up, then letâ€™s use that info.

Reply to hotstatrat

tbwhite

7/27

I don't think we want an actual prediction just based on 40 innings. If a guy has gotten off to a hot start, there are basically 2 possibilities: he is the same pitcher but has just been lucky, or he has improved himself in some way and is actually a better pitcher.

If you base a prediction just on the 40 innings, you are assuming the latter, which is missing the point because the whole question is what do these 40 IP mean ?

If you base a prediction on past performance plus incorporate the new 40 IP, then your prediction essentially assumes the former that the guy is the same pitcher.

In other words, any prediction is close to useless because it assumes an answer to the question your asking, but it doesn't really answer the question.

I think what people really want to know is what was the probability of this 40 IP sample occurring given the past performance of this pitcher. If you can look at that 40 IP sample and say there was less than a 10% chance of that happening randomly given this guys previous level of performance, then maybe you conclude that the chance of him having improved is large enough to warrant picking him up in your fantasy league.

Reply to tbwhite

blcartwright

7/27

40 IP, is about 170 batters faced. According to Derek Carty's research http://www.baseballprospectus.com/article.php?articleid=14215 that is long enough for GB, BB & SO to stabilize, so you can get a good idea of how well the pitcher is doing in the categories that a pitcher has the most control over.

Reply to blcartwright

Oleoay

7/27

Anecdotally, I can eyeball if a rookie pitcher is worth risking a roster spot in a fantasy league based on three starts.

Reply to Oleoay

markpadden

7/26

Questions that remain curiously unanswered:

1) Why were PECOTA pitcher win-loss record projections so screwed up in the pre-season?

2) What was changed in the PECOTA rest-of-season weighting scheme a few days ago, and why was it done?

3) Why is it assumed that a good test of ERA estimators is using the past 1000 IP to predict the next 200 IP?

4) Why was the park environment ignored in assessing HR rates vs. flyball rates as predictive indicators?

5) Why was schedule ignored in assessing HR rates vs. flyball rates as predictive indicators?

6) "Aren't you graphing actual HR rates vs. predicted HR rates based on flyball rates? Of course the actual HR rate is going to be wider--that's the nature of real life vs. projection, right?

Are you implying that the distribution of expected home runs based on contact rate is wider than that for fly ball rate?"

7) "A good test might be to ask whether HR/FB is more predictive of second half home runs than HR/CON?"

8) "...there is a reason that this sort of analysis can be unsatisfactory: it assumes that a fly ball that wasnâ€™t a home run is equally predictive of future home runs as a fly ball that is a home run."

But isn't this just as true as HR/CON? In fact, couldn't you turn this argument around and say that this is why HR/FB is preferable to HR/CON? Because fly balls are more predictive of future home runs than, say, ground balls?"

Reply to markpadden

dkappelman

7/26

I think something is fishy with the 1989-1999 data you're using.

If you look at first and second half HR/CON vs FB/CON rates for 2002 - 2010 (cutoff 200 TBF in each half, using 180 days into the season as the 1st half / 2nd half split) you get this:

R^2
FB/CON
.084
HR/CON
.046

This is more in line with previous research and shows the batted ball data to be useful when predicting second half home runs.

Reply to dkappelman

dkappelman

7/26

I should just also note that perhaps there are certain things which the changing run environments does when looking at FB/CON, which ends up giving the two different time periods different results. Perhaps the data is ok on both ends. I don't know.

Reply to dkappelman

hotstatrat

7/26

To eighteen: all of baseball is a tempest in a teapot. That's why we love it.

Reply to hotstatrat

Olinkapo

7/26

Irony: evo34 was the one to whip out a "stay classy" in this thread.

You know, pissing and moaning about tone is unbecoming when you're desperately trying to do a hatchet job, and ignoring facts and context to get the job done.

Reply to Olinkapo

markpadden

7/13

What specific shortcomings in Colin's work which I pointed out are "ignoring facts and context"? I ask very specific important questions, and he ignored all them. Meanwhile, he took to Twitter to mock the entire exercise.

Reply to markpadden

hotstatrat

7/27

tbwhite/Cartwright/Bergstrom responses above -

Thanks. Whatever the value is of 40 innings worth of data - it does have some value. There are situations either in predicting success or describing hotness where you might want to have the best stat for analyzing those 40 innings. You might be comparing two players who are very close and that stat could be the best arbitrator . . . and whatever the value of GB data has regarding HR yields with only that amount of data, I still believe it has more value than HR/9. I do not claim to have made that notion up. Others I respect think that as well. I just don't have a study handy that shows it. . . and whatever the benefits or detriments are of all of SIERA's many complicated nuances vs. FIP's simplicity, the most significant one is that it uses GB data instead of HR/9 data.

If Colin and Mike want to concentrate their efforts on PITCH/fx and HIT/fx, because this is samll potatoes; go for it. I look forward to their findings. Meanwhile, FanGraphs has a working SIERA, if I feel the need to check it out.

Reply to hotstatrat

hotstatrat

7/27

One more thing . . .

TagoTiger makes a case that HR per Fly Ball is to some small degree a skill: http://www.insidethebook.com/ee/index.php/site/comments/pre_introducing_batted_ball_fip_part_2/

Paapfly shows just how great Matt Cain is at generating a low HR/FB: http://www.paapfly.com/2011/02/matt-cain-ignores-xfip-again-and-again.html

So, yet a newer improved SIERA would include appropriately regressed HR/FB data.

Reply to hotstatrat

TangoTiger1

7/28

I'm not sure why you are getting minused on your last 2 posts.

SIERA purposefully limits itself to:
1. not using HR
2. not using prior years

Those are constraints it imposes upon itself.

Marcel purposefully uses HR and uses prior years, as does PECOTA.

FIP purposefully limits itself to BB, HB, SO, HR, and current year.

All these decisions are made because the metrics are trying to answer a specific question.

So, don't expect any of them to change to be "better", because they have already decided that they want to be limited to some extent.

Reply to TangoTiger1

hotstatrat

7/27

Sorry - not one more thing afterall . . . Just to stir up the other side of this controversy, Matt Swartz isn't the only BP asset to move to FanGraphs recently. It looks like David Laurila was also part of the package that brought RJ Anderson here?

What's going on here? You poach us like a minor league team, so we'll poach back to gain major league acceptance?

Personally, I always enjoyed David's interviews. However, they were a guilty pleasure. I didn't make time to read them often enough.

Reply to hotstatrat

Oleoay

7/27

I'm starting to think it's kind of like contractors switching companies every year or two to get a new signing bonus and a bit higher base pay or better perks. Besides, announcing a new writer gives the site something to brag about, and allows the writer to reach a new audience.

And yeah, I liked David interviews too.

Reply to Oleoay

Manufactured Runs: Lost in the SIERA Madre

No Estimating Zone

Thank you for reading

Latest Articles

Fantasy Four: Q&A with Chris Torres $

Five & Dive, Episode 426: Let’s Talk Turkey

The Almost-Comprehensive Fall League Prospect Rankings $

TA: The Dodgers Are Doing Five Blades $

So You’ve Decided to Trust the Robots B

Colin Wyers

Latest Articles

Fantasy Four: Q&A with Chris Torres $

Five & Dive, Episode 426: Let’s Talk Turkey

The Almost-Comprehensive Fall League Prospect Rankings $