Friday, December 28, 2012
Pitcher v Pitcher
One of my favorite little things that I like to research is pitcher v pitcher. Poz has also stumbled on this, and presents us some interesting info. Note: Jack Morris is all over his article.
Buy The Book from Amazon
One of my favorite little things that I like to research is pitcher v pitcher. Poz has also stumbled on this, and presents us some interesting info. Note: Jack Morris is all over his article.
Matt gives us some results on a couple of metrics. It’s good work at the beginning, though I got lost at the end.
It goes back to the work Glenn’s been doing, when he talks about FIP and predictiveFIP (or FutureFIP): create the metric to do what you need it to do.
They’ve pitched virtually the same number of innings. Their runs allowed rate are extremely close (nearly identical per batter faced). Their FIP are in the same ballpark. Schilling earned about 115MM$ in salary to 130MM$ for Brown. Even their W/L record is similar (216-146, 211-144). Is Schilling’s post-season record the differentiator that puts Schilling above the borderline, and he will be on the Hall of Fame ballot year after year, building his case, while Brown was a one-and-done pitcher? Brown deserves to remain in the conversation.
ERA is a terrible idea. It takes something factual, the number of runs scored, and then decides which of those to attribute to one pitcher or another (in case of multiple pitchers in the same inning), and of those attributed, decide which of those are “earned” and “unearned”.
When you take something factual, and decide to split it up in some systematic fashion, you have to worry about systematic biases. And the larger the sample, the more the systematic bias will shine through.
Of the 1318 runs allocated to Curt Schilling, 1253 are declared as “earned” (a rate of 95%). He faced 13284 batters (in 3261 innings). That’s a shade under .1 runs per batter faced.
Of the 1357 runs allocated to Kevin Brown, 1185 are declared as “earned” (a rate of 87%). He faced 13542 batters (in 3256 innings). That’s an even smaller shade over .1 runs per batter faced.
Why do we have a bias? Curt Schilling was a flyball pitcher while Kevin Brown was a groundball pitcher. And errors are assigned to infielders far more than they are assigned to outfielders. The very fact that Kevin Brown allows a ball to hit the ground will likely lead to him getting less “earned” credit for anything bad that happens, that that “bad stuff” gets transferred to his fielders. But, Kevin Brown had zero expectation of “perfect” fielders. That Brown allows a groundball comes with it the reality that it’ll get muffed more than Schilling allowing a flyball.
By allowing the scorer to decide that a pitcher gets absolved of blame in this biased manner, we perpetuate the systematic bias in the metric. And we end up with Curt’s ERA at 3.46, and Brown at 3.28.
“Yeah, yeah, whatever”, might be your reaction. After all, I picked the two most extreme pitchers of the current generation (true). And their RA9 is 3.75 for Brown and 3.64 for Schilling. ERA’s biased measure of +0.18 for Schilling becomes the unbiased -0.11 for Schilling. So we’re talking about a +/-0.15 gap at the extreme. True enough. If you don’t care, you don’t care. At least be aware of the issue, so you know enough to discard it as being mostly irrelevant.
The other issue is the mid-inning pitching change, where runs are allocated only if the following pitchers lets those runners score. So, the run gets counted entirely to the initial pitcher only if the following pitcher allows those runs to score. The reality is that we have a SHARED responsibility, but baseball record-keeping is so transfixed to give entire credit to one player or another, be it runs allowed or games “won”. To reflect the reality of shared responsibility, we give the pitcher who leaves the game with runners on base a portion of those runners counting as runs scored, whether they scored or not. And we give the relieving pitcher the remaining positive amount of the run if the runner scored, and a NEGATIVE run amount of the runner was left on base (so that at the team level, it all adds up for that inning).
Be aware of the issues, and then decide if it’s relevant enough for you.
I used to be involved in a great many Primer semantical arguments years ago. One I just came across:
I never said FIP was magic or comprehensive or any of the things some of you seem to think that I wrote. I said that FIP is what the pitcher actually did. True statement.
No. It’s some of the stuff the pitcher actually did.
For example, throwing an 86 MPH meatball over the middle of the plate, that gets ripped in the gap for a triple is omitted.
Two people who are talking to each other, but don’t want to listen to each other. It’s clear what each side is trying to say.
FIP is only concerned with those things that don’t involve his fielders (though Mike Trout stealing or not stealing HR are on the edge there). It is AGNOSTIC on every other event.
This is no different than OBP only being concerned with whether a batter reached base, and not interested in how far he reached base. So, a walk = HR for purposes of OBP. It is AGNOSTIC as to how far the batter got.
So, why do we have a problem? Well, there is a big gap between a player’s OBP and his overall offensive production. Namely, his HR and extra bases and his baserunning. It’s an obvious gap, so obvious, that we don’t even need to point it out. No one is going to say that “OBP is almost the same as a hitter’s overall offense”.
With pitchers, the gap between a pitcher FIP and his overall defensive production is fairly “small” (with small being defined however one wishes to define it). So, we end up with one side thinking “small” means “non-existant”, and the other side thinking “small” means “still notable”.
FIP is impressive because it only needs 20-35% of a pitcher’s batters faced in order to explain a large portion of his talent.
***
I should also point out if we actually compiled 86mph fastballs over the heart of the plate, we would include that in a metric.
Dave makes the argument that teams seem to be eschewing ERA in light of peripherals. He brings up the case of Liriano, who has had an ERA above 5.00 in three of the last four seasons, and he asks:
For whatever reason, Liriano has been consistently terrible at stranding runners, and while it’s easy to write that off as a fluke over a year or even two, it gets a bit tougher to believe that this is all just random variance in sequencing when he’s at 840 innings pitched and has a career LOB% under 70%.
Another article wishing to raise awareness to kwERA.
By the way, I’ve never been particularly fond of the name, and I just made it up out of convenience. I used to also call it szERA (strike zone ERA), but then I realized that really that would be better for something based on called balls and called strikes or something. So, I reverted to kwERA (k for strikeouts and w for walks).
Anyway, feel free to call it something else, and if you can sound it it, like FIP (fip) or wOBA (wuh-bah), all the better.
Glenn gives us some kwERA results.
kwERA is one of those little metrics that I love because it tries to do as little as possible, be as transparent as possible, and yet, stand shoulder-to-shoulder with the sabermetr-icky stats. It’s a gateway metric even more appealing than FIP, as it shocks you that you can just rely on a pitcher’s strikeout (K) rate and his walk (W) rate to come up with a metric that mirrors future ERA as good as anything else you can think of.
I should note that the idea for kwERA came from GuyM, who I’m guessing was probably influenced by FIP, which was based on DIPS (Voros), which might have been subconsciously inspired by DER (Bill James).
My role for kwERA was really to test Guy’s theory on how to combine SO and BB (as it turns out, 1 to 1), and to determine how reliable it is (as it turns out, quite high). Back then, we were having a discussing on K/BB as a ratio and K-BB as a differential (per PA), and the differential is really the only way to look at combining the two.
Anyway, Glenn’s done a tremendous job in exploring the issue even more, and really has added plenty of research for people to sift through and be convinced by. What I liked about what Glenn has done is that he simply followed the evidence, rather than try to prove (twist or bend the data) whatever he wanted to prove from the outset.
kwERA actually made an appearance in the THT annual this year, which is nice to see.
A reader on Bill’s site suggests:
Well, having watched most of [Cliff Lee’s] starts during his cold streak reminded me of something I have always believed: won-lost records tell us something about the run context/enviroment in which the pitcher competed. And Lee did not and the won lost record tells you that. He pitched in hard luck, but also, he pitched well in a lot of games where the other guy pitched better, and that tells you something about the run environment of his games. So, no when you get out-pitched in about half your starts, what you are seeing is a low-run context, not the tragic result. Lee is a good pitcher who had awful luck. But he’s not Cy Young, but rather an illustration of why W-L, the only stat automatically adjusted to game conditions and environment, etc., has some value.
Cliff Lee and Cole Hamels had very similar pitching lines pitching in front of the same fielders and in the same parks. Bill’s reader is suggesting that Lee’s and Hamels’ polar opposites in W/L record is indicative of the run environment they happened to find themselves on those particular days. That there was something in the environment that legitimately caused Lee to pitch in Astrodome-like conditions, while Hamels was pitching in Coors-like conditions (not that extreme, but something like that).
What say you?
Love listening to the pitching coaches talk:
BW: Casey Janssen. It’s late and sharp. It doesn’t have a lot of depth to it; it’s actually a true cutter. If you want to talk about a true cutter, it’s a side-to-side pitch. It looks like a fastball and then cuts off on the same plane. Anything with depth, I call a slider. Casey’s is a true cutter, because it looks like a fastball until the very last moment, then it either cuts in on the hands of a lefty, or away from the barrel on a righty. He’s got the ability to manipulate the ball in that way.
A cutter is a pitch that’s mostly thrown at the belt. It’s not thrown down in the zone as much as a slider. It looks like a belt-high fastball coming in, but when you swing at it, it’s moved off center a little bit.
DL: Unless you’re Mariano Rivera, can a cutter be your primary pitch?
BW: No, it shouldn’t. Nobody in baseball has as good a cutter as Mariano, so he could go to it like that. A cutter needs to be set up with a straight fastball. You can’t just sit there and cut and cut and cut, with a mediocre cutter.
According to the Fangraphs’ compiled data of Gameday classifications, for 2010-2012 min 100 IP total, Mariano Rivera threw the cutter 84% of the time, followed by Dotel at 81% and Jansen (Dodgers) at 70%. Soria is at 54%, Sonnastine at 52%, and after that it goes down to 39% and lower.
DL: Can you usually tell if someone is going to be able to handle that role?
BW: You have to find out. You don’t know. They work their way up the ladder from the sixth inning, seventh inning, eighth inning, then you see if they pitch the same way in the ninth inning. The guys who don’t change the way they pitch are the ones who can handle that role.
In other words, a “clutch” pitcher is someone who keeps his cool. He doesn’t “rise” to the occasion.
Glenn gives us some information to consider. I’d be very wary of converting the correlation he shows into a causation. Just because you show a relationship doesn’t mean that one is the causative agent to the other. In addition, if we start the chart say in the late 1980s, right around when Henke and Eck et al were redefining the position, it’s not clear we’d get the same kind of correlation.
In any case, it’s a fascinating subject, one I’ve devoted plenty of research to. You have the idea of whether a pitcher is better served if he knows his inning role rather than his leverage role. Not only can he prepare himself based on the clock, but also less wasted warmups. You have LHP come in to face disproportionate number of LHH. You have RP who come in for a set amount of pitches, so they can gas it more and pace themselves less. You have pitchers who are brought in as relievers, rather than being failed starters. But you also have double the population of relievers today compared to the past. Looking at SP as a control group is one step, but not the only step.
Anyway, it’s fascinating, and the more ideas and research we can get, then the better the answer we can provide.
The idea of pitching is to randomize your pitches. If you throw your fastball two-thirds of the time, you can’t sequence it fast-fast-other-fast-fast-other. You can’t 100% throw your fastball in say a 3-0 count or not throw a fastball in an 0-2 count. There’s different rates for each count for each pitcher (and for each quality of batter, etc). The whole thing is a huge random number generator centered on a mean based on the game state and identity of the pitcher and batter.
Bill notes that some pitchers will “pitch backwards”. That even though 94% of fastballs are thrown on a 3-0 count, there are some pitchers who throw their fastballs half as much as that. He went through each count and identified James Shields and Bronson Arroyo as the most “backwards” pitchers.
Shields and Arroyo have had plenty of success in MLB. The question therefore is if they could have made even more success with a more traditional pitching scheme, or did they have to throw “backwards” as they have in order to achieve that success?
Jeff investigated something Billy Beane said, and finds evidence to back it up.
We had a fantastic thread a few weeks back where MGL showed the team UZR of Tigers fielders, based on which pitcher was on the mound.
The impetus for that research was that the Tigers had overall below-average fielders, and while Scherzer and Porcello had worse-than-average BABIP (batting average on balls in play), Verlander had a better-than-average BABIP. The gap between the two sets of pitchers was 50 points of BABIP. And the point I was trying to make was that just because Verlander played in front of bad fielders doesn’t mean that we should presume that had someone else been on that mound on that day that they would have gotten bad fielding support. That perhaps, just perhaps, those bad fielders happen to be playing well that day. And so, we shouldn’t be “over-adjusting” Verlander’s BABIP on the assumption that he got bad fielding support.
Just as you wouldn’t presume a good hitting team provides good run support to all its pitchers, or a good bullpen helps out all of its pitchers, then neither should you presume that having a good set of fielders behind you means that ended up receiving good fielding support.
And MGL showed that Verlander indeed got good fielding when he was pitching.
Anway, so Glenn decided to run a correlation of team UZR and individual pitcher BABIP to see how strong the correlation exists. And it’s pretty darn low (r=.13). Indeed, a pitcher’s high K rate had a stronger relationship to his low BABIP (r=.20). The reality is that the overwhelming reason for the spread in BABIP is simply random variation.
However, Glenn’s study is subject to his sample, and unfortunately, his sample included everyone with at least 20 IP (ostensibly 60 BIP). So, that more than anything is going to drive the results. Remember, the larger the number of trials, the higher the r. That’s because the larger the number of trials, the less random variation has an impact, and the more signal can rise above the noise.
I’d like to see the study redone with pitchers with at least 300 balls in play.
Bill James said something interesting:
It would depend on how good the pitchers are. If you take pitchers of a certain quality (Jim Kaat, Tommy John, Jamie Moyer, Bob Gibson, Jim Palmer, Catfish Hunter, Juan Marichal. ...like that) it would be close to 5-10 in their last 15 decisions. If pitchers with lesser careers, it takes less failure to drive them out of the game, so the norm would be more like 6.5-8.5.
First a note about W/L record: when you look at it in an aggregate sense, W/L records work out ok. If you wanted to use ERA+ or ERA-, that’s fine too.
Second note about taking final or anything: you have a selection bias. A bad player on a streak of good luck won’t retire. A good player on a streak of bad luck may retire, but a good player on a streak of terrible luck will retire. A bad player performing as he should will retire. So, you really really really have to be careful.
So, this is why Bill is saying what he’s saying. If you have a presumed true talent .600 pitcher, who happened to pitch at .450, then you’d expect a bounceback, and he might still be given a chance (he ran into alot of bad luck). But if you have a presumed true talent .600 pitcher who happened to pitch at .300, then it calls into question whether he actually still is a .600 pitcher (he couldn’t have run into so much bad luck… it must be a lower talent level).
Anyway, so Bill is saying that great pitchers have a .333 or so win% in their final 15 decisions, and lesser pitchers have a .433 win%. A reader did the work, and came out with:
Using Retrosheet game log files since 1911 when they have complete info, and considering only games as a starter: The top 20 pitchers with the most career starts finished 107-193 = .357; all other pitchers with 200 starts finished .423.
So, Bill nailed it as good as you could have possibly expected. Note that the reader separated the pitchers into “top 20 with most career starts” and “all others with at leat 200 starts”. Presumably, if we looked at pitchers with 100-199 career starts, their final 15 decisions would have been closer to .430-.450, and then all other pitchers with at least 15 starts would be above .450.
You could probably use this information to establish a replacement level. I’d like to see this work repeated with ERA+ or ERA-. You also don’t have to limit it to last 15 decisions. You can just look at “last year”, if it makes it easier to process.
You can also include age, and limit the look to only those last seasons before age 30 and from age 30+. You should see similar type of results due to the selection bias.
I can’t find the rule right now, but it’s somewhere in here (pdf). The pitcher must maintain one foot on the rubber while the ball is in his hand. He can’t release the ball with both feet off the rubber. And yet:
These were the first two pitchers I looked for, and I wouldn’t be surprised if virtually all pitchers are like this. Why don’t they just change the rule?
I love Mariano Rivera. You love Mariano Rivera. We all love Mariano Rivera. And when it comes to the post-season, I’d probably trust Mariano Rivera to pitch from the seventh through the twelvth inning more than anyone else, including RJ in relief, Pedro in relief, or anyone else. And what Mariano Rivera has done in his 15 years in relief, the consistency he’s shown, the lack of aging, is flabbergasting.
So, I can see why if he’s out for the season, there’s this automatic despondency. But this is baseball, where one player can only have so much impact. And any player is replaceable, for the right price. Or if you can replace him completely, you can come awfully close.
How close? In 2011-12, Mariano Rivera pitched 69.2 innings. In 2012, Rafael Soriano pitched two fewer innings at 67.2. Rivera gave up 15 runs, while Soriano gave up 17. Soriano allowed 6 HR and 49 other hits, while Rivera allowed 3 HR and 44 other hits. Soriano struck out 69 and Rivera struck out 68. The walks is what sets Rivera apart from the pitching universe, with only 8 walks, while Soriano had 21 (BB-IBB+HBP). And after all that, Rivera had 49 saves and Soriano had 42 (note that Rivera got 5 saves in April, preventing Soriano from getting any).
Soriano did a fantastic job in picking up the slack. Yes, his performance was a shade behind Mariano Rivera. But his performance was good enough that he could actually stand in Rivera’s shadow.
And the Yankees are in the playoffs. How much of a dropoff could there have possibly been between Mariano Rivera and Rafael Soriano?
Yes, the playoffs are a different thing, and Mariano is unmatched as I said. You don’t have to remind me of that. I just said it twice. I’ll say it a third time: he’s the best.
When you quantify Mariano Rivera’s value, you realize that losing CC Sabathia would be more devastating than losing Mariano Rivera.
In The Book, we were able to establish the huge advantage to being able to pitch in relief as opposed to pitching as a starter. There are good reasons for this:
1. No need to pace
2. Not needing to face a batter a second time later in the game
3. Pitching when it’s colder (hitting performance peaks somewhere around 5-6 PM)
There could be other reasons that we’ve never researched. For example, is it easier to pitch if you only have two pitches to throw, as opposed to mastering 3 or 4 (or even 5!)? But, none of that is why I’m posting this thread. I’m just establishing a point of fact.
Anyway, so the reason is that three years ago, I established the “Rule of 17”:
Basically, use the “rule of 17�: difference in BABIP is 17 points higher as starter. K/PA is 17% higher as reliever. And HR per contacted PA is 17% higher as starter. Walk rate is FLAT.
I could kick myself, but I didn’t notice that the Rule of 17 ALSO applies to runs allowed per 9 innings. I did note this:
I went ahead and calculated the wOBA for all the starters/relief, and for pitchers since 1993, it was .352 for starters and .323 for relievers, a gap of 29 wOBA points, almost identical to what I found in The Book. The relievers had a wOBA that was 8% less than starters. When I looked at the whole dataset (1953-2008), it was also the same 8%.
Roughly speaking, the relationship between wOBA (or OBP or SLG) to runs allowed is that you square that component (or multiply any of the two components). That’s why Bill James’ original Runs Created worked so well, because it was essentially OBP times SLG. Anyway, so if we see that wOBA as a reliever is 8% less than wOBA as a starter, then roughly speaking, we’d expect runs allowed to be about 16% less.
In fact, if you put the starter component numbers into into something like BaseRuns or a Markov calculator, and then you translate those component numbers into its equivalent reliever component numbers and put it into the same BaseRuns or Markov calculation, you’ll get just a bit above 17% as the drop in runs allowed rate.
That is, we can extend the Rule of 17 for starter-to-reliever conversion by also including Runs Allowed into the mix!
This becomes very useful for when we do research into comparing year-to-year changes in performance, and we see that pitchers move between roles. So, to be able to put everyone on the same scale, you can take a reliever’s RA9 and divide it by 0.83 (that is, 1 minus 17%), and that puts it onto the starter’s scale.
The Book details the issue of selection bias, so you have to be careful when you look at INDIVIDUAL pitchers (notably, a starter going into an emergency relief role will NOT see an improvement in production, and a failed starter turned into a reliever will, by selective sampling, “improve” because he was unlucky to have failed as a starter to begin with, more likely than not, and so on… it’s all in The Book).
But, in the aggregate, this adjustment will help more than it hurts, and having an adjustment is better than no adjustment.
More excellent work from Glenn.
It is a principle urging one to select from among competing hypotheses that which makes the fewest assumptions. The razor asserts that one should proceed to simpler theories until simplicity can be traded for greater explanatory power.
Indeed, in this case, we don’t even need to apply Occam’s Razor, since we get LESS explanatory power the more complex we make it!
***
Glenn shows that the most simplest of all, simply K per PA, shows an RMSE in his sample of 0.87 runs. That basically means that two-thirds of his pitchers were off by 0.87 runs, if we tried to predict next year’s runs allowed. Is that good or not good? Doesn’t matter. That’s the baseline.
Then, we add in walks and hit batters, which are the natural companions to K. The RMSE goes down to 0.86 runs. (Had I not rounded, the gap would be even closer, going from .868 to .862.) Here, you can apply Occam’s Razor. We made it a bit more complex and we got a bit more explanatory power. I think it’s easy enough to accept to make it K minus walks.
Then we add in HR (in the form of FIP). We made it a bit more complex, and it did WORSE (RMSE of .865). In this case, there is no tradeoff. Adding the HR actually created more noise. Naturally, it seems insane to not want to know if a pitcher gave up 5 HR, 15 HR, or 25 HR. We have to also remember that Glenn chose his sample specifically based on playing time. And one would think that a selection bias would prevent the HR from getting too high. So, if the sample is filled with guys with similar number of HR, then there’s not enough variation to impact the RMSE. That could be a reason. (You can apply the same reasoning for walks, since a starting pitcher who gives up a sh!tload of walks won’t be making 100-inning seasons for two straight years.)
Anyway, so maybe HR does have too much noise in it. So, we trade HR for FB%, in the form of xFIP. We now made it even more complex than FIP. And the RMSE got worse at 0.88 runs. It was even worse than just using K per PA. That is, adding in both walks and FB% not only cancelled out the signal of the walk, but also cancelled out part of the signal of the K.
Glenn also tried SIERA, which includes K and BB and FB%, and then adds more, and then combines them in a non-straightforward manner. RMSE is 0.874, which is worse than a straight K / PA. Once again, there is no tradeoff, because SIERA made it more complex and got a worse result.
Finally, a straight RA9 to RA9 correlation: RMSE of 0.880, which doesn’t help us any.
So, we’re back to the basics, and that is K minus walks (or walks+hitbatter - IBB) as the best overall results, which has the additional virtue as being the second simplest method we can conceive (after K/PA).
Recent comments
Older comments
Page 1 of 391 pages 1 2 3 > Last ›Complete Archive – By Category
Complete Archive – By Date