[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
THE BOOK cover
The Unwritten Book
is Finally Written!

Read Excerpts & Reviews
E-Book available
as Amazon Kindle or
at iTunes for $9.99.

Hardcopy available at Amazon
SABR101 required reading if you enter this site. Check out the Sabermetric Wiki. And interesting baseball books.
Shop Amazon & Support This Blog
RECENT FORUM TOPICS
Jul 12 15:22 Marcels
Apr 16 14:31 Pitch Count Estimators
Mar 12 16:30 Appendix to THE BOOK - THE GORY DETAILS
Jan 29 09:41 NFL Overtime Idea
Jan 22 14:48 Weighting Years for NFL Player Projections
Jan 21 09:18 positional runs in pythagenpat
Oct 20 15:57 DRS: FG vs. BB-Ref

Advanced

Tangotiger Blog

A blog about baseball, hockey, life, and whatever else there is.

Parks

Parks

Friday, January 19, 2024

Statcast: Theoretical and Actual HR Park Factors

As we know, Coors helps batters tremendously with the carry of the ball, on the order of twenty feet, on 400 foot batted balls.  However, Coors also happens to be the deepest park in MLB, sixteen feet deeper for homeruns.  So, on the one hand, the environment adds 20 feet, while on the other hand, its configuration costs the batter 16 feet.  The net effect is +4 feet.  While that may not sound like much, each foot adds 3% HR, for an estimated +12% HR.  Since 2020, Coors has been at +9%.  So, that's a pretty good match in terms of actual HR being hit compared to expected based on the physical and environmental characteristics.

Here it is for all ballparks with GABP leading the way on one end, and Comerica on the other end (click to embiggen).

(4) Comments • 2024/01/22 • Parks

Friday, May 19, 2023

Statcast: HR in X Parks

We have recently updated the process for the HR in X Parks on Savant.  We have both a Standard version and an Adjusted version.

Previously, both were based on the distance landing with a presumed 45 degree descent angle (which implies a 30 degree launch angle).  This made the math pretty easy, since distance to fence plus fence height is what the batted ball needs to be pass.  And for the most part, since HR are hit around that launch angle, the math worked well enough.

Now, we use the actual trajectory of the batted ball, comparing the position of the ball in 3D, every 1 msec, to the fence it is trying to clear.  So, alot more intensive in the calculation and a little more precise.  We'll check later to do see who wins and loses the most based on the new method compared to the old method.

As for Standard v Adjusted: the Standard method assumes the overlay method, and only looks at the actual fence configuration.  This doesn't work for Coors especially, as Coors has the deepest fence in baseball.  The Standard method would treat Coors as the least HR friendly park, rather than the most.  Obviously, that is nonsensical and is the complete opposite of reality.  That's where the Adjusted method comes in, using the actual park impact by distance.  Coors for example adds about 20 feet to long flyballs, and so, that's what is needed.

If you only use one method, use the Adjusted method.  If you want to use two methods, use the Adjusted method.  If you really understand what you are doing, use the Adjusted method.  Finally, if all else fails, then you can use the Standard method, but only in conjunction with the Adjusted method, and never in isolation.  Never, ever in isolation. Ever.

Friday, July 29, 2022

Statcast Lab: the many methods of HR park factors

There are at least three different ways to determine park factors as it relates to homeruns. I will detail each one here.  It is important to note that just because a park favors or does not favor home runs does not necessarily mean it has the same favorability on runs scored.  While naturally there is some kind of relationship, it is not dispositive.  Sometimes a park helps HR and hurts other areas (or vice versa).

Method 1: WOWY

The first way is the traditional way: Count the number of HR hit at a certain home park, and count the number of HR for games involving that team on the road. This is reasonable enough, since we are controlling for the batters and pitchers involved, on the idea that players will play just as often at home as on the road. So, we are comparing Yankees batters and pitchers (and their opponents) at Yankee Stadium, and at parks of Yankee road games.

I have an enhanced version that actually controls for the identity of the batters and pitchers, and their handedness (rather than assume it, as most naive methods do). While it's reasonable enough to think Giancarlo Stanton will play just as often at home as on the road, it's not a given. And it's certainly not a given that pitchers will split their time evenly. When we do it this way, we can ensure that no one player will have a disproportionate impact. This is the method we use at Baseball Savant, and you can see those results here. I call this method the WOWY (with or without you) method.

Over the Statcast period of 2016-present, the park most hitter friendly for home runs by this WOWY method is Great American Ballpark (GABP, home of the Reds), at an index value of 129. In other words, HR are hit at a 29% higher rate than the average park. On the flip side, Oracle Park (home of the Giants) has an index value of 72, or 28% fewer HR than average. The second least friendly park is Kauffman (home of the Royals) at a value of 76.

Method 2: HR in X Parks

The second way is what we call the HR in X Parks method. We have the trajectory of every batted ball (home run or not). And so what we can do is "overlay" these trajectories onto each of the other 29 parks. If the path of those batted balls clears the fence for a particular park, we call that an "expected" HR. If it doesn't, then we don't. If a Stanton blast travels 450 feet, we know that that blast would be a HR in every park, or 30/30, or 100%, or 1.00 "expected" HR or 1.0 xHR. On the other hand, a Stanton laser that barely clears a park, when overlayed onto the other 29 parks may or may not also be a HR. Let's suppose it would only clear two other parks. And so, this HR would get an xHR value of 0.1 (3/30). It can go the other way as well: a Stanton warning track out might be a HR in other parks. We add all those up.

Once we do that, we can compare the total number of actual HR hit in a certain park to the xHR hit in all parks. And we get a ratio. GABP comes out as the top HR park by this method as well at an index value of 130. The least HR-friendly park by the HR in X Parks method is Kauffman at an index value of 74, while the next park is Oracle at 79.

One medium note: In the enhanced version I applied above, it's not just a pure overlay. If we did that, Coors Field would turn into a massive PITCHER's park. Why is that? Because Coors is the park where the fence is the farthest in MLB, 16 feet deeper than the average park. So how is Coors so hitter friendly? Well, its elevation allows the ball to fly farther, some 20+ extra feet. Therefore, the trajectories are adjusted to account for the different environmental conditions of each park. You can see those values right here.

Method 3: Fence Configuration + Park Environmental Conditions

The third way is (almost) purely theoretical. Since we lidar each park, we know the exact distance of every fence, along with their height. We can determine the "HR distance" as the distance of the fence, plus its height. We can do this because HR blasts launched at around a 30 degree angle will have a descent angle near fence clearing of around 45 degrees. In other words, in order to know the distance needed to clear the fence, we just need to know the distance to landing, which is distance plus height of fence. It's a very nice and simple shortcut. Coors for example is 398 feet for HR distance, while GABP is the shortest at 374 (the average is 382).

The other piece of theoretical knowledge is that every foot of distance will change the HR rate by 3%. So, GABP's fence configuration has an index value of 126 and so will add 26% HR, while the Coors fence configuration has an index value of 62, and so will remove 38% of HR.

We therefore need to add one value, the environmental conditions noted above, that Coors adds some 20+ feet to flyballs. Every park has its own environment conditions value. We add or subtract those numbers to the fence configuration number to come up with an index value, which I will call Fence+Env for purposes here. GABP once again leads with an index value of 131. So, three different methods, all three of them agree not only that GABP is the most HR-friendly of the parks, but pretty much agree to its magnitude as well.

Coors Field has these index values:

115: WOWY method

122: HR in X Parks

119: Fence+Env

That's pretty good agreement on a park as extreme as Coors.  On the flip side, the Fence+Env method gives us a new winner for least HR-friendly: Comerica Park at 73. Kauffman (77) and Oracle (80) are next in line.

Disagreements

A few times, we just don't have much agreement. Yankee Stadium is one such park. Using the WOWY method (which has the advantage of modeling player behaviour, but the disadvantage of being at the mercy of sampling issues), it comes in at a robust 115.

The HR in X Parks method (which has the advantage of tracking each batted ball, but the disadvantage of assuming you can simply overlay batted balls as if players wouldn't change their behaviour) comes in at 107.

Finally, the almost purely theoretical method of just relying on the fence configuration and implied environmental conditions (which is its advantage, with the disadvantage of having no actual observations) is a neutral 100. Interestingly, if we relied just on its short fences, Yankee Stadium would come in at 114. It's only when we applied the environmental conditions that the park suppresses distances by 4 feet that that 114 becomes 100.

Is it possible that the players, not really accounting for the environmental conditions, see the shorter distance and change their behaviour? Maybe. In the end, what really matters is how players respond in reality, rather than assuming players won't respond at all.

Attached are the various HR park factors since 2016 (click to embiggen).  The last column is just a simple average of the other three.  It's not necessarily what I would recommend using.  I would lean toward WOWY, though like all things, a weighting of each of the three is likely in order.

(3) Comments • 2022/07/30 • Parks

Thursday, May 26, 2022

HR in X parks update: includes additional park factors

When we first rolled the HR in X parks feature out on Savant, we knew we had a problem with Coors.  The idea was that we'd roll out iteration 1, and get back to iteration 2. That took far too long, but here it is.  

The first iteration only dealt with distances overlayed on the park configuration.  As you may know, or  maybe you don't, Coors has the deepest fences in MLB.  The fences are so deep that, ceteris paribus, it suppresses home runs by nearly 40%.  But when one park is 5280 above sea level, then things are not equal.  Coors adds about 20 feet to long flyballs.  Suddenly, a park that has its fences nominally 15 feet deeper than the average park, in effect has its fences 5 feet closer!  This is why Coors is a HR haven, as Coors HR Derby winner Pete Alonso can attest.

Check out the before/after of Alonso's HR in X Parks, iteration 1 and iteration 2. (Click to embiggen)

(1) Comments • 2022/05/26 • Parks

Friday, June 19, 2020

Park Impact, 6 of N: wOBAcon v xwOBAcon

Here’s a link to the series.

One of the issues we have in determining the Impact of each Park is how to tease out Random Variation. And by Random Variation, it’s not necessarily “luck”. It’s really things that happen randomly at that park. If for example Comerica has JD Martinez and Miguel Cabrera hitting warning track bombs disproportionately at home than away, that’s Random Variation to Comerica. Unless of course we can isolate it, in which case, we can turn that into a parameter on its own, and no longer part of Random Variation.

Let’s start with Chase Field, as it provides an ideal test case: in 2018, it introduced the Humidor. We know about Coors introducing the Humidor a long while back. Now, we have Chase doing it with Statcast at the ready.

If you want to know the presumed impact of the Humidor, I’ll direct you to David Kagan and Alan Nathan. The summary is thusly:

Balls stored in the humidor have a higher water content than they would if they were stored in the Rockies dugout. The higher water content means the balls are “mushier.” In addition, they weight a bit more. As a result, such a ball will come off the bat with a lower exit velocity and thus won’t travel as far.

To put it even plainer: the ball has less bounciness. And if it has less bounciness, the exit velocity is lower. That means if you take two identical swings at Chase, one in 2016-17 (pre-Humidor) and one in 2018-19 (with Humidor), the one with the Humidor will result in a lower Quality of Contact. The wOBA will decrease.

The above physicists correctly predicted a drop in production. To show you in wOBAcon (wOBA on Contact) terms at Chase, this is how the Chase batters (and opposing batters) did at Chase and the same two teams did away from Chase. In other words, we’re applying WOWY at the team level.

  • 2016: +0.038
  • 2017: +0.034
  • 2018: +0.003
  • 2019: -0.003

In other words, pre-Humidor, Chase added about 36 points to a batter’s wOBAcon, while with-Humidor, it was neutral. This is based on the actual observed outcomes.

How about the Quality of Contact? We have xwOBAcon, which turns the exit velocity and launch angle of each batted ball into its equivalent wOBA value. So the “x” takes the launch characteristics of mph and degrees and puts it onto the wOBA scale. That’s xwOBAcon. And we repeat the WOWY method on xwOBAcon. Here’s what we get:

  • 2016: +0.037
  • 2017: +0.047
  • 2018: +0.007
  • 2019: +0.003

Notice anything? The quality of contact pre-Humidor was +0.042, while the with-Humidor was +0.005. These +42 and +5 points based on launch characteristics are astonishingly close to the +36 and 0 points based on the observed outcomes. In other words, we are able to capture the effect of the Humidor, based on both the launch characteristics as well as the outcomes. Let me put these numbers side by side, and I’ll drop the decimals, and just refer to this from here-on-out as wOBA points. The first set is the Observed Outcomes (Obs) and the second set is the estimated values based on the launch characteristics (X). I’ll also add the Difference of the two, which is the Random Variation, meaning “everything else”.

Chase Field

  • Season Obs X Diff
  • 2016: +38 +37 +01
  • 2017: +34 +47 -13
  • 2018: +03 +07 -04
  • 2019: -03 +03 -06

Here’s how it looks for Coors:

  • Season Obs X Diff
  • 2016: +54 +02 +52
  • 2017: +41 -14 +55
  • 2018: +76 +19 +57
  • 2019: +67 +15 +52

Focus on that Diff. Do you see how consistent that number is? So, even though the Observed Actual Outcomes fluctuated between very high and enormously high, almost all of that fluctuation can be attributed to the Quality of Contact that season.

And we expected a high Diff for Coors: it’s the elevation. We could in fact create a parameter for Elevation, which would account for about 50 points of wOBAcon. In other words, we can explain everything we saw at Coors based on elevation and fluctuation of Quality of Contact. And we were able to explain a good deal at Chase.

How about the other parks? Based on Observed Outcomes, the two largest Pitcher Parks are Marlins Park and Citi Field. However, as luck would have it, two of the three parks that had the lowest Quality of Contact was… Marlins Park and Citi Field. Let me show you that table for each:

Marlins Park

  • Season Obs X Diff
  • 2016: -39 -26 -13
  • 2017: -32 -31 -01
  • 2018: -52 -25 -27
  • 2019: -10 -11 +01

So, Marlins Park is still a Pitcher Park, but not as extreme as we thought. Unless of course the poor Quality of Contact is endemic to Marlins Park. That maybe there’s a Batter’s Eye or something else about Marlins Park that is causing poorer Quality of Contact. Until that is determined, we can presume that it’s Random Variation.

Citi Field

  • Season Obs X Diff
  • 2016: -16 -07 -09
  • 2017: -38 -23 -15
  • 2018: -42 -16 -26
  • 2019: -10 -16 +06

Similar to Marlins Park, Citi Field is a Pitcher Park, but not quite the Pitcher Park the outcomes would make you believe.

Now, based on this methodology, the largest Pitcher Park is Busch. Busch was already in the running, since it had the third largest effect based on Observed Outcomes. But based on the Quality of Contact, Busch was actually a positive (which is by luck we presume). The result is a park that is hugely favorable to the pitcher:

Busch Field

  • Season Obs X Diff
  • 2016: -21 +06 -27
  • 2017: -17 +09 -26
  • 2018: -25 +07 -32
  • 2019: -34 -03 -31

The difference column is very stable, which is always good to see.

One of the parks that I highlighted in the series was Progressive Field, which was jumping around in Park Impact. It was actually extreme enough to bother me to look for a reason. I was thinking it might be a Great Lakes effect. Until today.

Progressive Field

  • Season Obs X Diff
  • 2016: +34 +17 +17
  • 2017: +01 -03 +04
  • 2018: +16 +16 +00
  • 2019: +05 -17 +22

Compare 2016 to 2019: the observed outcomes suggested there was +29 more points to try to explain in 2016. But once you account for the quality of contact, that basically disappears. There’s still an effect, but not a high variability one. And so the Great Lakes effect might still be real, just not in a capricious manner like wind. Progressive is still a Batter Park, but now we can see that the fluctuations we’ve observed coincided to a great deal with the Quality of Contact. To say I was sabermetrically-ecstatic is an understatement.

Here’s how the parks look when we plot the Quality of Contact against “Everything Else”. If we did this right, we should see no correlation. And we see… no correlation.

(Click to embiggen.)

Parks at the top are Batter’s Parks. Coors we know, GABP because of the HR and Globe Life Park is the Texas heat. (If we wanted to, and we will eventually, is break down the Park Impact based on elevation, distance of fences, temperature, and “everything else”.)

Parks at the bottom are Pitcher’s Parks. Oracle (Giants) and Oakland being together almost certainly points to The Bay. So, in the above breakdown, we’d probably add proximity-to-water. Kauffman has high elevation and far fences. The fences help the pitchers more than the elevation. Comerica we all know about, or at least Miguel Cabrera, JD Martinez and Nick Castellanos can attest to the warning track outs that would otherwise be HR anywhere else.

Most Barreled balls resulting in outs by season in the Statcast era

  • 2019 - Nick Castellanos (20)
  • 2018 - Nick Castellanos (17)
  • 2017 - Nick Castellanos (17)
  • 2016 - Miguel Cabrera (19)
  • 2015 - JD Martinez (19)
(3) Comments • 2020/06/19 • Parks

Wednesday, June 17, 2020

Park Impact, 5 of N: Do you need to account for the Park Impact to forecast player performance?  Or what do we mean by overfitting?

About twenty years ago, there were three things I learned from saber-legend @Voros:

  1. DIPS (which is the most profound discovery in the last 30 years)
  2. Binary tree for events (too long to explain, but basically, chain events as something happening or not happening)
  3. Park Factors are not all that they are cracked up to be

That last one is the one that was surprising. Now, I don’t know if Voros STILL holds to it, or how strongly he holds to it. And what I’m about to say is based on memory. Now, I have terrible short-term memory, but pretty good long-term memory, especially if it’s something that hit me like a ton of bricks. Which this one did. Basically, Voros was saying that no matter what he did, that other than for extreme parks like Coors, it wasn’t worth it.

As I detailed in the Park Impact series this week…

... there’s ALOT of Random Variation going on. Indeed, most of what we see can be explained as Random Variation Centered Around a True Mean (or RVCATM for long). The way we had always done Park Factors is to use the OBSERVED Park Factors and assume that they are TRUE. After all, we’d see Coors, and we really have no reason to think otherwise. What ended up happening is when we do that, we get into situations where a pitcher’s park will “play” as a batter’s park one year, and we’d adjust the stats that season AS IF IT WAS A BATTER’S PARK. This is what is meant by overfitting: you take an observation and you do not back out the Random Variation that contributed to that observation.

So, Voros argued, it was better to do nothing (except for those extreme parks). What he was really saying is that there was more noise than signal, and so doing nothing was better than doing too much (aka applying the overfit). Except for his exceptions like Coors.

This is actually what led to me releasing The Marcels. The Marcels doesn’t make ANY park adjustment. I didn’t think it could work, but it did. Or at least, it worked well enough.

A better approach is (probably) to use a 5-year or 10-year rolling average to establish the Park Impact. (The best would be to apply SME-regression, but let’s stick with observations only.) So, given a choice between using NO park factors at all, using a one-year OBSERVATIONAL park factor, or using a five-year OBSERVATIONAL park factor, this would likely be the order of preference:

  1. Five-year
  2. None
  3. One-year

(Again, Coors notwithstanding, and regression notwithstanding.)

Park Impact, 4 of N: BACON

We just did homeruns.  Now, we turn our attention to contacted balls (meaning plate appearances that don't end in a strikeout, walk or hit batter).  So, that's the "CON" part of the metric.  And we are looking at hits or reached-on-error.  So, BACON, batting average on contact.

There's Coors Field again: players at Coors will add 38 points to their BACON compared to away parks.  Fenway, Globe Life (Rangers), Chase (Arizona) are other big hitters here, though Chase-since-Humidor is closer to league average.  At the bottom are Petco and Citi that will drop their BACON by 20 points.  Marlins, A's, Mariners homeparks are also costly.

We see the standard deviation for each park hovers around 10 points (all parks are 5 to 15 points).  What would we expect from Random Variation?  Also exactly 10 points.  In other words, we don't have to worry about externalities in terms of a park impact for each park.  Whatever variation you see, season to season, can be explained largely, even almost totally, to Random Variation.

In terms of Regression Toward the Mean: we would regress 41% for any one season, meaning we'd add 0.7 seasons of ballast.  Which means that if you had 7 seasons of observations, you'd regress only 10% toward the mean. So again, if you want to go with a rolling average, you will be pretty safe going with 7 seasons unregressed.  Unless, as noted, you know something major happened, like at Chase.

(Click to embiggen.)

(1) Comments • 2020/06/17 • Parks

Park Impact, 3 of N: Homeruns

In part 2, we looked at strikeouts.  If you go in the comments of that thread, I added the walks, which was fairly uninteresting.  Now, the homeruns.  As a reminder, or maybe I never mentioned it, I am using the Quick way.  That means looking at batters at home and away, as well as pitchers at home and away.  And "home" is the team's main home.  If teams play at a neutral-site game, neither is home.  And if a hurricane causes a team to host another team, but the host team bats first, the host team is considered home.  

(Click to embiggen)

So, for those who haven't followed the series: this shows how many HR are added at that park, per plate appearance.  Coors adds 0.006 HR/PA, while Oracle Park subtracts 0.007 HR/PA (relative to the away parks).  More red, more HR, more blue, less HR.

The last column shows the standard deviation of the seasonal park impact numbers.  The mean is 0.003 (or 0.0031 for more precision).  What would we expect due to Random Variation (not even including wind, temperature, and park changes)?  That would be 0.003 (or 0.0030 for more precision).  In other words, the season to season changes you see are what we'd expect from Random Variation.  I have to say, I am surprised.  It's more likely that this Quick method is just not good enough.  I'll do it the good or better way soon.  Still, given the results I see, I doubt we'll see anything that will be noticeably different.  I mean, I might have expected 0.0035, maybe as high as 0.0040.  Which means that visually, it won't stand out any different.

Anyway, given that, it's far more likely that we should use a standard park impact number, rather than a seasonal one.  If you want to argue for a rolling five-year, that's probably fine.  We should make adjustments by temperature, though that will only be felt at the game level, and hardly make a dent at the seasonal level.  And if you are aware of park configuration changes, like Citi went through some years back, you should use that too.  So, HR park impact numbers, more than any of them, require some SME (subject matter expertise) hand-holding.  

And here's how it looks comparing HR to wOBA.  As you'd expect, some strong relationship.

(4) Comments • 2020/06/17 • Parks

Tuesday, June 16, 2020

Park Impact, 2 of N: Strikeouts

Continuing our look after wOBA, we turn our attention to how the park impacts strikeouts.

The more red, the more the park adds strikeouts.  The more blue, the fewer the strikeouts.  Coors and Kauffman are the two parks that both keep strikeouts low. As it turns out, along with Chase, they are also the two parks that are at the highest elevation.  By and large, fewer strikeouts means a batter's park and more strikeouts means a pitcher's park.  And these results are somewhat in-line with that, though there are a few exceptions like Busch being a pitcher's park AND decreases strikeouts.

That last column is the standard deviation of the park impact numbers for strikeouts.  We would have expected one SD to be 0.007.  Instead, the average of these SD is 0.008.  There's really no reason outside of Random Variation to see the bouncing up and down.  If it went up and stayed up for a park, we can attribute it to some change in the background that helps (or hurts) the batter's eye.  By and large, we don't see that.  

Globe Life Park looks like a candidate that it could have happened.  From 2001-2008, it helped the batter slightly, by reducing strikeouts by 0.005 per PA.  From 2009-2014, it helped more, by reducing strikeouts by 0.011.  Then from 2015-2019, it helped the batter tremendously by dropping strikeouts by 0.020 per PA.  That said, one SD is 0.009, showing that it's not that alarming, but if someone knows the changes at Globe Life Park, then we might learn something.

(Click to embiggen.)

 

(2) Comments • 2020/06/17 • Parks

Monday, June 15, 2020

Can a hitter’s park “play” as a pitcher’s park in some season?

​Coors Field is a hitter’s park. The evidence is easy to see. Rockies batters scored 165 more runs at home than on the road in 2019, while Rockies pitchers allowed 130 more runs at Coors than away from Coors. The reason we do it this way is because we want to use the same players at Coors and away from Coors. So, we look at Rockies batters at Coors and away. And we look at Rockies pitchers at Coors and away. This way, it doesn’t matter if the team is filled with great batters or poor batters, or great pitchers or poor pitchers. It’s the same players at Coors and the same players away from Coors. This is the WOWY approach applied to parks: players with Coors compared to the same players without Coors.

Anyway, in 2019, that’s a whopping 295 more runs scored at Coors, than away from Coors, in games involving Rockies players. Or if you prefer it in ratio form, there were 1.39 times as many runs scored at Coors than away from Coors, in games involving the Rockies. In 2018, it was 1.27. In 2014 a whopping 1.50, and in 2012, an even more whopping 1.57.

Indeed, in the history of Coors Field (which includes the addition of the Humidor), the park impact has a ratio of 1.39X. Which is the identical figure for 2019. It’s clear: Coors Field is a hitter’s park.

On Friday, July 12, 2019, the Rockies beat the Reds 3-2 at Coors. Did it “play” as a pitcher’s park? The next day still with the Reds, there were 26 runs scored. And on Sunday, it was 19 runs. Did this mean that on Saturday and Sunday it “played” as an extreme hitter’s park?

On July 3, there were six runs scored at Coors. The prior day, against the same team, it was 17 runs. Did a hitter’s park play as an extreme hitter’s park followed by playing as a pitcher’s park?

How about a double header on July 15 against the Giants? 21 runs in game one and 3 runs in game two. So, did Coors play as an extreme hitter’s park in the afternoon and an extreme pitcher’s park in the evening?

We all know the answer here: NO, Coors Field does not toggle between playing like a pitcher’s park and hitter’s park. It IS a hitter’s park, and sometimes, Random Variation conspires for fewer walks and hits to result than we’d normally expect at Coors. And sometimes, Random Variation conspires for more walks and hits to result than we’d normally expect at Coors. And we rely on the so-called law of large numbers to minimize Random Variation. And we may THINK that 81 games is enough to tell us if a park is a hitter’s park or a pitcher’s park. But no, 81 games is still not enough. In 2012, I said the ratio was 1.58 and in 2014, I said the ratio was 1.50. Those are the two largest ratios at Coors compared to away since 2000. But in 2013, it was 1.27. While still extreme in 2013, it wasn’t the enormously extreme of the prior year and the succeeding year. And in 2013, it was below the Coors historical average.  It’s a huge bounce back and forth.

So, now imagine a park that is at 0.90 one year, 1.10 another year and 0.90 in the third year. That’s still the kind of bounce that we saw with Coors in 2012-2014. Except now, because it’s below/above the 1.00 neutral line, it makes it seem as if the park “played” one way one year and another way another year.

The reality is that 81 games is not enough to figure out if a park is a batter’s park or a pitcher’s park.

More to come…

Wednesday, October 02, 2019

Statcast Lab: Park Bias Report in Pitch Speed

?Continuing my look at uncovering park biases, if any, I now turn my attention to Pitch Speed.

The typical way I have done this in the past is the WOWY (with or without you) approach.  It's fairly straightforward, if a bit tricky to code.  You look at pitchers at each park, and compare themselves to their own speeds in the rest-of-league parks.  So, at Fenway and away from Fenway (and not just Redsox pitchers, but ALL pitchers who pitched at Fenway and away from Fenway).  You figure out their difference in speeds, weighted by the lesser of their number of pitches in the "two" parks.  Here's how that looks for 2018 and 2019. (click to embiggen)

?

Now, simply that we get non-zero values doesn't represent a bias.  We have to figure out how much random variation could have contributed to that.  We see in the above that Yankee Stadium appears at the top in both years, while Globe Life Park was up one year and down the other.  This is a good sign that we've got some level of random variation.  A correlation of the two gives us an r of 0.50.  This means that about half of what you see (using this method) is signal and the other half is noise.  So seeing +0.30 in 2019 for Yankee Stadium would mean there is a bias of 0.15 mph.  Every other park in 2019 is less than +/- 0.1 mph.  

This is a very weak bias.  And it's not even clear that this bias would necessarily be at the tracking level. There could be environmental reasons where the release speed is higher in one park or the other.

As I've linked above, and you have seen in my blog the past few months, I have a clearer method to look for park bias: we compare the home pitchers to the away pitchers in the same park.  If for example Citi Field is (literally) home to fireballers, we would not expect the tracking of the away pitchers to also have a high pitch speed.  But, if the Mets pitchers aren't that (pun intended) hot, but the tracking is showing them high, we'd expect the away pitchers to also have their speeds read hot.  

So, a flat line shows zero bias, and a sloped line at 45 degrees shows complete bias.  Here's how it looks in 2018 and 2019, limited to fastballs and sinkers only (click to embiggen):

?

?

To say I was sabermetrically ecstatic when I ran this a few minutes ago is to put it mildly.  Citi Field tracks the home pitchers hot, and the away pitchers not.  Which is what you'd expect on a team of fireballers.  Yankee Stadium does show the away pitchers slightly hot, consistent with the WOWY approach I just presented.

However, we can't just look at individual points.  The key is to look at all 60 points.  And all 60 points are scattered all over the place, with no correlation at all between the fastball speeds of home pitchers to their peers in the same park.

Also note that range in speeds of home pitchers is quite wide, at +/- 1.5 mph, while the away pitchers (made up of basically every other pitcher in the league) at +/- 0.5 mph (or -0.6 to +0.4).

As I do a year-end analysis of all the data points you've seen me post about in the past few years, I will run these home/away park bias reports, so we can see the extent to which biases exist (if any).  And how we need to correct it (as we saw with the Catcher Framing)

Friday, April 26, 2019

Pre-Lab: Behavioural Park Impacts

?I've been looking at park effects on and off for 20+ years.  Recently, it's been "on".  It was also "on" twelve years ago, where if you also look in the comments, I have park impact numbers for walks and strikeouts and other things.  At the time, I presumed the REASON is because of lighting and backdrops and whatnot.  While that is still a cause, there may be a more relevant cause: behaviour. 

We are aware that players will change their behaviour based on the run environment.   It would not surprise you if for example I say that batters are going to take more pitches, and swing less, and by doing so, will swing with more authority, which may result in a more go-big-go-home approach.  So, the run environment, and the era environment, more than the park would contribute to more K.  So, while we are all aware of this implicitly, I did not put 2 and 2 together and consider it at the park level as well.  After all, if it's a 6 run per game park at Coors, it doesn't matter (too much) if the rest of the league is at 5 or 4 runs per game.  Players respond to their environment (to a point).

I'll probably be going into "off" mode soon with park effects, but I just wanted to put this out there, in case aspiring saberists have interest.  When I get back "on" this subject, I'll dig more into it.

Wednesday, March 27, 2019

WOWY Framing, part 5 of N: barnstorming testing

?In part 4, I talked about how to test for parks.  You really should read that first, as what I'm about to say next will be... unconventional.

Thanks for reading that.  This test will be based on the away team of each park.  Let's take the Expos (*).  They visit Shea (**).  The catchers for the Mets are now labelled "host of Expos".  Expos visit Fulton County (***) and the Braves catchers are also labelled "host of Expos".  So, I will have "Expos away" paired with the rest of the NL (****) "host of Expos".  I repeat all this for every team.

  • (*) yes, I am still in denial
  • (**) yes, I am still stuck in 1994
  • (***) yes, the Expos broke the division string
  • (****) no interleague play still

So, what do we expect?  Well, without doing ANY adjustments, we should see no correlation.  After all, we have the Expos not tied to any one park, nor to any one team.  The Expos are barnstorming their way around the league.  As are all other teams. This is the perfect control group. This is the image on the left below.

And AFTER we apply adjustments, we should ALSO not see any correlation.  This is how we can make sure we haven't overfitted the data.  Sweet right?  This is the image on the right below. (click to make bigger)

And this looks pretty good to me.

?

WOWY Framing, part 4 of N: pitcher and park adjustments

?One of the key components of the WOWY process is being able to identify and effectively neutralize variables.  In this case, for Catcher Framing, we are interested in accounting for the pitcher and park.

One of the best ways to know that you need a park adjustment to begin with is to compare the performance of the home and away catchers.  Ideally, the correlation should be 0.  After all, they are two independent groups of catchers.  Well, they are dependent: they play in the same park.  

This is what it looks like for 2015-2018, doing Catcher Framing without doing a WOWY.

?

As you can see, there's a strong park influence.The runs saved by the home catchers are certainly not independent from the away catchers.  The correlation is a high r=0.48.

Now, using the WOWY process, this is what I end up with.

I was able to remove the park influence, but it is possible I went slightly too far.  This correlation is r=0.14, and in the negative direction.

Also notice how the points of the away catchers got much tighter (the data points along the y axis are much more compressed).  The home catchers of course won't compress as much, since those catchers are linked to their home parks.  The away catchers is essentially pretty even across all the parks.

Anyway, so, any time you create a metric, always test for a park effect.  And after you (think you have) neutralized for the park, test again and make sure there's no more bias, or at least have the bias reduced a good deal.

(7) Comments • 2019/08/30 • Ball_Tracking Parks

Tuesday, January 01, 2019

How much random variation is there in park factors?

A few of you are walking into the middle of a conversation. I'll try to catch you up. The Colorado Rockies, prior to the installation of their humidor, from 1995 to 2001 at Coors, had 13.8 runs per game (scored and allowed). Those very same Rockies, when they played away from Coors, had 8.8 runs per game. That is an enormous impact which we can attribute to Coors.

Wait, you are saying. How about controlling for the hitters and pitchers you are saying. This is why we are looking at all games involving the Rockies. Walker, Helton, Bichette, Big Cat, Hampton, they all played at home and on the road. In a scientific experiment, what you try to do is isolate one variable and plan on ceteris paribus (all other things equal). In this case, with a presumption that we have similar players at Coors and (in games involving Rockies only) away from Coors, the one variable we are isolating is the park, in this case, Coors.

Now that you are caught up, our park factor is 13.8/8.8 = 157%. In other words, runs come 57% greater at Coors than away. From 2002-present, the Humidor era, we have 11.2 runs at Coors and (in games involving Rockies) 8.5 games away from Coors, or 132% park factor. (On some sites like @Fangraphs and @Baseball_ref, you will see that 132% presented as 116%, to reflect the factor you'd apply to a seasonal line: 132% for Rockies home games and 100% for Rockies away games.)

Now you are really caught up.

Sometimes, you will see jumps in park factors year to year for the same park. Take Arlington Park, where the stadium configuration is virtually untouched since 1994:

http://www.seamheads.com/ballparks/ballpark.php?parkID=ARL02&tab=pf1

But the ballpark factors year to year are all over the place (these park factors are the "untouched", not the "split in half" factors):

http://www.seamheads.com/ballparks/ballpark.php?parkID=ARL02&tab=pf1

So, what happened? Is this weather, or is this random variation? Or both? The easiest way to figure out how much the weather and other externalities are affecting the park is to first figure out how much random variation can exist.

Enter wOBA. There are about 6000 PA in each park each year. One standard deviation in wOBA is 0.5/sqrt(PA) or 0.006455. When you have two sets of observations of 6000 PA, the standard deviation of the DIFFERENCE of those two distributions is multiplied by root2, or 0.006455 x 1.41 = 0.009. That is, based purely on random variation, we expect one SD in the difference of the home and away wOBA to be a whopping 0.009. 

To put this in a practical sense, you would observe .329 and .320 for home and away at one standard deviation. Divide the two, and the wOBA park factor is .329/.320 = 1.028. And since runs are proportional to the square of wOBA, then the run park factor that we can attribute to random variation is one SD = 1.057 or 5.7%.

If you look at Arlington, we can see that one SD of those one year park factors is 10.1%.

We can backout that random variation portion:

10.1^2 = 5.7^2 + ArlingtonSpread^2

ArlingtonSpread = 8.3%

Therefore, we can see, at least with Arlington, that it is subject to a great deal of variation (likely weather related) beyond just random variation.

Citi Field, since 2012 (the current configuration) has an observed spread of 6.7%, just barely more than the random variation. That leaves 

CitiSpread = 3.5%

In 2018, CitiField had a one-year park factor of 74%, compared to the 2012-2017 average of 89%. That difference of 15% is not (all) real. A good portion of it is random variation, some of it is natural ebb and flow of weather patterns in New York. The rest COULD be real.

If I had to take an educated guess at the TRUE park effect of Citi Field in 2018, it would be a simple average of:

2018 observation of 74%

2012-2017 observation of 89%

Regression Toward the Mean of 100%

And so, 87-88%. Maybe it would be a 3🔢1 weighting instead? In that case, it would be 83%. But that's probably as low as you'd take it. It certainly is not 74%.

We can look into it next time.

Anyway, for aspiring saberists, go through this process for all the parks, and show us which parks have a large spread like Arlington and which is closer to pure random variation.

***

P.S. And this is for @pobguy, since he likes the controlled stadium of Tropicana: the spread in the one-year run factor of 1998-2018 is 6.6%, a bit higher than the 5.7% we'd expect from Random Variation.

Which, now that I think about it: 5.7% is actually the random variation based on wOBA only. RUNS however have an additional component, and that's the timing in stringing these events along. So, 5.7% is just random variation in INDEPENDENT wOBA. But then you need the random variation in the stringing of these. That'll probably bring it up to 6 to 6.5%. There's also random variation in that we're not TRULY having IDENTICAL players home and away. And so, maybe we do in fact get closer to 6.6-6.7%, the spread in park factors we observed at Tropicana (and Citi Field).

So, when we see the park run factor of The Trop, from 2009-2013 look like this: 100-81-83-88-94, that could very well be explained purely by random variation. And so, you would simply apply the same park factor each year.

***

P.S.2: Be careful because the other 29 parks may have changed, so, relative to the other 29, you will need to account for that.

(11) Comments • 2022/12/07 • Parks Statistical_Theory

Friday, August 24, 2018

Jacob deGrom flipping a Bayesian Coin

?If god gave you a coin, and said that it was perfectly balanced and immune to flipping tampering, and you proceeded to flip 10 heads in a row, your expectation for the next flip is exactly 50/50. That is because your PRIOR EXPECTATION was 50/50, and no amount of observed outcomes will move you off that prior.

In the real world, your expectation on the next flip will be a smidge over 50/50, with that smidge entirely consistent on the strength of your prior expectation of 50/50.

Mathematically, it looks like this:

  • S: how strong is your prior
  • p: probability of your prior
  • F: number of observed flips
  • h: percentage of flips that are heads

If your prior is god, S is infinity. Otherwise, your S is going to be somewhere between 1 and some large number, lets say a million. Let's say S is 100 for this example. p is 50%.

A prior is a "pre-flip", it's something that you presume has already happened. So, you had 100 pre flips of which 50 are pre heads. You then got 10 straight heads. In other words, the total number of flips is not 10, but 110. The number of heads you observed is not 10 but 60. And so, your expectation going forward is 60/110 = 55%.

You think: no way would I bet on 55%. In that case, the strength of your prior was not 100, but maybe 1000. You repeat the exercise, and now you see you had 1010 flips, of which 510 were heads or an expectation in the future of 50.5% heads.

What S strength should your prior be?  It should be whatever you know about coins, the minting process, your history, and a good guess.

***

That's how this works. If you see Jacob deGrom pitching at Citi Field which has depressed run scoring at an enormous rate this year (at least that's what we have observed in a vacuum), you cannot simply ignore the history of Citi Field, or of MLB history, and think that Citi helped pitchers to the extent this year's observations would make you think.

Citi has had 6 years under its current configuration with 9 years in all, PRIOR to this year.  You have to count them as... something.  Because of weather, and because  other parks also change in comparison, we don't want to treat those observations at equally weighting to this year.  You'd probably want to count these 9 years as about six, with more weight to the more recent seasons (and making adjustments for the change in configuration).  

You ALSO want to simply consider the possibility that Citi is a neutral park.  Maybe that gets a weight of 1.  So, you have six weighted years at about -11%, 1 year at 0%, and this current year (78% of a season) at -30%.  Add it up, and our expectation going forward is -11.5%.

Maybe you don't like that prior.  So go ahead, and try it out yourself.  You come up with your own prior.  Whatever it is, you will find that your true expectation is going to be much closer to 11% than 30%.

How Citi Field has affected deGrom this year is pretty much the same as every other year of his career.

(2) Comments • 2018/08/25 • Parks Statistical_Theory

Wednesday, September 21, 2016

Statcast Prelab: Evaluating the effect of positioning outfielders

?So, I had this idea that I could just implement pretty quickly.  Of course, now that I see the results, I see that it's going to take me a bit longer.

The idea was this: what would happen if I position every fielder for every play in a standard spot for that position.  For example, the standard position for a RF is to play at +27 degrees, 294 feet from home plate.  Therefore, I'd ask "What would Adam Eaton do if he played at +27, 294 on every single play?"  I figured out how far each ball hit was from that point along with its hang time.  And therefore I could figure out his "needed speed" to make the play.  Since I already have Eaton's performance at each needed speed level, I could just apply that rate to Eaton.  And the results look great.  Except... except when they don't.

Welcome to Colorado.  Colorado is a park that tests everything we do.  And it's no exception here.  Whereas the average RF plays at +27/294, the average RF in Colorado plays at +28/308.  And those 14 feet in RF (15 in CF, 12 in LF) makes all the difference in the world.  Without accounting for the park, and simply putting all the RF in Colorado games at +27/294, the estimated BABIP for the team went from .351 to .372.  It would look like positioning had a 21 point BABIP effect.  And technically it did.  But it wasn't because of the nuance of the game.  It was simply an understanding of the park.

The next worse park was Chase field in Arizona, as the DBacks had a similar 17 point BABIP effect in "smart" positioning.  Chase field is where you'll find outfielders playing the farthest out, after Coors and Angel Stadium.  And to round out the three best teams in BABIP effect was the Dodgers.  Dodgers?  Because of Dodger Stadium?  No, because they play in the same division as Colorado and Arizona.  So, they were getting the benefit of "smart" positioning because of all their games there.  And to top it off, the Padres ranked sixth.  Now, they could all very well be this smart.  But I can only figure that out after I tease out the park effects.  So,  what started off as something that could have been really cool will end up being... well, still really cool, but it's going to need one extra layer of complexity to address.

More to come...

Wednesday, April 06, 2016

Exit speed and launch angle on distance

?Alan sets a baseline model and then checks results for various ballparks due to elevation and atmospheric conditions.

(14) Comments • 2016/04/07 • Ball_Tracking Parks

Tuesday, January 19, 2016

Park Impact at Safeco and Petco

?Interesting results.  Note that Scoreboard changes, and any other non-field changes that can affect the wind has to be part of the configuration changes.  I don't remember if KJOK who has the most complete ballparks database available to the public tracks it, but there's another thing to consider.

Wednesday, December 09, 2015

Elevation and HR

?Some good research based on non-MLB parks:

The average number of home runs per game at sea level in this study is 1.10, so Adair’s calculations predict 1.49 home runs per game at 5500 feet. In fact, an extrapolation of the best fit line relating home runs to elevation using these data says that at 5500 feet we will see 1.76 home runs per game. That’s an increase of 65%, nearly double Adair's prediction. Pre-humidor Coors Field saw an average of 3.20 home runs per game as opposed to 1.93 home runs per game by Rockies batters at other stadiums, which is an increase of 66%, just about the same as this dataset, and so we may consider Adair’s prediction of 35% to be a lower bound.

(3) Comments • 2015/12/10 • Parks
Page 1 of 3 pages  1 2 3 > 

Latest...

COMMENTS

Nov 23 14:15
Layered wOBAcon

Nov 22 22:15
Cy Young Predictor 2024

Oct 28 17:25
Layered Hit Probability breakdown

Oct 15 13:42
Binomial fun: Best-of-3-all-home is equivalent to traditional Best-of-X where X is

Oct 14 14:31
NaiveWAR and VictoryShares

Oct 02 21:23
Component Run Values: TTO and BIP

Oct 02 11:06
FRV v DRS

Sep 28 22:34
Runs Above Average

Sep 16 16:46
Skenes v Webb: Illustrating Replacement Level in WAR

Sep 16 16:43
Sacrifice Steal Attempt

Sep 09 14:47
Can Wheeler win the Cy Young in 2024?

Sep 08 13:39
Small choices, big implications, in WAR

Sep 07 09:00
Why does Baseball Reference love Erick Fedde?

Sep 03 19:42
Re-Leveraging Aaron Judge

Aug 24 14:10
Science of baseball in 1957

THREADS

January 19, 2024
Statcast: Theoretical and Actual HR Park Factors

May 19, 2023
Statcast: HR in X Parks

July 29, 2022
Statcast Lab: the many methods of HR park factors

May 26, 2022
HR in X parks update: includes additional park factors

June 19, 2020
Park Impact, 6 of N: wOBAcon v xwOBAcon

June 17, 2020
Park Impact, 5 of N: Do you need to account for the Park Impact to forecast player performance?  Or what do we mean by overfitting?

June 17, 2020
Park Impact, 4 of N: BACON

June 17, 2020
Park Impact, 3 of N: Homeruns

June 16, 2020
Park Impact, 2 of N: Strikeouts

June 15, 2020
Can a hitter’s park “play” as a pitcher’s park in some season?

October 02, 2019
Statcast Lab: Park Bias Report in Pitch Speed

April 26, 2019
Pre-Lab: Behavioural Park Impacts

March 27, 2019
WOWY Framing, part 5 of N: barnstorming testing

March 27, 2019
WOWY Framing, part 4 of N: pitcher and park adjustments

January 01, 2019
How much random variation is there in park factors?