clock menu more-arrow no yes mobile

Filed under:

Sabermetrics 101: Data

I believe quite strongly that before you look at statistics you should be aware of the data sources and their limitations. Hence... this.

Prerequisites for understanding: None

Prerequisites for derivation: N/A

Generation 1: The Box Score

Every time you open up the newspaper and scan the box scores, you're taking in a huge amount of raw data. Banks and banks of numbers are arrayed for your consumption, providing the reader with information about a batter's at-bats, hits, walks, runs, runs batted in - the works. Pitchers get innings pitched, earned runs, runs, strikeouts, walks, saves, etc. Fielders get... well, they don't get much. Errors, putouts, and assists are the story here. All in all, the 'old school' statistics are all found in the box score, and the first sabermetric knowledge 'rush' came from playing with box score numbers. The problem, of course, is that there is very little context in any generation one statistic - there's no translation available from the statistics presented in the box score to actual runs or wins.

Generation 2: Play-by-play

Here's where it gets interesting. Instead of a snapshot of a baseball game, play-by-play data tells us the story. By recording every event that happens on the diamond, we can build up our game state. Play-by-play data does not introduce any new statistics in and of itself. Instead, it provides proper context for the first generation metrics. This is extremely useful, because with the game states come linear weights, Base Runs, and WPA - in other words, translation tools to get from hits and strikeouts and all sort of statistics to runs and wins. The development of game state is probably the single most important event in the history of serious baseball analysis, enabling us to weight events on the field with far more accuracy than our crude guesses of a generation before. Play-by-play data is our Rosetta Stone.

Generation 3: Ball in play

While the second generation didn't introduce new statistics so much as it arranged them in a much more compelling fashion, the third brings us ball in play data. Generally, this is taken to mean knowing whether a batted ball was a grounder, line drive, fly ball, or pop fly, but there are also more granular measurements on velocity, angle off the bat, specific fielding zones, etc. Clearly, having this information gives us far more scope to evaluate the defence (remember, previously we were basically stuck with putouts, assists, and total chances to build from). With improved defensive metrics come improve pitching ones, too. While the actual change here isn't nearly as radical as the difference between the box score and play-by-play data, generation three statistics give us vital insights regarding pitching and defence. The problem, of course, with ball in play data is the source. The information comes from stringers who may have systematic bias in calling line drives fly balls or vice versa. Some are more reliable than others. There are at least three sources of ball in play information, all yielding different results, which can lead to some quite serious discrepancies when one algorithm is run against two different data sources. And to top all of this off, we don't even have a clear cut definition of what our BIP categories actually mean! In other words, there's some significant measurement error involved in this sort of classification, one that can have a huge impact on our interpretation of the statistics.

Generation 4: f/x

The most recent change in our information supply and probably the most interesting. The f/x statistics basically monitor everything that happens on a baseball diamond. Everything. Pitch f/x sees high speed cameras triangulate on the ball, picking up release point, velocity, spin information, movement, and whether or not the pitch was a strike (although there are still some manual measurements involved, and the camera system isn't accurate enough to take any of our values as gospel). This alone has huge implications for evaluating pitchers - with the caveat being that researchers have only just begun to explore them. Hit f/x trains those same cameras on the ball leaving the bat, allowing us for the first time to see how hard the ball was hit, and to where (admittedly, not including spin on a batted ball weakens hit f/x significantly). The as yet unimplemented 'Field f/x' tracks the ball and each and every player on the diamond, which has huge ramifications for our evaluation of defence. Together, the f/x generation is essentially the ultimate scouting tool in the form of pure, hard data. This is where we live, and it's exciting times.