Play Ball! Talking Data with an Analyst on MLB’s Opening Day

BIG DATA WORLD

 In my life it’s not really spring until I hear the first uttering of “Play ball!” in an MLB stadium. I love a lot of different things about baseball; the crack of the bat, the cadence of the game, the strategy the casual observer may not see, but the thing that keeps me coming back and filling score books year after year are the statistics.

Baseball’s been known for stats for as long as there’s been baseball, so it makes sense to spend this Opening Day asking a proclaimed lover of data and baseball about the future of data in the sport he loves.

Mike Fast is an analyst with the Houston Astros. Prior to joining the Astros last off season, Mike spent two years at Baseball Prospectus where he quickly became known as a statistical expert. Not to leave out his life prior to baseball, Fast also spent 17 years as a semiconductor engineer. You guessed it…Mike loves data. He was kind enough to indulge a friend in a little Q&A on data’s role and future in baseball.

This post is a bit longer than our usual, but I hope, like me, you find it to be a worthwhile and entertaining peek into baseball analysis and how it relates to the data issues that more traditional companies face on a daily basis.

Were you always fascinated by data, regardless of the industry or format?

What I love about data is the stories that it tells you if spend enough time with it and pay close attention to what it is saying and what it is not saying.  Hearing someone else's conclusions from the data is mildly interesting.  Getting my own hands dirty with the data and learning something new is exciting.

The amount of data collected in baseball is growing all the time. Do you think baseball has hit the “Big Data” threshold yet in any form – volume, variety, velocity, variety?

Most of our data is probably not big data in the sense that is usually used by data scientists.  But it's getting to the point where we have to pay attention to the costs and time involved in data storage and transmission and where our ability to collect data outstrips our ability to analyze it.

It's fairly common for me to look at data sets with hundreds of thousands of records.  But what is honestly more of a challenge is marrying all of our diverse data sets together.  We have everything from scouting reports, which have a great deal of diverse detail in themselves, to medical reports to video to radar and camera tracking data to weather data to histories of transactions and contracts.

High-speed, high-definition video is probably the most imminent big data challenge for baseball in more typical sense of that term.

From your perspective, what is most fascinating about the big data phenomenon?

What amazes me is how much there is to learn about the game of baseball.  The opportunities for helping baseball players and teams improve are endless.  But you have to be a good manager of data to make that happen, and that's only going to be more important as the data get bigger.

SportVision has announced FIELDf/x to record where players are on the field with each hit. Pitchf/x was already available. How “big” do you think this new data is going to be and how will impact the role of analyst, especially in scouting?

We’ll know exactly how quickly someone responds now!

One of the big challenges for Sportvision and MLBAM with FIELDf/x is compressing and transmitting huge amounts of video so that it can be analyzed in near real-time.  They record 15 frames per second of video from four cameras at every game. Then they have to process that video to figure out where the players are.  They are tracking anywhere from 16 to 19 people at a time (9 fielders, 4 umpires, two base coaches, a batter, and up to three runners, not counting ball boys or others who may run on the field temporarily), as well as the location of the baseball at all times. 

For a three-hour game, that's over three million data points.  With 2400 games in a season, that adds up to billions of records -which is certainly a lot for analysts to parse.  But the bigger infrastructure challenge is transmitting and processing the raw data in order to distill it down to those billions of data points.

As an analyst, the more detailed data like PITCHf/x, HITf/x, and FIELDf/x raises the question of what you can learn about a player's skill level from a snapshot of his instantaneous performance, as compared to what you learn from the much larger sample of his longer-term record of performance over the course of several seasons.

What do you find to be the most fascinating statistic collected or calculated in baseball and why?

Hmm...tough question!  Once I've learned something, I'm usually ready to move on to the next thing.  Having said that, the PITCHf/x data is really a treasure trove of insight.  I love that we know how know in great detail what is in a pitcher's repertoire of pitches and how he uses it to approach batters.  We've also gained a great deal of insight into the aerodynamics of a baseball.

FIELDf/x is just in its early stages, but I've learned some similarly fascinating things from it, for example, about the paths that base runners take around the bases or the routes that outfielders take to catch fly balls.  There was an academic paper published a few years ago that claimed that base runners took inefficient paths around the bases, but if you look at the FIELDf/x data, you can see that, in fact, the base runners on their own were already taking the efficient paths described by the researchers.

How much do you think the average baseball team relies on data today versus early baseball years?

There is a lot of variety in how front offices use information.  It is said that every team uses both scouting and statistics in some fashion, and I suppose that's true, but it's not a terribly descriptive platitude. 

It's how you use the information that matters, not where it came from.  There are a lot of decisions to be made in baseball, from rosters to contracts to the amateur draft to player development to medical care and strength and conditioning, and on and on.  In each of those areas, there is an opportunity to gather better information and apply it better decision making.  There is a lot of room for baseball to grow in empowering coaches and players and scouts with better information

Because I’m an Astros fan and I can’t resist…knowing what you know at this point about the potential roster for 2013, how many games will the Astros win this year?

All of them in which we're leading after the last inning!  Hopefully it's a bigger number than a lot of people expect.

I hope so too, Mike!

Many thanks to Mike Fast (@FastBalls) of the Houston Astros for indulging this baseball fan with a little data talk on Opening Day. For more about the future of data, read, "The Next Oil."

Add comment


Security code
Refresh

Free Monthly Utopia Data Newsletter

Scroll to Top