The ability to obtain Major League Baseball PITCHf/x data has been available for years. However, even ‘tried and true’ methods are hard to implement if you’re not computer savvy. Furthermore, it seemed silly to me that installing stack technologies was required for data collection. This (among other things) brought motivation for pitchRx, an R package that simplifies PITCHf/x data collection and visualization. As a by-product, pitchRx also simplifies collection of data from numerous XML files. This post demonstrates pitchRx‘s functionality in context to PITCHf/x data.

Once you have R installed, you must install pitchRx and load it into your R session:

install.packages("pitchRx")
library(pitchRx)

Now that pitchRx is installed successfully, it’s time for data collection. Since there is a ton of PITCHf/x data out there, I recommend collecting no more than a years worth at a single time. Even then, it may require 4GB of RAM, so make sure your machine has at least 4GB.

data <- scrapeFX(start="2008-01-01", end="2009-01-01")

On my machine, it takes a little over an hour for this code to run. This could be drastically reduced with a fast internet connection and a good processor (or parallel processors). While your waiting for your data, you should think about how you want to store your data. There are two options that I will cover here:

  1. MySQL database
  2. This is the preferred method for storing data. It is fairly simple to work with MySQL databases directly from R, it just takes time to set up your database and learn SQL syntax. I recommend that you first create empty tables in your database, with the appropriate format for each field (otherwise you get dates treated as characters). You can access my template for tables I have collected here. Once you have created your empty MySQL tables “pitch” and “atbat”, you could run the following in your R console – to append records to your SQL tables:

    library(RMySQL); drv <- dbDriver("MySQL");
    MLB <- dbConnect(drv, user="your_user_name", password="your_password", port=your_port, dbname="your_database_name", host="your_host");
    dbWriteTable(MLB, value = data$pitch, name = "pitch", row.names = FALSE, append = TRUE);
    dbWriteTable(MLB, value = data$atbat, name = "atbat", row.names = FALSE, append = TRUE)

  3. csv files
  4. This method is those that don’t have time to learn MySQL. Remember that we are dealing a large amount of data, so reading and writing these csv files are going to require some patience. Anyway, here is some relevant code:

    write.csv(data$atbat, file="08atbats.csv");
    write.csv(data$pitch, file="08pitches.csv")

    Note that before you repeat this collection and storing process for 2009, 2010 and so on you will want to clear your workspace each time, ie.

    rm(data)

    You might be asking yourself: “Why does scrapeFX return two data frames?” Well, if you’re repetitively querying subsets of data and/or going to be building a database with other tables (that can be linked back to these), then you want to keep them separate. You most certainly don’t want to keep this amount of information in your virtual memory or else it will slow down everything on your machine.

    No matter what type of analysis your conducting, at some point you will probably want to merge the “atbat” with the “pitch” table. If you have your MySQL tables set-up, you can save time by joining your tables before loading them into R. Say you wanted all of the data you have Mariano Rivera. Assuming that your MySQL database connection is in your R workspace:

    Rivera <- dbGetQuery(MLB, "SELECT * FROM atbat INNER JOIN pitch ON
    (atbat.num = pitch.num AND atbat.url = pitch.url)
    WHERE atbat.pitcher_name = 'Mariano Rivera'")

    If you’re going the csv files route, you would want to do something along these lines (it’s going to be much slower):

    atbats <- read.csv{file="08atbats.csv")
    Rivera_atbats <- subset(atbats, pitcher_name %in% "Mariano Rivera")
    pitches <- read.csv{file="08pitches.csv")
    Rivera <- join(Rivera_atbats, pitches, by = c("num", "url"), type="inner")

    The really nice thing about using scrapeFX is that it automatically adds useful columns to the ‘pitch’ and ‘atbat’. This is an attempt to limit the number of tables you need to conduct a meaningful analysis (at multiple levels). Here is a list of things derived from the source during the scraping process:

    1. pitcher_name:
    2. Relevant player name (instead of ID numbers).

    3. batter_name:
    4. Relevant player name (instead of ID numbers).

    5. num:
    6. This is added to the ‘pitch’ table and is the (ordered) atbat number that corresponds to each pitch. It seems that tfs_zulu is meant to link thes tables together, but I’ve found cases where the first pitch of an atbat happened before the actual atbat. This leads to discrepancies.

    7. count:
    8. The pitch count before the pitch was thrown.

    9. inning:
    10. The relevant inning of the game.

    11. top_inning:
    12. Is is the top or bottom of the inning?

    13. url:
    14. The file name where the record was taken from. Convenient for joining tables together and investigating discrepancies.

    About these ads