I’ve made the switch from wordpress to jekyll. See my new site here. I made this switch so I could have a more control over managing files and a better framework for producing blog posts via knitr


The latest release of pitchRx has me excited for two reasons. First, R 3.0.0 has the capability to host R Markdown vignettes generated via knitr, so an extension of the pitchRx demo page is now available as an official vignette. After installing pitchRx 0.4, either visit the direct link or enter help(package="pitchRx", help_type="html") then click on “User guides, package vignettes and other documentation”. In order for everything to render properly, use a modern browser with WebGL graphics enabled.

Secondly, strikeFX now has the option geom="subplot2d". This is a great way to look at the distribution of a discrete variable across subregions of the strike-zone. The package vignette has an example of how to use this new functionality. Below you can see a more sophisticated example which combines the variable top_inning and des to compare called strikes and balls for both home and away pitching.

I finally released pitchRx on CRAN today and was pleasantly surprised by the initial reception. I failed to mention the ‘pitchRx demo page’ in the NEWS file on CRAN, so I wanted to mention it here. This page will be helpful if you want to harness the visualization capabilities of pitchRx from your R console (as opposed to the shiny GUI).

Let me guess. You love baseball. You’re probably a sabermetrics nerd. You might even keep your own baseball blog. You know about PITCHf/x, but don’t have the resources or computing chops to put it to use. If so, you’re in the right place. PITCHf/x visualization is now easier than ever (and free) thanks to pitchRx and shiny.

A web hosted version of this tool is available here. However, in order to create animations, you’ll need to run the local version. To run the local version, first install R, then enter the following into your R console:

shiny::runGitHub('shiny_apps', 'cpsievert', subdir='pitchRx')

This should eventually open your default web browser. Your browser page should look similar to the screenshot below. (I built this using Firefox, if you are running into problems with your browser, please notify me and/or consider using Firefox)

By default, the sample dataset in pitchRx is loaded. However, there is an option to upload any csv file from your local machine (if you don’t have data, but want some, see my post about obtaining PITCHf/x data).

This tool is exciting because you can generate complicated plots quickly and download them for later use (just click the “Download Current Plot” button). I don’t consider myself a baseball analyst, but I would love to see those of you using this work to complement your analyses (please cite pitchRx and/or this post).

Here is a short explanation of what is currently possible with this tool:

First, consider plotting options that apply to any plotting method listed below. These “universal” options include: choosing x-axis and y-axis limits as well as column and/or row-wise facetting.

  1. Animation of flight paths. Options include choice of different coloring variables.
  2. Visualization of strikezones. In this case, there are options that apply to any of the four plotting geometries. These options include: adding contour lines to the current plot as well as adjustment of vertical locations according to aggregate strikezones (ie, average heights).
    1. Using a “point” geometry: Options include choice of different coloring variables.
    2. Using a “tile” geometry: Options include altering density definitions.
    3. Using a “hex” geometry: Options include altering density definitions as well as adjustment of hex heights and widths.
    4. Using a “bin” geometry: Options include altering density definitions as well as adjustment of bin heights and widths.

Altering density definitions is probably my favorite feature. It allows one to examine differences between two different strikezone densities. Since it’s difficult to do this visually, pitchRx provides the basis for plotting differenced densities. For example, one may want to subtract the density of “Balls” from “Called Strikes” to explore umpire influence:

Lastly, have fun and let me know what you think!

The ability to obtain Major League Baseball PITCHf/x data has been available for years. However, even ‘tried and true’ methods are hard to implement if you’re not computer savvy. Furthermore, it seemed silly to me that installing stack technologies was required for data collection. This (among other things) brought motivation for pitchRx, an R package that simplifies PITCHf/x data collection and visualization. As a by-product, pitchRx also simplifies collection of data from numerous XML files. This post demonstrates pitchRx‘s functionality in context to PITCHf/x data.

Once you have R installed, you must install pitchRx and load it into your R session:


Now that pitchRx is installed successfully, it’s time for data collection. Since there is a ton of PITCHf/x data out there, I recommend collecting no more than a years worth at a single time. Even then, it may require 4GB of RAM, so make sure your machine has at least 4GB.

data <- scrapeFX(start="2008-01-01", end="2009-01-01")

On my machine, it takes a little over an hour for this code to run. This could be drastically reduced with a fast internet connection and a good processor (or parallel processors). While your waiting for your data, you should think about how you want to store your data. There are two options that I will cover here:

  1. MySQL database
  2. This is the preferred method for storing data. It is fairly simple to work with MySQL databases directly from R, it just takes time to set up your database and learn SQL syntax. I recommend that you first create empty tables in your database, with the appropriate format for each field (otherwise you get dates treated as characters). You can access my template for tables I have collected here. Once you have created your empty MySQL tables “pitch” and “atbat”, you could run the following in your R console – to append records to your SQL tables:

    library(RMySQL); drv <- dbDriver("MySQL");
    MLB <- dbConnect(drv, user="your_user_name", password="your_password", port=your_port, dbname="your_database_name", host="your_host");
    dbWriteTable(MLB, value = data$pitch, name = "pitch", row.names = FALSE, append = TRUE);
    dbWriteTable(MLB, value = data$atbat, name = "atbat", row.names = FALSE, append = TRUE)

  3. csv files
  4. This method is those that don’t have time to learn MySQL. Remember that we are dealing a large amount of data, so reading and writing these csv files are going to require some patience. Anyway, here is some relevant code:

    write.csv(data$atbat, file="08atbats.csv");
    write.csv(data$pitch, file="08pitches.csv")

    Note that before you repeat this collection and storing process for 2009, 2010 and so on you will want to clear your workspace each time, ie.


    You might be asking yourself: “Why does scrapeFX return two data frames?” Well, if you’re repetitively querying subsets of data and/or going to be building a database with other tables (that can be linked back to these), then you want to keep them separate. You most certainly don’t want to keep this amount of information in your virtual memory or else it will slow down everything on your machine.

    No matter what type of analysis your conducting, at some point you will probably want to merge the “atbat” with the “pitch” table. If you have your MySQL tables set-up, you can save time by joining your tables before loading them into R. Say you wanted all of the data you have Mariano Rivera. Assuming that your MySQL database connection is in your R workspace:

    Rivera <- dbGetQuery(MLB, "SELECT * FROM atbat INNER JOIN pitch ON
    (atbat.num = pitch.num AND atbat.url = pitch.url)
    WHERE atbat.pitcher_name = 'Mariano Rivera'")

    If you’re going the csv files route, you would want to do something along these lines (it’s going to be much slower):

    atbats <- read.csv{file="08atbats.csv")
    Rivera_atbats <- subset(atbats, pitcher_name %in% "Mariano Rivera")
    pitches <- read.csv{file="08pitches.csv")
    Rivera <- join(Rivera_atbats, pitches, by = c("num", "url"), type="inner")

    The really nice thing about using scrapeFX is that it automatically adds useful columns to the ‘pitch’ and ‘atbat’. This is an attempt to limit the number of tables you need to conduct a meaningful analysis (at multiple levels). Here is a list of things derived from the source during the scraping process:

    1. pitcher_name:
    2. Relevant player name (instead of ID numbers).

    3. batter_name:
    4. Relevant player name (instead of ID numbers).

    5. num:
    6. This is added to the ‘pitch’ table and is the (ordered) atbat number that corresponds to each pitch. It seems that tfs_zulu is meant to link thes tables together, but I’ve found cases where the first pitch of an atbat happened before the actual atbat. This leads to discrepancies.

    7. count:
    8. The pitch count before the pitch was thrown.

    9. inning:
    10. The relevant inning of the game.

    11. top_inning:
    12. Is is the top or bottom of the inning?

    13. url:
    14. The file name where the record was taken from. Convenient for joining tables together and investigating discrepancies.

Nerd alert! I just stumbled across Ramnath Vaidyanathan’s awesome R package (slidify) that makes publishing robust and interactive presentation slides easier than ever. Thus, I should update my previous post accordingly – with an outline on how to produce those same slides using slidify.

If you haven’t heard of Slidify, I recommend the demo video to get acquainted. For lazy people like me, Slidify can offer a workaround if you don’t have time to learn and integrate complicated stuff like pandoc, CSS and javascript.

Not much of the original real_time.Rmd file was changed to produce this real_time_slidy.Rmd file (seen below. All that was changed was the heading code which does the conversion and formatting for you (as opposed to using pandoc – as in the previous post). If you’re just interested viewing the slides that this creates, you can preview them here: real_time_slidy.html

I recently presented my research to the Statistical Graphics Group here at Iowa State. It was a shameless self-promotion of my R package, pitchRx, which makes Major League Baseball’s PITCHf/x data easier to obtain and analyze. People went crazy over the real-time animations in my presentation slides, so I thought I would give a straight-forward tutorial on how to make your own!

First of all, make sure you have R (version 2.15.1 or newer), R-studio and pandoc installed on your machine. I’ve provided a screenshot of the real_time.Rmd file below which produces other files which are necessary for the animations. The code in this (R Markdown) file depends on several R packages, so make sure you have the required packages installed before knitting this file:

(1) devtools

(2) knitr

(3) R2SWF

Once the proper packages are installed, make sure you have a internet connection and click on the “Knit HTML” feature in R-studio.

Once the knitting process has finished, this creates both a markdown and HTML file in your working directory. After these files are created, open your favorite command-line interface and:

(1) Mimic your current directory to the directory you are using in R-studio

(2) Enter the following command: pandoc -s -S -i -t dzslides real_time.md -o real_time.html

Now open real_time.html in a web browser…and Voliá! Note that these slides can be viewed with any browser that supports HTML5. If your browser doesn’t support HTML5, go out and get yourself a real browser

You can easily customize this code to animate any set of pitches that you might be interested in. Just edit the dates and player(s) in the scrapeFX() function. You also probably want to consider subsetting the FX dataframe according to which pitch types you want. It’s that easy! If you think this is really cool and want to know more, just be patient and be on the look out for a formal paper 🙂