PxP: The Making Of

PxP stands for both play-by-play and pitch-by-pitch, although if judged by how procrastinatingly* slow I go about it, the best description is probably peu à peu.

* I think I just made up a word.

Nowadays, baseball data is easily available, freely and abundantly. You want FIP, wOBA or WAR? You can get it, often in more than one flavor. Useful additives included, like adjustments for leagues and parks, baselines and weighted factors. While the precision and the functionality of said stats continue to rise, their complexity often only allows for acceptance, not for in-depth understanding of how they function.

We used to roam the baseball-information-highway in a Yugo 45. It couldn’t take us many places, but once you lifted the engine hood, there was no mystery how it worked*. Today, while we cruise along in our comfy 2009 Infiniti FX50, we know the exact outside temperature, gasoline consumption and time of arrival to all the beautiful destinations Yugo could never take us. Yet, we don’t even know where the spark plugs are, should we need to change one.

* When it worked. We actually owned one and my brightest memory of it is running 6 miles as an 11-year old kid to get my Dad after the car died on my Mom and me in the middle of nowhere. And when I say “middle of nowhere”, I mean it. It was a dusty road leading from our cottage on the coast of a small Adriatic island towards the village about 20 miles away with no signs of life around it except for crickets. It was noon. It was summer. And Hvar is the kind of an island where hotels attract tourists from North by offering them a free night stay for every day where a cloud crosses the horizon or the temperature drops beneath 100. Or something like that.

I also remember that few things perplexed me more than the news that, in mid-eighties, Yugos would be exported to America. It was a first chink in the armor of my Utopian dreams of a country where everyone is rich, beautiful, free and brave. I still believed the brave part, though.

We barely cook from scratch anymore. You know, go to the local market, buy flour, eggs, vegetables and fish and do our own stuff. Perhaps even experiment with the recipe. We take both our food and our baseball information preprocessed and while there is probably nothing wrong with it, it’s not the only way. The basic underlying data is also available, we have free access to information about every pitch thrown in the Majors, about every plate appearance, every painful strikeout and every glorious home run.

That’s what PxP: The Making Of is about. In a series of PxP analysis I’ll touch different topics that I find interesting, but in this article I want to show you that it is not always that complicated to cook from scratch. It’s not a complete cookbook, but rather a (hopefully) useful set of tips on how to organize your kitchen.


First, the somewhat standard preamble. I realize that AN consists of people whose technical skills and interests range from very basic to the levels that by far exceed mine. As anyone who writes, I wish for a broad audience and therefor hope that the former will pardon the complexity and the latter the simplicity of the explanations. If not, just write your favorite comment about super-nerds.

I included a very simple example of the whole process, try it out if you can.


Ingredients – how to get your hands on the raw data

I use two data sources for PxP analysis. One is retrosheet; I use that for play by play and basic pitch by pitch information. The other is the MLB site, where Pitch F/X data can be obtained from. Both are free. The data sets are not compatible, they use different identifiers for players and games and I have not cross-referenced them so far. However, I believe that it can be done with a reasonable effort and actually plan to do so for one of the analysis.


Downloading retrosheet data is extremely straightforward. Go to their site and click on either a year or a decade you want to download. The files are in a .zip format – unzipping them shouldn’t be a problem. One year worth of play-by-play data is about 2 MB, that’s less than one MP3 song.

The data is organized in files containing all home games per team and files containing each team’s roster for that given year. Both files are in a simple text format – you can open them with a notepad or a text editor of your choice. So, for example all of the A’s home games played in 2008 are contained in a file called 2008OAK.EVA, while the roster can be found in the file with the name OAK2008.ROS.


Pitch F/X Data can be found here and is structured in a more complex way, with deep sub-directory levels. You will first have to click on a link leading you to a desired year, then the next one, leading you to a desired month, than a day, than a game and finally a file containing the Pitch F/X data.

This is obviously something that is acceptable if you just want to have a look at one single game, but not something you can manually do if you want to analyze a series of games or whole seasons. Luckily, Mike Fast wrote a great article on how to not only download, but also store the Pitch F/X data from MLB site. That’s what I started with when I created my database for the Carlos Gonzalez analysis. Go to his site and follow the instructions and nothing much can go wrong. I did change some minor things in the scripts he offers, partially to make them compatible with MS SQL and partially to optimize performance when working with large sets of data, but the instructions are great as they are.

So, I will focus on retrosheet data for the remainder of this article.


We basically need 5 major tools, just like a baseball player. We need to download the data, we need to process it, to store it, to analyze it and, optionally, to display it nicely. Alternatively, just send the results of your analysis to danmerqury*, he sets the bar when nice graphs are a topic.

* My spell-checker suggests danmerqury be changed to “dangerous”. Beware,all.

Some of the tools I use are free to obtain, others are not. I feel more comfortable and work quicker using programs I need for work, too, so I opted for them. However, there are free tools available for every step along the way, so you can get the same results without having to pay any. As is, I use PERL, C#, MS SQL and Excel. What I do in C# can be also be done in PERL and MySQL and Open Office are quality alternatives to Microsoft tools.

Downloading is only complex for Pitch F/X data – check Mike’s site for how to do that. Retrosheet data can be on your computer in a matter of one or two clicks, as described above.

Processing the retrosheet data

Part of the processing can be done with a tool that retrosheet offers, but I opted to write my own. Partly because I want complete control and partly because I didn’t realize there was one available in the first place. I never said I was smart. I actually never tried the tool they have.

I would suggest you write your own, if you can stomach little programming at all. The first one I wrote a while ago to understand how streaks work is actually fairly simple. Retrosheet files can be a little intimidating at first, but they are pretty well structured. Have a look here to understand the structure in detail.

The core of the files are the lines beginning with “play”. Those are the ones that describe what is actually happening on the field. Let’s take a look at an example from a game that Mets played in 2005.


Every line describing a play consists of seven parts. The format is “play”, [inning], [top/bottom], [batter], [count], [pitch sequence], [result]. So in our example, the following happened:

Inning = 4
Top/Bottom = Top (0 indicates top, 1 bottom)
Batter = Jose Guillen (guilj001 can be linked to Jose Guillen in roster files)
Count = 1 and 1
Pitch sequence = called strike, ball, put in play
Result = Ground out, short stop to first baseman.

Or, later in the same game


Marlon Anderson came to bat in the bottom of the fifth, and after taking the first pitch for a ball, hit the line drive to left field that resulted in a single. Or


Esteban Loaiza was put out by a catcher on a pop-up to catcher in a foul territory.

So, let’s do a very simple practical example. Say, I want to know what is the best count to hit a home run. Or at least what is the count most HR were hit at. And say I want to know that for 2008.

    • The next, prepare a table in a database. I do it in MS SQL, the syntax is almost the same in MySQL. In this case it is an extremely simple table:

if exists (select * from sysobjects where name = ‘tbl_homerunsByCount’)
drop table tbl_homerunsByCount

create table tbl_homerunsByCount
HR_count char(2)

The table consists of only one field, the pitch count at which the HR was hit. It is two characters long, first for the balls, second one for strikes.

  • So, now comes the file processing stuff. I do mine in C#, but you can do it in any programming or scripting language that supports some connecting with the database. This is actually the whole thing you need. First a little part where you set the scope, get the files and connect to the database.

* I find the code much more easily readable when in color, and formatting colors here a hassle. Therefor the code chunks are bitmaps. If you want to copy it, here it is in a text form.

I stored the files in the C:\retrosheet directory and added my database server name, the name of the database itself and the credentials that I use to connect to it. That is the part where you need to put your personal settings.

After unzipping the 2008 file, I have 30 individual files and for this exercise I want to use them all. That’s why I leave the “team” variable empty, otherwise I could limit it to a certain team only. The “year” variable is helpful, when you already have more than one year downloaded.

This part gets a hold of all the files and then the real processing starts.



If you don’t program much, this might seem complicated. If you do, it seems more than trivial. There are two loops. The outer one fetches event files until there aren’t any left, the inner one fetches a line of text within the file, until there aren’t any left.

Then every such line is split along the commas and checked for content. If it is a play event with a home run as an outcome the pitch count is stored in the table we prepared in the previous step.

And that’s basically it. Thanks to enormous processing power we all enjoy in these modern times, a whole year of batting is processed in a minute. Now we are ready to analyze the data.

  • vignette17 wrote a great article on how to use SQL in what was at the time the nerdiest FanPost on AN. For our example we only need one simple statement:

select “count” = HR_count, “Number of HR” = count(*) from tbl_homerunsByCount
group by HR_count
order by “Number of HR” desc

What it does, is it tells the database it should find all the different entries (in our case all the different pitch counts), count the occurrences and sort it starting with the most frequent one. And the answer is:

count Number of HR
—– ————
00    860
10    593
11    561
01    484
32    466
22    427
21    404
12    361
20    282
31    274
02    150
30    16

Now, this is not an earth-shattering revelation, but it’s nevertheless interesting. And it’s your own. I, for one, am actually surprised that so few HR are hit on a 3:1, the count I personally loved the best, and the one where I used to swing for the warning track most often.

If you have any possibility to do so, try to implement this example in your own environment. Get it to run, compare the results, and then expand it little by little. There is a pleasure in concluding your own research, just like there is one in fixing your car or baking your own cake. Perhaps, you’ll get hooked – I would love to see any original work that I might have encouraged.

  • The final step is presenting the data. I normally copy the data in Excel and format it there. Occasionally, I will snuggle up to my wife and ask her to illustrate a certain point, but this is not something I generally encourage, at least not with my wife.

PxP – what can it do?

It’s a work in progress, really. Hitting part is rather easy, accounting for all the baserunner movements not so much. Once you delve deeper into the files you will find all kinds of things happening, like:


In this example, Jhonny(sic!) Peralta hit a fielders choice grounder to the third baseman, who realized that the runner on third was going on contact, threw to the catcher who returned the throw to the shortstop and tagged the runner out. The runner from third base advanced to third on the groundout and Peralta ended up on second, advancing on the throw.

Although baseball is extremely structured, it is not always easy to map the happenings and mishappenings on the field to abbreviations and then back to logical movements for everyone involved. Right now I can process about 98.6% of the games from last three years without encountering events the program can not interpret on its own. As I am interested in the big bodies of data more than single particular truths, I decided to live with that for right now, and treat the games I couldn’t interpret as rainouts. They are perfectly random and thus omitting them does not influence my analysis at all.

Playing with the data that is collected this way is fun.Hopefully I can tell you some stuff that you will find interesting, too. I’ll start soon.

Leave a Reply

Your email address will not be published. Required fields are marked *