PXP: Out of Control

We are probably looking at some 10,000 words here. Bouncing incoherently in my head are phrases like “true and observed level of talent”, “BABIP is not another word for luck” and “Maria Grazia Cucinotta”, in a thought process pretty much in line with the title of this article. I don’t really know yet where this is going to take us, but it will include graphs, a few Pozterikses and the biggest table you have ever seen on AN. And there already was a drawing, so I guess there is something for everyone.Before we even try to evaluate athletes and their performances we have to be very aware of countless limitations and obstacles in our way. Especially in baseball, where so many variables come into play. If we, for example, want to measure the weight of a rock, we will simply put it on a scale and read the result. The better the scale, more precise* the result, but the principle is the same – we get a definite result. And, as a society we have grown so accustomed to having exactly one correct answer to every question, that we sometimes struggle when encountering impenetrable gray zone, such as the thick veil wrapped around one of our favorite topics – the true level of talent of baseball players.

* I really am in constant awe about how precise we all manage to measure things nowadays. I work on optical measurement instruments development and for many of our measurements the precision standard is one picometer. It means we measure stuff to the precision of one single picometer. Do you have any idea how bloody small that is? Think of the Earth — and while you are at it, check the incredible BBC documentary “Planet Earth” if you haven’t done so yet — of everything there is on it, all the buildings, all the cars, all the cities, regions, countries, continents, the vastness of the oceans. Now imagine the whole planet shrinks to a size of a marble. You can barely recognize the shapes of the continents and you can perhaps guess general location of some cities. And now imagine how small a single baseball would be somewhere inside that marble. That’s a picometer.

See, we can not take baseball talent and put it on a scale. What we can do, is to interpret the results and try to strip them of every outside influence and hope that what we are left with is really what we were looking for in the first place. It is little bit like calculating the weight of our rock by counting the glass sherds after it was thrown through a window.

But that’s not the only problem. Even if had the perfect tools for our reverse-engineering experiments, we would still fall short of a definite answer. Simply said – because there is none. Unlike with the weight of the rock, that elusive true level of baseball talent changes constantly.

Athletes do not perform on a constant level. In baseball we are so occupied with trying to separate the essential from the noise that we sometimes forget that. Take track, for example. If you follow a certain sprinter or a high jumper* for a season, you will notice that their results vary from meet to meet. Or if you jog, ride a bike or make origami you have surely noticed that you are fitter on some days than you are on the others. Baseball players and their ability perform are no different.

* Speaking of high jumpers, I once saw Dragutin Topic jump over 2.00 m (6’7″). Not impressed? He was 17, just a year my senior. Still not impressed? It was his first jump in the national championship, while almost the half of the competition already faulted out on the lower heights. More? He never took his training pants off for the jump. Even more? He cleared the bar jumping freaking SCISSORS! It was the most cocky, awesome, obnoxious, incredible, this-is-my-field-and-you-all-better-go-home-now performance I have ever witnessed. Next year, he would jump higher than any junior in the world ever did. Few months ago he jumped higher than any 40yr old ever did. I really should start practicing again.

So, in short, we are trying to measure something that changes constantly and we can do so only by using tools that measure its somewhat distant consequences. That’s why we so welcome any component that we can take out of the equation and make our job simpler. That’s why we so welcomed and embraced BABIP and that’s why many of us are so abusing it.

Some time ago it was observed that pitchers could not significantly repeat their BABIP rates year over year, while they could much more do so with their K, BB and HR rates. The simplified explanation was that while pitchers control* the latter three, BABIP is pure luck. Of course, that’s not how the people on the forefront of sabermetric research phrased it. They went at lengths about observed level of talent versus true one, made zillions of correlation tables and were very careful at how they formulate their findings.

* Now, I guess nobody really believes that pitchers “control” BB, K and HR. As NRA would say, pitchers don’t hit home runs, batters do. But pitchers do help, and what is meant by “control” is that their influence on such outcomes is strong enough to be noticed year after year. Their influence on BABIP, on the other hand is not.

But just like “through” turned into “thru”, the public at large wanted none of the complicated explanations. BABIP is luck, plain and simple, and high ones will come down and low ones will rise. For hitters and pitchers alike. Fantasy baseball sites started offering advices on sell high / buy low based on BABIP, ESPN took to talking about it and even today, majority of AN commenters use BABIP as a synonym for luck, even when they talk about hitters.

Only, it’s not.

It’s easy to see it for the hitters. Just go to any stat site and compare BABIP of NL league pitchers to that of TOP 20 hitters, for example. The logical reasoning behind it is that better hitters will hit balls harder and hit more line drives and as such make it much harder for defense to throw them out. Makes sense, right?

But how about pitchers? Don’t they have a say in that? Is it really so that good pitches put in play have the same chance of being an out as do the good ones? While we accept that bad pitches lead to home runs, we assume that bad pitches don’t increase a chance of a base hit because there is no evidence of year to year correlation in pitchers’ BABIP.

Ever watched a batting practice? Line drive after line drive, homer after homer. If such a BP fastball were thrown in a game and swung at, would it have a higher chance to be a base hit than a slider outside of the zone? Logic tells us “yes”, common BABIP sense tells us “no” and the correct answer is “You bet your sauerkraut it would”.


So, when the hitters connected on the meatball and put it in play, they had a batting average of .368, significantly higher than the 2010 baseline of .299. However when they reached for the slider that was either too low or too far outside they did significantly worse.

So, by taking a mix of all the pitchers, hitters, defenses, parks, counts and whatnot, we are hoping to isolate the effect the pitch quality itself has on the BABIP. What we get is the observed level of talent, and we are assuming that it somewhat accurately represents the true level of talent. Or in other words it seems that, all other things being equal, an extremely bad pitch will be hit for higher average than an extremely good one.

But it only seems so.

It could be that the meatball is only thrown on 3-1 where the batter is sitting on it. And slider in the dirt is only swung at on 0-2 where the batter is choking on the bat anyway and just slapping it.

To test it somewhat further, let’s use MLB’s own Nasty Factor. And here is what its creators had to say about it.

The Nasty Factor evaluates several properties of each pitch, and rates the “nastiness” of the pitch on a scale from 0-100, based in part on the success or failure of opposing hitters against previous similar pitches. The Nasty Factor incorporates several different factors for each pitch, including:

  • Velocity — The greater the pitch’s velocity — as compared to that pitcher’s and the league’s range of speed for that pitch type — the greater the nastiness;
  • Sequence — The more the pitcher mixes up his pitches, the greater the nastiness… and certain pitch sequences are nastier than others, too;
  • Location — The closer to the edges of the strike zone is, the greater the nastiness, while pitches closer to the middle of the plate, and farther away outside the strike zone, decrease in nastiness;
  • Movement — The more movement the pitcher applies to the pitch — as compared to that pitcher’s and the league’s range of movement for that pitch type — the greater the nastiness.

The Nasty Factor also adjusts for how often the pitcher has faced the current batter during the game, as well as how often he has used the same pitch type against the same batter in the current at-bat and previously in the game.

You get the idea. High numbers represent tough pitches. How exactly MLB goes about weighing different components is irrelevant, because we will first test the nastiness on a known quantity, like swing and miss percentage. The nastier the pitch is, the higher the whiff rate should be. There is only one problem – the location:

The closer to the edges of the strike zone is, the greater the nastiness, while pitches closer to the middle of the plate, and farther away outside the strike zone, decrease in nastiness;

Now this is somewhat understandable, as neither the meatball down the middle nor a regular fastball two feet outside are really nasty pitches. The first will normally be hammered, the second will normally be looked at. But for the purpose of this exercise, the swing and miss rates are exactly the opposite if one swings at both pitches. To eliminate this effect, I have looked at all the pitches thrown last year where nasty factor was over 50. I then divided them into groups of 5 and looked at how they correlate with swing and miss rates. Then I did the same for the BABIP.

If you look at the first column, you will notice that the numbers pretty much go up as the NF rises. It seems that as the pitches get tougher the swing and miss rates rise, and a simple correlation test confirms that. This is something we would logically expect if the NF really describes the toughness of the pitch, and the presented numbers seem to support the idea that MLB has done a pretty good job in that.

So, now that we believe that NF is a decent indicator of the quality of the pitch, we can turn to the column on the right. Again it seems that BABIP goes down as the nastiness increases and the -.82 correlation confirms that. It is lower than the correlation on the left hand side, but it is still a decently high number, so that we can say – it seems that all other things being equal, batting average on the balls in play will decrease with the toughness of the pitch.

Important limitation of our findings is that we do not control for the other factors. We couldn’t compare BABIP with pitch quality while all other factors were being constant (same defense, same batters, same parks, same temperature, wind…). But by mixing in a large number of pitches from a wide variety of situations we are hoping that other factors will be offset. It is naturally possible that, for example, pitchers will only throw tough pitches to lousy batters, thus creating a false connection. It however, seems logically unlikely.

This data indicates that tough pitches will indeed be hit for lower average. And pitch toughness being one thing controlled exclusively by the pitcher, we would expect the pitcher to control his BABIP, right? So, how come they don’t?

First, influencing something is not the same as controlling it. Even if better pitchers threw only nasty pitches all the time, it could be that the part of the BABIP that is influenced by the pitch itself is proportionally irrelevant. Think of packing your bag for a vacation. Whether you pack the second pair of shoes is completely up to you and has direct influence on the overall weight of the plane you will board. But can you say that your packing decisions control the weight of the plane? Of course, not – it depends much more on the type of the plane itself, number of the passengers, length of the flight and the fuel in the tanks, the way other passengers packed their bags and so on. The weight of the plane will vary from flight to flight although I might possess the skill to pack my bag light every time.

Our plane is our defense behind the pitcher, our fuel is the park, our temperature and wind are the other passengers… you get the idea.

So, while the batters spread their at bats over different circumstances, pitchers tend to have bigger batches of continuous same influence on their BABIP, because they have their at bats in front of the same defense night after night (and in batches of some 25 per night instead of 4-5). Their bottom line will be more influenced by the defensive prowess of their teammates, and as soon as that changes (be it for a trade of the pitcher himself, a new shortstop or even a yearly fluctuation in the defensive ability of one or two defensive players on his team), his own performance will be highly influenced by it.

Say, one defense performs 10% worse in 2011 than they did in 2010. It will have a direct 10% influence on the pitcher on that team. It will however have only some 1% influence on all the batters in the league, because they will, unlike the pitcher in question, have only some 10% of their at bats against that defense. And the chances are, that some other defense might be playing 10% better, so for batters it might not have any influence on their BABIP.

The predictive value of BABIP for pitchers is as good as non-existent. But just because the end-result is not completely in their hands it does not mean that pitchers don’t play a role in it. That’s because we have a big gap between the observed level of talent ( the actual BABIP, heavily influenced by other factors) and the true level of BABIP (the part that the pitcher controls). The above data does indicate that if pitchers would have a different defense behind them every 5 at bats, we might see the same influence on BABIP as we do with the batters.

OK, only 2.5k words by now and I am pretty much done. We could spend another 7.5k discussing Maria Grazia, but I’ll leave that for a DLD.

I will however try to sum up my thoughts, to somewhat reduce the traditional “what the ^$% was he trying to say” effect:

  • We can’t measure baseball talent in itself, only the results of it
  • Interpreting the results is hard, because there are many different influences on them
  • What we try to measure is “observed level of talent”, how athletes really perform is their “true level of talent”
  • “True level of talent” changes constantly
  • Batters influence their BABIP, and we can notice that in the “observed level of talent”
  • Pitchers (much less than batters) , too, influence their BABIP, but it’s near impossible to see it in the “observed level of talent”
  • Presented data seem to support the above notion
  • Batters have their at bats rather evenly spread over defenses, parks and time
  • Pitchers have theirs in bunches, in front of the same defense
  • Every change in defensive ability of his team will heavily influence pitchers, while having small effect on batters
  • If the pitchers, too, had different defense every couple of at bats, they would probably have higher influence on their BABIP then they do now

As a final act, please help yourself to a biggest mother&*%)ing table ever published in this part of woods.

Click on it to enlarge. To get a little bit of a feeling what results pitchers have with different pitches, roam around. Remember that no pitch can be seen in a vacuum, and that what works for one pitcher must not necessarily work for the other. Still just have fun and then come back, to tell me just how much you appreciate all the hard work I put in, in my countless hours in front of the VCR, counting pitch after pitch.

Leave a Reply

Your email address will not be published. Required fields are marked *