Empirically testing park factors … part I

Okay. I’m sure you can guess what happened to me. Yup, fell into the park factor rabbit hole. . . .

But I have to say, it’s kind of cool down here!

Take this as a kind of postcard from my journeys—I’ll keep reporting and have a lot more to say when I get back (if I ever do).

But one thing I decided to do was replicate Bill Petti’s cool examination of the predictive power of park factors for players who switch teams. He reported finding that park factors overestimate the positive impact of a favorable park change, particularly for power hitters.

One limitation of the study, though, is that it compares the performance of park-factor predictions to nothing.  All he does is calculate how much or little a player’s performance changed after moving from one park to another.

That’s not enough information. We want to know how much closer or further off our prediction of a player’s performance would be if we didn’t consider park factors. Even if park-factors overshoot the mark, they might generate predictions closer to a player’s observed performance than whatever alternative prediction scheme we’d be using if we ignored park factors!

So I decided to test a park-factor informed model against a park-factor ignorant one.

I looked at home park wOBAs for players who played for different teams in successive seasons and who batted at least 150 times each year. 

As a baseline model, I predicted the new home park wOBA with a regression model that regresses year 2 wOBA on year 1—regardless of home park.  Not a very sophisticated model, for sure!, but better than nothing and a good place to start.

I then compared that prediction to the one we’d get if we adjusted that prediction to take account of park factors. The park factors I used were Statcast’s. They provide both left- and right-handed batting factors for each park, so I used that information in the estimates. Switch-hitters got separate comparisons that reflected their performance on both sides of the plate.

Obviously, in estimating this-year-as-a-function-of-last-year regression model I excluded the players whose second season performance was to be predicted! The predictions, in other words, are all out-of-sample ones in relation to the model, with and without park-factor adjustments. There were about 120 players who met the relevant criteria.

In this way, I got two models to compare: a raw, park-factor unadjusted next-year-from-last; and a park-factor adjusted next-year-from-last that consisted of applying park factors to the results of the raw model.

What did I find? The models did equally well. On average, they were off by about 12 wOBA points (MAE, weighted for plate appearances)—better than I expected actually, given how simple the regression model was!

But the cool thing is how each performed conditional on the performance of the player in the new park. Look:

Sure enough, as Petti found,the park-factor adjusted prediction over-estimated the performance of the power hitters.

 

But the unadjusted formula under­-estimated those same players’ performances by an equivalent degree!

That’s why they come out the same “on average.”

Interesting, right?

A couple of more points.

First, remember that for park-factor adjustments to be worthwhile they have to pay for their diminution of the explanatory power of raw performance measures. That was the nerve of the last post.

In this very simple test, park factors just cleared the bar: the predictions of raw and park-factor adjusted are equal. But if that’s what we observed as models became more precise, that would be sort of bad for park factors; it would suggest they aren’t adding anything to raw performance predictions.

Second, the “simple” model used here should be expected both to underestimate the second-season performance of better-than-average performers and overestimate the first-season performance of below-average ones.  That’s because a pure “auto-regressive” (next from last) model identifies the equation that minimizes the deviation of outputs from observed ones. If you apply that equation again to predict the values, ones that are higher than average will be pulled down toward the average, and ones below pulled up. Do it again and again and again and eventually everyone will be predicted to converge on the unconditional or sample-wide mean.

That’s a ridiculous divergence from reality! Genuinely better players will continue to do better than genuinely worse ones. A better model would figure that out.

And then the unadjusted model wouldn’t overestimate an underestimate so much at the extremes.

But what would happen to park factors? Whatever it is in Statcast’s factors that causes them to over-estimate power hitters’ performance after a park change is fixed. . . . If so, the balance could tip in favor of raw, unajusted predictions, even if one is trying to forecast things like how changing parks will affect a player’s production. . . .

Or at least that’s one conjecture! To test it, we need more data . . .  more data . . .  MORE DATA!

Stay tuned-

Leave a Reply

Your email address will not be published. Required fields are marked *