Evaluating the 2018 Projection Systems

As we all hunker down with our spreadsheets in preparation for draft season, it’s time for my favorite yearly tradition – evaluating the projection systems! Forgive the somewhat abbreviated post in comparison to past years’ analyses, but trust that all the same meticulous research is here as always. The goals, as always, are to evaluate the best projection systems on a per-category basis for the purposes of fantasy baseball, and to determine the best possible mix of projections (separated into rate stats and playing time) as we look ahead to ‘19.

In this study, I’ll focus on the most commonly used projections – the same ones that appear in the Big Board: Steamer, PECOTA, ZiPS, ATC, The Bat, Fangraphs Depth Charts, and Fangraphs Fans. The categories of interest are the typical 5×5 categories, HR/R/RBI/SB/AVG for hitters, and W/SO/SV/ERA/WHIP for pitchers. Since for fantasy purposes we only care about the relative projections made by each system (ie, we only need to know Trout is the best hitter in baseball, not exactly what his AVG will be), I’ll primarily use R squared to evaluate how well the projections correlated to actual results, but I’ll also be including RMSE to show the absolute error in each projection system. The most common fantasy leagues draft about 300 players, broken out into 180 hitters, 90 SP, and 30 RP, and so I’ve used the consensus top 300 players as determined by an average of the projection systems, and will only be evaluating the systems based on their projections of those 300 players. One final adjustment – hitters that didn’t end up reaching 400 PA and pitchers that didn’t reach 35 IP have been thrown out of the sample to reduce the effect of short playing-time outliers (typically from injury).

Hitters

First, a definition – the Big Board mix for hitters this year, which combines the systems to produce the best overall results, is:

  • Playing Time: 41% Fans, 32% ZiPS, 27% Steamer

  • Rate Stats: 58% ATC, 42% Steamer

Note that rate stats like AVG were also evaluated as part of the ‘total’ projections by using a playing-time weighted value indicated by an ‘n’ (e.g. “nAVG”). The SB (*) projections were evaluated on the basis of two separate populations and averaged: players who stole >5 bases, and those who stole 5 or less. Past analysis has shown that evaluating this as a single population gives undue credit for projecting the low-steal players, and not enough credit for accurately projecting high-steal players.

 

 

Starting with playing time, nearly every projection system struggles here every year. Between injuries, lineup spots, and role changes, playing time is just plain difficult to peg. Many people swear by the hand-curated playing time over at Fangraphs, but here we see it failed to live up to that hype for the 3rd straight year. By combining Steamer, ZiPS and the Fans’ opinions of playing time, we get the best of each system, so the Big Board mix comes up with a significantly better overall projection of PA’s.

The rate stat performances were more similar than different, although you may note some poor performances in R and RBI for a few systems. The systems which average projections (ATC, FGDepth, Big Board mix) come out on top, as might be expected. The net result is calculated in terms of the percentage above (or below) average across the five categories for each system, and is also plotted below.

 

 

In the case of hitters, the Big Board mix beats all the others noticeably, beating the average five-category R sq. by about 13%. On a per-PA basis, ATC comes close, and FGDepth is not far behind that, but the improved playing time gives my mix the big advantage. I said in this piece last year that I hoped to see improved results from PECOTA, but they only got worse in ‘18. I’m holding out hope that their new DRC+ improvements will help in 2019! 

RMSE: This gives you an idea of what the typical error in each category, for each system. Each counting stat is listed as error per 600 PA to normalize the values to approx. full season scale.

 

 

Pitchers

Another definition – the Big Board mix for pitchers this year, which combines the systems to produce the best overall results, is:

  • Playing Time: 72% ATC, 28% Steamer

  • Rate Stats: 32% ATC, 30% Steamer, 27% PECOTA, 11% The Bat

As with the hitter projections, weighted rate stats will be indicated by an ‘n’ (e.g. “nERA”). The IP, W, and SO (*) projections were evaluated on the basis of two separate populations and averaged: starters and relievers. Past analysis has shown that evaluating this as a single population gives undue credit for projecting the separation between these two populations. The SV (**) projections were evaluated for relievers only.

 

 

The new method for IP evaluation produces, in my opinion, a much more accurate result – pitcher playing time projection is hard. ATC rises above the rest though, so kudos to Ariel. The Big Board mix is made marginally better by incorporating a bit of Steamer.

ZiPS managed to do something amazing this year – the W/IP projection literally had a 0.00 correlation (R squared) to actual 2018 results. I don’t think I’ve ever seen that. Otherwise, we again see that many systems produced similar results, with a few poor performances in individual categories here or there. ATC and The Bat fared quite well overall, with Steamer just a bit behind them. Still, the Big Board mix takes the top spot from all of them by performing reasonably well in every category. ZiPS had a rough year on the pitching side – might the FG site projections improve by incorporating ATC into their depth chart projections instead of ZiPS..? Again, the net results are plotted below:

 

 

One thing that sticks out here is that The Bat might have been one of the best if they’d had a better playing time projection! But beyond that, we also see just how impressive a year ATC had.

RMSE: The root mean square error for each pitching category… in this case, normalized to 200IP (or 65, for SVs).

 

 

15 thoughts on “Evaluating the 2018 Projection Systems”

  1. Been waiting for this excitedly! Thanks for running this analysis like in years past!

    Quick thoughts:

    1. Maybe for determining the best projections in terms of accuracy, we start looking at the historical works 3-5 years in the past and weighting their performance that way. Year-to-year a projection system might struggle but if it’s consistent then we should think about its future performance differently.

    2. Is it just me or is ERA really high RMSE? Only predicting to a .9 error is really wild.

    3. Is it plausible to completely scramble the playing time from stat projections? Like taking 0 Steamer for stats but taking its playing time. Feel like things might get wacky and away from the core view on certain players each projection system wants to have. Same goes for injury, are projection systems not in the market of predicting injury? It obviously weighs into expected playing time but what about performance itself?

    1. Good questions, taking them one at a time –
      1. I would like to move toward that eventually, but the past few years have had enough changes (ball juicing/unjuicing, introduction of statcast data and incorporation into projection systems, creation of new public projection systems) that it would not be worth doing IMO.
      2. Yeah, crazy huh? ERA fluctuates wildly year over year. It’s why you’re much better served trying to draft skills like K rate.
      3. I’m not sure I follow you totally, the one example I can think where this would be the case is when players are projected for platoon roles, or reliever roles, and receive improved skills projections accordingly. But otherwise, these systems generally don’t have much in the way of human interference as far as I know, so no, they don’t concern themselves with predicting injury impacts on performance.

      1. Thanks for clarifying — yeah I agree that there’s some year-to-year changes but it would still be interesting to see an average of the past 3-5 years and their performance. I imagine it wouldn’t be difficult since the hard work has already been done!

        For 3 I just meant that if you separate a projections’ stats and playing time things might get messy. Like if one projection system has 50 less PA then that would result in less R’s and RBI’s, so you’re kind of minimizing that if you only take the stats or only take the playing time. Unless you’re normalizing the stats on a per-PA basis and then multiplying by the factor of PA’s.

        And for 2 I want to generalize this to all stats and our drafting philosophy: Is there a way we can weigh how accurately the projections are in each stat category versus the weight we place on a players value? So like you said, since we can’t really predict ERA then shouldn’t it not have as much of a linear weight in a players z-score? Whereas things like .Avg and R’s and RBI’s are probably stickier and only have errors +- 10-15.

        I assume that when we’re calculating z-scores and valuing players, that it assumes all 5 stats are given equal weight even though some are more predictable than others. Is this something to think about going forward?

        1. Oh yeah, well, the counting stats are examined both as-is, and fully separated from the playing time, that’s the point of the red-colored half of each table. Hope that makes sense.

          You’ve asked a tough question re: weighting categories. I’ve thought about this in the past. The projection systems already heavily regress to the mean, for stats that have higher variance. Since the z-scores are based on real-world variance, but the projection scores are calculated based on projection variance, I think the z-score method actually already neatly accounts for the issue you’re talking about! Fancy that, eh?

  2. Very interesting. I wonder how you reconcile this with the analysis showing that ATC and THE BAT dominated all others in Cohen’s piece (on FG) on a simulated draft methodology?

    1. Nothing to reconcile, just two different approaches arriving at slightly different answers. ATC performed very well in this analysis as well… so Ariel definitely had a good year! If there’s anything to be said for his method, I think it’s a bit too granular for my tastes, looking at how the systems performed on individual players (as opposed to the overall population) is going to make you more prone to single-year fluctuations.

        1. Harper Wallbanger

          Yes. It’s also quite likely that Steamer struggles under Ariel’s method because Steamer is the most ubiquitous projection system. The AAV’s he’s using as a baseline price for players are heavily driven by what Steamer says about players. As ATC and The Bat get more popular (they’ve only been available on Fangraphs for 1-2 yrs now), that edge will disappear.

  3. Big fan of Big Board. I’ve been using it the past few years and have always finished in the money.

    Have a question regarding the Fans projections. Was the Fans’ playing time success last year a fluke or have they been reliable in the past? It seems like so many players have less than 10 ballots submitted for Fan projections. Even Trout only has 28 ballots. I can’t argue with the numbers showing the success last year, but it just feels odd to be trusting the opinions of a dozen or so people.

    1. As far as I can tell, Fans do well with projecting position battles but not with projecting health (they are generally too optimistic). The reason their ‘opinions’ at least end up in the realm of possibility is because of the way FG asks the questions on the page – it sets you up to project things into a somewhat reasonable range by showing you the past performances side by side with your projection.

  4. Hey Ryan. Trying to understand how some of the hitting projections came together. The post states the following drives the Rate Stats projection for hitters:

    Rate Stats: 58% ATC, 42% Steamer

    Now let’s take Aaron Judge as an example. I downloaded the board and I see Judge projected for an AVG of .270 and an OPS of .940.

    When I go to Rotochamp (https://www.rotochamp.com/Baseball/Player.aspx?MLBAMID=592450) to check his projections, I see that ATC and Steamer provide the following projections:

    AVG: .268 (ATC), .251 (Steamer)
    OPS: .931 (ATC), .870 (Steamer)

    This doesn’t seem to make sense to me. If ATC and Steamer are the only two drivers of hitting projections, how can Judge be projected for a higher AVG and OPS than either of the two individual projection systems provides?

    Sorry if I’m missing something obvious here. Thanks for any insight you can provide.

Leave a Comment

Scroll to Top