December 26th, 2019 / game grades // wepa // prediction /
Final scores are reasonable predictors of future performance, but game grades like DVOA and WEPA are measurably better
Individually, WEPA was the best game grade tested, beating both DVOA and PFF in descriptive and predictive power
Even when a game grade isn’t as predictive of future performance as other game grades, incorporating it into a blended game grade yields a metric that is more predictive than any individual game grade
Depending on implementation, a game grade that blends Point Differential, WEPA, and PFF tends to perform best
Point differential is the product of not only strategy and execution, but also randomness and performance in high leverage moments. While point differential determines wins and losses, it may not always be the best tool for measuring performance.
Game grades attempt to take performance measurement past the superficial level of point differential. For instance, DVOA uses the situational context of play level performance to remove randomness quantitatively, while PFF Grades use game tape to assess performance qualitatively.
By measuring performance through individual components, game grades often allow for performance attribution as well. Perhaps the best example of this is EPA, which allocates a fraction of every point scored to individual plays.
Given that game grades are an attempt to measure performance in a more meaningful way, it is important to assess the effectiveness of the game grades themselves.
Game grades generally balance three main objectives:
Describe the final score of the game
Decompose the final score into components (i.e. players, phases, play-type, etc)
Predict the final score of future games
A game grade will never measure the final score better than the point margin itself, but by correcting for leverage and randomness, game grades can be more predictive of future performance. If a game grade is not more predictive of future performance, then its only potential use is as an attribution tool for describing how a team won. Therefore, the best game grades should correlate to game margin, predict future performance better than game margin, and be built in a way that allows for deeper analysis of individual plays, situations, players, etc.
This analysis will consider 4 Game Grades:
Success Rate - The percent of a team’s plays that increase the team’s expected probability of winning the game
WEPA - A model that weights EPA based on how well specific play type predict future performance
DVOA - Football Outsiders’ proprietary game grade based on play-by-play data
PFF Grades - Team grades based on film study of every player on every play
Point differential is a net score that is shared by each team. If a team wins by 10, their opponent definitionally lost by 10. While all game grades are “net” scores in the sense that they measure a team’s offensive and defensive performance, the grades are not necessarily shared by each team. It’s not uncommon for both teams to have positive (or negative) game grades. To derive a single “margin” grade shared by each team, game grades would have to be netted against each other such that the higher graded team always had a positive delta, and the lower graded team always had a perfectly inverse negative delta.
Past analysis has shown offensive performance to be far stickier than defensive performance. Using net grades could prove less predictive for offensive shootouts, as game grades are likely to weight offensive performance more heavily and could consider each team a “winner” individually.
For instance, WEPA graded both the Rams and the Chiefs as winners in their 2018 Monday Night Football classic:
To identify potential benefits and costs to using netted or non-netted grades, both approaches will be used. To keep the sample sizes identical, only home team grades will be used for non-net analysis.
In this analysis, a game grade’s ability to describe the final score of a game is measured by its correlation to the same game’s point differential. A game grade’s ability to predict future performance is measured by its RSQ to future point differential (aka out of sample point differential).
The OoS RSQ is measured across three rolling windows--1 game, 4 games, and 8 games. The windows are equally sized and constrained to a single season. For instance, a team’s 4th game has 4 preceding same-season games that are used to make a prediction about the team’s next 4 same season games (games 5, 6, 7, and 8). Because each team only plays 16 games a season, only 1 data point exists for the 8 game window. This window is analogous to a “First Half v Second Half” analysis:
In general, game grades correlate strongly to the games they seek to describe, and in multiple instances, do a better job predicting future point differential than point differential does itself. That said, in nearly all situations, the “net” grade does a better job describing a game, but a worse job predicting future games**.
It’s intuitive that combining the performance of both teams yields a more accurate descriptive metric. It should also be intuitive as to why a non-netted score is more accurate at predicting future games. Game grade metrics include both offensive and defensive performance. If performance is successfully attributed to the team that caused it, then netting an opponent’s grade, which only contains information about the opponent’s performance, adds noise. Furthermore, past analysis has shown that a team controls its offensive performance more than its defensive performance, and thus any metric that only exists on a net basis (i.e. point differential or success rate) contains some inefficiency as it attributes offensive and defensive performance equally.
As individual game grades, non net WEPA and non net DVOA appear to offer the best performance. They both possess strong descriptive value, are measurably better at predicting future performance, and allow for attribution at the individual play level.
Though PFF grades do not perform as well as their quantitative play-by-play counterparts, there can be value in a less predictive measurement if that measurement encodes fundamentally different information. Depending on covariance profiles, combining two metrics can result in a new metric that possess a better performance profile than either metric individually.
A blended metric that combines Point Margin, WEPA, and PFF (all standardized for scale) is more predictive than either of the metrics individually:
Elo uses differences in Elo rating (and various macro factors like home field advantage) to predict win probability. Elo then compares this expectation against the actual game outcome to determine whether or not a team’s rating should be adjusted up or down. For instance, if a team was expected to lose, but instead won, its rating would be increased and its opponent’s rating decreased.
Elo’s post game adjustment is only as accurate as its measure of outcome. Elo uses point differential as its measure of outcome, meaning any non-predictive randomness that exists in the final score is encoded into Elo’s adjustments and ratings. By replacing point differential with a more accurate game grade, Elo’s assessment of outcome and subsequent adjustments based on that assessment improve:
Blending alternative game grades into a basic Elo model that only uses final scores and off-season regression, improves the model’s score (as measured by Brier) much in the same way it improves the RSQ of OoS point differential. Additionally, the model is made even more accurate by blending multiple game grades together. In this case, a game grade that is 60% point differential, 20% WEPA, and 20% PFF works best.
This approach also works in more advanced versions of Elo like nfelo, which factors in QB adjustments, Vegas lines, and smarter pre-season adjustments:
Again, a blended metric that includes both Point Differential, WEPA, and PFF works best. However, in this case, a lower PFF weight is warranted, perhaps because nfelo and 538’s Elo encode similar information already by adjusting for individual QB play.
The superior predictive power of game grades proves that randomness can be corrected for. This finding doesn’t seem surprising given the broad adoption of advanced game grades like DVOA. Even those skeptical of analytics recognize the pitfalls of relying on point differential alone.
What is surprising, perhaps, is the superior performance of newer alternative game grades like success rate and WEPA. The most descriptive, predictive, and attributable measure of a team’s performance is not a paywalled black box model. Rather, it is an open source model built on an open source dataset. While advanced metrics like DVOA made football audiences fundamentally smarter, we now have the ability to build superior alternatives.
Adoption of open source alternatives will take time, but given the community and resources built around tools like nflscrapeR, this adoption seems inevitable.
*The “D” in DVOA stands for Defense-Adjusted, meaning its game grades adjust over time based on a continuously updated notion of the opponent’s quality. Due to this feature, DVOA’s individual game grades use future information and should be valued for their descriptive rather than predictive ability. DVOA as a point-in-time metric (i.e. without future information) is only published as an aggregate team metric.
In theory, the predictive power of point-in-time DVOA game grades and team grades should be similar. As expected, a team’s point-in-time DVOA through 8 games, which includes no future information, is less predictive than it’s average DVOA game grade through the first 8 games (0.263 vs 0.278). However, the discrepancy is not large, suggesting that the defensive adjustments are not the most important element of DVOA.
** WEPA is optimized around its ability for the net grade to predict the last 8 games of the season. As a result, the relative predictiveness of the net 8 game window is likely higher than it would had WEPA been optimized for a different objective function.