December 30th, 2021 / QBs // Replacement // Team and Scheme /
Though football is a team sport, the quarterback position is often thought of as an individual one. Quarterbacks control a significant portion of the offense, and as a result, there’s a natural desire to attribute success and failure to the individual. Whether through poor metrics like QB wins or better metrics like EPA / drop back, we present team performance as individual performance when we talk about QBs.
However, by looking at replacement performance versus expectation, market expectation for replacements, and the few natural experiments created by starting QBs changing teams, we see that “team and scheme” are, perhaps, more influential than previously thought.
In order to analyze how backups perform when replacing starters, we need a way to measure QB performance. Here, I will use 538’s QB model, which uses high level box scores (yards, attempts, TDs, rushes, etc) to approximate QBR. Though the model has its shortcomings, it is highly accessible and is easy to adjust for era. The model’s mechanics also present an easy way to compare a QB’s expected performance to their actual performance.
In 538’s model, each QB has an expected level of performance based on their historical performance. After each game, the model looks at the QB’s actual performance and uses that information to create a new expected level of performance for the QB. Though actual performance can fluctuate game to game, the expected level of performance is fairly predictive and stable, which makes it a good proxy for measuring QB quality.
To control for passing inflation over the last 20 years, all QB performance (both expected and actual) is presented relative to league median performance. In this context, a league average QB has a value of 0 while the best and worst QBs will have very positive and negative values respectively:
With 538’s era adjusted framework, we can now compare how QBs perform relative to expectation and analyze a replacement’s performance based on the quality of the starter they are replacing.
The first thing we notice is that replacement quarterbacks are more likely to be replacing below average QBs. While the distribution of all starters centers on zero (ie an average QB), the distribution of replacements centers below zero. Note, this distribution is not looking at the replacement’s quality, but rather the quality of the QB they are replacing:
This distribution is fairly logical. Replacements play when the starter is injured or benched, and while all quarterbacks get injured, only bad quarterbacks get benched. It’s unsurprising, then, that replacement QBs do not perform well. The average 538 Elo grade for a replacement player in their first start is -43.
However, absolute performance says little about the potential influence “team and scheme” have on QB play. What matters more is the replacement’s performance relative to expectation. Since 538’s QB framework is a model with predicted and observed values, we can define the model's error as the difference between the two. When the error is positive, it means that the QB outperformed expectation, and when the error is negative, it means the QB underperformed expectation. Plotting “Replacement Error” against starter value, we see that there is a positive correlation between the two:
A starter’s quality should have no bearing on a replacement’s performance, and yet, we see that it does. The better the starter, the more likely the replacement QB is to exceed expectation. Predicting QB play in a single game is quite difficult due to the game-to-game variance that exists in the NFL, so a low RSQ is not terribly surprising. By binning starter quality and calculating an average, the relationship becomes clearer (albeit overstated):
So, what does this mean? A model’s error term should be random. When a model’s error term correlates with some factor, it means the model isn’t capturing some dynamic present in that factor.
Here, 538’s QB model’s error term correlates with the quality of the starter, and from this, we may infer that some factor exogenous to the QB (ie “team & scheme”) is influencing how the QB’s performance is measured. In layman’s terms, surrounding talent and scheme has an influence on QB performance. Good starters perform well, in part, because their surrounding talent and scheme are good. When that starter is replaced, their backup benefits from the same positive situation and exceeds the expectations we hold from their past performance and draft position.
While a correlating error term is interesting, it could just represent a flaw in 538’s model and its inability to consider surrounding talent. If “team & scheme” were systemically underappreciated, we’d expect to see a similar error in other models.
This is exactly what we see when looking at the best model out there--the Vegas spread:
When a replacement plays for a below average starter, they underperform Vegas expectations and cover only 46% of games in their first start. Conversely, when a replacement plays in place of an above average starter, they outperform Vegas expectations and cover 54% of games in their start.
In effect, markets make the same mistake as 538’s model. They assume too much of a starting QB’s performance is individual and place too little emphasis on surrounding talent and coaching--“team & scheme” is under valued.
The ultimate test of “team & scheme” importance would be a true A/B test in which one QB played in two different situations (situation A vs situation B). Of course, it’s not possible to run such an experiment, but we can approximate one (sketchily) by looking at instances where a starting QB changed teams.
Starting QBs are a precious commodity in the NFL, and it’s rare to see them change teams. Furthermore, when teams do bring in a new starter, it’s typically part of a larger reset where the head coach or other players are replaced as well. As a result, there are very few “natural experiments” where a new starter joins a situation where the surrounding context is similar enough to make a point of comparison.
Since 1999, only 19 starting QBs have moved to a team that retained their head coach from the previous season. If “team & scheme” was not important, we would expect the QB’s performance on their previous team to predict their performance on the new team, and we wouldn't expect the previous starter’s performance to predict any of the performance of the new starter.
However, using 538’s era adjust QB values to measure performance, we do see that the performance of the previous starter helps us predict the performance of the new starter:
While the previous starter’s performance doesn’t correlate as well as the QB’s own past season, it does correlate strongly (0.459). Furthermore, when we use the previous starter’s performance in a regression to predict the new starter’s performance, it’s a (mildly) statistically significant variable that improves the regression’s fit. The “team & scheme” that the new QB is joining, does seem to be fairly important.
In some sense, there’s no such thing as a bad starting quarterback in the NFL. Almost every NFL quarterback was the best quarterback in the history of their high school or college, and the few “bad” quarterbacks who do make it to the NFL are quickly weeded out by their performance in training camps and games. NFL quarterbacks are elite athletes that, when put in the right situation, can have a standout game or even a standout season. Heck, even Andy Dalton had a season where he competed for the MVP.
It’s convenient to imagine QB performance as an easily discernible individual metric, but in reality, surrounding “team & scheme” are hugely influential. While it’s impossible to precisely measure just how large that influence is, the data presented here suggests it may be more important than we’ve given it credit for in the past. We shouldn’t forget that football is, afterall, a team sport.