What a great World Cup! More goals, longer games, shocking defeats, and a thrilling final. And yet, if you actually look at the history of international soccer, you might wonder why anyone turns the TV on at all.
Here’s the thing: there have been 20 World Cups over more than 80 years, with an average of about 24 nations competing per cup (32 nations in the modern version). Yet only twelve countries have even made it to the final, and only eight countries have ever won it. Of those eight, four teams—Brazil, Germany, Italy, and Argentina—make up more than 75 percent of all the victories, and are also nearly 50 PERCENT OF THE RUNNER-UPS.
To put this in perspective, of the 32 teams in the NFL, all but four have reached the Super Bowl, and more than half of all NFL teams have won the Super Bowl at least once.
So, international soccer is very much defined by a small number of teams of consistent, dramatically higher quality than the rest of the world.
And yet! It is very hard to create any predictor of individual match outcomes, to the degree that upsets during the tournament are a normal and expected part of the game. Anyone watching the United States come inches away from stealing the match from a clearly superior Belgian team in the round of sixteen can see how an upset might occur.
This is the deep tension of international soccer: a totally dominant club of nations that remain vulnerable to earth-shaking upsets.
Big Data Loves a Challenge Like This!
Many people have tried to figure out how to wade through what is happening here and create predictive tools for figuring out match outcomes. It turns out that this is very hard. Andrew Gelman’s approach is typical of the efforts to make sense of the signal and the noise. He made a valiant effort to try to quantify variance based on a known metric of soccer quality called the Power Index and use it to predict goal differential and final outcomes.
It didn’t work so well, or to be specific, it predicted basically nothing about match outcome. After working through eight iterations of his hypothesis, Andrew decided that it was not a very good way to see what was going on, or, as he puts it: “My model sucks.”
It’s not your fault, Andrew. Big Data often is backed into a trap where the assumption is that, with sufficient understanding of some abstract inputs and outcomes, meaningful results can be had if we just choose the right statistical approach. But it is not so with soccer. Not with even all the goals in all the World Cups could you get predictive levels of data. They are too random and too rare.
Randomness is so huge in soccer. Scoring goals is just very very hard. Even a perfectly set up scoring opportunity has about a 50/50 chance of ending up in the net.
And talk about rare. Only about 173 goals were scored in this entire world cup (a 30-year high, by the way), at a blistering pace of three goals an hour. That gets you get a LOT of variance, especially since something like 90% of those goals were the deciding differentiator in a match’s outcome.
So what to do? the point of statistical analysis is to squeeze variance out of an equation, but the essence of measuring performance on goals is that variance is huge no matter what you do.
And yet, and yet! There is this small club of teams that emerges through the variance to be dominant year after year. They are doing something to win. SOMETHING must be measurable. But what?
This is Where Information Architecture Comes Into Play
IA is, fundamentally speaking, the study of meaning, relationships, and coordination of those relationships. If we just look at goals or the abstract power scores, we are basically measuring an outcome without really getting to the meaning of soccer.
So what is soccer’s meaning, its *ontology*? At a fundamental level, soccer an exercise in creating statistical chances to score and to prevent scoring. This is its essence. The concepts involve, for example:
- shots on goal
- plays in each third of a field
- certain kinds of plays that are more meaningful than others
- certain players that are more dangerous (or safe) than others
- the state of player fitness, as a measure of the time of the game and whether the player is a substitute
The relationships, or *taxonomy* between these events can often be used to understand how certain events might occur with higher frequency.
Finally, in soccer as in other free-flowing sports, certain sequences—their *choreography* of events are more dangerous than others. A late-game substitute is more likely to score—as demonstrated brilliantly by German substitute Goetze scoring the World Cup’s final goal on a counterattack against an exhausted Argentinian defense (or, for you U.S. fans, Julian Green’s remarkable goal against Belgium).
The point of all this is that the interchange between meaning, relationships, and sequence is a much more relevant set of factors for determining the quality and outcomes of games.
Surprisingly, this is a kind of data that is actually readily available. There are even iPod apps that summarize soccer matches by the acts of their players.
Scoring chances are pretty easy to determine from these kinds of models, and it turns out that the best teams in the world get many more of these chances—and convert on them more frequently—than less successful teams.
One thing about this analysis is that it is very POSITIONAL and RELATIONAL—that is, where the ball is on the field at a given point in time, who is going to get it, the angles of the attack, etc. It is overall more accurate analysis but much more difficult to quantify statistically. However, there are quants who can do it. Many NBA teams use top-down video capture of players and then enters their movements into a computer to determine the best kinds of shots for certain players, and the best defenses to deal with them. Many coaches and players say this approach has completely revolutionized their approach to basketball.
It can also lead us to some powerful insights. If we look at that data we would find, for example, that the United States was unbelievably, almost impossibly lucky in this tournament, whereas the Germans were generally speaking a little unlucky with the exception of the two games against the two best opponents they played: France and Argentina.
So Big Data, come back. Create statistical models based on the true Information Architecture of soccer, and I think you’ll find that you get a much better sense of why the best teams are as good as they are. And therein lies so much of the beauty and mystery of this wild, wonderful game.