The 2022 Qatar World Cup is here! As the most widely watched sporting event in the world, there is a lot riding on the outcome. But can we predict who will win?
At first, this seems to be a very hard prediction to make. There are 32 teams in the competition and a total of 64 matches to play. Although some teams are favoured, there is also a lot that could go wrong even for the best team, so there really are no guarantees.
But we can simplify the problem if we ask a more basic question: who will win in a match between team A and team B? If we have an answer to this, we can break down the entire tournament and predict the outcome match-by-match.
The Model
So what’s our answer? There are many approaches to take and the outcome of every match could depend on an enormous number of variables: Who is playing? How good are they as individuals? Have they been injured recently? Can they play well as a team? Who is managing them? Can they cope with the weather conditions? But this much detail can be a bit overwhelming, so it’s normally a good idea to start with something simple.
Instead, imagine that every team’s performance was governed by a single number, i.e. their overall skill level. We would like to have a model which simply takes in the skill levels of any two teams and makes a prediction about who will win. For example, θGermany and θSpain would be the skill levels of the German and Spanish national teams. If the German team were to have a higher skill level than the Spanish, then we would like our model to predict this team is likely to win in a match between the two. Therefore we choose the following model: the probability that team A will beat team B is given by:
Pr (A beats B)=logistic (θA-θB)
When teams are closely matched then both have an appreciable chance of winning. But as the difference between their skill levels grows, the chance that the weaker team will prevail shrinks quickly to zero.
This model is a form of logistic regression and can easily be trained and evaluated inside evoML.
Dataset
To train our model we use a dataset that contains historical information about a huge range of international matches since the 1880s. Each entry of our dataset contains the following information:
- Date of the match, the teams playing, and the location of the match.
- Whether the match was on a neutral pitch.
- The tournament, such as friendly, UEFA Euro, and Fifa World Cup.
- The score of each of the teams.
Preprocessing
Of course, data from the 19th century may not be relevant in predicting the outcome of the upcoming World Cup… Instead, we decided to use only the data belonging to the last 10 years. Furthermore, the tournament, city, and country columns were found to be extremely imbalanced and thus we decided to include only whether the match was on a neutral pitch.
But the most important piece of preprocessing we do is to encode which teams are playing in any given match. We take an approach similar to one-hot encoding. Each team is given its own column in the preprocessed dataset. The home team will be encoded with a +1, the away team will be encoded with a -1 and if a team is not participating in a match then that column will be left with a 0.
For example, assume we wanted to encode three matches: (team_1 vs team_3), (team_3 vs team_1) and (team_2 vs team_4) the encoded data would look like the table below:
With the dataset encoded in this way we can train the model
Probability (team A beats team B)=logistic (Χ·θ)
where Χ is an encoded row of the dataset and θ represents the teams’ skill levels.
Finally, we need to create a suitable target column for the problem we want to solve. At first, we have just a simple question: which of the two teams will win? So the problem is an example of binary classification and we can simply filter out any matches which end in a draw, and add a column with the desired binary target.
Model Training
Now that we have prepared our dataset, we are ready to train a model that will predict the outcome of a match. Although we had logistic regression in mind, inside EvoML we can add a wide variety of other approaches to our trials, such as LightGBM, XGBoost, and other tree-based models, which can all be validated, compared and ranked. Despite the simplicity of this approach, we find that Logistic Regression performs best with an accuracy of 74.8% and we decide to use it to make our predictions.
Feature Importance – Team Rankings
Logistic regression is a very simple model, but our decision to use it has a key advantage: it is highly explainable. When we train this model it will learn to rank teams based on their skill levels. As explained above, these skill levels are stored inside the model coefficients and inside evoML they can easily be viewed by going to a model’s feature importance page. In the case of our logistic regression model we see the following:
Matching the encoded features to the original names we get a ranking of the top 10 best teams in our dataset:
- Brazil
- Argentina
- Spain
- France
- Belgium
- England
- Germany
- Portugal
- Italy
- Colombia
We observe that our logistic regression model is able to return a very good estimate of teams’ skills as 8 out of 10 of the top teams as estimated by our logistic regression model are in the top 10 teams in the FIFA world rankings. Not bad for such a simple approach!
Group Stage
Now that we have a model which can predict the outcomes of any pairing of teams, we would like to use it to predict how all the teams will perform in the World Cup. This starts with the group stage in which the teams competing in the tournament are divided into 8 groups of 4. Each group plays a round-Robin series of matches and the top two teams are selected for the knockout stage.
Unlike during the knockout stage, a match in the group stage never ends in a penalty shootout. If we have a match between teams A and B there are now three possible outcomes: team A wins, team B wins, or if they run out of time the match may simply end in a draw. If a team wins they will receive three points, while if they lose they get zero points. If it’s a draw then both teams just get a single point. If two teams are drawn on points at the end of their group’s matches then the winner is selected based on goal difference.
Predicting the outcome of a group stage match is an example of multi-class classification. Although this situation is slightly more complicated than what we introduced above, all we need to do is supply evoML with a dataset containing our three possible outcomes and it will automatically train a model which can predict their individual probabilities. These probabilities can then be used in a Monte Carlo simulation of the entire group stage.
Monte Carlo Simulation
Now that we’ve mentioned Monte Carlo simulation your eyes may begin to glaze over… But this is actually a crucial tool in making any kind of useful prediction, so you had better pay attention. Monte Carlo methods use random sampling to find approximate numerical solutions to questions in probability theory when an exact solution is impossible. So why do we need it?
Well here’s the problem: There are 48 games in the group stage, each with 3 possible outcomes. That means there are 3^48 different ways the group stage can go! That’s more outcomes than there are grains of sand on Earth. If we only used our model to predict the most likely outcome of each game we would be looking at just a single one of these 3^48 possible futures, and it would almost certainly be wrong and not that useful.
But what are we really interested in? We want to know the chances of each team progressing to the knockout stage. So here’s the solution: instead of looking at just the most likely outcome of each game, we take a sample of the different possible outcomes. We can then make thousands of simulations of how the group stage will go and find what percentage of the time each team makes it through to the knockout stage.
For every match, our model gives three probabilities: the probability of team A winning, the probability of team B winning and the probability of a draw. In each simulation we go through all the matches in every group, sample the outcomes from this distribution, add up everyone’s points and build the scoreboard. Now, all we need to do is count how many times each team either wins or is the runner-up in their group. If the teams in second and third place have an equal number of points then we assume each team has a 50% chance of progressing. The results of this simulation are presented in the tables below:
Probabilities of teams in groups A-D making it to the last 16.
Probabilities of teams in groups E-H making it to the last 16.
For each team, we display the probability predicted by our model that they will be placed either 1st or 2nd in their group. These probabilities are a key additional piece of information in any prediction. For example, if we want to place a bet on any of these matches it is not good enough to just know what the most likely outcome is. We also need to think about how confident we are in our prediction and weight the amount of money we put at risk accordingly, otherwise in the long run we are almost guaranteed to lose money.
If you’re not gambling then we don’t need to worry about that. TL;DR here’s who we think will make the cut!
Out of the top 16 teams in the FIFA world rankings, we predict that 14 will reach the knockout, with the unfortunate exceptions of Italy (who did not qualify) and the United States (beaten by Iran?). Let’s see about that.
Knockout Stage
Next comes the knockout stage, which is very tense, but at least we don’t need to worry about draws anymore. All we need to do is sample some of the outcomes we predicted for the group stage, go step-by-step through the 16 matches of the knockout stage, and then we are done! We have predicted the entire World Cup using Monte Carlo methods and a model provided by evoML.
Probabilities of teams being in the final 4.
And The Winner Is…
We arrive at the following results: Brazil and France will make it to the finals, after beating Argentina and Belgium, respectively. Brazil will beat France to become the winner of the World Cup, while Argentina beat Belgium to finish third. That is to say, the most likely outcome predicted by our model is:
As they are historically the best-performing team, it’s not so surprising that the Brazilian team are the favourites this year, but this would be their first triumph since 2002 when they last reached the final! But it could also see France’s second title in a row, something which hasn’t happened since 1962 when this was achieved by, er, Brazil again.
So those are our predictions for the 2022 Qatar World Cup. The tournament began on Sunday with a match between the hosts and Ecuador, and concludes on the 18th of December between… Well, we’ll see. Enjoy the tournament!