Scraping Football Data with Beautiful Soup In Python
In this post, we will look at how to scrape the football/soccer results from the Euro 2021 Championships.
We will use the BBC's results pages as the source for our data, and will use Python to extract the data we are interested in - notably, team names and the goals each team has scored.
To fetch the data, we'll use Python's popular requests library, which can be used to send HTTP requests to web endpoints. After fetching this data, we'll use BeautifulSoup to extract the data we need. We will also use some utilities from pandas to make our lives easier - and we will use pandas more when analyzing the data that we collect here!
By the end, we will have gathered every result from the tournament, including results from penalty shootouts!
A Jupyter Notebook with all the code for this post can be found here
If you are unfamiliar with BeautifulSoup, check out this video to learn the fundamentals.
Watch the video for this post below.
Objectives
In this post, you will learn:
- How to use the
BeautifulSouplibrary to parse raw HTML content and extract data that we are interested in. - How to use the
requestslibrary to send simple GET requests to a website/url. - How to construct code that will fetch and extract data across multiple pages that we are interested in.
BBC Results Pages
The BBC's URLs for the Euro 2021 results have the following general format.
- https://www.bbc.com/sport/football/european-championship/scores-fixtures/2021-06-11
- https://www.bbc.com/sport/football/european-championship/scores-fixtures/2021-06-12
- ...
- https://www.bbc.com/sport/football/european-championship/scores-fixtures/2021-07-10
- https://www.bbc.com/sport/football/european-championship/scores-fixtures/2021-07-11
Notice that each URL corresponds to a given date - the same base URL is used, followed by the date at the end of the URL. The tournament starts on the 11th June, and ends on the 11th July.
We need to generate a range of dates, from the start-date until the end-date. We can use the pandas.date_range() function for that. Let's write these requirements into Python code.
# imports required for the tutorial
from bs4 import BeautifulSoup
import pandas as pd
import requests
# set the BBC base url, and the start and end dates
base_url = 'https://www.bbc.co.uk/sport/football/european-championship/scores-fixtures'
start_date = '2021-06-11'
end_date = '2021-07-11'
# use the date_range() function from pandas to generate all the days between
# the start date and end date
tournament_dates = pd.date_range(start_date, end_date)
tournament_dates
>> DatetimeIndex(['2021-06-11', '2021-06-12', '2021-06-13', '2021-06-14',
'2021-06-15', '2021-06-16', '2021-06-17', '2021-06-18',
'2021-06-19', '2021-06-20', '2021-06-21', '2021-06-22',
'2021-06-23', '2021-06-24', '2021-06-25', '2021-06-26',
'2021-06-27', '2021-06-28', '2021-06-29', '2021-06-30',
'2021-07-01', '2021-07-02', '2021-07-03', '2021-07-04',
'2021-07-05', '2021-07-06', '2021-07-07', '2021-07-08',
'2021-07-09', '2021-07-10', '2021-07-11'],
dtype='datetime64[ns]', freq='D')We create a date range from the start until the end date on line 13, giving us all days between (and including) those days. Line 16 onwards shows the output of the date_range call - a DateTimeIndex with an entry for every day in the range.
These can now be appended to our base_url to create all the BBC URLs that we will be scraping - let's do this in a list-comprehension.
urls = [f"{base_url}/{dt.date()}" for dt in tournament_dates]
For each date, we remove the time component (hours, minutes, seconds) by calling .date(), and use an f-string to append the date to the end of the base_url.
Next, we want to send an HTTP request to the BBC pages to retrieve the page contents. These contents will be passed to the BeautifulSoup object, allowing us to parse the HTML and search the document for the data we want.
We will use the requests library to send HTTP GET requests to each of our URLs.
In accordance with 'ethical scraping' best practices, we'll sleep for 1 second in between each request to the BBC to avoid putting any strain on their servers.
To perform a GET request to the first URL (for the first day in the tournament), we would use the following code.
response = requests.get(urls[0])
# decodes and returns HTML as a string - here we get the first 50 characters from the HTML document
response.text[:50]
We can inspect the HTML for this page more clearly using a browser's developer tools. We are interested in getting the result for this URL.
Looking at the browser's developer tools, we can see the result is contained within an HTML article tag with a class of sp-c-fixture.
Let's grab this data using BeautifulSoup. To do this, we pass our response.text to the BeautifulSoup object, and set the parser as the html.parser to indicate that we're parsing HTML text.
soup = BeautifulSoup(response.text, 'html.parser')
We can use BeautifulSoup's search functionality to search the document for the article tag with the class name sp-c-fixture
The soup.find_all() function is used to search for all these tags - remember, there may be more than one on the page, as some days of football had up to 4 matches.
soup.find_all() returns a list of all the given tags that match the parameters it is given.
Let's now call soup.find_all() to get all the article tags with the class sp-c-fixture
fixtures = soup.find_all('article', {'class': 'sp-c-fixture'})
The first parameter to soup.find_all() is the HTML tag - in this case, the article tag. The second parameter is a dictionary specifying additional matching criteria - in this case, we want only elements with the class sp-c-fixture.
This will give us a list of results for matches on the given date. For each result that we get, we can now drill down and get the data from the children of each article tag that has been collected.
Each fixture is represented by a Tag object - from there, we can extract the home and away team (and the goals scored) from the fixture using the following selectors.
- .sp-c-fixture__team--home .sp-c-fixture__team-name-trunc class for the home team.
- .sp-c-fixture__team--away .sp-c-fixture__team-name-trunc class for the away team.
- .sp-c-fixture__number--home class for the home team's number of goals.
- .sp-c-fixture__number--away class for the away team's number of goals.
For example, you can see the home team's selector below.
Because we're using CSS selectors, we will use the .select_one() method rather than the .find() method.
Let's write code to extract the first result of Turkey 0-3 Italy. Firstly, we'll write a helper method that allows us to pretty-print the result in that format.
def show_result(home, home_goals, away, away_goals) -> str:
return f"{home} {home_goals} - {away_goals} {away}"
Now, we can extract the data we need using BeautifulSoup. Our fixtures list should have a single element, as there was only one game on the first day of the tournament - so we will index into this list at element zero to get the Tag, and can then extract the data using the class names we identified.
home = fixtures[0].select_one('.sp-c-fixture__team--home .sp-c-fixture__team-name-trunc').text
away = fixtures[0].select_one('.sp-c-fixture__team--away .sp-c-fixture__team-name-trunc').text
home_goals = fixtures[0].select_one('.sp-c-fixture__number--home').text
away_goals = fixtures[0].select_one('.sp-c-fixture__number--away').text
# call helper method to pretty-print the result with this data
show_result(home, home_goals, away, away_goals)
>> 'Turkey 0 - 3 Italy'
We can see that the correct result has been extracted successfully!
Let's apply this approach to all of the URLs, for each match day, in a loop. This will take a while to run, because we're sending many HTTP requests, and sleeping for a second between requests.
import time
results = []
for url in urls:
# send the HTTP request, then sleep.
response = requests.get(url)
time.sleep(1)
# pass response data to BeautifulSoup
soup = BeautifulSoup(response.text)
# get all fixtures on the page
fixtures = soup.find_all('article', {'class': 'sp-c-fixture'})
# iterate over fixtures to extract data and append to results list
for fixture in fixtures:
home = fixture.select_one('.sp-c-fixture__team--home .sp-c-fixture__team-name-trunc').text
away = fixture.select_one('.sp-c-fixture__team--away .sp-c-fixture__team-name-trunc').text
home_goals = fixture.select_one('.sp-c-fixture__number--home').text
away_goals = fixture.select_one('.sp-c-fixture__number--away').text
results.append(show_result(home, home_goals, away, away_goals))
Let's look at the first 5 results.
print(results[:5])
This gives the following output.
['Turkey 0 - 3 Italy',
'Wales 1 - 1 Switzerland',
'Denmark 0 - 1 Finland',
'Belgium 3 - 0 Russia',
'Austria 3 - 1 North Macedonia']
This works well, but there is one issue: we have no way of identifying the winner in knockout games, if the game was won on penalties. Let's find the penalty winners!
Knockout rounds started on the 26th June, so we'll create a KNOCKOUT_GAMES_START variable, using a pandas.Timestamp, and for each game, will check whether it's on (or after) this date.
If it is after the KNOCKOUT_GAMES_START date, we will search for the penalty data in the web document - this can be found in the .sp-c-fixture__win-message class.
Let's write code to adapt the loop, to take into account the potential for penalties.
results = []
KNOCKOUT_GAMES_START = pd.Timestamp('2021-06-26')
# We need to redefine our show_result() function to take penalties into account
def show_result(home, home_goals, away, away_goals, pens=None) -> str:
if pens:
return f"{home} {home_goals} - {away_goals} {away} ({pens})"
return f"{home} {home_goals} - {away_goals} {away}"
for url in urls:
response = requests.get(url)
time.sleep(1)
soup = BeautifulSoup(response.text)
# get all fixtures on the page
fixtures = soup.find_all('article', {'class': 'sp-c-fixture'})
for fixture in fixtures:
home = fixture.select_one('.sp-c-fixture__team--home .sp-c-fixture__team-name-trunc').text
away = fixture.select_one('.sp-c-fixture__team--away .sp-c-fixture__team-name-trunc').text
home_goals = fixture.select_one('.sp-c-fixture__number--home').text
away_goals = fixture.select_one('.sp-c-fixture__number--away').text
# split off the date from the end of the URL
game_date = pd.Timestamp(url.split("/")[-1])
# check if this date falls within the knockout rounds
# if so, check if there are penalties, append to result if so, and continue
if game_date >= KNOCKOUT_GAMES_START:
pens = fixture.select_one('.sp-c-fixture__win-message')
if pens is not None:
results.append(show_result(home, home_goals, away, away_goals, pens.text))
continue
# if no penalties, we can use the same code as before (no final argument for penalties)
results.append(show_result(home, home_goals, away, away_goals))
On line 2, we define the date for when the knockout games start. We also split off the date from the end of the URL on line 27, and compare these two dates on line 31.
If the game falls after the start of the knockout rounds, we need to look for whether the match went to penalties (line 32). If the select_one method discovers an element with that class, we append the text from the penalties block to our show_result function with the final parameter.
Let's look at the last 5 results from the tournament, 2 of which went to penalties, to see if this works.
print(results[-5:])
This gives the following output.
['Czech Rep 1 - 2 Denmark',
'Ukraine 0 - 4 England',
'Italy 1 - 1 Spain (Italy win 4-2 on penalties)',
'England 2 - 1 Denmark',
'Italy 1 - 1 England (Italy win 3-2 on penalties)']
This gives the correct output, showing the penalty winners where the match ended in a draw. This is good - we've got the data we needed from the page, and are displaying all the accurate results from the tournament!
Summary
In this post, we've used popular Python tools - requests and BeautifulSoup - to successfully gather football result data from Euro 2021 results on the BBC website.
Next Steps
We would like to model this data in a more natural manner, rather than just having strings that represent the results.
In the next lessons, we will look at creating Python classes that model the data. In particular, we will create:
- A Result class to represent a result in the tournament, with attributes for each team and the goals each team scored in the match. We will create methods to determine who won a match, lost a match, whether the result was a draw, and more
- A TeamStat class to represent a team's overall statistics in the tournament (goals scored, goals conceded, games won on penalties, etc).
We will use type-annotated Python 3.7 dataclasses to model the data.
If you enjoyed this post, please subscribe to our YouTube channel and follow us on Twitter to keep up with our new content!
Please also consider buying us a coffee, to encourage us to create more posts and videos!