|
.. currentmodule:: socceraction.data.opta |
|
|
|
========================= |
|
Loading Opta data |
|
========================= |
|
|
|
`Opta's event stream data`_ comes in many different flavours. The |
|
:class:`OptaLoader` class provides an API client enabling you to fetch |
|
data from the following data feeds as Pandas DataFrames: |
|
|
|
- Opta F1, F9 and F24 JSON feeds |
|
- Opta F7 and F24 XML feeds |
|
- StatsPerform MA1 and MA3 JSON feeds |
|
- WhoScored.com JSON data |
|
|
|
Currently, only loading data from local files is supported. |
|
|
|
-------------------------- |
|
Connecting to a data store |
|
-------------------------- |
|
|
|
First, you have to create a :class:`OptaLoader` object and configure it |
|
for the data feeds you want to use. |
|
|
|
Generic setup |
|
============= |
|
|
|
To set up a :class:`OptaLoader` you have to specify the root |
|
directory, the filename hierarchy of the feeds and a parser for each feed. |
|
For example:: |
|
|
|
from socceraction.data.opta import OptaLoader, parsers |
|
|
|
api = OptaLoader( |
|
root="data/opta", |
|
feeds = { |
|
"f7": "f7-{competition_id}-{season_id}-{game_id}.xml", |
|
"f24": "f24-{competition_id}-{season_id}-{game_id}.xml", |
|
} |
|
parser={ |
|
"f7": parsers.F7XMLParser, |
|
"f24": parsers.F24XMLParser |
|
} |
|
) |
|
|
|
|
|
Since the loader uses the directory structure and file names to determine |
|
which files should be parsed, the root directory should have a predefined |
|
file hierarchy defined in the ``feeds`` argument. A wide range of file names |
|
and directory structures are supported. However, the competition, season, and |
|
game identifiers must be included in the file names to be able to locate the |
|
corresponding files for each entity. For example, you might have grouped feeds |
|
by competition and season as follows:: |
|
|
|
root |
|
βββ competition_<competition_id> |
|
β βββ season_<season_id> |
|
β β βββ f7_<game_id>.xml |
|
β β βββ f24_<game_id>.xml |
|
β βββ ... |
|
βββ ... |
|
|
|
In this case, you can use the following feeds configuration:: |
|
|
|
feeds = { |
|
"f7": "competition_{competition_id}/season_{season_id}/f7_{game_id}.xml", |
|
"f24": "competition_{competition_id}/season_{season_id}/f24_{game_id}.xml", |
|
} |
|
|
|
.. note:: |
|
|
|
On Windows, the backslash character should be used as a path separator. |
|
|
|
Furthermore, a few standard configurations are provided. These are listed below. |
|
|
|
|
|
Opta F7 and F24 XML feeds |
|
========================= |
|
|
|
.. code-block:: python |
|
|
|
from socceraction.data.opta import OptaLoader |
|
|
|
api = OptaLoader(root="data/opta", parser="xml") |
|
|
|
The root directory should have the following structure: |
|
|
|
.. code-block:: |
|
|
|
root |
|
βββ f7-{competition_id}-{season_id}.xml |
|
βββ f24-{competition_id}-{season_id}-{game_id}.xml |
|
βββ ... |
|
|
|
|
|
Opta F1, F9 and F24 JSON feeds |
|
============================== |
|
|
|
.. code-block:: python |
|
|
|
from socceraction.data.opta import OptaLoader |
|
|
|
api = OptaLoader(root="data/opta", parser="json") |
|
|
|
The root directory should have the following structure: |
|
|
|
.. code-block:: |
|
|
|
root |
|
βββ f1-{competition_id}-{season_id}.json |
|
βββ f9-{competition_id}-{season_id}.json |
|
βββ f24-{competition_id}-{season_id}-{game_id}.json |
|
βββ ... |
|
|
|
StatsPerform MA1 and MA3 JSON feeds |
|
=================================== |
|
|
|
.. code-block:: python |
|
|
|
from socceraction.data.opta import OptaLoader |
|
|
|
api = OptaLoader(root="data/statsperform", parser="statsperform") |
|
|
|
The root directory should have the following structure: |
|
|
|
.. code-block:: |
|
|
|
root |
|
βββ ma1-{competition_id}-{season_id}.json |
|
βββ ma3-{competition_id}-{season_id}-{game_id}.json |
|
βββ ... |
|
|
|
|
|
WhoScored |
|
========= |
|
|
|
`WhoScored.com`_ is a popular website that provides detailed live match statistics. |
|
These statistics are compiled from Opta's event feed, which can be scraped |
|
from the website's source code using a library such as `soccerdata`_. Once you |
|
have downloaded the raw JSON data, you can parse it using the :class:`OptaLoader` |
|
with: |
|
|
|
.. code-block:: python |
|
|
|
from socceraction.data.opta import OptaLoader |
|
|
|
api = OptaLoader(root="data/whoscored", parser="whoscored") |
|
|
|
The root directory should have the following structure: |
|
|
|
.. code-block:: |
|
|
|
root |
|
βββ {competition_id}-{season_id}-{game_id}.json |
|
βββ ... |
|
|
|
|
|
Alternatively, the soccerdata library provides a wrapper that immediately |
|
returns a :class:`OptaLoader` object for a scraped dataset. |
|
|
|
.. code-block:: python |
|
|
|
import soccerdata as sd |
|
|
|
# Setup a scraper for the 2021/2022 Premier League season |
|
ws = sd.WhoScored(leagues="ENG-Premier League", seasons=2021) |
|
# Scrape all games and return a OptaLoader object |
|
api = ws.read_events(output_fmt='loader') |
|
|
|
|
|
.. warning:: |
|
|
|
Scraping data from WhoScored.com violates their terms of service. Legally, |
|
scraping this data is therefore a grey area. If you decide to use this |
|
data anyway, this is your own responsibility. |
|
|
|
|
|
------------ |
|
Loading data |
|
------------ |
|
|
|
Next, you can load the match event stream data and metadata by calling the |
|
corresponding methods on the :class:`OptaLoader` object. |
|
|
|
- :func:`OptaLoader.competitions()` |
|
- :func:`OptaLoader.games()` |
|
- :func:`OptaLoader.teams()` |
|
- :func:`OptaLoader.players()` |
|
- :func:`OptaLoader.events()` |
|
|
|
.. _Opta's event stream data: https://www.statsperform.com/opta-event-definitions/ |
|
.. _soccerdata: https://soccerdata.readthedocs.io/en/latest/datasources/WhoScored.html |
|
.. _WhoScored.com: https://www.whoscored.com/ |
|
|