# Spacy Entity Linker |
## Introduction |
Spacy Entity Linker is a pipeline for spaCy that performs Linked Entity Extraction with Wikidata on a given Document. |
The Entity Linking System operates by matching potential candidates from each sentence |
(subject, object, prepositional phrase, compounds, etc.) to aliases from Wikidata. The package allows to easily find the |
category behind each entity (e.g. "banana" is type "food" OR "Microsoft" is type "company"). It can is therefore useful |
for information extraction tasks and labeling tasks. |
The package was written before a working Linked Entity Solution existed inside spaCy. In comparison to spaCy's linked |
entity system, it has the following advantages: |
- no extensive training required (entity-matching via database) |
- knowledge base can be dynamically updated without retraining |
- entity categories can be easily resolved |
- grouping entities by category |
It also comes along with a number of disadvantages: |
- it is slower than the spaCy implementation due to the use of a database for finding entities |
- no context sensitivity due to the implementation of the "max-prior method" for entitiy disambiguation (an improved |
method for this is in progress) |
## Use |
```python |
import spacy # version 3.0.6' |
# initialize language model |
nlp = spacy.load("en_core_web_md") |
# add pipeline (declared through entry_points in setup.py) |
nlp.add_pipe("entityLinker", last=True) |
doc = nlp("I watched the Pirates of the Caribbean last silvester") |
# returns all entities in the whole document |
all_linked_entities = doc._.linkedEntities |
# iterates over sentences and prints linked entities |
for sent in doc.sents: |
sent._.linkedEntities.pretty_print() |
# https://www.wikidata.org/wiki/Q194318 Pirates of the Caribbean Series of fantasy adventure films |
# https://www.wikidata.org/wiki/Q12525597 Silvester the day celebrated on 31 December (Roman Catholic Church) or 2 January (Eastern Orthodox Churches) |
``` |
### EntityCollection |
contains an array of entity elements. It can be accessed like an array but also implements the following helper |
functions: |
- <code>pretty_print()</code> prints out information about all contained entities |
- <code>print_super_classes()</code> groups and prints all entites by their super class |
```python |
doc = nlp("Elon Musk was born in South Africa. Bill Gates and Steve Jobs come from the United States") |
doc._.linkedEntities.print_super_entities() |
# human (3) : Elon Musk,Bill Gates,Steve Jobs |
# country (2) : South Africa,United States of America |
# sovereign state (2) : South Africa,United States of America |
# federal state (1) : United States of America |
# constitutional republic (1) : United States of America |
# democratic republic (1) : United States of America |
``` |
### EntityElement |
each linked Entity is an object of type <code>EntityElement</code>. Each entity contains the methods |
- <code>get_description()</code> returns description from Wikidata |
- <code>get_id()</code> returns Wikidata ID |
- <code>get_label()</code> returns Wikidata label |
- <code>get_span()</code> returns the span from the spacy document that contains the linked entity |
- <code>get_url()</code> returns the url to the corresponding Wikidata item |
- <code>pretty_print()</code> prints out information about the entity element |
- <code>get_sub_entities(limit=10)</code> returns EntityCollection of all entities that derive from the current |
entityElement (e.g. fruit -> apple, banana, etc.) |
- <code>get_super_entities(limit=10)</code> returns EntityCollection of all entities that the current entityElement |
derives from (e.g. New England Patriots -> Football Team)) |
## Example |
In the following example we will use SpacyEntityLinker to find find the mentioned Football Team in our text and explore |
other football teams of the same type |
```python |
doc = nlp("I follow the New England Patriots") |
patriots_entity = doc._.linkedEntities[0] |
patriots_entity.pretty_print() |
# https://www.wikidata.org/wiki/Q193390 |
# New England Patriots |
# National Football League franchise in Foxborough, Massachusetts |
football_team_entity = patriots_entity.get_super_entities()[0] |
football_team_entity.pretty_print() |
# https://www.wikidata.org/wiki/Q17156793 |
# American football team |
# organization, in which a group of players are organized to compete as a team in American football |
for child in football_team_entity.get_sub_entities(limit=32): |
print(child) |
# New Orleans Saints |
# New York Giants |
# Pittsburgh Steelers |
# New England Patriots |
# Indianapolis Colts |
# Miami Seahawks |
# Dallas Cowboys |
# Chicago Bears |
# Washington Redskins |
# Green Bay Packers |
# ... |
``` |
### Entity Linking Policy |
Currently the only method for choosing an entity given different possible matches (e.g. Paris - city vs Paris - |
firstname) is max-prior. This method achieves around 70% accuracy on predicting the correct entities behind link |
descriptions on wikipedia. |
## Note |
The Entity Linker at the current state is still experimental and should not be used in production mode. |
## Performance |
The current implementation supports only Sqlite. This is advantageous for development because it does not requirement |
any special setup and configuration. However, for more performance critical usecases, a different database with |
in-memory access (e.g. Redis) should be used. This may be implemented in the future. |
## Installation |
To install the package run: <code>pip install spacy-entity-linker</code> |
Afterwards, the knowledge base (Wikidata) must be downloaded. This can be done by calling |
<code>python -m spacy_entity_linker "download_knowledge_base"</code> |
This will download and extract a ~500mb file that contains a preprocessed version of Wikidata |
## Data |
the knowledge base was derived from this dataset: https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data |
It was cleaned and post-procesed, including filtering out entities of "overrepresented" categories such as |
* village in China |
* train stations |
* stars in the Galaxy |
* etc. |
The purpose behind the knowledge base cleaning was to reduce the knowledge base size, while keeping the most useful entities for general purpose applications. |
Currently, the only way to change the knowledge base is a bit hacky and requires to replace or modify the underlying sqlite database. You will find it under <code>site_packages/data_spacy_entity_linker/wikidb_filtered.db</code>. The database contains 3 tables: |
* <b>aliases</b> |
* en_alias (english alias) |
* en_alias_lowercase (english alias lowercased) |
* <b>joined</b> |
* en_label (label of the wikidata item) |
* views (number of views of the corresponding wikipedia page (in a given period of time)) |
* inlinks (number of inlinks to the corresponding wikipedia page) |
* item_id (wikidata id) |
* description (description of the wikidata item) |
* <b>statements</b> |
* source_item_id (references item_id) |
* target_item_id (references item_id) |
* edge_property_id |
* 279=subclass of (https://www.wikidata.org/wiki/Property:P279) |
* 31=instance of (https://www.wikidata.org/wiki/Property:P31) |
* 361=part of (https://www.wikidata.org/wiki/Property:P361) |
## Versions: |
- <code>spacy_entity_linker>=0.0</code> (requires <code>spacy>=2.2,<3.0</code>) |
- <code>spacy_entity_linker>=1.0</code> (requires <code>spacy>=3.0</code>) |
## TODO |
- [ ] implement Entity Classifier based on sentence embeddings for improved accuracy |
- [ ] implement get_picture_urls() on EntityElement |
- [ ] retrieve statements for each EntityElement (inlinks + outlinks) |