update README for new version
Browse files
README.md
CHANGED
@@ -1,66 +1,74 @@
|
|
1 |
# Spacy Entity Linker
|
2 |
|
3 |
## Introduction
|
4 |
-
|
5 |
-
a given Document.
|
6 |
The Entity Linking System operates by matching potential candidates from each sentence
|
7 |
-
|
8 |
-
|
9 |
-
|
|
|
|
|
|
|
10 |
|
11 |
-
The package was written before a working Linked Entity Solution existed inside spaCy. In comparison to spaCy's linked entity system, it has the following advantages:
|
12 |
- no extensive training required (entity-matching via database)
|
13 |
- knowledge base can be dynamically updated without retraining
|
14 |
- entity categories can be easily resolved
|
15 |
- grouping entities by category
|
16 |
|
17 |
It also comes along with a number of disadvantages:
|
18 |
-
- it is slower than the spaCy implementation due to the use of a database for finding entities
|
19 |
-
- no context sensitivity due to the implementation of the "max-prior method" for entitiy disambiguation (an improved method for this is in progress)
|
20 |
|
|
|
|
|
|
|
21 |
|
22 |
## Use
|
|
|
23 |
```python
|
24 |
-
import spacy
|
25 |
|
26 |
-
#initialize language model
|
27 |
-
nlp = spacy.load("
|
28 |
|
29 |
-
#add pipeline (declared through entry_points in setup.py)
|
30 |
nlp.add_pipe("entityLinker", last=True)
|
31 |
|
32 |
-
doc = nlp("I watched the Pirates of the
|
33 |
|
34 |
-
#returns all entities in the whole document
|
35 |
-
all_linked_entities=doc._.linkedEntities
|
36 |
-
#iterates over sentences and prints linked entities
|
37 |
for sent in doc.sents:
|
38 |
sent._.linkedEntities.pretty_print()
|
39 |
-
|
40 |
-
#OUTPUT:
|
41 |
-
#https://www.wikidata.org/wiki/Q194318
|
42 |
-
|
43 |
|
44 |
```
|
45 |
|
46 |
### EntityCollection
|
47 |
-
|
48 |
-
helper
|
|
|
|
|
49 |
- <code>pretty_print()</code> prints out information about all contained entities
|
50 |
- <code>print_super_classes()</code> groups and prints all entites by their super class
|
51 |
|
52 |
```python
|
53 |
doc = nlp("Elon Musk was born in South Africa. Bill Gates and Steve Jobs come from the United States")
|
54 |
doc._.linkedEntities.print_super_entities()
|
55 |
-
#OUTPUT:
|
56 |
-
#human (3) : Elon Musk,Bill Gates,Steve Jobs
|
57 |
-
#country (2) : South Africa,United States of America
|
58 |
-
#sovereign state (2) : South Africa,United States of America
|
59 |
-
#federal state (1) : United States of America
|
60 |
-
#constitutional republic (1) : United States of America
|
61 |
-
#democratic republic (1) : United States of America
|
62 |
```
|
|
|
63 |
### EntityElement
|
|
|
64 |
each linked Entity is an object of type <code>EntityElement</code>. Each entity contains the methods
|
65 |
|
66 |
- <code>get_description()</code> returns description from Wikidata
|
@@ -69,73 +77,96 @@ each linked Entity is an object of type <code>EntityElement</code>. Each entity
|
|
69 |
- <code>get_span()</code> returns the span from the spacy document that contains the linked entity
|
70 |
- <code>get_url()</code> returns the url to the corresponding Wikidata item
|
71 |
- <code>pretty_print()</code> prints out information about the entity element
|
72 |
-
- <code>get_sub_entities(limit=10)</code> returns EntityCollection of all entities that derive from the current
|
73 |
-
|
|
|
|
|
74 |
|
75 |
## Example
|
76 |
-
|
77 |
-
|
|
|
78 |
|
79 |
```python
|
80 |
|
81 |
doc = nlp("I follow the New England Patriots")
|
82 |
|
83 |
-
patriots_entity=doc._.linkedEntities[0]
|
84 |
patriots_entity.pretty_print()
|
85 |
-
#OUTPUT:
|
86 |
-
#https://www.wikidata.org/wiki/Q193390
|
87 |
-
#193390
|
88 |
-
#New England Patriots
|
89 |
-
#National Football League franchise in Foxborough, Massachusetts
|
90 |
|
91 |
-
football_team_entity=patriots_entity.get_super_entities()[0]
|
92 |
football_team_entity.pretty_print()
|
93 |
-
#OUTPUT:
|
94 |
-
#https://www.wikidata.org/wiki/Q17156793
|
95 |
-
#17156793
|
96 |
-
#American football team
|
97 |
-
#organization, in which a group of players are organized to compete as a team in American football
|
98 |
|
99 |
|
100 |
for child in football_team_entity.get_sub_entities(limit=32):
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
```
|
115 |
|
116 |
### Entity Linking Policy
|
117 |
-
|
118 |
-
the
|
|
|
|
|
119 |
|
120 |
## Note
|
|
|
121 |
The Entity Linker at the current state is still experimental and should not be used in production mode.
|
122 |
|
123 |
## Performance
|
124 |
-
|
125 |
-
|
126 |
-
|
|
|
127 |
|
128 |
## Installation
|
129 |
|
130 |
To install the package run: <code>pip install spacy-entity-linker</code>
|
131 |
|
132 |
-
Afterwards, the knowledge base (Wikidata) must be downloaded. This can be done by calling
|
133 |
|
134 |
<code>python -m spacy_entity_linker "download_knowledge_base"</code>
|
135 |
|
136 |
This will download and extract a ~500mb file that contains a preprocessed version of Wikidata
|
137 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
138 |
## TODO
|
139 |
-
|
|
|
140 |
- [ ] implement get_picture_urls() on EntityElement
|
141 |
- [ ] retrieve statements for each EntityElement (inlinks + outlinks)
|
|
|
1 |
# Spacy Entity Linker
|
2 |
|
3 |
## Introduction
|
4 |
+
|
5 |
+
Spacy Entity Linker is a pipeline for spaCy that performs Linked Entity Extraction with Wikidata on a given Document.
|
6 |
The Entity Linking System operates by matching potential candidates from each sentence
|
7 |
+
(subject, object, prepositional phrase, compounds, etc.) to aliases from Wikidata. The package allows to easily find the
|
8 |
+
category behind each entity (e.g. "banana" is type "food" OR "Microsoft" is type "company"). It can is therefore useful
|
9 |
+
for information extraction tasks and labeling tasks.
|
10 |
+
|
11 |
+
The package was written before a working Linked Entity Solution existed inside spaCy. In comparison to spaCy's linked
|
12 |
+
entity system, it has the following advantages:
|
13 |
|
|
|
14 |
- no extensive training required (entity-matching via database)
|
15 |
- knowledge base can be dynamically updated without retraining
|
16 |
- entity categories can be easily resolved
|
17 |
- grouping entities by category
|
18 |
|
19 |
It also comes along with a number of disadvantages:
|
|
|
|
|
20 |
|
21 |
+
- it is slower than the spaCy implementation due to the use of a database for finding entities
|
22 |
+
- no context sensitivity due to the implementation of the "max-prior method" for entitiy disambiguation (an improved
|
23 |
+
method for this is in progress)
|
24 |
|
25 |
## Use
|
26 |
+
|
27 |
```python
|
28 |
+
import spacy # version 3.0.6'
|
29 |
|
30 |
+
# initialize language model
|
31 |
+
nlp = spacy.load("en_core_web_md")
|
32 |
|
33 |
+
# add pipeline (declared through entry_points in setup.py)
|
34 |
nlp.add_pipe("entityLinker", last=True)
|
35 |
|
36 |
+
doc = nlp("I watched the Pirates of the Caribbean last silvester")
|
37 |
|
38 |
+
# returns all entities in the whole document
|
39 |
+
all_linked_entities = doc._.linkedEntities
|
40 |
+
# iterates over sentences and prints linked entities
|
41 |
for sent in doc.sents:
|
42 |
sent._.linkedEntities.pretty_print()
|
43 |
+
|
44 |
+
# OUTPUT:
|
45 |
+
# https://www.wikidata.org/wiki/Q194318 194318 Pirates of the Caribbean Series of fantasy adventure films
|
46 |
+
# https://www.wikidata.org/wiki/Q12525597 12525597 Silvester the day celebrated on 31 December (Roman Catholic Church) or 2 January (Eastern Orthodox Churches)
|
47 |
|
48 |
```
|
49 |
|
50 |
### EntityCollection
|
51 |
+
|
52 |
+
contains an array of entity elements. It can be accessed like an array but also implements the following helper
|
53 |
+
functions:
|
54 |
+
|
55 |
- <code>pretty_print()</code> prints out information about all contained entities
|
56 |
- <code>print_super_classes()</code> groups and prints all entites by their super class
|
57 |
|
58 |
```python
|
59 |
doc = nlp("Elon Musk was born in South Africa. Bill Gates and Steve Jobs come from the United States")
|
60 |
doc._.linkedEntities.print_super_entities()
|
61 |
+
# OUTPUT:
|
62 |
+
# human (3) : Elon Musk,Bill Gates,Steve Jobs
|
63 |
+
# country (2) : South Africa,United States of America
|
64 |
+
# sovereign state (2) : South Africa,United States of America
|
65 |
+
# federal state (1) : United States of America
|
66 |
+
# constitutional republic (1) : United States of America
|
67 |
+
# democratic republic (1) : United States of America
|
68 |
```
|
69 |
+
|
70 |
### EntityElement
|
71 |
+
|
72 |
each linked Entity is an object of type <code>EntityElement</code>. Each entity contains the methods
|
73 |
|
74 |
- <code>get_description()</code> returns description from Wikidata
|
|
|
77 |
- <code>get_span()</code> returns the span from the spacy document that contains the linked entity
|
78 |
- <code>get_url()</code> returns the url to the corresponding Wikidata item
|
79 |
- <code>pretty_print()</code> prints out information about the entity element
|
80 |
+
- <code>get_sub_entities(limit=10)</code> returns EntityCollection of all entities that derive from the current
|
81 |
+
entityElement (e.g. fruit -> apple, banana, etc.)
|
82 |
+
- <code>get_super_entities(limit=10)</code> returns EntityCollection of all entities that the current entityElement
|
83 |
+
derives from (e.g. New England Patriots -> Football Team))
|
84 |
|
85 |
## Example
|
86 |
+
|
87 |
+
In the following example we will use SpacyEntityLinker to find find the mentioned Football Team in our text and explore
|
88 |
+
other football teams of the same type
|
89 |
|
90 |
```python
|
91 |
|
92 |
doc = nlp("I follow the New England Patriots")
|
93 |
|
94 |
+
patriots_entity = doc._.linkedEntities[0]
|
95 |
patriots_entity.pretty_print()
|
96 |
+
# OUTPUT:
|
97 |
+
# https://www.wikidata.org/wiki/Q193390
|
98 |
+
# 193390
|
99 |
+
# New England Patriots
|
100 |
+
# National Football League franchise in Foxborough, Massachusetts
|
101 |
|
102 |
+
football_team_entity = patriots_entity.get_super_entities()[0]
|
103 |
football_team_entity.pretty_print()
|
104 |
+
# OUTPUT:
|
105 |
+
# https://www.wikidata.org/wiki/Q17156793
|
106 |
+
# 17156793
|
107 |
+
# American football team
|
108 |
+
# organization, in which a group of players are organized to compete as a team in American football
|
109 |
|
110 |
|
111 |
for child in football_team_entity.get_sub_entities(limit=32):
|
112 |
+
print(child)
|
113 |
+
# OUTPUT:
|
114 |
+
# New Orleans Saints
|
115 |
+
# New York Giants
|
116 |
+
# Pittsburgh Steelers
|
117 |
+
# New England Patriots
|
118 |
+
# Indianapolis Colts
|
119 |
+
# Miami Seahawks
|
120 |
+
# Dallas Cowboys
|
121 |
+
# Chicago Bears
|
122 |
+
# Washington Redskins
|
123 |
+
# Green Bay Packers
|
124 |
+
# ...
|
125 |
```
|
126 |
|
127 |
### Entity Linking Policy
|
128 |
+
|
129 |
+
Currently the only method for choosing an entity given different possible matches (e.g. Paris - city vs Paris -
|
130 |
+
firstname) is max-prior. This method achieves around 70% accuracy on predicting the correct entities behind link
|
131 |
+
descriptions on wikipedia.
|
132 |
|
133 |
## Note
|
134 |
+
|
135 |
The Entity Linker at the current state is still experimental and should not be used in production mode.
|
136 |
|
137 |
## Performance
|
138 |
+
|
139 |
+
The current implementation supports only Sqlite. This is advantageous for development because it does not requirement
|
140 |
+
any special setup and configuration. However, for more performance critical usecases, a different database with
|
141 |
+
in-memory access (e.g. Redis) should be used. This may be implemented in the future.
|
142 |
|
143 |
## Installation
|
144 |
|
145 |
To install the package run: <code>pip install spacy-entity-linker</code>
|
146 |
|
147 |
+
Afterwards, the knowledge base (Wikidata) must be downloaded. This can be done by calling
|
148 |
|
149 |
<code>python -m spacy_entity_linker "download_knowledge_base"</code>
|
150 |
|
151 |
This will download and extract a ~500mb file that contains a preprocessed version of Wikidata
|
152 |
|
153 |
+
## Data
|
154 |
+
the knowledge base was derived from this dataset: https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data
|
155 |
+
|
156 |
+
It was cleaned and post-procesed, including filtering out entities of "overrepresented" categories such as
|
157 |
+
* village in China
|
158 |
+
* train stations
|
159 |
+
* stars in the Galaxy
|
160 |
+
* etc.
|
161 |
+
|
162 |
+
The purpose behind the knowledge base cleaning was to reduce the knowledge base size, while keeping the most useful entities for general purpose applications.
|
163 |
+
## Versions:
|
164 |
+
|
165 |
+
- <code>spacy_entity_linker>=0.0</code> (requires <code>spacy>=2.2,<3.0</code>)
|
166 |
+
- <code>spacy_entity_linker>=1.0</code> (requires <code>spacy>=3.0</code>)
|
167 |
+
|
168 |
## TODO
|
169 |
+
|
170 |
+
- [ ] implement Entity Classifier based on sentence embeddings for improved accuracy
|
171 |
- [ ] implement get_picture_urls() on EntityElement
|
172 |
- [ ] retrieve statements for each EntityElement (inlinks + outlinks)
|
setup.py
CHANGED
@@ -22,7 +22,7 @@ with open("README.md", "r") as fh:
|
|
22 |
|
23 |
setup(
|
24 |
name='spacy-entity-linker',
|
25 |
-
version='0.0
|
26 |
author='Emanuel Gerber',
|
27 |
author_email='[email protected]',
|
28 |
packages=['spacy_entity_linker'],
|
@@ -32,13 +32,11 @@ setup(
|
|
32 |
"Intended Audience :: Developers",
|
33 |
"Intended Audience :: Science/Research",
|
34 |
"License :: OSI Approved :: MIT License",
|
35 |
-
"Operating System :: POSIX :: Linux",
|
36 |
"Programming Language :: Cython",
|
37 |
"Programming Language :: Python",
|
38 |
-
"Programming Language :: Python ::
|
39 |
-
"Programming Language :: Python ::
|
40 |
-
"Programming Language :: Python :: 3"
|
41 |
-
"Programming Language :: Python :: 3.4"
|
42 |
],
|
43 |
description='Linked Entity Pipeline for spaCy',
|
44 |
long_description=long_description,
|
|
|
22 |
|
23 |
setup(
|
24 |
name='spacy-entity-linker',
|
25 |
+
version='1.0.0',
|
26 |
author='Emanuel Gerber',
|
27 |
author_email='[email protected]',
|
28 |
packages=['spacy_entity_linker'],
|
|
|
32 |
"Intended Audience :: Developers",
|
33 |
"Intended Audience :: Science/Research",
|
34 |
"License :: OSI Approved :: MIT License",
|
|
|
35 |
"Programming Language :: Cython",
|
36 |
"Programming Language :: Python",
|
37 |
+
"Programming Language :: Python :: 3.6"
|
38 |
+
"Programming Language :: Python :: 3.7"
|
39 |
+
"Programming Language :: Python :: 3.8"
|
|
|
40 |
],
|
41 |
description='Linked Entity Pipeline for spaCy',
|
42 |
long_description=long_description,
|