egerber1 commited on
Commit
abba49f
·
1 Parent(s): d6504ae

update README for new version

Browse files
Files changed (2) hide show
  1. README.md +96 -65
  2. setup.py +4 -6
README.md CHANGED
@@ -1,66 +1,74 @@
1
  # Spacy Entity Linker
2
 
3
  ## Introduction
4
- Spacy Entity Linker is a pipeline for spaCy that performs Linked Entity Extraction with Wikidata on
5
- a given Document.
6
  The Entity Linking System operates by matching potential candidates from each sentence
7
- (subject, object, prepositional phrase, compounds, etc.) to aliases
8
- from Wikidata. The package allows to easily find the category behind each entity (e.g. "banana" is type "food" OR "Microsoft" is type "company"). It can
9
- is therefore useful for information extraction tasks and labeling tasks.
 
 
 
10
 
11
- The package was written before a working Linked Entity Solution existed inside spaCy. In comparison to spaCy's linked entity system, it has the following advantages:
12
  - no extensive training required (entity-matching via database)
13
  - knowledge base can be dynamically updated without retraining
14
  - entity categories can be easily resolved
15
  - grouping entities by category
16
 
17
  It also comes along with a number of disadvantages:
18
- - it is slower than the spaCy implementation due to the use of a database for finding entities
19
- - no context sensitivity due to the implementation of the "max-prior method" for entitiy disambiguation (an improved method for this is in progress)
20
 
 
 
 
21
 
22
  ## Use
 
23
  ```python
24
- import spacy
25
 
26
- #initialize language model
27
- nlp = spacy.load("en_core_web_sm")
28
 
29
- #add pipeline (declared through entry_points in setup.py)
30
  nlp.add_pipe("entityLinker", last=True)
31
 
32
- doc = nlp("I watched the Pirates of the Carribean last silvester")
33
 
34
- #returns all entities in the whole document
35
- all_linked_entities=doc._.linkedEntities
36
- #iterates over sentences and prints linked entities
37
  for sent in doc.sents:
38
  sent._.linkedEntities.pretty_print()
39
-
40
- #OUTPUT:
41
- #https://www.wikidata.org/wiki/Q194318 194318 Pirates of the Caribbean Series of fantasy adventure films
42
- #https://www.wikidata.org/wiki/Q12525597 12525597 Silvester the day celebrated on 31 December (Roman Catholic Church) or 2 January (Eastern Orthodox Churches)
43
 
44
  ```
45
 
46
  ### EntityCollection
47
- contains an array of entity elements. It can be accessed like an array but also implements the following
48
- helper functions:
 
 
49
  - <code>pretty_print()</code> prints out information about all contained entities
50
  - <code>print_super_classes()</code> groups and prints all entites by their super class
51
 
52
  ```python
53
  doc = nlp("Elon Musk was born in South Africa. Bill Gates and Steve Jobs come from the United States")
54
  doc._.linkedEntities.print_super_entities()
55
- #OUTPUT:
56
- #human (3) : Elon Musk,Bill Gates,Steve Jobs
57
- #country (2) : South Africa,United States of America
58
- #sovereign state (2) : South Africa,United States of America
59
- #federal state (1) : United States of America
60
- #constitutional republic (1) : United States of America
61
- #democratic republic (1) : United States of America
62
  ```
 
63
  ### EntityElement
 
64
  each linked Entity is an object of type <code>EntityElement</code>. Each entity contains the methods
65
 
66
  - <code>get_description()</code> returns description from Wikidata
@@ -69,73 +77,96 @@ each linked Entity is an object of type <code>EntityElement</code>. Each entity
69
  - <code>get_span()</code> returns the span from the spacy document that contains the linked entity
70
  - <code>get_url()</code> returns the url to the corresponding Wikidata item
71
  - <code>pretty_print()</code> prints out information about the entity element
72
- - <code>get_sub_entities(limit=10)</code> returns EntityCollection of all entities that derive from the current entityElement (e.g. fruit -> apple, banana, etc.)
73
- - <code>get_super_entities(limit=10)</code> returns EntityCollection of all entities that the current entityElement derives from (e.g. New England Patriots -> Football Team))
 
 
74
 
75
  ## Example
76
- In the following example we will use SpacyEntityLinker to find find the mentioned Football Team in our text
77
- and explore other football teams of the same type
 
78
 
79
  ```python
80
 
81
  doc = nlp("I follow the New England Patriots")
82
 
83
- patriots_entity=doc._.linkedEntities[0]
84
  patriots_entity.pretty_print()
85
- #OUTPUT:
86
- #https://www.wikidata.org/wiki/Q193390
87
- #193390
88
- #New England Patriots
89
- #National Football League franchise in Foxborough, Massachusetts
90
 
91
- football_team_entity=patriots_entity.get_super_entities()[0]
92
  football_team_entity.pretty_print()
93
- #OUTPUT:
94
- #https://www.wikidata.org/wiki/Q17156793
95
- #17156793
96
- #American football team
97
- #organization, in which a group of players are organized to compete as a team in American football
98
 
99
 
100
  for child in football_team_entity.get_sub_entities(limit=32):
101
- print(child)
102
- #OUTPUT:
103
- #New Orleans Saints
104
- #New York Giants
105
- #Pittsburgh Steelers
106
- #New England Patriots
107
- #Indianapolis Colts
108
- #Miami Seahawks
109
- #Dallas Cowboys
110
- #Chicago Bears
111
- #Washington Redskins
112
- #Green Bay Packers
113
- #...
114
  ```
115
 
116
  ### Entity Linking Policy
117
- Currently the only method for choosing an entity given different possible matches (e.g. Paris - city vs Paris - firstname) is max-prior. This method achieves around 70% accuracy on predicting
118
- the correct entities behind link descriptions on wikipedia.
 
 
119
 
120
  ## Note
 
121
  The Entity Linker at the current state is still experimental and should not be used in production mode.
122
 
123
  ## Performance
124
- The current implementation supports only Sqlite. This is advantageous for development because
125
- it does not requirement any special setup and configuration. However, for more performance critical usecases, a different
126
- database with in-memory access (e.g. Redis) should be used. This may be implemented in the future.
 
127
 
128
  ## Installation
129
 
130
  To install the package run: <code>pip install spacy-entity-linker</code>
131
 
132
- Afterwards, the knowledge base (Wikidata) must be downloaded. This can be done by calling
133
 
134
  <code>python -m spacy_entity_linker "download_knowledge_base"</code>
135
 
136
  This will download and extract a ~500mb file that contains a preprocessed version of Wikidata
137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  ## TODO
139
- - [ ] implement Entity Classifier based on sentence embeddings for improved accuracy
 
140
  - [ ] implement get_picture_urls() on EntityElement
141
  - [ ] retrieve statements for each EntityElement (inlinks + outlinks)
 
1
  # Spacy Entity Linker
2
 
3
  ## Introduction
4
+
5
+ Spacy Entity Linker is a pipeline for spaCy that performs Linked Entity Extraction with Wikidata on a given Document.
6
  The Entity Linking System operates by matching potential candidates from each sentence
7
+ (subject, object, prepositional phrase, compounds, etc.) to aliases from Wikidata. The package allows to easily find the
8
+ category behind each entity (e.g. "banana" is type "food" OR "Microsoft" is type "company"). It can is therefore useful
9
+ for information extraction tasks and labeling tasks.
10
+
11
+ The package was written before a working Linked Entity Solution existed inside spaCy. In comparison to spaCy's linked
12
+ entity system, it has the following advantages:
13
 
 
14
  - no extensive training required (entity-matching via database)
15
  - knowledge base can be dynamically updated without retraining
16
  - entity categories can be easily resolved
17
  - grouping entities by category
18
 
19
  It also comes along with a number of disadvantages:
 
 
20
 
21
+ - it is slower than the spaCy implementation due to the use of a database for finding entities
22
+ - no context sensitivity due to the implementation of the "max-prior method" for entitiy disambiguation (an improved
23
+ method for this is in progress)
24
 
25
  ## Use
26
+
27
  ```python
28
+ import spacy # version 3.0.6'
29
 
30
+ # initialize language model
31
+ nlp = spacy.load("en_core_web_md")
32
 
33
+ # add pipeline (declared through entry_points in setup.py)
34
  nlp.add_pipe("entityLinker", last=True)
35
 
36
+ doc = nlp("I watched the Pirates of the Caribbean last silvester")
37
 
38
+ # returns all entities in the whole document
39
+ all_linked_entities = doc._.linkedEntities
40
+ # iterates over sentences and prints linked entities
41
  for sent in doc.sents:
42
  sent._.linkedEntities.pretty_print()
43
+
44
+ # OUTPUT:
45
+ # https://www.wikidata.org/wiki/Q194318 194318 Pirates of the Caribbean Series of fantasy adventure films
46
+ # https://www.wikidata.org/wiki/Q12525597 12525597 Silvester the day celebrated on 31 December (Roman Catholic Church) or 2 January (Eastern Orthodox Churches)
47
 
48
  ```
49
 
50
  ### EntityCollection
51
+
52
+ contains an array of entity elements. It can be accessed like an array but also implements the following helper
53
+ functions:
54
+
55
  - <code>pretty_print()</code> prints out information about all contained entities
56
  - <code>print_super_classes()</code> groups and prints all entites by their super class
57
 
58
  ```python
59
  doc = nlp("Elon Musk was born in South Africa. Bill Gates and Steve Jobs come from the United States")
60
  doc._.linkedEntities.print_super_entities()
61
+ # OUTPUT:
62
+ # human (3) : Elon Musk,Bill Gates,Steve Jobs
63
+ # country (2) : South Africa,United States of America
64
+ # sovereign state (2) : South Africa,United States of America
65
+ # federal state (1) : United States of America
66
+ # constitutional republic (1) : United States of America
67
+ # democratic republic (1) : United States of America
68
  ```
69
+
70
  ### EntityElement
71
+
72
  each linked Entity is an object of type <code>EntityElement</code>. Each entity contains the methods
73
 
74
  - <code>get_description()</code> returns description from Wikidata
 
77
  - <code>get_span()</code> returns the span from the spacy document that contains the linked entity
78
  - <code>get_url()</code> returns the url to the corresponding Wikidata item
79
  - <code>pretty_print()</code> prints out information about the entity element
80
+ - <code>get_sub_entities(limit=10)</code> returns EntityCollection of all entities that derive from the current
81
+ entityElement (e.g. fruit -> apple, banana, etc.)
82
+ - <code>get_super_entities(limit=10)</code> returns EntityCollection of all entities that the current entityElement
83
+ derives from (e.g. New England Patriots -> Football Team))
84
 
85
  ## Example
86
+
87
+ In the following example we will use SpacyEntityLinker to find find the mentioned Football Team in our text and explore
88
+ other football teams of the same type
89
 
90
  ```python
91
 
92
  doc = nlp("I follow the New England Patriots")
93
 
94
+ patriots_entity = doc._.linkedEntities[0]
95
  patriots_entity.pretty_print()
96
+ # OUTPUT:
97
+ # https://www.wikidata.org/wiki/Q193390
98
+ # 193390
99
+ # New England Patriots
100
+ # National Football League franchise in Foxborough, Massachusetts
101
 
102
+ football_team_entity = patriots_entity.get_super_entities()[0]
103
  football_team_entity.pretty_print()
104
+ # OUTPUT:
105
+ # https://www.wikidata.org/wiki/Q17156793
106
+ # 17156793
107
+ # American football team
108
+ # organization, in which a group of players are organized to compete as a team in American football
109
 
110
 
111
  for child in football_team_entity.get_sub_entities(limit=32):
112
+ print(child)
113
+ # OUTPUT:
114
+ # New Orleans Saints
115
+ # New York Giants
116
+ # Pittsburgh Steelers
117
+ # New England Patriots
118
+ # Indianapolis Colts
119
+ # Miami Seahawks
120
+ # Dallas Cowboys
121
+ # Chicago Bears
122
+ # Washington Redskins
123
+ # Green Bay Packers
124
+ # ...
125
  ```
126
 
127
  ### Entity Linking Policy
128
+
129
+ Currently the only method for choosing an entity given different possible matches (e.g. Paris - city vs Paris -
130
+ firstname) is max-prior. This method achieves around 70% accuracy on predicting the correct entities behind link
131
+ descriptions on wikipedia.
132
 
133
  ## Note
134
+
135
  The Entity Linker at the current state is still experimental and should not be used in production mode.
136
 
137
  ## Performance
138
+
139
+ The current implementation supports only Sqlite. This is advantageous for development because it does not requirement
140
+ any special setup and configuration. However, for more performance critical usecases, a different database with
141
+ in-memory access (e.g. Redis) should be used. This may be implemented in the future.
142
 
143
  ## Installation
144
 
145
  To install the package run: <code>pip install spacy-entity-linker</code>
146
 
147
+ Afterwards, the knowledge base (Wikidata) must be downloaded. This can be done by calling
148
 
149
  <code>python -m spacy_entity_linker "download_knowledge_base"</code>
150
 
151
  This will download and extract a ~500mb file that contains a preprocessed version of Wikidata
152
 
153
+ ## Data
154
+ the knowledge base was derived from this dataset: https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data
155
+
156
+ It was cleaned and post-procesed, including filtering out entities of "overrepresented" categories such as
157
+ * village in China
158
+ * train stations
159
+ * stars in the Galaxy
160
+ * etc.
161
+
162
+ The purpose behind the knowledge base cleaning was to reduce the knowledge base size, while keeping the most useful entities for general purpose applications.
163
+ ## Versions:
164
+
165
+ - <code>spacy_entity_linker>=0.0</code> (requires <code>spacy>=2.2,<3.0</code>)
166
+ - <code>spacy_entity_linker>=1.0</code> (requires <code>spacy>=3.0</code>)
167
+
168
  ## TODO
169
+
170
+ - [ ] implement Entity Classifier based on sentence embeddings for improved accuracy
171
  - [ ] implement get_picture_urls() on EntityElement
172
  - [ ] retrieve statements for each EntityElement (inlinks + outlinks)
setup.py CHANGED
@@ -22,7 +22,7 @@ with open("README.md", "r") as fh:
22
 
23
  setup(
24
  name='spacy-entity-linker',
25
- version='0.0.6',
26
  author='Emanuel Gerber',
27
  author_email='[email protected]',
28
  packages=['spacy_entity_linker'],
@@ -32,13 +32,11 @@ setup(
32
  "Intended Audience :: Developers",
33
  "Intended Audience :: Science/Research",
34
  "License :: OSI Approved :: MIT License",
35
- "Operating System :: POSIX :: Linux",
36
  "Programming Language :: Cython",
37
  "Programming Language :: Python",
38
- "Programming Language :: Python :: 2",
39
- "Programming Language :: Python :: 2.7",
40
- "Programming Language :: Python :: 3",
41
- "Programming Language :: Python :: 3.4"
42
  ],
43
  description='Linked Entity Pipeline for spaCy',
44
  long_description=long_description,
 
22
 
23
  setup(
24
  name='spacy-entity-linker',
25
+ version='1.0.0',
26
  author='Emanuel Gerber',
27
  author_email='[email protected]',
28
  packages=['spacy_entity_linker'],
 
32
  "Intended Audience :: Developers",
33
  "Intended Audience :: Science/Research",
34
  "License :: OSI Approved :: MIT License",
 
35
  "Programming Language :: Cython",
36
  "Programming Language :: Python",
37
+ "Programming Language :: Python :: 3.6"
38
+ "Programming Language :: Python :: 3.7"
39
+ "Programming Language :: Python :: 3.8"
 
40
  ],
41
  description='Linked Entity Pipeline for spaCy',
42
  long_description=long_description,