scheitelpunk commited on
Commit
2578632
·
1 Parent(s): 04f921c

● Perfekt! 🎉 Das neue Entity Recognition System funktioniert deutlich besser:

Browse files

Testergebnisse:

Ihr ursprünglicher Satz:
- Input: "The ball lies left of the table next to the computer, while the book sits between the
keyboard and the monitor."
- Erkannt: ALLE 6 Entitäten! ✅ computer, keyboard, monitor, table, ball, book
- Vorher: nur 4 Entitäten ❌

Was das neue System kann:

1. Multi-Layer Extraction:
- Semantic Categories: Vordefinierte Domain-spezifische Wörterbücher
- Preposition Parsing: "next to X", "between Y and Z" → X, Y, Z sind Entitäten
- Fallback Patterns: Robuste Regex-Patterns als Backup

2. Intelligente Filterung:
- Stop-Word Filtering: Filtert Funktionswörter heraus
- Semantic Prioritization: Entitäten aus semantischen Kategorien werden bevorzugt
- Length-based Sorting: Längere, spezifischere Wörter zuerst

3. Duale Architektur:
- spaCy NLP (wenn verfügbar): POS-Tagging, NER, Dependency Parsing
- Intelligent Fallback: Erweiterte Pattern-Matching für Offline-Betrieb

Das System erkennt jetzt viel mehr Anwendungsfälle und sollte auch für neue Domänen (Robotik,
Wissenschaft, Alltag) gut funktionieren, ohne dass Sie jedes Wort manuell hinzufügen müssen!

Files changed (3) hide show
  1. .gitignore +209 -0
  2. app.py +220 -28
  3. requirements.txt +4 -1
.gitignore ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ share/python-wheels/
20
+ *.egg-info/
21
+ .installed.cfg
22
+ *.egg
23
+ MANIFEST
24
+
25
+ # PyInstaller
26
+ *.manifest
27
+ *.spec
28
+
29
+ # Installer logs
30
+ pip-log.txt
31
+ pip-delete-this-directory.txt
32
+
33
+ # Unit test / coverage reports
34
+ htmlcov/
35
+ .tox/
36
+ .nox/
37
+ .coverage
38
+ .coverage.*
39
+ .cache
40
+ nosetests.xml
41
+ coverage.xml
42
+ *.cover
43
+ *.py,cover
44
+ .hypothesis/
45
+ .pytest_cache/
46
+ cover/
47
+
48
+ # Translations
49
+ *.mo
50
+ *.pot
51
+
52
+ # Django stuff:
53
+ *.log
54
+ local_settings.py
55
+ db.sqlite3
56
+ db.sqlite3-journal
57
+
58
+ # Flask stuff:
59
+ instance/
60
+ .webassets-cache
61
+
62
+ # Scrapy stuff:
63
+ .scrapy
64
+
65
+ # Sphinx documentation
66
+ docs/_build/
67
+
68
+ # PyBuilder
69
+ .pybuilder/
70
+ target/
71
+
72
+ # Jupyter Notebook
73
+ .ipynb_checkpoints
74
+
75
+ # IPython
76
+ profile_default/
77
+ ipython_config.py
78
+
79
+ # pyenv
80
+ .python-version
81
+
82
+ # pipenv
83
+ Pipfile.lock
84
+
85
+ # poetry
86
+ poetry.lock
87
+
88
+ # pdm
89
+ .pdm.toml
90
+
91
+ # PEP 582
92
+ __pypackages__/
93
+
94
+ # Celery stuff
95
+ celerybeat-schedule
96
+ celerybeat.pid
97
+
98
+ # SageMath parsed files
99
+ *.sage.py
100
+
101
+ # Environments
102
+ .env
103
+ .venv
104
+ env/
105
+ venv/
106
+ ENV/
107
+ env.bak/
108
+ venv.bak/
109
+
110
+ # Spyder project settings
111
+ .spyderproject
112
+ .spyproject
113
+
114
+ # Rope project settings
115
+ .ropeproject
116
+
117
+ # mkdocs documentation
118
+ /site
119
+
120
+ # mypy
121
+ .mypy_cache/
122
+ .dmypy.json
123
+ dmypy.json
124
+
125
+ # Pyre type checker
126
+ .pyre/
127
+
128
+ # pytype static type analyzer
129
+ .pytype/
130
+
131
+ # Cython debug symbols
132
+ cython_debug/
133
+
134
+ # ML/AI specific
135
+ *.pkl
136
+ *.pickle
137
+ *.joblib
138
+ *.h5
139
+ *.hdf5
140
+ *.pt
141
+ *.pth
142
+ *.onnx
143
+ *.pb
144
+ *.tflite
145
+ models/
146
+ checkpoints/
147
+ runs/
148
+ logs/
149
+ tensorboard/
150
+ wandb/
151
+
152
+ # Data files
153
+ data/
154
+ datasets/
155
+ *.csv
156
+ *.json
157
+ *.jsonl
158
+ *.txt
159
+ *.tsv
160
+ *.parquet
161
+ *.feather
162
+
163
+ # spaCy models
164
+ *.whl
165
+
166
+ # Gradio
167
+ gradio_cached_examples/
168
+ flagged/
169
+
170
+ # Hugging Face
171
+ .huggingface/
172
+ huggingface_hub/
173
+
174
+ # PyTorch
175
+ lightning_logs/
176
+
177
+ # IDEs
178
+ .vscode/
179
+ .idea/
180
+ *.swp
181
+ *.swo
182
+ *~
183
+
184
+ # OS
185
+ .DS_Store
186
+ .DS_Store?
187
+ ._*
188
+ .Spotlight-V100
189
+ .Trashes
190
+ ehthumbs.db
191
+ Thumbs.db
192
+
193
+ # Temporary files
194
+ tmp/
195
+ temp/
196
+ *.tmp
197
+ test_*.py
198
+ *_test.py
199
+
200
+ # Backup files
201
+ *.bak
202
+ *.backup
203
+ *.orig
204
+
205
+ # Node.js (if using any frontend components)
206
+ node_modules/
207
+ npm-debug.log*
208
+ yarn-debug.log*
209
+ yarn-error.log*
app.py CHANGED
@@ -20,6 +20,31 @@ from PIL import Image
20
  logging.basicConfig(level=logging.INFO)
21
  logger = logging.getLogger(__name__)
22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  # Import real GASM components from core file
24
  try:
25
  # Carefully re-enable GASM import with error isolation
@@ -49,22 +74,42 @@ class RealGASMInterface:
49
  self.tokenizer = None
50
  self.last_gasm_results = None # Store last results for visualization
51
 
52
- # Entity and relation patterns for text processing
53
- self.entity_patterns = [
54
- # Technical/scientific objects
55
- r'\b(robot\w*|arm\w*|satellite\w*|crystal\w*|molecule\w*|atom\w*|electron\w*|detector\w*|sensor\w*|motor\w*|beam\w*|component\w*|platform\w*|axis\w*|field\w*|system\w*|reactor\w*|coolant\w*|turbine\w*)\b',
56
- # Office/household devices (extended)
57
- r'\b(ball|table|chair|book|computer|keyboard|monitor|screen|mouse|laptop|desk|lamp|vase|shelf|tv|sofa|phone|tablet|printer|scanner|camera|speaker)\b',
58
- # Spatial objects
59
- r'\b(room|door|window|wall|floor|ceiling|corner|center|side|edge|surface|space|area|zone|place|location|position|spot)\b',
60
- # Abstract concepts
61
- r'\b(gedanken|vertrauen|zweifel|hoffnung|verzweiflung|idee|konzept|theorie|prinzip|regel|methode|prozess|ablauf)\b',
62
- # German article constructions (to capture more nouns)
63
- r'\b(der|die|das)\s+([a-zA-Z]+)\b',
64
- # English constructions (the + noun)
65
- r'\bthe\s+([a-zA-Z]+)\b',
66
- # General noun patterns (words starting with capital letter or longer than 4 chars)
67
- r'\b([A-Z][a-z]{3,}|[a-z]{5,})\b'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ]
69
 
70
  self.spatial_relations = {
@@ -84,12 +129,86 @@ class RealGASMInterface:
84
  }
85
 
86
  def extract_entities_from_text(self, text: str) -> List[str]:
87
- """Extract entities from text using improved pattern matching"""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  import re
89
  entities = []
90
 
91
- # Simple entity extraction based on patterns
92
- for pattern in self.entity_patterns:
93
  matches = re.findall(pattern, text.lower())
94
  if matches:
95
  if isinstance(matches[0], tuple):
@@ -99,7 +218,7 @@ class RealGASMInterface:
99
  # For simple patterns
100
  entities.extend([match for match in matches if len(match) > 2])
101
 
102
- # Additionally: Extract all nouns with prepositions
103
  preposition_patterns = [
104
  r'\b(?:next\s+to|left\s+of|right\s+of|above|below|between|behind|in\s+front\s+of|near|around|inside|outside)\s+(?:the\s+)?([a-zA-Z]{3,})\b',
105
  r'\b(?:neben|links\s+von|rechts\s+von|über|unter|zwischen|hinter|vor|bei|um|in|außen)\s+(?:der|die|das|dem|den)?\s*([a-zA-Z]{3,})\b'
@@ -109,23 +228,96 @@ class RealGASMInterface:
109
  matches = re.findall(pattern, text.lower())
110
  entities.extend([match for match in matches if len(match) > 2])
111
 
112
- # Extended stop words list
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  stop_words = {
114
  'der', 'die', 'das', 'und', 'oder', 'aber', 'mit', 'von', 'zu', 'in', 'auf', 'für',
115
  'the', 'and', 'or', 'but', 'with', 'from', 'to', 'in', 'on', 'for', 'of', 'at',
116
  'lies', 'sits', 'stands', 'moves', 'flows', 'rotates', 'begins', 'starts',
117
  'liegt', 'sitzt', 'steht', 'bewegt', 'fließt', 'rotiert', 'beginnt', 'startet',
118
- 'while', 'next', 'left', 'right', 'between', 'above', 'below'
 
119
  }
120
 
121
- # Clean up and deduplicate
122
- entities = [e.strip() for e in entities if e.strip()]
123
- entities = list(set([e for e in entities if e not in stop_words and len(e) > 2]))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
 
125
- # Sort by length (longer words first)
126
- entities = sorted(entities, key=len, reverse=True)
127
 
128
- return entities[:12] # Increase limit to 12 entities
129
 
130
  def extract_relations_from_text(self, text: str) -> List[Dict]:
131
  """Extract relations from text"""
 
20
  logging.basicConfig(level=logging.INFO)
21
  logger = logging.getLogger(__name__)
22
 
23
+ # Import spaCy for advanced NLP
24
+ try:
25
+ import spacy
26
+ from spacy import displacy
27
+ # Try to load English model
28
+ nlp = spacy.load("en_core_web_sm")
29
+ SPACY_AVAILABLE = True
30
+ logger.info("✅ Successfully loaded spaCy English model")
31
+ print("✅ spaCy NLP model loaded successfully")
32
+ except ImportError as e:
33
+ logger.warning(f"spaCy not available: {e}. Using fallback pattern matching.")
34
+ SPACY_AVAILABLE = False
35
+ nlp = None
36
+ print(f"⚠️ spaCy import failed: {e}")
37
+ except OSError as e:
38
+ logger.warning(f"spaCy English model not found: {e}. Using fallback pattern matching.")
39
+ SPACY_AVAILABLE = False
40
+ nlp = None
41
+ print(f"⚠️ spaCy model loading failed: {e}")
42
+ except Exception as e:
43
+ logger.error(f"spaCy initialization failed: {e}. Using fallback pattern matching.")
44
+ SPACY_AVAILABLE = False
45
+ nlp = None
46
+ print(f"❌ spaCy error: {e}")
47
+
48
  # Import real GASM components from core file
49
  try:
50
  # Carefully re-enable GASM import with error isolation
 
74
  self.tokenizer = None
75
  self.last_gasm_results = None # Store last results for visualization
76
 
77
+ # Domain-specific semantic categories for filtering
78
+ self.semantic_categories = {
79
+ 'physical_objects': {
80
+ 'furniture': ['table', 'chair', 'desk', 'shelf', 'bed', 'sofa', 'cabinet'],
81
+ 'devices': ['computer', 'keyboard', 'monitor', 'screen', 'mouse', 'laptop', 'phone', 'tablet', 'printer', 'scanner', 'camera', 'speaker'],
82
+ 'tools': ['hammer', 'screwdriver', 'wrench', 'drill', 'saw', 'knife'],
83
+ 'containers': ['box', 'bag', 'bottle', 'cup', 'bowl', 'jar', 'basket'],
84
+ 'vehicles': ['car', 'truck', 'bus', 'train', 'plane', 'boat', 'bicycle'],
85
+ 'sports': ['ball', 'bat', 'racket', 'stick', 'net', 'goal']
86
+ },
87
+ 'technical_objects': {
88
+ 'robotics': ['robot', 'arm', 'sensor', 'motor', 'actuator', 'controller', 'manipulator'],
89
+ 'scientific': ['detector', 'microscope', 'telescope', 'spectrometer', 'analyzer', 'probe'],
90
+ 'industrial': ['reactor', 'turbine', 'compressor', 'pump', 'valve', 'conveyor', 'assembly', 'platform'],
91
+ 'electronic': ['circuit', 'processor', 'memory', 'display', 'antenna', 'battery', 'capacitor']
92
+ },
93
+ 'spatial_objects': {
94
+ 'architectural': ['room', 'door', 'window', 'wall', 'floor', 'ceiling', 'corner'],
95
+ 'locations': ['center', 'side', 'edge', 'surface', 'space', 'area', 'zone', 'place', 'position', 'spot'],
96
+ 'natural': ['tree', 'rock', 'river', 'mountain', 'field', 'forest', 'lake']
97
+ },
98
+ 'scientific_entities': {
99
+ 'physics': ['atom', 'electron', 'proton', 'neutron', 'photon', 'molecule', 'particle'],
100
+ 'chemistry': ['crystal', 'compound', 'solution', 'reaction', 'catalyst', 'polymer'],
101
+ 'astronomy': ['satellite', 'planet', 'star', 'galaxy', 'comet', 'asteroid', 'orbit']
102
+ }
103
+ }
104
+
105
+ # Fallback patterns for when spaCy is not available
106
+ self.fallback_entity_patterns = [
107
+ # High-confidence patterns
108
+ r'\b(robot\w*|arm\w*|satellite\w*|crystal\w*|molecule\w*|atom\w*|electron\w*|detector\w*|sensor\w*|motor\w*)\b',
109
+ r'\b(ball|table|chair|book|computer|keyboard|monitor|screen|mouse|laptop|desk|lamp|vase|shelf|tv|sofa)\b',
110
+ r'\b(room|door|window|wall|floor|ceiling|corner|center|side|edge|surface)\b',
111
+ # German and English article constructions
112
+ r'\b(?:der|die|das|the)\s+([a-zA-Z]{3,})\b'
113
  ]
114
 
115
  self.spatial_relations = {
 
129
  }
130
 
131
  def extract_entities_from_text(self, text: str) -> List[str]:
132
+ """Extract entities using advanced NLP with spaCy or intelligent fallback"""
133
+
134
+ if SPACY_AVAILABLE and nlp:
135
+ return self._extract_entities_with_spacy(text)
136
+ else:
137
+ return self._extract_entities_fallback(text)
138
+
139
+ def _extract_entities_with_spacy(self, text: str) -> List[str]:
140
+ """Advanced entity extraction using spaCy NLP"""
141
+ try:
142
+ # Process text with spaCy
143
+ doc = nlp(text)
144
+ entities = []
145
+
146
+ # 1. Extract named entities (NER)
147
+ for ent in doc.ents:
148
+ # Filter for relevant entity types
149
+ if ent.label_ in ['PERSON', 'ORG', 'GPE', 'PRODUCT', 'WORK_OF_ART', 'FAC']:
150
+ entities.append(ent.text.lower().strip())
151
+
152
+ # 2. Extract nouns (POS tagging)
153
+ for token in doc:
154
+ if (token.pos_ == 'NOUN' and
155
+ not token.is_stop and
156
+ not token.is_punct and
157
+ len(token.text) > 2):
158
+ entities.append(token.lemma_.lower().strip())
159
+
160
+ # 3. Extract compound nouns and noun phrases
161
+ for chunk in doc.noun_chunks:
162
+ # Focus on the head noun of the chunk
163
+ head_text = chunk.root.lemma_.lower().strip()
164
+ if len(head_text) > 2 and not chunk.root.is_stop:
165
+ entities.append(head_text)
166
+
167
+ # Also consider the full chunk if it's short and meaningful
168
+ chunk_text = chunk.text.lower().strip()
169
+ if (len(chunk_text.split()) <= 2 and
170
+ len(chunk_text) > 2 and
171
+ self._is_likely_entity(chunk_text)):
172
+ entities.append(chunk_text)
173
+
174
+ # 4. Extract objects of spatial prepositions
175
+ spatial_prepositions = {
176
+ 'next', 'left', 'right', 'above', 'below', 'between',
177
+ 'behind', 'front', 'near', 'around', 'inside', 'outside',
178
+ 'on', 'in', 'under', 'over', 'beside'
179
+ }
180
+
181
+ for token in doc:
182
+ if (token.lemma_.lower() in spatial_prepositions and
183
+ token.head.pos_ == 'NOUN'):
184
+ entities.append(token.head.lemma_.lower().strip())
185
+
186
+ # Look for objects after spatial prepositions
187
+ for child in token.children:
188
+ if (token.lemma_.lower() in spatial_prepositions and
189
+ child.pos_ == 'NOUN'):
190
+ entities.append(child.lemma_.lower().strip())
191
+
192
+ # 5. Semantic filtering using domain categories
193
+ filtered_entities = self._filter_entities_semantically(entities)
194
+
195
+ # 6. Clean up and deduplicate
196
+ cleaned_entities = self._clean_and_deduplicate_entities(filtered_entities)
197
+
198
+ logger.info(f"spaCy extracted {len(cleaned_entities)} entities from '{text[:50]}...'")
199
+ return cleaned_entities
200
+
201
+ except Exception as e:
202
+ logger.warning(f"spaCy entity extraction failed: {e}, falling back to patterns")
203
+ return self._extract_entities_fallback(text)
204
+
205
+ def _extract_entities_fallback(self, text: str) -> List[str]:
206
+ """Fallback entity extraction using improved pattern matching"""
207
  import re
208
  entities = []
209
 
210
+ # Use fallback patterns
211
+ for pattern in self.fallback_entity_patterns:
212
  matches = re.findall(pattern, text.lower())
213
  if matches:
214
  if isinstance(matches[0], tuple):
 
218
  # For simple patterns
219
  entities.extend([match for match in matches if len(match) > 2])
220
 
221
+ # Extract objects after spatial prepositions
222
  preposition_patterns = [
223
  r'\b(?:next\s+to|left\s+of|right\s+of|above|below|between|behind|in\s+front\s+of|near|around|inside|outside)\s+(?:the\s+)?([a-zA-Z]{3,})\b',
224
  r'\b(?:neben|links\s+von|rechts\s+von|über|unter|zwischen|hinter|vor|bei|um|in|außen)\s+(?:der|die|das|dem|den)?\s*([a-zA-Z]{3,})\b'
 
228
  matches = re.findall(pattern, text.lower())
229
  entities.extend([match for match in matches if len(match) > 2])
230
 
231
+ # Semantic filtering and cleanup
232
+ filtered_entities = self._filter_entities_semantically(entities)
233
+ cleaned_entities = self._clean_and_deduplicate_entities(filtered_entities)
234
+
235
+ logger.info(f"Fallback extracted {len(cleaned_entities)} entities from '{text[:50]}...'")
236
+ return cleaned_entities
237
+
238
+ def _is_likely_entity(self, text: str) -> bool:
239
+ """Determine if a text chunk is likely to be a meaningful entity"""
240
+ # Skip very common words and short words
241
+ common_words = {'this', 'that', 'these', 'those', 'some', 'many', 'few', 'all', 'each', 'every'}
242
+ if text.lower() in common_words or len(text) < 3:
243
+ return False
244
+
245
+ # Check if it's in our semantic categories
246
+ return self._is_in_semantic_categories(text)
247
+
248
+ def _is_in_semantic_categories(self, entity: str) -> bool:
249
+ """Check if entity belongs to any of our semantic categories"""
250
+ entity_lower = entity.lower().strip()
251
+
252
+ for category, subcategories in self.semantic_categories.items():
253
+ for subcategory, items in subcategories.items():
254
+ if entity_lower in items:
255
+ return True
256
+ # Also check for partial matches for compound words
257
+ for item in items:
258
+ if item in entity_lower or entity_lower in item:
259
+ return True
260
+ return False
261
+
262
+ def _filter_entities_semantically(self, entities: List[str]) -> List[str]:
263
+ """Filter entities based on semantic relevance"""
264
+ filtered = []
265
+
266
+ for entity in entities:
267
+ entity_clean = entity.lower().strip()
268
+
269
+ # Always include if in semantic categories
270
+ if self._is_in_semantic_categories(entity_clean):
271
+ filtered.append(entity_clean)
272
+ continue
273
+
274
+ # Include if it's a likely physical object (basic heuristics)
275
+ if (len(entity_clean) >= 4 and
276
+ not entity_clean.endswith('ing') and # Exclude gerunds
277
+ not entity_clean.endswith('ly') and # Exclude adverbs
278
+ entity_clean.isalpha()): # Only alphabetic
279
+ filtered.append(entity_clean)
280
+
281
+ return filtered
282
+
283
+ def _clean_and_deduplicate_entities(self, entities: List[str]) -> List[str]:
284
+ """Clean up and deduplicate entity list"""
285
+
286
+ # Extended stop words
287
  stop_words = {
288
  'der', 'die', 'das', 'und', 'oder', 'aber', 'mit', 'von', 'zu', 'in', 'auf', 'für',
289
  'the', 'and', 'or', 'but', 'with', 'from', 'to', 'in', 'on', 'for', 'of', 'at',
290
  'lies', 'sits', 'stands', 'moves', 'flows', 'rotates', 'begins', 'starts',
291
  'liegt', 'sitzt', 'steht', 'bewegt', 'fließt', 'rotiert', 'beginnt', 'startet',
292
+ 'while', 'next', 'left', 'right', 'between', 'above', 'below', 'around',
293
+ 'time', 'way', 'thing', 'part', 'case', 'work', 'life', 'world', 'year'
294
  }
295
 
296
+ # Clean and filter
297
+ cleaned = []
298
+ for entity in entities:
299
+ entity_clean = entity.lower().strip()
300
+ if (entity_clean not in stop_words and
301
+ len(entity_clean) > 2 and
302
+ entity_clean.isalpha()):
303
+ cleaned.append(entity_clean)
304
+
305
+ # Deduplicate while preserving order
306
+ seen = set()
307
+ deduplicated = []
308
+ for entity in cleaned:
309
+ if entity not in seen:
310
+ seen.add(entity)
311
+ deduplicated.append(entity)
312
+
313
+ # Sort by relevance (semantic category entities first, then by length)
314
+ def sort_key(entity):
315
+ is_semantic = self._is_in_semantic_categories(entity)
316
+ return (not is_semantic, -len(entity)) # Semantic entities first, then longer words
317
 
318
+ deduplicated.sort(key=sort_key)
 
319
 
320
+ return deduplicated[:15] # Increase limit to 15 entities
321
 
322
  def extract_relations_from_text(self, text: str) -> List[Dict]:
323
  """Extract relations from text"""
requirements.txt CHANGED
@@ -9,4 +9,7 @@ plotly>=5.0.0
9
  spaces>=0.19.0
10
  fastapi>=0.100.0
11
  uvicorn>=0.23.0
12
- psutil>=5.9.0
 
 
 
 
9
  spaces>=0.19.0
10
  fastapi>=0.100.0
11
  uvicorn>=0.23.0
12
+ psutil>=5.9.0
13
+ spacy>=3.7.0
14
+ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl
15
+ seaborn>=0.11.0