Spaces:

maksymdolgikh
/

seqeval_with_fbeta

Runtime error

App Files Files Community

maksymdolgikh commited on Feb 20, 2024

Commit

473b250

1 Parent(s): 552ef3e

initial commit

Browse files

Files changed (4) hide show

README.md +165 -6
app.py +11 -0
requirements.txt +2 -0
seqeval_with_fbetal.py +179 -0

README.md CHANGED Viewed

@@ -1,13 +1,172 @@
 ---
-title: Seqeval With Fbeta
-emoji: 🦀
-colorFrom: yellow
 colorTo: red
 sdk: gradio
-sdk_version: 4.19.1
 app_file: app.py
 pinned: false
-license: apache-2.0
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: seqeval
+emoji: 🤗
+colorFrom: blue
 colorTo: red
 sdk: gradio
+sdk_version: 3.19.1
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  seqeval is a Python framework for sequence labeling evaluation.
+  seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on.
+  This is well-tested by using the Perl script conlleval, which can be used for
+  measuring the performance of a system that has processed the CoNLL-2000 shared task data.
+  seqeval supports following formats:
+  IOB1
+  IOB2
+  IOE1
+  IOE2
+  IOBES
+  See the [README.md] file at https://github.com/chakki-works/seqeval for more information.
 ---
+# Metric Card for seqeval
+Modified version of [seqeval](https://huggingface.co/spaces/evaluate-metric/seqeval) metric that include optional Fβ score.
+## Metric description
+seqeval is a Python framework for sequence labeling evaluation. seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on.
+## How to use
+Seqeval produces labelling scores along with its sufficient statistics from a source against one or more references.
+It takes two mandatory arguments:
+`predictions`: a list of lists of predicted labels, i.e. estimated targets as returned by a tagger.
+`references`: a list of lists of reference labels, i.e. the ground truth/target values.
+It can also take several optional arguments:
+`beta`: the weight beta of micro Fβ-score
+`suffix` (boolean): `True` if the IOB tag is a suffix (after type) instead of a prefix (before type), `False` otherwise. The default value is `False`, i.e. the IOB tag is a prefix (before type).
+`scheme`: the target tagging scheme, which can be one of [`IOB1`, `IOB2`, `IOE1`, `IOE2`, `IOBES`, `BILOU`]. The default value is `None`.
+`mode`: whether to count correct entity labels with incorrect I/B tags as true positives or not. If you want to only count exact matches, pass `mode="strict"` and a specific `scheme` value. The default is `None`.
+`sample_weight`: An array-like of shape (n_samples,) that provides weights for individual samples. The default is `None`.
+`zero_division`: Which value to substitute as a metric value when encountering zero division. Should be one of [`0`,`1`,`"warn"`]. `"warn"` acts as `0`, but the warning is raised.
+```python
+>>> seqeval = evaluate.load('seqeval')
+>>> predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
+>>> references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
+>>> results = seqeval.compute(predictions=predictions, references=references)
+```
+## Output values
+This metric returns a dictionary with a summary of scores for overall and per type:
+Overall:
+`accuracy`: the average [accuracy](https://huggingface.co/metrics/accuracy), on a scale between 0.0 and 1.0.
+`precision`: the average [precision](https://huggingface.co/metrics/precision), on a scale between 0.0 and 1.0.
+`recall`: the average [recall](https://huggingface.co/metrics/recall), on a scale between 0.0 and 1.0.
+`f1`: the average [F1 score](https://huggingface.co/metrics/f1), which is the harmonic mean of the precision and recall. It also has a scale of 0.0 to 1.0.
+`fbeta`: the micro Fβ score.
+Per type (e.g. `MISC`, `PER`, `LOC`,...):
+`precision`: the average [precision](https://huggingface.co/metrics/precision), on a scale between 0.0 and 1.0.
+`recall`: the average [recall](https://huggingface.co/metrics/recall), on a scale between 0.0 and 1.0.
+`f1`: the average [F1 score](https://huggingface.co/metrics/f1), on a scale between 0.0 and 1.0.
+`fbeta`: the micro Fβ score.
+### Values from popular papers
+The 1995 "Text Chunking using Transformation-Based Learning" [paper](https://aclanthology.org/W95-0107) reported a baseline recall of 81.9% and a precision of 78.2% using non Deep Learning-based methods.
+More recently, seqeval continues being used for reporting performance on tasks such as [named entity detection](https://www.mdpi.com/2306-5729/6/8/84/htm) and [information extraction](https://ieeexplore.ieee.org/abstract/document/9697942/).
+## Examples
+Maximal values (full match) :
+```python
+>>> seqeval = evaluate.load('seqeval')
+>>> predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
+>>> references = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
+>>> results = seqeval.compute(predictions=predictions, references=references)
+>>> print(results)
+{'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1}, 'PER': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1}, 'overall_precision': 1.0, 'overall_recall': 1.0, 'overall_f1': 1.0, 'overall_accuracy': 1.0}
+```
+Minimal values (no match):
+```python
+>>> seqeval = evaluate.load('seqeval')
+>>> predictions = [['O', 'B-MISC', 'I-MISC'], ['B-PER', 'I-PER', 'O']]
+>>> references = [['B-MISC', 'O', 'O'], ['I-PER', '0', 'I-PER']]
+>>> results = seqeval.compute(predictions=predictions, references=references)
+>>> print(results)
+{'MISC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1}, 'PER': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 2}, '_': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1}, 'overall_precision': 0.0, 'overall_recall': 0.0, 'overall_f1': 0.0, 'overall_accuracy': 0.0}
+```
+Partial match:
+```python
+>>> seqeval = evaluate.load('seqeval')
+>>> predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
+>>> references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
+>>> results = seqeval.compute(predictions=predictions, references=references)
+>>> print(results)
+{'MISC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1}, 'PER': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1}, 'overall_precision': 0.5, 'overall_recall': 0.5, 'overall_f1': 0.5, 'overall_accuracy': 0.8}
+```
+## Limitations and bias
+seqeval supports following IOB formats (short for inside, outside, beginning) : `IOB1`, `IOB2`, `IOE1`, `IOE2`, `IOBES`, `IOBES` (only in strict mode) and `BILOU` (only in strict mode).
+For more information about IOB formats, refer to the [Wikipedia page](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) and the description of the [CoNLL-2000 shared task](https://aclanthology.org/W02-2024).
+## Citation
+```bibtex
+@inproceedings{ramshaw-marcus-1995-text,
+    title = "Text Chunking using Transformation-Based Learning",
+    author = "Ramshaw, Lance  and
+      Marcus, Mitch",
+    booktitle = "Third Workshop on Very Large Corpora",
+    year = "1995",
+    url = "https://www.aclweb.org/anthology/W95-0107",
+}
+```
+```bibtex
+@misc{seqeval,
+  title={{seqeval}: A Python framework for sequence labeling evaluation},
+  url={https://github.com/chakki-works/seqeval},
+  note={Software available from https://github.com/chakki-works/seqeval},
+  author={Hiroki Nakayama},
+  year={2018},
+}
+```
+## Further References
+- [README for seqeval at GitHub](https://github.com/chakki-works/seqeval)
+- [CoNLL-2000 shared task](https://www.clips.uantwerpen.be/conll2002/ner/bin/conlleval.txt)

app.py ADDED Viewed

	@@ -0,0 +1,11 @@

+import sys
+import evaluate
+from evaluate.utils import launch_gradio_widget
+sys.path = [p for p in sys.path if p != "/home/user/app"]
+module = evaluate.load("seqeval_with_fbetal")
+sys.path = ["/home/user/app"] + sys.path
+launch_gradio_widget(module)

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ git+https://github.com/huggingface/evaluate@8dfe05784099fb9af55b8e77793205a3b7c86465
2	+ seqeval

seqeval_with_fbetal.py ADDED Viewed

	@@ -0,0 +1,179 @@

+# Copyright 2020 The HuggingFace Evaluate Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" seqeval metric. """
+import importlib
+from typing import List, Optional, Union
+import datasets
+from seqeval_with_fbetal.metrics import accuracy_score, classification_report
+import evaluate
+_CITATION = """\
+@inproceedings{ramshaw-marcus-1995-text,
+    title = "Text Chunking using Transformation-Based Learning",
+    author = "Ramshaw, Lance  and
+      Marcus, Mitch",
+    booktitle = "Third Workshop on Very Large Corpora",
+    year = "1995",
+    url = "https://www.aclweb.org/anthology/W95-0107",
+}
+@misc{seqeval,
+  title={{seqeval}: A Python framework for sequence labeling evaluation},
+  url={https://github.com/chakki-works/seqeval},
+  note={Software available from https://github.com/chakki-works/seqeval},
+  author={Hiroki Nakayama},
+  year={2018},
+}
+"""
+_DESCRIPTION = """\
+seqeval is a Python framework for sequence labeling evaluation.
+seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on.
+This is well-tested by using the Perl script conlleval, which can be used for
+measuring the performance of a system that has processed the CoNLL-2000 shared task data.
+seqeval supports following formats:
+IOB1
+IOB2
+IOE1
+IOE2
+IOBES
+See the [README.md] file at https://github.com/chakki-works/seqeval for more information.
+"""
+_KWARGS_DESCRIPTION = """
+Produces labelling scores along with its sufficient statistics
+from a source against one or more references.
+Args:
+    predictions: List of List of predicted labels (Estimated targets as returned by a tagger)
+    references: List of List of reference labels (Ground truth (correct) target values)
+    beta: Weight for the F-score
+    suffix: True if the IOB prefix is after type, False otherwise. default: False
+    scheme: Specify target tagging scheme. Should be one of ["IOB1", "IOB2", "IOE1", "IOE2", "IOBES", "BILOU"].
+        default: None
+    mode: Whether to count correct entity labels with incorrect I/B tags as true positives or not.
+        If you want to only count exact matches, pass mode="strict". default: None.
+    sample_weight: Array-like of shape (n_samples,), weights for individual samples. default: None
+    zero_division: Which value to substitute as a metric value when encountering zero division. Should be on of 0, 1,
+        "warn". "warn" acts as 0, but the warning is raised.
+Returns:
+    'scores': dict. Summary of the scores for overall and per type
+        Overall:
+            'accuracy': accuracy,
+            'precision': precision,
+            'recall': recall,
+            'f1': F1 score, also known as balanced F-score or F-measure,
+            'fbeta': F-score with weight beta
+        Per type:
+            'precision': precision,
+            'recall': recall,
+            'f1': F1 score, also known as balanced F-score or F-measure,
+            'fbeta': F-score with weight beta
+Examples:
+    >>> predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
+    >>> references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
+    >>> seqeval = evaluate.load("seqeval")
+    >>> results = seqeval.compute(predictions=predictions, references=references, beta=1.0)
+    >>> print(list(results.keys()))
+    ['MISC', 'PER', 'overall_precision', 'overall_recall', 'overall_f1', 'overall_accuracy']
+    >>> print(results["overall_f1"])
+    0.5
+    >>> print(results["PER"]["f1"])
+    1.0
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Seqeval(evaluate.Metric):
+    def _info(self):
+        return evaluate.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            homepage="https://github.com/chakki-works/seqeval",
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Sequence(datasets.Value("string", id="label"), id="sequence"),
+                    "references": datasets.Sequence(datasets.Value("string", id="label"), id="sequence"),
+                }
+            ),
+            codebase_urls=["https://github.com/chakki-works/seqeval"],
+            reference_urls=["https://github.com/chakki-works/seqeval"],
+        )
+    def _compute(
+        self,
+        predictions,
+        references,
+        beta: float = 1.0,
+        suffix: bool = False,
+        scheme: Optional[str] = None,
+        mode: Optional[str] = None,
+        sample_weight: Optional[List[int]] = None,
+        zero_division: Union[str, int] = "warn",
+    ):
+        if scheme is not None:
+            try:
+                scheme_module = importlib.import_module("seqeval.scheme")
+                scheme = getattr(scheme_module, scheme)
+            except AttributeError:
+                raise ValueError(f"Scheme should be one of [IOB1, IOB2, IOE1, IOE2, IOBES, BILOU], got {scheme}")
+        report = classification_report(
+            y_true=references,
+            y_pred=predictions,
+            suffix=suffix,
+            output_dict=True,
+            scheme=scheme,
+            mode=mode,
+            sample_weight=sample_weight,
+            zero_division=zero_division,
+        )
+        report.pop("macro avg")
+        report.pop("weighted avg")
+        if beta != 1.0:
+            beta2 = beta ** 2
+            for k, v in report.items():
+                denom = beta2 * v["precision"] + v["recall"]
+                if denom == 0:
+                    denom += 1
+                v[f"f{beta}-score"] = (1 + beta2) * v["precision"] * v["recall"] / denom
+        overall_score = report.pop("micro avg")
+        scores = {
+            type_name: {
+                "precision": score["precision"],
+                "recall": score["recall"],
+                "f1": score["f1-score"],
+                f"f{beta}": score[f"f{beta}-score"],
+                "number": score["support"],
+            }
+            for type_name, score in report.items()
+        }
+        scores["overall_precision"] = overall_score["precision"]
+        scores["overall_recall"] = overall_score["recall"]
+        scores["overall_f1"] = overall_score["f1-score"]
+        scores[f"overall_f{beta}"] = overall_score[f"f{beta}-score"]
+        scores["overall_accuracy"] = accuracy_score(y_true=references, y_pred=predictions)
+        return scores