Spaces:

Natooz
/

levenshtein

Sleeping

App Files Files Community

Natooz commited on Nov 26, 2024

Commit

814113b

verified ·

1 Parent(s): 175cd1f

Upload 5 files

Browse files

Files changed (5) hide show

README.md +61 -7
app.py +40 -0
levenshtein.py +123 -0
pyproject.toml +114 -0
requirements.txt +3 -0

README.md CHANGED Viewed

@@ -1,14 +1,68 @@
 ---
-title: Levenshtein
-emoji: 🔥
-colorFrom: red
-colorTo: red
 sdk: gradio
 sdk_version: 5.6.0
 app_file: app.py
 pinned: false
-license: apache-2.0
-short_description: Levenshtein (edit) distance metric
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Levenshtein distance
+emoji: ✍️
+colorFrom: blue
+colorTo: green
+tags:
+- evaluate
+- metric
+description: Levenshtein (edit) distance
 sdk: gradio
 sdk_version: 5.6.0
 app_file: app.py
 pinned: false
 ---
+# Metric Card for the Levenshtein (edit) distance
+## Metric Description
+This metric computes the Levenshtein distance, also commonly called "edit distance". The Levenshtein distance measures the number of combined editions, deletions and additions to perform on a string so that it becomes identical to a second one. It is a popular metric for text similarity.
+This module directly calls the [Levenshtein package](https://github.com/rapidfuzz/Levenshtein) for fast execution speed.
+## How to Use
+### Inputs
+*List all input arguments in the format below*
+- **predictions** *(string): sequence of prediction strings*
+- **references** *(string): sequence of reference string;*
+- **kwargs** *keyword arguments to pass to the [Levenshtein.distance](https://rapidfuzz.github.io/Levenshtein/levenshtein.html#Levenshtein.distance) method.*
+### Output Values
+Dictionary mapping to the average Levenshtein distance (lower is better) and the ratio ([0, 1]) distance (higher is better).
+### Examples
+```Python
+import evaluate
+levenshtein = evaluate.load("Natooz/Levenshtein")
+results = levenshtein.compute(
+    predictions=[
+        "foo", "baroo"  # 0 and 2 edits
+    ],
+    references=[
+        "foo", "bar"
+    ],
+)
+print(results)
+# {"levenshtein": 1, "levenshtein_ratio": 0.875}
+```
+## Citation
+```bibtex
+@ARTICLE{1966SPhD...10..707L,
+       author = {{Levenshtein}, V.~I.},
+        title = "{Binary Codes Capable of Correcting Deletions, Insertions and Reversals}",
+      journal = {Soviet Physics Doklady},
+         year = 1966,
+        month = feb,
+       volume = {10},
+        pages = {707},
+       adsurl = {https://ui.adsabs.harvard.edu/abs/1966SPhD...10..707L},
+      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+}
+```

app.py ADDED Viewed

	@@ -0,0 +1,40 @@

+"""Application file."""
+from pathlib import Path
+import evaluate
+import gradio as gr
+module = evaluate.load("Natooz/levenshtein")
+# Code taken and adapted from: https://github.com/huggingface/evaluate/blob/main/src/evaluate/utils/gradio.py
+local_path = Path(__file__).parent
+# if there are several input types, use first as default.
+if isinstance(module.features, list):
+    (feature_names, feature_types) = zip(*module.features[0].items())
+else:
+    (feature_names, feature_types) = zip(*module.features.items())
+gradio_input_types = evaluate.utils.infer_gradio_input_types(feature_types)
+def compute(data):
+    return module.compute(**evaluate.utils.parse_gradio_data(data, gradio_input_types))
+gradio_app = gr.Interface(
+        fn=compute,
+        inputs=gr.Dataframe(
+            headers=feature_names,
+            col_count=len(feature_names),
+            row_count=1,
+            datatype=evaluate.utils.json_to_string_type(gradio_input_types),
+        ),
+        outputs=gr.Textbox(label=module.name),
+        description=module.info.description,
+        title=f"Metric: {module.name}",
+        article=evaluate.utils.parse_readme(local_path / "README.md"),
+    )
+if __name__ == "__main__":
+    gradio_app.launch()

levenshtein.py ADDED Viewed

	@@ -0,0 +1,123 @@

+"""Levenshtein metric file."""
+from __future__ import annotations
+from typing import TYPE_CHECKING
+import datasets
+import evaluate
+from Levenshtein import distance
+if TYPE_CHECKING:
+    from collections.abc import Sequence
+_CITATION = """\
+@InProceedings{huggingface:levenshtein,
+    title = {Levenshtein (edit) distance},
+    authors={Nathan Fradet},
+    year={2024}
+}
+"""
+_DESCRIPTION = """\
+This metrics computes the Levenshtein (edit) distance.
+It directly calls the "Levenshtein" package using the ``distance`` method:
+https://rapidfuzz.github.io/Levenshtein/levenshtein.html#Levenshtein.distance
+"""
+_KWARGS_DESCRIPTION = """
+This metric computes the Levenshtein distance, also commonly called "edit distance".
+The Levenshtein distance measures the number of combined editions, deletions and
+additions to perform on a string so that it becomes identical to a second one. It is a
+popular metric for text similarity.
+This module directly calls the
+[Levenshtein package](https://github.com/rapidfuzz/Levenshtein) for fast execution
+speed.
+Args:
+    predictions: list of prediction strings.
+    references: list of reference strings.
+    **kwargs: keyword arguments to pass to the [Levenshtein.distance](https://rapidfuzz.github.io/Levenshtein/levenshtein.html#Levenshtein.distance)
+        method.
+Returns:
+    Dictionary mapping to the average Levenshtein distance (lower is better) and the
+        ratio ([0, 1]) distance (higher is better).
+Examples:
+    >>> levenshtein = evaluate.load("Natooz/Levenshtein")
+    >>> results = levenshtein.compute(
+    ...     predictions=[
+    ...         "foo", "baroo"
+    ...     ],
+    ...     references=,[
+    ...         "foo", "bar1"
+    ...     ],
+    ... )
+    >>> print(results)
+    {"levenshtein": 1, "levenshtein_ratio": 0.875}
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Levenshtein(evaluate.Metric):
+    """Module for the ``distance`` method of the "Levenshtein" package."""
+    def _info(self) -> evaluate.MetricInfo:
+        """
+        Return the module info.
+        :return: module info.
+        """
+        return evaluate.MetricInfo(
+            # This is the description that will appear on the modules page.
+            module_type="metric",
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            # This defines the format of each prediction and reference
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Value("string"),
+                    "references": datasets.Value("string"),
+                }
+            ),
+            # Homepage of the module for documentation
+            homepage="https://huggingface.co/spaces/Natooz/Levenshtein",
+            # Additional links to the codebase or references
+            codebase_urls=[
+                "https://github.com/rapidfuzz/Levenshtein",
+            ],
+            reference_urls=[
+                "https://rapidfuzz.github.io/Levenshtein/levenshtein.html#Levenshtein.distance"
+            ],
+        )
+    def _compute(
+        self,
+        predictions: Sequence[float] | None = None,
+        references: Sequence[int] | None = None,
+        **kwargs,
+    ) -> dict[str, float]:
+        """
+        Return the average Levenshtein (edit) distance.
+        See the "Levenshtein" PyPi package documentation for the complete usage
+        information: https://rapidfuzz.github.io/Levenshtein/
+        """
+        if len(predictions) != len(references):
+            msg = "The number of predictions must be equal to the number of references."
+            raise ValueError(msg)
+        # Compute the distances
+        results, ratios = [], []
+        for prediction, reference in zip(predictions, references):
+            edit_distance = distance(prediction, reference, **kwargs)
+            results.append(edit_distance)
+            ratios.append(edit_distance / (len(prediction) + len(reference)))
+        # Return average distance and ratio
+        return {
+            "levenshtein": sum(results) / len(results),
+            "levenshtein_ratio": 1 - sum(ratios) / len(ratios),
+        }

pyproject.toml ADDED Viewed

	@@ -0,0 +1,114 @@

+[tool.ruff]
+target-version = "py313"
+[tool.ruff.lint]
+extend-select = [
+    "ARG",
+    "A",
+    "ANN",
+    "B",
+    "BLE",
+    "C4",
+    "COM",
+    "D",
+    "E",
+    "EM",
+    "EXE",
+    "F",
+    "FA",
+    "FBT",
+    "G",
+    "I",
+    "ICN",
+    "INP",
+    "INT",
+    "ISC",
+    "N",
+    "NPY",
+    "PERF",
+    "PGH",
+    "PTH",
+    "PIE",
+    # "PL",
+    "PT",
+    "Q",
+    "RET",
+    "RSE",
+    "RUF",
+    "S",
+    # "SLF",
+    "SIM",
+    "T",
+    "TCH",
+    "TID",
+    "UP",
+    "W",
+]
+# Each rule exclusion should be explained here.
+# By default, we think it is better to select groups of rules (above), and exclude
+# specific problematic rules, instead of selecting specific rules. By doing so, in case
+# the ruff rules groups change, this requires us to check and handle the new rules or
+# changes, making sure we stay up to date and keep the best practices.
+# ANN003:
+# Would mostly apply to args/kwargs that are passed to methods from dependencies, for
+# which the signature can change depending on the version. This would either be too
+# difficult to comply and/or would add a lot of noqa exceptions. ANN002 is used as it
+# adds very few "noqa" exceptions, but ANN003 would add too much complexity.
+# ANN101 and ANN102:
+# Yields errors for `self` in methods from classes, which is unecessary.
+# The existence of these rules is currently questioned, they are likely to be removed.
+# https://github.com/astral-sh/ruff/issues/4396
+# B905
+# The `strict` keyword argument for the `zip` built-in method appeared with Python
+# 3.10. As we support previous versions, we cannot comply (yet) with this rule. The
+# exclusion should be removed when dropping support for Python 3.9.
+# D107
+# We document classes at the class level (D101). This documentation should cover the
+# way classes are initialized. So we do not document `__init__` methods.
+# D203
+# "one-blank-line-before-class", incompatible with D211 (blank-line-before-class).
+# We follow PEP 257 and other conventions by preferring D211 over D203.
+# D212
+# "multi-line-summary-first-line", incompatible with D213
+# (multi-line-summary-second-line).
+# We follow PEP 257, which recommend to set put the summary line on the second line
+# after the blank line of the opening quotes.
+# FBT001 and FBT002
+# Refactoring all the methods to make boolean arguments keyword only would add
+# complexity and could break code of users. It's ok to have booleans as positional
+# arguments with default values. For code redability though, we enable FB003.
+# COM812:
+# Yields errors for one-line portions without comma. Trailing commas are automatically
+# set with ruff format anyway. This exclusion could be removed when this behavior is
+# fixed in ruff.
+# UP038
+# Recommends to | type union with `isinstance`, which is only supported since Python
+# 3.10. The exclusion should be removed when dropping support for Python 3.9.
+# (ISC001)
+# May cause conflicts when used with the ruff formatter. They recommend to disable it.
+# We leave it enabled but keep this in mind.
+ignore = [
+    "ANN003",
+    "ANN101",
+    "ANN102",
+    "B905",
+    "COM812",
+    "D107",
+    "D203",
+    "D212",
+    "FBT001",
+    "FBT002",
+    "UP038",
+]

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+evaluate>=0.4.0
+Levenshtein>=0.26.0
+datasets