Hugging Face – The AI community building the future.

accuracy

Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative

bertscore

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks. See the project's README at https://github.com/Tiiiger/bert_score#readme for more information.

bleu

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics. Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality. Neither intelligibility nor grammatical correctness are not taken into account.

bleurt

BLEURT a learnt evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model (Devlin et al. 2018) and then employing another pre-training phrase using synthetic data. Finally it is trained on WMT human annotations. You may run BLEURT out-of-the-box or fine-tune it for your specific application (the latter is expected to perform better). See the project's README at https://github.com/google-research/bleurt#readme for more information.

brier_score

The Brier score is a measure of the error between two probability distributions.

cer

Character error rate (CER) is a common metric of the performance of an automatic speech recognition system. CER is similar to Word Error Rate (WER), but operates on character instead of word. Please refer to docs of WER for further information. Character error rate can be computed as: CER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct characters, N is the number of characters in the reference (N=S+D+C). CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions. This value is often associated to the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CER of 0 being a perfect score.

character

CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER).

charcut_mt

CharCut is a character-based machine translation evaluation metric.

chrf

ChrF and ChrF++ are two MT evaluation metrics. They both use the F-score statistic for character n-gram matches, and ChrF++ adds word n-grams as well which correlates more strongly with direct assessment. We use the implementation that is already present in sacrebleu. The implementation here is slightly different from sacrebleu in terms of the required input format. The length of the references and hypotheses lists need to be the same, so you may need to transpose your references compared to sacrebleu's required input format. See https://github.com/huggingface/datasets/issues/3154#issuecomment-950746534 See the README.md file at https://github.com/mjpost/sacreBLEU#chrf--chrf for more information.

code_eval

This metric implements the evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code" (https://arxiv.org/abs/2107.03374).

comet

Crosslingual Optimized Metric for Evaluation of Translation (COMET) is an open-source framework used to train Machine Translation metrics that achieve high levels of correlation with different types of human judgments (HTER, DA's or MQM). With the release of the framework the authors also released fully trained models that were used to compete in the WMT20 Metrics Shared Task achieving SOTA in that years competition. See the [README.md] file at https://unbabel.github.io/COMET/html/models.html for more information.

competition_math

This metric is used to assess performance on the Mathematics Aptitude Test of Heuristics (MATH) dataset. It first canonicalizes the inputs (e.g., converting "1/2" to "\frac{1}{2}") and then computes accuracy.

confusion_matrix

The confusion matrix evaluates classification accuracy. Each row in a confusion matrix represents a true class and each column represents the instances in a predicted class.

coval

CoVal is a coreference evaluation tool for the CoNLL and ARRAU datasets which implements of the common evaluation metrics including MUC [Vilain et al, 1995], B-cubed [Bagga and Baldwin, 1998], CEAFe [Luo et al., 2005], LEA [Moosavi and Strube, 2016] and the averaged CoNLL score (the average of the F1 values of MUC, B-cubed and CEAFe) [Denis and Baldridge, 2009a; Pradhan et al., 2011]. This wrapper of CoVal currently only work with CoNLL line format: The CoNLL format has one word per line with all the annotation for this word in column separated by spaces: Column Type Description 1 Document ID This is a variation on the document filename 2 Part number Some files are divided into multiple parts numbered as 000, 001, 002, ... etc. 3 Word number 4 Word itself This is the token as segmented/tokenized in the Treebank. Initially the *_skel file contain the placeholder [WORD] which gets replaced by the actual token from the Treebank which is part of the OntoNotes release. 5 Part-of-Speech 6 Parse bit This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the "([pos] [word])" string (or leaf) and concatenating the items in the rows of that column. 7 Predicate lemma The predicate lemma is mentioned for the rows for which we have semantic role information. All other rows are marked with a "-" 8 Predicate Frameset ID This is the PropBank frameset ID of the predicate in Column 7. 9 Word sense This is the word sense of the word in Column 3. 10 Speaker/Author This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data. 11 Named Entities These columns identifies the spans representing various named entities. 12:N Predicate Arguments There is one column each of predicate argument structure information for the predicate mentioned in Column 7. N Coreference Coreference chain information encoded in a parenthesis structure. More informations on the format can be found here (section "*_conll File Format"): http://www.conll.cemantix.org/2012/data.html Details on the evaluation on CoNLL can be found here: https://github.com/ns-moosavi/coval/blob/master/conll/README.md CoVal code was written by @ns-moosavi. Some parts are borrowed from https://github.com/clarkkev/deep-coref/blob/master/evaluation.py The test suite is taken from https://github.com/conll/reference-coreference-scorers/ Mention evaluation and the test suite are added by @andreasvc. Parsing CoNLL files is developed by Leo Born.

cuad

This metric wrap the official scoring script for version 1 of the Contract Understanding Atticus Dataset (CUAD). Contract Understanding Atticus Dataset (CUAD) v1 is a corpus of more than 13,000 labels in 510 commercial legal contracts that have been manually labeled to identify 41 categories of important clauses that lawyers look for when reviewing contracts in connection with corporate transactions.

exact_match

Returns the rate at which the input predicted strings exactly match their references, ignoring any strings input as part of the regexes_to_ignore list.

f1

The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation: F1 = 2 * (precision * recall) / (precision + recall)

frugalscore

FrugalScore is a reference-based metric for NLG models evaluation. It is based on a distillation approach that allows to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance.

glue

GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.

google_bleu

The BLEU score has some undesirable properties when used for single sentences, as it was designed to be a corpus measure. We therefore use a slightly different score for our RL experiments which we call the 'GLEU score'. For the GLEU score, we record all sub-sequences of 1, 2, 3 or 4 tokens in output and target sequence (n-grams). We then compute a recall, which is the ratio of the number of matching n-grams to the number of total n-grams in the target (ground truth) sequence, and a precision, which is the ratio of the number of matching n-grams to the number of total n-grams in the generated output sequence. Then GLEU score is simply the minimum of recall and precision. This GLEU score's range is always between 0 (no matches) and 1 (all match) and it is symmetrical when switching output and target. According to our experiments, GLEU score correlates quite well with the BLEU metric on a corpus level but does not have its drawbacks for our per sentence reward objective.

indic_glue

IndicGLUE is a natural language understanding benchmark for Indian languages. It contains a wide variety of tasks and covers 11 major Indian languages - as, bn, gu, hi, kn, ml, mr, or, pa, ta, te.

mae

Mean Absolute Error (MAE) is the mean of the magnitude of difference between the predicted and actual values.

mahalanobis

Compute the Mahalanobis Distance Mahalonobis distance is the distance between a point and a distribution. And not between two distinct points. It is effectively a multivariate equivalent of the Euclidean distance. It was introduced by Prof. P. C. Mahalanobis in 1936 and has been used in various statistical applications ever since [source: https://www.machinelearningplus.com/statistics/mahalanobis-distance/]

mape

Mean Absolute Percentage Error (MAPE) is the mean percentage error difference between the predicted and actual values.

mase

Mean Absolute Scaled Error (MASE) is the mean absolute error of the forecast values, divided by the mean absolute error of the in-sample one-step naive forecast on the training set.

matthews_correlation

Compute the Matthews correlation coefficient (MCC) The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient. [source: Wikipedia]

mauve

MAUVE is a measure of the statistical gap between two text distributions, e.g., how far the text written by a model is the distribution of human text, using samples from both distributions. MAUVE is obtained by computing Kullback–Leibler (KL) divergences between the two distributions in a quantized embedding space of a large language model. It can quantify differences in the quality of generated text based on the size of the model, the decoding algorithm, and the length of the generated text. MAUVE was found to correlate the strongest with human evaluations over baseline metrics for open-ended text generation.

mean_iou

IoU is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth. For binary (two classes) or multi-class segmentation, the mean IoU of the image is calculated by taking the IoU of each class and averaging them.

meteor

METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigram-precision, unigram-recall and their harmonic F1 combination.

mse

Mean Squared Error(MSE) is the average of the square of difference between the predicted and actual values.

nist_mt

DARPA commissioned NIST to develop an MT evaluation facility based on the BLEU score.

pearsonr

Pearson correlation coefficient and p-value for testing non-correlation. The Pearson correlation coefficient measures the linear relationship between two datasets. The calculation of the p-value relies on the assumption that each dataset is normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases. The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets.

perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models. It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`. For more information on perplexity, see [this tutorial](https://huggingface.co/docs/transformers/perplexity).

poseval

The poseval metric can be used to evaluate POS taggers. Since seqeval does not work well with POS data that is not in IOB format the poseval is an alternative. It treats each token in the dataset as independant observation and computes the precision, recall and F1-score irrespective of sentences. It uses scikit-learns's classification report to compute the scores.

precision

Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation: Precision = TP / (TP + FP) where TP is the True positives (i.e. the examples correctly labeled as positive) and FP is the False positive examples (i.e. the examples incorrectly labeled as positive).

r_squared

The R^2 (R Squared) metric is a measure of the goodness of fit of a linear regression model. It is the proportion of the variance in the dependent variable that is predictable from the independent variable.

recall

Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation: Recall = TP / (TP + FN) Where TP is the true positives and FN is the false negatives.

rl_reliability

Computes the RL reliability metrics from a set of experiments. There is an `"online"` and `"offline"` configuration for evaluation.

roc_auc

This metric computes the area under the curve (AUC) for the Receiver Operating Characteristic Curve (ROC). The return values represent how well the model used is predicting the correct classes, based on the input data. A score of `0.5` means that the model is predicting exactly at chance, i.e. the model's predictions are correct at the same rate as if the predictions were being decided by the flip of a fair coin or the roll of a fair die. A score above `0.5` indicates that the model is doing better than chance, while a score below `0.5` indicates that the model is doing worse than chance. This metric has three separate use cases: - binary: The case in which there are only two different label classes, and each example gets only one label. This is the default implementation. - multiclass: The case in which there can be more than two different label classes, but each example still gets only one label. - multilabel: The case in which there can be more than two different label classes, and each example can have more than one label.

rouge

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

sacrebleu

SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization for you. See the [README.md] file at https://github.com/mjpost/sacreBLEU for more information.

sari

SARI is a metric used for evaluating automatic text simplification systems. The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system. Sari = (F1_add + F1_keep + P_del) / 3 where F1_add: n-gram F1 score for add operation F1_keep: n-gram F1 score for keep operation P_del: n-gram precision score for delete operation n = 4, as in the original paper. This implementation is adapted from Tensorflow's tensor2tensor implementation [3]. It has two differences with the original GitHub [1] implementation: (1) Defines 0/0=1 instead of 0 to give higher scores for predictions that match a target exactly. (2) Fixes an alleged bug [2] in the keep score computation. [1] https://github.com/cocoxu/simplification/blob/master/SARI.py (commit 0210f15) [2] https://github.com/cocoxu/simplification/issues/6 [3] https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/sari_hook.py

seqeval

seqeval is a Python framework for sequence labeling evaluation. seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on. This is well-tested by using the Perl script conlleval, which can be used for measuring the performance of a system that has processed the CoNLL-2000 shared task data. seqeval supports following formats: IOB1 IOB2 IOE1 IOE2 IOBES See the [README.md] file at https://github.com/chakki-works/seqeval for more information.

smape

Symmetric Mean Absolute Percentage Error (sMAPE) is the symmetric mean percentage error difference between the predicted and actual values defined by Chen and Yang (2004).

spearmanr

The Spearman rank-order correlation coefficient is a measure of the relationship between two datasets. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Positive correlations imply that as data in dataset x increases, so does data in dataset y. Negative correlations imply that as x increases, y decreases. Correlations of -1 or +1 imply an exact monotonic relationship. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Spearman correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.

squad

This metric wrap the official scoring script for version 1 of the Stanford Question Answering Dataset (SQuAD). Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

squad_v2

This metric wrap the official scoring script for version 2 of the Stanford Question Answering Dataset (SQuAD). Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

super_glue

SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.

ter

TER (Translation Edit Rate, also called Translation Error Rate) is a metric to quantify the edit operations that a hypothesis requires to match a reference translation. We use the implementation that is already present in sacrebleu (https://github.com/mjpost/sacreBLEU#ter), which in turn is inspired by the TERCOM implementation, which can be found here: https://github.com/jhclark/tercom. The implementation here is slightly different from sacrebleu in terms of the required input format. The length of the references and hypotheses lists need to be the same, so you may need to transpose your references compared to sacrebleu's required input format. See https://github.com/huggingface/datasets/issues/3154#issuecomment-950746534 See the README.md file at https://github.com/mjpost/sacreBLEU#ter for more information.

trec_eval

The TREC Eval metric combines a number of information retrieval metrics such as precision and nDCG. It is used to score rankings of retrieved documents with reference values.

wer

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

wiki_split

WIKI_SPLIT is the combination of three metrics SARI, EXACT and SACREBLEU It can be used to evaluate the quality of machine-generated texts.

xnli

XNLI is a subset of a few thousand examples from MNLI which has been translated into a 14 different languages (some low-ish resource). As with MNLI, the goal is to predict textual entailment (does sentence A imply/contradict/neither sentence B) and is a classification task (given two sentences, predict one of three labels).

xtreme_s

XTREME-S is a benchmark to evaluate universal cross-lingual speech representations in many languages. XTREME-S covers four task families: speech recognition, classification, speech-to-text translation and retrieval.

AIML-TUDA/VerifiableRewardsForScalableLogicalReasoning

VerifiableRewardsForScalableLogicalReasoning is a metric for evaluating logical reasoning in AI systems by providing verifiable rewards. It computes rewards through symbolic execution of candidate solutions against validation programs, enabling automatic, transparent and reproducible evaluation in AI systems.

AfibulIslam/rouge

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

AiMayI/rougeBench

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

AlhitawiMohammed22/CER_Hu-Evaluation-Metrics

Aye10032/loss_metric

Aye10032/top5_error_rate

Baleegh/Fluency_Score

Fluency Score is a metric designed to distinguish between old eloquent classical Arabic and Modern Standard Arabic. This is done by a text classifier that was finetuned for this task.

Bekhouche/NED

The Normalized Edit Distance (NED) is a metric used to quantify the dissimilarity between two sequences, typically strings, by measuring the minimum number of editing operations required to transform one sequence into the other, normalized by the length of the longer sequence. The NED ranges from 0 to 1, where 0 indicates identical sequences and 1 indicates completely dissimilar sequences. It is particularly useful in tasks such as spell checking, speech recognition, and OCR. The normalized edit distance can be calculated using the formula: NED = (1 - (ED(pred, gt) / max(length(pred), length(gt)))) Where: gt: ground-truth sequence pred: predicted sequence ED: Edit Distance, the minimum number of editing operations (insertions, deletions, substitutions) needed to transform one sequence into the other.

BridgeAI-Lab/Sem-nCG

Sem-nCG (Semantic Normalized Cumulative Gain) Metric evaluates the quality of predicted sentences (abstractive/extractive) in relation to reference sentences and documents using Semantic Normalized Cumulative Gain (NCG). It computes gain values and NCG scores based on cosine similarity between sentence embeddings, leveraging a Sentence-BERT encoder. This metric is designed to assess the relevance and ranking of predicted sentences, making it useful for tasks such as summarization and information retrieval.

BridgeAI-Lab/SemF1

SEM-F1 metric leverages the pre-trained contextual embeddings and evaluates the model generated semantic overlap summary with the reference overlap summary. It evaluates the semantic overlap summary at the sentence level and computes precision, recall and F1 scores. Refer to the paper `SEM-F1: an Automatic Way for Semantic Evaluation of Multi-Narrative Overlap Summaries at Scale` for more details.

BucketHeadP65/confusion_matrix

Compute confusion matrix to evaluate the accuracy of a classification. By definition a confusion matrix :math:C is such that :math:C_{i, j} is equal to the number of observations known to be in group :math:i and predicted to be in group :math:j. Thus in binary classification, the count of true negatives is :math:C_{0,0}, false negatives is :math:C_{1,0}, true positives is :math:C_{1,1} and false positives is :math:C_{0,1}.

BucketHeadP65/roc_curve

Compute Receiver operating characteristic (ROC). Note: this implementation is restricted to the binary classification task.

CZLC/rouge_raw

ROUGE RAW is language-agnostic variant of ROUGE without stemmer, stop words and synonymas. This is a wrapper around the original http://hdl.handle.net/11234/1-2615 script.

Camus-rebel/bertscore-with-torch_dtype

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks. See the project's README at https://github.com/Tiiiger/bert_score#readme for more information.

DaliaCaRo/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

DarrenChensformer/action_generation

TODO: add a description here

DarrenChensformer/eval_keyphrase

TODO: add a description here

DarrenChensformer/relation_extraction

TODO: add a description here

Dnfs/awesome_metric

TODO: add a description here

DoctorSlimm/bangalore_score

TODO: add a description here

DoctorSlimm/kaushiks_criteria

TODO: add a description here

Drunper/metrica_tesi

TODO: add a description here

Felipehonorato/eer

Equal Error Rate (EER) is a measure that shows the performance of a biometric system, like fingerprint or facial recognition. It's the point where the system's False Acceptance Rate (letting the wrong person in) and False Rejection Rate (blocking the right person) are equal. The lower the EER value, the better the system's performance. EER is used in various security applications, such as airports, banks, and personal devices like smartphones and laptops, to evaluate the effectiveness of the biometric system in correctly identifying users.

FergusFindley/character

CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER).

FergusFindley/meteor

METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigram-precision, unigram-recall and their harmonic F1 combination.

FergusFindley/sacrebleu

SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization for you. See the [README.md] file at https://github.com/mjpost/sacreBLEU for more information.

FergusFindley/wer

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

Fritz02/execution_accuracy

TODO: add a description here

GMFTBY/dailydialog_evaluate

TODO: add a description here

GMFTBY/dailydialogevaluate

TODO: add a description here

Glazkov/mars

TODO: add a description here

He-Xingwei/sari_metric

SARI is a metric used for evaluating automatic text simplification systems. The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system. Sari = (F1_add + F1_keep + P_del) / 3 where F1_add: n-gram F1 score for add operation F1_keep: n-gram F1 score for keep operation P_del: n-gram precision score for delete operation n = 4, as in the original paper. This implementation is adapted from Tensorflow's tensor2tensor implementation [3]. It has two differences with the original GitHub [1] implementation: (1) Defines 0/0=1 instead of 0 to give higher scores for predictions that match a target exactly. (2) Fixes an alleged bug [2] in the keep score computation. [1] https://github.com/cocoxu/simplification/blob/master/SARI.py (commit 0210f15) [2] https://github.com/cocoxu/simplification/issues/6 [3] https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/sari_hook.py

Ikala-allen/relation_extraction

This metric is used for evaluating the F1 accuracy of input references and predictions.

JP-SystemsX/nDCG

The Discounted Cumulative Gain is a measure of ranking quality. It is used to evaluate Information Retrieval Systems under the following 2 assumptions: 1. Highly relevant documents/Labels are more useful when appearing earlier in the results 2. Documents/Labels are relevant to different degrees It is defined as the Sum over all relevances of the retrieved documents reduced logarithmically proportional to the position in which they were retrieved. The Normalized DCG (nDCG) divides the resulting value by the best possible value to get a value between 0 and 1 s.t. a perfect retrieval achieves a nDCG of 1.

JerryGanst/rouge

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

JesseOU/rouge

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

Josh98/nl2bash_m

Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative

KaliSurfKukt/brier_score

The Brier score is a measure of the error between two probability distributions.

KevinSpaghetti/accuracyk

computes the accuracy at k for a set of predictions as labels

Khaliq88/execution_accuracy

TODO: add a description here

Kick28/wer

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

Kunjal7298/wer

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

LottieW/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

Markdown/rouge

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

MathewShen/bleu

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics. Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality. Neither intelligibility nor grammatical correctness are not taken into account.

Merle456/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

Muennighoff/code_eval_octopack

This metric implements code evaluation with execution across multiple languages as used in the paper "OctoPack: Instruction Tuning Code Large Language Models" (https://arxiv.org/abs/2308.07124).

NCSOFT/harim_plus

HaRiM+ is reference-less metric for summary quality evaluation which hurls the power of summarization model to estimate the quality of the summary-article pair. <br /> Note that this metric is reference-free and do not require training. It is ready to go without reference text to compare with the generation nor any model training for scoring.

NathanMad/bertscore-with-torch_dtype

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks. See the project's README at https://github.com/Tiiiger/bert_score#readme for more information.

Natooz/ece

Expected Calibration Error (ECE)

Natooz/levenshtein

Levenshtein (edit) distance

Natooz/mse

Mean Squared Error(MSE) is the average of the square of difference between the predicted and actual values.

Ndyyyy/bertscore

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks. See the project's README at https://github.com/Tiiiger/bert_score#readme for more information.

NikitaMartynov/spell-check-metric

This module calculates classification metrics e.g. precision, recall, F1, on spell-checking task.

NimaBoscarino/weat

TODO: add a description here

Ochiroo/rouge_mn

TODO: add a description here

Pipatpong/perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models. It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`. For more information on perplexity, see [this tutorial](https://huggingface.co/docs/transformers/perplexity).

Qui-nn/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

RajithaRama/rouge

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

Ransaka/cer

Character error rate (CER) is a common metric of the performance of an automatic speech recognition system. CER is similar to Word Error Rate (WER), but operates on character instead of word. Please refer to docs of WER for further information. Character error rate can be computed as: CER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct characters, N is the number of characters in the reference (N=S+D+C). CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions. This value is often associated to the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CER of 0 being a perfect score.

Remeris/rouge_ru

ROUGE RU is russian language variant of ROUGE with stemmer and stop words but without synonymas. It is case insensitive, meaning that upper case letters are treated the same way as lower case letters.

RiciHuggingFace/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

Ruchin/jaccard_similarity

Jaccard similarity coefficient score is defined as the size of the intersection divided by the size of the union of two label sets. It is used to compare the set of predicted labels for a sample to the corresponding set of true labels.

SEA-AI/box-metrics

built upon yolov5 iou functions. Outputs metrics regarding box fit

SEA-AI/det-metrics

Modified cocoevals.py which is wrapped into torchmetrics' mAP metric with numpy instead of torch dependency.

SEA-AI/horizon-metrics

This huggingface metric calculates horizon evaluation metrics using `seametrics.horizon.HorizonMetrics`.

SEA-AI/mot-metrics

TODO: add a description here

SEA-AI/panoptic-quality

PanopticQuality score

SEA-AI/ref-metrics

TODO: add a description here

SEA-AI/user-friendly-metrics

TODO: add a description here

Shaikh58/rouge

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

Soroor/cer

Character error rate (CER) is a common metric of the performance of an automatic speech recognition system. CER is similar to Word Error Rate (WER), but operates on character instead of word. Please refer to docs of WER for further information. Character error rate can be computed as: CER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct characters, N is the number of characters in the reference (N=S+D+C). CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions. This value is often associated to the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CER of 0 being a perfect score.

SpfIo/wer_checker

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

TelEl/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

TwentyNine/sacrebleu

SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization for you. See the [README.md] file at https://github.com/mjpost/sacreBLEU for more information.

Usmankiani256/meteor

METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigram-precision, unigram-recall and their harmonic F1 combination.

Vallp/ter

TER (Translation Edit Rate, also called Translation Error Rate) is a metric to quantify the edit operations that a hypothesis requires to match a reference translation. We use the implementation that is already present in sacrebleu (https://github.com/mjpost/sacreBLEU#ter), which in turn is inspired by the TERCOM implementation, which can be found here: https://github.com/jhclark/tercom. The implementation here is slightly different from sacrebleu in terms of the required input format. The length of the references and hypotheses lists need to be the same, so you may need to transpose your references compared to sacrebleu's required input format. See https://github.com/huggingface/datasets/issues/3154#issuecomment-950746534 See the README.md file at https://github.com/mjpost/sacreBLEU#ter for more information.

Vertaix/vendiscore

The Vendi Score is a metric for evaluating diversity in machine learning. See the project's README at https://github.com/vertaix/Vendi-Score for more information.

Vickyage/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

Viktorix/rouge

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

Viona/fuzzy_reordering

TODO: add a description here

Viona/infolm

TODO: add a description here

Viona/kendall_tau

TODO: add a description here

Vipitis/shadermatch

compare rendered frames from shadercode, using a WGPU implementation

Vlasta/pr_auc

TODO: add a description here

Winfred13/cocoevaluate

TODO: add a description here

Yeshwant123/mcc

Matthews correlation coefficient (MCC) is a correlation coefficient used in machine learning as a measure of the quality of binary and multiclass classifications.

Zillionkkk/code_eval

This metric implements the evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code" (https://arxiv.org/abs/2107.03374).

abdusah/aradiawer

This new module is designed to calculate an enhanced Dialectical Arabic (DA) WER (AraDiaWER) based on linguistic and semantic factors.

abidlabs/mean_iou

IoU is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth. For binary (two classes) or multi-class segmentation, the mean IoU of the image is calculated by taking the IoU of each class and averaging them.

abidlabs/mean_iou2

IoU is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth. For binary (two classes) or multi-class segmentation, the mean IoU of the image is calculated by taking the IoU of each class and averaging them.

ag2435/my_metric

TODO: add a description here

agkphysics/ccc

Concordance correlation coefficient

ahnyeonchan/Alignment-and-Uniformity

akki2825/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

alvinasvk/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

andstor/code_perplexity

Perplexity measure for code.

angelasophie/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

angelina-wang/directional_bias_amplification

Directional Bias Amplification is a metric that captures the amount of bias (i.e., a conditional probability) that is amplified. This metric was introduced in the ICML 2021 paper ["Directional Bias Amplification"](https://arxiv.org/abs/2102.12594) for fairness evaluation.

anz2/iliauniiccocrevaluation

TODO: add a description here

argmaxinc/detailed-wer

Word Error Rate (WER) metric with detailed error analysis capabilities for speech recognition evaluation

arthurvqin/pr_auc

This metric computes the area under the curve (AUC) for the Precision-Recall Curve (PR). summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.

aryopg/roc_auc_skip_uniform_labels

This metric computes the area under the curve (AUC) for the Receiver Operating Characteristic Curve (ROC). The return values represent how well the model used is predicting the correct classes, based on the input data. A score of `0.5` means that the model is predicting exactly at chance, i.e. the model's predictions are correct at the same rate as if the predictions were being decided by the flip of a fair coin or the roll of a fair die. A score above `0.5` indicates that the model is doing better than chance, while a score below `0.5` indicates that the model is doing worse than chance. This metric has three separate use cases: - binary: The case in which there are only two different label classes, and each example gets only one label. This is the default implementation. - multiclass: The case in which there can be more than two different label classes, but each example still gets only one label. - multilabel: The case in which there can be more than two different label classes, and each example can have more than one label.

aynetdia/semscore

SemScore measures semantic textual similarity between candidate and reference texts. It has been shown to strongly correlate with human judgment on a system-level when evaluating the instructing following capabilities of language models. Given a set of model-generated outputs and target completions, a pre-trained sentence-transformer is used to calculate cosine similarities between them.

bascobasculino/mot-metrics

TODO: add a description here

bcsp/rouge

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

bdsaglam/jer

Computes precision, recall, and f1 scores for joint entity-relation extraction.

berkatil/map

This is the mean average precision (map) metric for retrieval systems. It is the average of the precision scores computer after each relevant document is got. You can refer to [here](https://amenra.github.io/ranx/metrics/#mean-average-precision)

berkatil/mrr

This is the mean reciprocal rank (mrr) metric for retrieval systems. It is the average of the precision scores computer after each relevant document is got. You can refer to [here](https://amenra.github.io/ranx/metrics/#mean-reciprocal-rank)

bomjin/code_eval_octopack

This metric implements code evaluation with execution across multiple languages as used in the paper "OctoPack: Instruction Tuning Code Large Language Models" (https://arxiv.org/abs/2308.07124).

boschar/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

bowdbeg/docred

TODO: add a description here

bowdbeg/matching_series

Matching-based time-series generation metric

bowdbeg/patch_series

TODO: add a description here

brian920128/doc_retrieve_metrics

TODO: add a description here

bstrai/classification_report

Build a text report showing the main classification metrics that are accuracy, precision, recall and F1.

buelfhood/fbeta_score

Calculate FBeta_Score

buelfhood/fbeta_score_2

Calculate FBeta_Score

bugbounty1806/accuracy

Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative

c12630/bertscore

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks. See the project's README at https://github.com/Tiiiger/bert_score#readme for more information.

c12630/bertscore2

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks. See the project's README at https://github.com/Tiiiger/bert_score#readme for more information.

carletoncognitivescience/peak_signal_to_noise_ratio

Image quality metric

chanelcolgate/average_precision

Average precision score.

ckb/unigram

TODO: add a description here

codeparrot/apps_metric

Evaluation metric for the APPS benchmark

cointegrated/blaser_2_0_qe

TODO: add a description here

cpllab/syntaxgym

Evaluates Huggingface models on SyntaxGym datasets (targeted syntactic evaluations).

d-matrix/dmxMetric

Evaluation function using lm-eval with d-Matrix integration. This function allows for the evaluation of language models across various tasks, with the option to use d-Matrix compressed models. For more information, see https://github.com/EleutherAI/lm-evaluation-harness and https://github.com/d-matrix-ai/dmx-compressor

d-matrix/dmx_perplexity

Perplexity metric implemented by d-Matrix. Perplexity (PPL) is one of the most common metrics for evaluating language models. It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`. Note that this metric is intended for Causual Language Models, the perplexity calculation is only correct if model uses Cross Entropy Loss. For more information, see https://huggingface.co/docs/transformers/perplexity

daiyizheng/valid

TODO: add a description here

danasone/ru_errant

TODO: add a description here

danieldux/hierarchical_softmax_loss

TODO: add a description here

danieldux/isco_hierarchical_accuracy

The ISCO-08 Hierarchical Accuracy Measure is an implementation of the measure described in [Functional Annotation of Genes Using Hierarchical Text Categorization](https://www.researchgate.net/publication/44046343_Functional_Annotation_of_Genes_Using_Hierarchical_Text_Categorization) (Kiritchenko, Svetlana and Famili, Fazel. 2005) applied to the ISCO-08 classification scheme by the International Labour Organization.

dannashao/span_metric

This metric calculates both Token Overlap and Span Agreement precision, recall and f1 scores.

datenbergwerk/classification_report

Build a text report showing the main classification metrics that are accuracy, precision, recall and F1.

davebulaval/meaningbert

MeaningBERT is an automatic and trainable metric for assessing meaning preservation between sentences See the project's README at https://github.com/GRAAL-Research/MeaningBERT/tree/main for more information.

dayil100/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

dayil100/accents_unplugged_eval_WER

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

dgfh76564/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

dotkaio/competition_math

This metric is used to assess performance on the Mathematics Aptitude Test of Heuristics (MATH) dataset. It first canonicalizes the inputs (e.g., converting "1/2" to "\frac{1}{2}") and then computes accuracy.

dvitel/codebleu

CodeBLEU

ecody726/bertscore

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks. See the project's README at https://github.com/Tiiiger/bert_score#readme for more information.

eirsteir/perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models. It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`. For more information on perplexity, see [this tutorial](https://huggingface.co/docs/transformers/perplexity).

erntkn/dice_coefficient

TODO: add a description here

flozi00/perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models. It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`. For more information on perplexity, see [this tutorial](https://huggingface.co/docs/transformers/perplexity).

fnvls/bleu1234

TODO: add a description here

fnvls/bleu_1234

TODO: add a description here

franzi2505/detection_metric

Compute multiple object detection metrics at different bounding box area levels.

fschlatt/ner_eval

TODO: add a description here

gabeorlanski/bc_eval

This metric implements the evaluation harness for datasets translated with the BabelCode framework as described in the paper "Measuring The Impact Of Programming Language Distribution" (https://arxiv.org/abs/2302.01973).

ginic/phone_errors

Error rates in terms of distance between articulatory phonological features can help understand differences between strings in the International Phonetic Alphabet (IPA) in a linguistically motivated way. This is useful when evaluating speech recognition or orthographic to IPA conversion tasks.

gjacob/bertimbauscore

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks. See the project's README at https://github.com/Tiiiger/bert_score#readme for more information.

gjacob/chrf

ChrF and ChrF++ are two MT evaluation metrics. They both use the F-score statistic for character n-gram matches, and ChrF++ adds word n-grams as well which correlates more strongly with direct assessment. We use the implementation that is already present in sacrebleu. The implementation here is slightly different from sacrebleu in terms of the required input format. The length of the references and hypotheses lists need to be the same, so you may need to transpose your references compared to sacrebleu's required input format. See https://github.com/huggingface/datasets/issues/3154#issuecomment-950746534 See the README.md file at https://github.com/mjpost/sacreBLEU#chrf--chrf for more information.

gjacob/google_bleu

The BLEU score has some undesirable properties when used for single sentences, as it was designed to be a corpus measure. We therefore use a slightly different score for our RL experiments which we call the 'GLEU score'. For the GLEU score, we record all sub-sequences of 1, 2, 3 or 4 tokens in output and target sequence (n-grams). We then compute a recall, which is the ratio of the number of matching n-grams to the number of total n-grams in the target (ground truth) sequence, and a precision, which is the ratio of the number of matching n-grams to the number of total n-grams in the generated output sequence. Then GLEU score is simply the minimum of recall and precision. This GLEU score's range is always between 0 (no matches) and 1 (all match) and it is symmetrical when switching output and target. According to our experiments, GLEU score correlates quite well with the BLEU metric on a corpus level but does not have its drawbacks for our per sentence reward objective.

gjacob/wiki_split

WIKI_SPLIT is the combination of three metrics SARI, EXACT and SACREBLEU It can be used to evaluate the quality of machine-generated texts.

gnail/cosine_similarity

TODO: add a description here

gorkaartola/metric_for_tp_fp_samples

This metric is specially designed to measure the performance of sentence classification models over multiclass test datasets containing both True Positive samples, meaning that the label associated to the sentence in the sample is correctly assigned, and False Positive samples, meaning that the label associated to the sentence in the sample is incorrectly assigned.

guydav/restrictedpython_code_eval

Same logic as the built-in `code_eval`, but compiling and running the code using `RestrictedPython`

hack/test_metric

TODO: add a description here

hage2000/code_eval_stdio

The stdio version of of the ["code eval"](https://huggingface.co/spaces/evaluate-metric/code_eval) metrics, which handles python programs that read inputs from STDIN and print answers to STDOUT, which is common in competitive programming (e.g. CodeForce, USACO) : )

hage2000/my_metric

TODO: add a description here

haotongye-shopee/ppl

TODO: add a description here

harshhpareek/bertscore

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks. See the project's README at https://github.com/Tiiiger/bert_score#readme for more information.

helena-balabin/youden_index

Youden index for finding the ideal threshold in an ROC AUC curve

hemulitch/cer

Character error rate (CER) is a common metric of the performance of an automatic speech recognition system. CER is similar to Word Error Rate (WER), but operates on character instead of word. Please refer to docs of WER for further information. Character error rate can be computed as: CER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct characters, N is the number of characters in the reference (N=S+D+C). CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions. This value is often associated to the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CER of 0 being a perfect score.

here2infinity/rouge

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

hpi-dhc/FairEval

Fair Evaluation for Squence labeling

huanghuayu/multiclass_brier_score

brier_score metric for multiclass problem.

hynky/sklearn_proxy

TODO: add a description here

hyperml/balanced_accuracy

Balanced Accuracy is the average of recall obtained on each class. It can be computed with: Balanced Accuracy = (TPR + TNR) / N Where: TPR: True positive rate TNR: True negative rate N: Number of classes

idsedykh/codebleu

TODO: add a description here

idsedykh/codebleu2

TODO: add a description here

idsedykh/megaglue

TODO: add a description here

idsedykh/metric

TODO: add a description here

illorca/FairEval

Fair Evaluation for Squence labeling

ingyu/klue_mrc

This metric wrap the unofficial scoring script for [Machine Machine Reading Comprehension task of Korean Language Understanding Evaluation (KLUE-MRC)](https://huggingface.co/datasets/klue/viewer/mrc/train). KLUE-MRC is a Korean reading comprehension dataset consisting of questions where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. As KLUE-MRC has the same task format as SQuAD 2.0, this evaluation script uses the same metrics of SQuAD 2.0 (F1 and EM). KLUE-MRC consists of 12,286 question paraphrasing, 7,931 multi-sentence reasoning, and 9,269 unanswerable questions. Totally, 29,313 examples are made with 22,343 documents and 23,717 passages.

iyung/meteor

METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigram-precision, unigram-recall and their harmonic F1 combination.

jarod0411/aucpr

TODO: add a description here

jialinsong/apps_metric

Evaluation metric for the APPS benchmark

jijihuny/ecqa

TODO: add a description here

jjkim0807/code_eval

This metric implements the evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code" (https://arxiv.org/abs/2107.03374).

joseph7777777/accuracy

Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative

jpxkqx/peak_signal_to_noise_ratio

Image quality metric

jpxkqx/signal_to_reconstruction_error

Signal-to-Reconstruction Error

juliakaczor/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

jzm-mailchimp/joshs_second_test_metric

TODO: add a description here

k4black/codebleu

Unofficial `CodeBLEU` implementation that supports Linux, MacOS and Windows.

kashif/mape

TODO: add a description here

kbmlcoding/apps_metric

Evaluation metric for the APPS benchmark

kdudzic/charmatch

TODO: add a description here

kgorman2205/accuracy

Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative

kilian-group/arxiv_score

TODO: add a description here

kiracurrie22/precision

Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation: Precision = TP / (TP + FP) where TP is the True positives (i.e. the examples correctly labeled as positive) and FP is the False positive examples (i.e. the examples incorrectly labeled as positive).

kyokote/my_metric2

TODO: add a description here

langdonholmes/cohen_weighted_kappa

TODO: add a description here

leslyarun/fbeta_score

Calculate FBeta_Score

lhy/hamming_loss

TODO: add a description here

lhy/ranking_loss

TODO: add a description here

livvie/accents_unplugged_eval

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

loubnabnl/apps_metric2

Evaluation metric for the APPS benchmark

lrhammond/apps-metric

Evaluation metric for the APPS benchmark

lvwerra/accuracy_score

"Accuracy classification score."

lvwerra/bary_score

TODO: add a description here

lvwerra/test

maggie-lee/exact_match

Returns the rate at which the input predicted strings exactly match their references, ignoring any strings input as part of the regexes_to_ignore list.

maksymdolgikh/seqeval_with_fbeta

seqeval is a Python framework for sequence labeling evaluation. seqeval can evaluate the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on. This is well-tested by using the Perl script conlleval, which can be used for measuring the performance of a system that has processed the CoNLL-2000 shared task data. seqeval supports following formats: IOB1 IOB2 IOE1 IOE2 IOBES See the [README.md] file at https://github.com/chakki-works/seqeval for more information.

manueldeprada/beer

BEER 2.0 (BEtter Evaluation as Ranking) is a trained machine translation evaluation metric with high correlation with human judgment both on sentence and corpus level. It is a linear model-based metric for sentence-level evaluation in machine translation (MT) that combines 33 relatively dense features, including character n-grams and reordering features. It employs a learning-to-rank framework to differentiate between function and non-function words and weighs each word type according to its importance for evaluation. The model is trained on ranking similar translations using a vector of feature values for each system output. BEER outperforms the strong baseline metric METEOR in five out of eight language pairs, showing that less sparse features at the sentence level can lead to state-of-the-art results. Features on character n-grams are crucial, and higher-order character n-grams are less prone to sparse counts than word n-grams.

maqiuping59/table_markdown

Table evaluation metrics for assessing the matching degree between predicted and reference tables. It calculates precision, recall, and F1 score for table data extraction or generation tasks.

maryxm/code_eval

This metric implements the evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code" (https://arxiv.org/abs/2107.03374).

maysonma/lingo_judge_metric

mdocekal/multi_label_precision_recall_accuracy_fscore

Implementation of example based evaluation metrics for multi-label classification presented in Zhang and Zhou (2014).

mdocekal/precision_recall_fscore_accuracy

This metric calculates precision, recall, accuracy, and fscore for classification tasks using scikit-learn.

medmac01/bertscore-eval

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks. See the project's README at https://github.com/Tiiiger/bert_score#readme for more information.

mehulrk18/meteor

METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigram-precision, unigram-recall and their harmonic F1 combination.

mehulrk18/rouge

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

mfumanelli/geometric_mean

The geometric mean (G-mean) is the root of the product of class-wise sensitivity.

mgfrantz/roc_auc_macro

TODO: add a description here

mrcuddle/mean_iou

IoU is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth. For binary (two classes) or multi-class segmentation, the mean IoU of the image is calculated by taking the IoU of each class and averaging them.

mtc/fragments

Fragments computes the extractiveness between source articles and their summaries. The metric computes two scores: coverage and density. The code is adapted from the newsroom package(https://github.com/lil-lab/newsroom/blob/master/newsroom/analyze/fragments.py). All credits goes to the authors of aforementioned code.

mtzig/cross_entropy_loss

computes the cross entropy loss

murinj/hter

HTER (Half Total Error Rate) is a metric that combines the False Accept Rate (FAR) and False Reject Rate (FRR) to provide a comprehensive evaluation of a system's performance. It can be computed with: HTER = (FAR + FRR) / 2 Where: FAR (False Accept Rate) = FP / (FP + TN) FRR (False Reject Rate) = FN / (FN + TP) TP: True positive TN: True negative FP: False positive FN: False negative

nevikw39/specificity

Specificity is the fraction of the negatives examples that were correctly labeled by the model as negatives. It can be computed with the equation: Specificity = TN / (TN + FP) Where TN is the true negatives and FP is the false positives.

nhop/L3Score

L3Score is a metric for evaluating the semantic similarity of free-form answers in question answering tasks. It uses log-probabilities of "Yes"/"No" tokens from a language model acting as a judge. Based on the SPIQA benchmark: https://arxiv.org/pdf/2407.09413

nlpln/tst

TODO: add a description here

noah1995/exact_match

Returns the rate at which the input predicted strings exactly match their references, ignoring any strings input as part of the regexes_to_ignore list.

nobody4/waf_metric

TODO: add a description here

nrmoolsarn/cer

Character error rate (CER) is a common metric of the performance of an automatic speech recognition system. CER is similar to Word Error Rate (WER), but operates on character instead of word. Please refer to docs of WER for further information. Character error rate can be computed as: CER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct characters, N is the number of characters in the reference (N=S+D+C). CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions. This value is often associated to the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CER of 0 being a perfect score.

ola13/precision_at_k

TODO: add a description here

oliviak-flpg/rouge

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

omidf/squad_precision_recall

This metric wrap the official scoring script for version 1 of the Stanford Question Answering Dataset (SQuAD). Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

ooliverz/meteor

METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigram-precision, unigram-recall and their harmonic F1 combination.

ooliverz/rouge

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

ooliverz/wer

Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

openpecha/bleurt

BLEURT a learnt evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model (Devlin et al. 2018) and then employing another pre-training phrase using synthetic data. Finally it is trained on WMT human annotations. You may run BLEURT out-of-the-box or fine-tune it for your specific application (the latter is expected to perform better). See the project's README at https://github.com/google-research/bleurt#readme for more information.

phonemetransformers/segmentation_scores

metric for word segmentation scores

phucdev/blanc_score

BLANC is a reference-free metric that evaluates the quality of document summaries by measuring how much they improve a pre-trained language model's performance on the document's text. It estimates summary quality without needing human-written references, using two variations: BLANC-help and BLANC-tune.

phucdev/vihsd

ViHSD is a Vietnamese Hate Speech Detection dataset. This space implements accuracy and f1 to evaluate models on ViHSD.

pico-lm/blimp

BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars. For more information on BLiMP, see the [dataset card](https://huggingface.co/datasets/nyu-mll/blimp).

pico-lm/perplexity

This is a fork of the huggingface evaluate library's implementation of perplexity. Perplexity (PPL) is one of the most common metrics for evaluating language models. It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`. For more information on perplexity, see [this tutorial](https://huggingface.co/docs/transformers/perplexity).

posicube/mean_reciprocal_rank

Mean Reciprocal Rank is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries, ordered by probability of correctness.

prajwall/mse

Mean Squared Error(MSE) is the average of the square of difference between the predicted and actual values.

red1bluelost/evaluate_genericify_cpp

TODO: add a description here

repllabs/mean_average_precision

TODO: add a description here

repllabs/mean_reciprocal_rank

TODO: add a description here

ronaldahmed/nwentfaithfulness

TODO: add a description here

saicharan2804/my_metric

Moses and PyTDC metrics

sakusakumura/bertscore

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks. See the project's README at https://github.com/Tiiiger/bert_score#readme for more information.

shalakasatheesh/squad

This metric wrap the official scoring script for version 1 of the Stanford Question Answering Dataset (SQuAD). Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

shalakasatheesh/squad_v2

This metric wrap the official scoring script for version 2 of the Stanford Question Answering Dataset (SQuAD). Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

shirayukikun/sescore

SEScore: a text generation evaluation metric

shunzh/apps_metric

Evaluation metric for the APPS benchmark

sign/signwriting_similarity

The Symbol Distance Metric is a novel evaluation metric specifically designed for SignWriting, a visual writing system for signed languages. Unlike traditional string-based metrics (e.g., BLEU, chrF), this metric directly considers the visual and spatial properties of individual symbols used in SignWriting, such as base shape, orientation, rotation, and position. It is primarily used to evaluate model outputs in SignWriting transcription and translation tasks, offering a similarity score between a predicted and a reference sign.

sma2023/wil

sometimesanotion/ner_eval

TODO: add a description here

sorgfresser/valid_efficiency_score

TODO: add a description here

sportlosos/sescore

SEScore: a text generation evaluation metric

svenwey/logmetric

TODO: add a description here

tianzhihui-isc/cer

Character error rate (CER) is a common metric of the performance of an automatic speech recognition system. CER is similar to Word Error Rate (WER), but operates on character instead of word. Please refer to docs of WER for further information. Character error rate can be computed as: CER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct characters, N is the number of characters in the reference (N=S+D+C). CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions. This value is often associated to the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CER of 0 being a perfect score.

transZ/sbert_cosine

Sbert cosine is a metric to score the semantic similarity of text generation tasks This is not the official implementation of cosine similarity using SBERT See the project at https://www.sbert.net/ for more information.

transZ/test_parascore

ParaScore is a new metric to scoring the performance of paraphrase generation tasks See the project at https://github.com/shadowkiller33/ParaScore for more information.

unnati/kendall_tau_distance

TODO: add a description here

venkatasg/gleu

Generalized Language Evaluation Understanding (GLEU) is a metric initially developed for Grammatical Error Correction (GEC), that builds upon BLEU by rewarding corrections while also correctly crediting unchanged source text.

vineelpratap/cer

Character error rate (CER) is a common metric of the performance of an automatic speech recognition system. CER is similar to Word Error Rate (WER), but operates on character instead of word. Please refer to docs of WER for further information. Character error rate can be computed as: CER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct characters, N is the number of characters in the reference (N=S+D+C). CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions. This value is often associated to the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CER of 0 being a perfect score.

vladman-25/ter

TER (Translation Edit Rate, also called Translation Error Rate) is a metric to quantify the edit operations that a hypothesis requires to match a reference translation. We use the implementation that is already present in sacrebleu (https://github.com/mjpost/sacreBLEU#ter), which in turn is inspired by the TERCOM implementation, which can be found here: https://github.com/jhclark/tercom. The implementation here is slightly different from sacrebleu in terms of the required input format. The length of the references and hypotheses lists need to be the same, so you may need to transpose your references compared to sacrebleu's required input format. See https://github.com/huggingface/datasets/issues/3154#issuecomment-950746534 See the README.md file at https://github.com/mjpost/sacreBLEU#ter for more information.

weiqis/pajm

a metric module to do Partial Answer & Justification Match (pajm).

whyen-wang/cocoeval

COCO eval

xu1998hz/sescore

SEScore: a text generation evaluation metric

xu1998hz/sescore_english_coco

SEScore: a text generation evaluation metric

xu1998hz/sescore_english_mt

SEScore: a text generation evaluation metric

xu1998hz/sescore_english_webnlg

SEScore: a text generation evaluation metric

xu1998hz/sescore_german_mt

SEScore: a text generation evaluation metric

yanxia123/bleu

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics. Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation's overall quality. Neither intelligibility nor grammatical correctness are not taken into account.

ybelkada/cocoevaluate

TODO: add a description here

yonting/average_precision_score

Average precision score.

youssef101/accuracy

Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative

youssef101/f1

The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation: F1 = 2 * (precision * recall) / (precision + recall)

yqsong/execution_accuracy

TODO: add a description here

yulong-me/yl_metric

TODO: add a description here

yuyijiong/quad_match_score

TODO: add a description here

yzha/ctc_eval

This repo contains code of an automatic evaluation metric described in the paper Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

zbeloki/m2

TODO: add a description here

zhangzc1213/meteor

METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigram-precision, unigram-recall and their harmonic F1 combination.

zhangzc1213/rouge

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters. This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

zsqrt/ter

TER (Translation Edit Rate, also called Translation Error Rate) is a metric to quantify the edit operations that a hypothesis requires to match a reference translation. We use the implementation that is already present in sacrebleu (https://github.com/mjpost/sacreBLEU#ter), which in turn is inspired by the TERCOM implementation, which can be found here: https://github.com/jhclark/tercom. The implementation here is slightly different from sacrebleu in terms of the required input format. The length of the references and hypotheses lists need to be the same, so you may need to transpose your references compared to sacrebleu's required input format. See https://github.com/huggingface/datasets/issues/3154#issuecomment-950746534 See the README.md file at https://github.com/mjpost/sacreBLEU#ter for more information.