QA-Evaluation-Metrics πŸ“Š

PyPI version qa-metrics Colab

A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models.

pip install qa-metrics is all you need!

πŸŽ‰ Latest Updates

  • Version 0.2.19 Released!
    • Paper accepted to EMNLP 2024 Findings! πŸŽ“
    • Enhanced PEDANTS with multi-pipeline support and improved edge case handling
    • Added support for OpenAI GPT-series and Claude Series models (OpenAI version > 1.0)
    • Integrated support for open-source models (LLaMA-2-70B-chat, LLaVA-1.5, etc.) via deepinfra
    • Introduced trained tiny-bert for QA evaluation (18MB model size)
    • Added direct Huggingface model download support for TransformerMatcher

πŸš€ Quick Start

Table of Contents

Prerequisites

  • Python >= 3.6
  • openai >= 1.0

Installation

pip install qa-metrics

πŸ’‘ Features

Our package offers six QA evaluation methods with varying strengths:

Method Best For Cost Correlation with Human Judgment
Normalized Exact Match Short-form QA (NQ-OPEN, HotpotQA, etc.) Free Good
PEDANTS Both short & medium-form QA Free Very High
Neural Evaluation Both short & long-form QA Free High
Open Source LLM Evaluation All QA types Free High
Black-box LLM Evaluation All QA types Paid Highest

πŸ“– Documentation

1. Normalized Exact Match

Method: em_match

Parameters

  • reference_answer (list of str): A list of gold (correct) answers to the question
  • candidate_answer (str): The answer provided by a candidate that needs to be evaluated

Returns

  • boolean: True if there are any exact normalized matches between gold and candidate answers
from qa_metrics.em import em_match

reference_answer = ["The Frog Prince", "The Princess and the Frog"]
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
match_result = em_match(reference_answer, candidate_answer)

2. F1 Score

Method: f1_score_with_precision_recall

Parameters

  • reference_answer (str): A gold (correct) answer to the question
  • candidate_answer (str): The answer provided by a candidate that needs to be evaluated

Returns

  • dictionary: Contains the F1 score, precision, and recall between a gold and candidate answer

Method: f1_match

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • threshold (float): F1 score threshold for considering a match (default: 0.5)

Returns

  • boolean: True if F1 score exceeds threshold for any gold answer
from qa_metrics.f1 import f1_match, f1_score_with_precision_recall

f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)

3. PEDANTS

Method: get_score

Parameters

  • reference_answer (str): A Gold answer
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • float: The similarity score between two strings (0 to 1)

Method: get_highest_score

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • dictionary: Contains the gold answer and candidate answer pair with highest matching score

Method: get_scores

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • dictionary: Contains matching scores for all gold answer and candidate answer pairs

Method: evaluate

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • boolean: True if candidate answer matches any gold answer

Method: get_question_type

Parameters

  • reference_answer (list of str): List of gold answers
  • question (str): The question being evaluated

Returns

  • list: The type of the question (what, who, when, how, why, which, where)

Method: get_judgement_type

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • list: A list revised rules applicable to judge answer correctness
from qa_metrics.pedant import PEDANT

pedant = PEDANT()
scores = pedant.get_scores(reference_answer, candidate_answer, question)
match_result = pedant.evaluate(reference_answer, candidate_answer, question)

4. Transformer Neural Evaluation

Method: get_score

Parameters

  • reference_answer (str): A Gold answer
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • float: The similarity score between two strings (0 to 1)

Method: get_highest_score

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • dictionary: Contains the gold answer and candidate answer pair with highest matching score

Method: get_scores

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • dictionary: Contains matching scores for all gold answer and candidate answer pairs

Method: transformer_match

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • boolean: True if transformer model considers candidate answer equivalent to any gold answer
from qa_metrics.transformerMatcher import TransformerMatcher

### supports `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta`
tm = TransformerMatcher()
match_result = tm.transformer_match(reference_answer, candidate_answer, question)

5. LLM Integration

Method: prompt_gpt

Parameters

  • prompt (str): The input prompt text
  • model_engine (str): OpenAI model to use (e.g., 'gpt-3.5-turbo')
  • temperature (float): Controls randomness (0-1)
  • max_tokens (int): Maximum tokens in response
from qa_metrics.prompt_llm import CloseLLM

model = CloseLLM()
model.set_openai_api_key(YOUR_OPENAI_KEY)
result = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo')

Method: prompt_claude

Parameters

  • prompt (str): The input prompt text
  • model_engine (str): Claude model to use
  • anthropic_version (str): API version
  • max_tokens_to_sample (int): Maximum tokens in response
  • temperature (float): Controls randomness (0-1)
model = CloseLLM()
model.set_anthropic_api_key(YOUR_ANTHROPIC_KEY)
result = model.prompt_claude(prompt=prompt, model_engine='claude-v1')

Method: prompt

Parameters

  • message (str): The input message text
  • model_engine (str): Model to use
  • temperature (float): Controls randomness (0-1)
  • max_tokens (int): Maximum tokens in response
from qa_metrics.prompt_open_llm import OpenLLM

model = OpenLLM()
model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
result = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1')

πŸ€— Model Hub

Our fine-tuned models are available on Huggingface:

πŸ“š Resources

πŸ“„ Citation

@misc{li2024pedantspreciseevaluationsdiverse,
      title={PEDANTS: Cheap but Effective and Interpretable Answer Equivalence}, 
      author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
      year={2024},
      eprint={2402.11161},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2402.11161}, 
}

πŸ“ License

This project is licensed under the MIT License.

πŸ“¬ Contact

For questions or comments, please contact: [email protected]

Downloads last month
47
Safetensors
Model size
355M params
Tensor type
F32
Β·
Inference Examples
Inference API (serverless) has been turned off for this model.

Dataset used to train zli12321/roberta-large-qa-evaluator