Spaces:
Running
A newer version of the Gradio SDK is available:
5.12.0
title: Regard
emoji: 🤗
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
- evaluate
- measurement
description: >-
Regard aims to measure language polarity towards and social perceptions of a
demographic (e.g. gender, race, sexual orientation).
Measurement Card for Regard
Measurement Description
The regard
measurement returns the estimated language polarity towards and social perceptions of a demographic (e.g. gender, race, sexual orientation).
It uses a model trained on labelled data from the paper "The Woman Worked as a Babysitter: On Biases in Language Generation" (EMNLP 2019)
How to Use
This measurement requires two lists of strings as input, enabling comparing the estimated polarity between the groups.
>>> regard = evaluate.load("regard", module_type="measurement")
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
>>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
>>> regard.compute(data = group1, references = group2)
Inputs
- data (list of
str
): prediction/candidate sentences, e.g. sentences describing a given demographic group. - references (list of
str
) (optional): reference/comparison sentences, e.g. sentences describing a different demographic group to compare against. - aggregation (
str
) (optional): determines the type of aggregation performed. If set toNone
, the difference between the regard scores for the two categories is returned. Otherwise: -average
: returns the average regard for each category (negative, positive, neutral, other) for each group -maximum
: returns the maximum regard for each group
Output Values
With a single input:
regard
: the regard scores of each string in the input list (if no aggregation is specified)
{'neutral': 0.95, 'positive': 0.02, 'negative': 0.02, 'other': 0.01}
{'negative': 0.97, 'other': 0.02, 'neutral': 0.01, 'positive': 0.0}
average_regard
: the average regard for each category (negative, positive, neutral, other) (if aggregation
= average
)
{'neutral': 0.48, 'positive': 0.01, 'negative': 0.5, 'other': 0.01}
max_regard
: the maximum regard across all input strings (if aggregation
= maximum
)
{'neutral': 0.95, 'positive': 0.024, 'negative': 0.972, 'other': 0.019}
With two lists of inputs:
By default, this measurement outputs a dictionary containing a list of regard scores, one for each category (negative, positive, neutral, other), representing the difference in regard between the two groups.
{'neutral': 0.35, 'negative': -0.36, 'other': 0.01, 'positive': 0.01}
With the aggregation='maximum'
option, this measurement will output the maximum regard for each group:
{'negative': 0.98, 'other': 0.04, 'neutral': 0.03, 'positive': 0.0}
With the aggregation='average'
option, this measurement will output the average regard for each category (negative, positive, neutral, other):
{'neutral': 0.37, 'negative': 0.57, 'other': 0.05, 'positive': 0.01}
Examples
Example 1 (single input):
>>> regard = evaluate.load("regard")
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
>>> results = regard.compute(data = group1)
>>> for d in results['regard']:
... print({l['label']: round(l['score'],2) for l in d})
{'neutral': 0.95, 'positive': 0.02, 'negative': 0.02, 'other': 0.01}
{'negative': 0.97, 'other': 0.02, 'neutral': 0.01, 'positive': 0.0}
Example 2 (comparison mode):
>>> regard = evaluate.load("regard", "compare")
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
>>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
>>> results = regard.compute(data = group1, references = group2)
>>> print({k: round(v, 2) for k, v in results['regard_difference'].items()})
{'neutral': 0.46, 'positive': 0.01, 'negative': -0.46, 'other': -0.01}
Example 3 (returns the maximum regard score):
>>> regard = evaluate.load("regard", "compare")
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
>>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
>>> results = regard.compute(data = group1, references = group2, aggregation = "maximum")
>>> print({k: round(v, 2) for k, v in results['max_data_regard'].items()})
{'neutral': 0.95, 'positive': 0.02, 'negative': 0.97, 'other': 0.02}
>>> print({k: round(v, 2) for k, v in results['max_references_regard'].items()})
{'negative': 0.98, 'other': 0.04, 'neutral': 0.03, 'positive': 0.0}
Example 4 (returns the average regard score):
>>> regard = evaluate.load("regard", "compare")
>>> group1 = ['xyz are described as mean', 'xyz are thought of as being too ambitious']
>>> group2 = ['xyz are known for making too much noise', 'xyz are described as often violent']
>>> results = regard.compute(data = group1, references = group2, aggregation = "average")
>>> print({k: round(v, 2) for k, v in results['average_data_regard'].items()})
{'neutral': 0.48, 'positive': 0.01, 'negative': 0.5, 'other': 0.01}
>>> print({k: round(v, 2) for k, v in results['average_references_regard'].items()})
{'negative': 0.96, 'other': 0.02, 'neutral': 0.02, 'positive': 0.0}
Citation(s)
@article{https://doi.org/10.48550/arxiv.1909.01326, doi = {10.48550/ARXIV.1909.01326}, url = {https://arxiv.org/abs/1909.01326}, author = {Sheng, Emily and Chang, Kai-Wei and Natarajan, Premkumar and Peng, Nanyun}, title = {The Woman Worked as a Babysitter: On Biases in Language Generation}, publisher = {arXiv}, year = {2019} }