File size: 10,185 Bytes
2f3b1e5 4b27dba 8ecf749 2f9ab22 4b27dba ec2d95d 4b27dba 2f3b1e5 4b27dba 2f9ab22 4b27dba 2f9ab22 4b27dba 2f9ab22 4b27dba ec2d95d 4b27dba 7fc4f15 4b27dba ec2d95d 4b27dba ec2d95d 4b27dba 70e1e0e 4b27dba 2f9ab22 4b27dba ec2d95d 4b27dba 70e1e0e 4b27dba 2f9ab22 4b27dba ec2d95d 4b27dba eb46d36 4b27dba 2f9ab22 3bdae74 ad9d2a5 3bdae74 bd69e65 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
---
language:
- en
tags:
- text-classification
license: cc0-1.0
library: Transformers
widget:
- text: "sdfsdfa"
example_title: "Gibberish"
- text: "idkkkkk"
example_title: "Uncertainty"
- text: "Because you asked"
example_title: "Refusal"
- text: "I am a cucumber"
example_title: "High-risk"
- text: "My job went remote and I needed to take care of my kids"
example_title: "Valid"
---
# SANDS
_Semi-Automated Non-response Detection for Surveys_
Non-response detection designed to be used for open-ended survey text in conjunction with human reviewers.
## Model Details
Model Description: This model is a fine-tuned version of the supervised SimCSE BERT base uncased model. It was introduced at [AAPOR](https://www.aapor.org/) 2022 at the talk _Toward a Semi-automated item nonresponse detector model for open-response data_. The model is uncased, so it treats `important`, `Important`, and `ImPoRtAnT` the same.
* Developed by: [National Center for Health Statistics](https://www.cdc.gov/nchs/index.htm), Centers for Disease Control and Prevention
* Model Type: Text Classification
* Language(s): English
* License: Apache-2.0
Parent Model: For more details about SimCSE, we encourage users to check out the SimCSE [Github repository](https://github.com/princeton-nlp/SimCSE), and the [base model](https://huggingface.co/princeton-nlp/sup-simcse-bert-base-uncased) on HuggingFace.
## How to Get Started with the Model
### Example of classification of a set of responses:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import pandas as pd
# Load the model
model_location = "NCHS/SANDS"
model = AutoModelForSequenceClassification.from_pretrained(model_location)
tokenizer = AutoTokenizer.from_pretrained(model_location)
# Create example responses to test
responses = [
"sdfsdfa",
"idkkkkk",
"Because you asked",
"I am a cucumber",
"My job went remote and I needed to take care of my kids",
]
# Run the model and compute a score for each response
with torch.no_grad():
tokens = tokenizer(responses, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
scores = torch.softmax(output.logits, dim=1).numpy()
# Display the scores in a table
columns = ["Gibberish", "Uncertainty", "Refusal", "High-risk", "Valid"]
df = pd.DataFrame(scores, columns=columns)
df.index.name = "Response"
print(df)
```
|Response| Gibberish| Uncertainty| Refusal| High-risk| Valid|
|--------|---------------|-----------------|-----------|-----------------|-----------|
|sdfsdfa| 0.998| 0.000| 0.000| 0.000| 0.000|
|idkkkkk| 0.002| 0.995| 0.001| 0.001| 0.001|
|Because you asked| 0.001| 0.001| 0.976| 0.006| 0.014|
|I am a cucumber| 0.001| 0.001| 0.002| 0.797| 0.178|
|My job went remote and I needed to take care of my kids| 0.000| 0.000| 0.000| 0.000| 1.000|
Alternatively, you can load the model using a pipeline
```python
from transformers import pipeline
pipe = pipeline("text-classification", "NCHS/SANDS")
print( pipe(responses) )
```
```python
[{'label': 'Gibberish', 'score': 0.9978908896446228},
{'label': 'Uncertainty', 'score': 0.9950007796287537},
{'label': 'Refusal', 'score': 0.9775006771087646},
{'label': 'High-risk', 'score': 0.9804121255874634},
{'label': 'Valid', 'score': 0.9997561573982239}]
```
With the pipeline set `top_k` to see all the full output:
```python
pipe(responses, top_k=5)
```
Finally, if you'd like to use a local GPU set the device to the GPU number (usually 0).
```python
pipe = pipeline("text-classification", "NCHS/SANDS", device=0)
```
## Uses
### Direct Uses
This model is intended to be used on survey responses for data cleaning to help researchers filter out non-responsive responses or junk responses to aid in research and analysis. The model will return a score for a response in 5 different categories: Gibberish, Refusal, Uncertainty, High Risk, and Valid as a probability vector that sums to 1.
### Response types
+ **Gibberish**: Nonsensical response where the respondent entered text without regard for English syntax. Examples: `ksdhfkshgk` and `sadsadsadsadsadsadsad`
+ **Refusal**: Responses with valid English but are either a direct refusal to answer the question asked or a response that provides no contextual relationship to the question asked. Examples: `Because` or `Meow`.
+ **Uncertainty**: Responses where the respondent does not understand the question, does not know the answer to the question, or does not know how to respond to the question. Examples: `I dont know` or `unsure what you are asking`.
+ **High-Risk**: Responses that may be valid depending on the context and content of the question. These responses require human subject matter expertise to classify as a valid response or not. Examples: `Necessity` or `I am a cucumber`
+ **Valid**: Responses that answer the question at hand and provide an insight to the respondents thought on the subject matter of the question. Examples: `COVID began for me when my children’s school went online and I needed to stay home to watch them` or `staying home, avoiding crowds, still wear masks`
## Misuses and Out-of-scope Use
The model has been trained to specifically identify survey non-response in open ended responses where the respondent taking the survey has given a response but their answer does not respond to the question at hand or providing any meaningful insight. Some examples of these types of responses are `meow`, `ksdhfkshgk`, or `idk`. The model was fine-tuned on 3,000 labeled open-ended responses to web probes on questions relating to the COVID-19 pandemic gathered from the [Research and Development Survey or RANDS](https://www.cdc.gov/nchs/rands/index.htm) conducted by the Division of Research and Methodology at the National Center for Health Statistics. Web probes are questions implementing probing techniques from cognitive interviewing for use in survey question design and are different than traditional open-ended survey questions. The context of our labeled responses limited in focus on both COVID and health responses, so responses outside this scope may notice a drop in performance.
The responses the model is trained on are also from both web and phone based open-ended probes. There may be limitations in model effectiveness with more traditional open ended survey questions with responses provided in other mediums.
This model does not assess the factual accuracy of responses or filter out responses with different demographic biases. It was not trained to be factual of people or events and so using the model for such classification is out of scope for the abilities of the model.
We did not train the model to recognize non-response in any language other than English. Responses in languages other than English are out of scope and the model will perform poorly. Any correct classifications are a result of the base SimCSE or Bert Models.
## Risks, Limitations, and Biases
To investigate if there were differences between demographic groups on sensitivity and specificity, we conducted two-tailed Z-tests across demographic groups. These included education (some college or less and bachelor’s or more), sex (male or female), mode (computer or telephone), race and ethnicity (non-Hispanic White, non-Hispanic Black, Hispanic, and all others who are non-Hispanic), and age (18-29, 30-44, 45-59, and 60+). There were 4,813 responses to 3 probes. To control for family-wise error rate, we applied the Bonferroni correction was applied to the alpha level (α < 0.00167).
There were statistically significant differences in specificity between education levels, mode, and White and Black respondents. There were no statistically significant differences in sensitivity. Respondents with some college or less had lower specificity compared to those with more education (0.73 versus 0.80, p < 0.0001). Respondents who used a smartphone or computer to complete their survey had a higher specificity than those who completed the survey over the telephone (0.77 versus 0.70, p < 0.0001). Black respondents had a lower specificity than White respondents (0.65 versus 0.78, p < 0.0001). Effect sizes for education and mode were small (h = 0.17 and h = 0.16, respectively) while the effect size for race was between small and medium (h = 0.28).
As the model was fine-tuned from SimCSE, itself fine-tuned from BERT, it will reproduce all biases inherent in these base models. Due to tokenization, the model may incorrectly classify typos, especially in acronyms. For example: `LGBTQ` is valid, while `LBGTQ` is classified as gibberish.
## Training
#### Training Data
The model was fine-tuned on 3,200 labeled open-ended responses from [RANDS during COVID 19 Rounds 1 and 2](https://www.cdc.gov/nchs/rands/index.htm). The base SimCSE BERT model was trained on BookCorpus and English Wikipedia.
#### Training procedure
+ Learning rate: 5e-5
+ Batch size: 16
+ Number training epochs: 4
+ Base Model pooling dimension: 768
+ Number of labels: 5
## Suggested citation
```bibtex
@misc{cibellihibben2023sands,
title={Semi-Automated Nonresponse Detection for Open-text Survey Data},
author={Kristen Cibelli Hibben, Zachary Smith, Ben Rogers, Valerie Ryan, Paul Scanlon, Kristen Miller, Travis Hoppe},
year={2023},
url={https://huggingface.co/NCHS/SANDS},
doi={ 10.57967/hf/0414 }
}
```
## Open source licence
Model and code, including source files and code samples if any in the content, are released as open source under the [Creative Commons Universal Public Domain](https://creativecommons.org/publicdomain/zero/1.0/). This means you can use the code, model, and content in this repository except for any offical trademarks in your own projects.
Open source projects are made available and contributed to under licenses that include terms that, for the protection of contributors, make clear that the projects are offered "as-is", without warranty, and disclaiming liability for damages resulting from using the projects. This model is no different. The open content license it is offered under includes such terms.
|