Spaces:
Running
Running
title: Word Count | |
emoji: π€ | |
colorFrom: green | |
colorTo: purple | |
sdk: gradio | |
sdk_version: 3.0.2 | |
app_file: app.py | |
pinned: false | |
tags: | |
- evaluate | |
- measurement | |
description: >- | |
Returns the total number of words, and the number of unique words in the input data. | |
# Measurement Card for Word Count | |
## Measurement Description | |
The `word_count` measurement returns the total number of word count of the input string, using the sklearn's [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) | |
## How to Use | |
This measurement requires a list of strings as input: | |
```python | |
>>> data = ["hello world and hello moon"] | |
>>> wordcount= evaluate.load("word_count") | |
>>> results = wordcount.compute(data=data) | |
``` | |
### Inputs | |
- **data** (list of `str`): The input list of strings for which the word length is calculated. | |
- **max_vocab** (`int`): (optional) the top number of words to consider (can be specified if dataset is too large) | |
### Output Values | |
- **total_word_count** (`int`): the total number of words in the input string(s). | |
- **unique_words** (`int`): the number of unique words in the input string(s). | |
Output Example(s): | |
```python | |
{'total_word_count': 5, 'unique_words': 4} | |
### Examples | |
Example for a single string | |
```python | |
>>> data = ["hello sun and goodbye moon"] | |
>>> wordcount = evaluate.load("word_count") | |
>>> results = wordcount.compute(data=data) | |
>>> print(results) | |
{'total_word_count': 5, 'unique_words': 5} | |
``` | |
Example for a multiple strings | |
```python | |
>>> data = ["hello sun and goodbye moon", "foo bar foo bar"] | |
>>> wordcount = evaluate.load("word_count") | |
>>> results = wordcount.compute(data=data) | |
>>> print(results) | |
{'total_word_count': 9, 'unique_words': 7} | |
``` | |
Example for a dataset from π€ Datasets: | |
```python | |
>>> imdb = datasets.load_dataset('imdb', split = 'train') | |
>>> wordcount = evaluate.load("word_count") | |
>>> results = wordcount.compute(data=imdb['text']) | |
>>> print(results) | |
{'total_word_count': 5678573, 'unique_words': 74849} | |
``` | |
## Citation(s) | |
## Further References | |
- [Sklearn `CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) | |