nanom commited on
Commit
469ae10
2 Parent(s): e8aad19 a69cb43

Updated README

Browse files
Files changed (1) hide show
  1. README.md +13 -102
README.md CHANGED
@@ -1,102 +1,13 @@
1
- # EDIA: Stereotypes and Discrimination in Artificial Intelligence
2
- [[Paper]](https://arxiv.org/abs/2207.06591) [[HuggingFace🤗 Demo]](https://huggingface.co/spaces/vialibre/edia)
3
-
4
- Language models and word representations obtained with machine learning contain discriminatory stereotypes. Here we present the EDIA project (Stereotypes and Discrimination in Artificial Intelligence). This project aimed to design and evaluate a methodology that allows social scientists and domain experts in Latin America to explore biases and discriminatory stereotypes present in word embeddings (WE) and language models (LM). It also allowed them to define the type of bias to explore and do an intersectional analysis using two binary dimensions (for example, female-male intersected with fat-skinny).
5
-
6
- EDIA contains several functions that serve to detect and inspect biases in natural language processing systems based on language models or word embeddings. We have models in Spanish and English to work with and explore biases in different languages ​​at the user's request. Each of the following spaces contains different functions that bring us closer to a particular aspect of the problem of bias and they allow us to understand different but complementary parts of it.
7
-
8
- You can test and explore this functions with our live demo hosted on HuggingFace🤗 by clicking [here](https://huggingface.co/spaces/vialibre/edia).
9
-
10
- ## Installation
11
-
12
- Setup the code in a virtualenv
13
-
14
- ```sh
15
- # Clone repo
16
- $ git clone https://github.com/fvialibre/edia.git && cd edia
17
- # Create and activate virtualenv
18
- $ python3 -m venv venv && source venv/bin/activate
19
- # Install requirements
20
- $ python3 -m pip install -r requirements.txt
21
- ```
22
- ## Setup data
23
-
24
- In order to start using this tool, you need to create the requiered structure for it to retrieve the data. To do this, we provide you a script for doing it automatically, but also explainations on how to do it manually for more personal customization.
25
-
26
- ### Automatic setup
27
- In the cloned repository you have the `setup.sh` script that you can run in Linux OS:
28
-
29
- ```sh
30
- $ ./setup.sh
31
- ```
32
-
33
- This will create a `data/` folder inside the repository and download from *Google Drive* two 100k embeddings files (for English and Spanish), and two vocabulary files (`Min` and `Full`, see [Manual setup](#Manual-setup)).
34
-
35
- ### Manual setup
36
- To setup the structure manually just create a `data/` folder inside the `edia` repository just cloned:
37
-
38
- ```sh
39
- $ mkdir data
40
- ```
41
-
42
- And then download inside this newly created folder the files you will need:
43
-
44
- * [Min vocabulary:](https://drive.google.com/file/d/1uI6HsBw1XWVvTEIs9goSpUVfeVJe-zEP/view?usp=sharing) Composed of only 56 words, for tests purpose only.
45
- * [Full vocabulary:](https://drive.google.com/file/d/1T_pLFkUucP-NtPRCsO7RkOuhMqGi41pe/view?usp=sharing) Composed of 1.2M words.
46
- * [Spanish word embeddings: ](https://drive.google.com/file/d/1YwjyiDN0w54P55-y3SKogk7Zcd-WQ-eQ/view?usp=sharing) 100K spanish word embeddings of 300 dimensions (from [Jorge Pérez's website](http://dcc.uchile.cl/~jperez))
47
- * [Spanish word embeddings: ](https://drive.google.com/file/d/1EN0pp1RKyRwi072QhVWJaDO8KlcFZo46/view?usp=sharing) 100K english word embeddings of 300 dimensions (from [Eyal Gruss's github](https://github.com/eyaler/word2vec-slim))
48
-
49
- > **Note**: You will need one of the two vocabulary files (`Min` or `Full`) if you don't want to be bothered to create the complex structure needed. The embeddings file, on the other side, can be one of your own, we just give this two as functional options.
50
-
51
- ## Usage
52
- ```sh
53
- # If you are not already in the venv
54
- $ source venv/bin/activate
55
- $ python3 app.py
56
- ```
57
-
58
- ## Tool Configuration
59
-
60
- The file `tool.cfg` contains configuration parameters for the tool:
61
-
62
- | **Name** | **Options** | **Description** |
63
- |---|---|---|
64
- | language | `es`, `en` | Changes the interface language |
65
- | embeddings_path | `data/100k_es_embedding.vec`, `data/100k_en_embedding.vec` | Path to word embeddings to use. You can use your own embedding file as long as it is in `.vec` format. If it's a `.bin` extended file, only gensims c binary format are valid. The options correspond to pretrained english and spanish embeddings. |
66
- | nn_method | `sklearn`, `ann` | Method used to fetch nearest neighbors. Sklearn uses [sklearn nearest neighbors](https://scikit-learn.org/stable/modules/neighbors.html) exact calculation so your embedding must fit in your computer's memory, it's a slower approach for large embeddings. [Ann](https://pypi.org/project/annoy/1.0.3/) is a approximate nearest neighbors search suitable for large embeddings that don't fit in memory |
67
- | max_neighbors | (int) `20` | Select amount of neighbors to fit sklearn nearest neighbors method. |
68
- | context_dataset | `vialibre/splittedspanish3bwc` | Path to splitted 3bwc dataset optimised for word context search. |
69
- | vocabulary_subset | `mini`, `full` | Vocabulary necessary for context search tool |
70
- | available_wordcloud | `True`, `False` | Show wordcloud in "Data" interface |
71
- | language_model | `bert-base-uncased`, `dccuchile/bert-base-spanish-wwm-uncased` | `bert-base-uncased` is an english language model, `bert-base-spanish-wwm-uncased` is an spanish model. You can inspect any bert-base language model uploaded to the [HuggingfaceHub](https://huggingface.co/models). |
72
- | available_logs | `True`, `False` | Activate logging of user's input. Saved logs will be stores in `logs/` folder. |
73
-
74
- ## Resources
75
- ### Videotutorials and user's manual
76
- * Word explorer: [[video]]() [manual: [es](https://shorturl.at/cgwxJ) | en]
77
- * Word bias explorer: [[video]]() [manual: [es](https://shorturl.at/htuEI) | en]
78
- * Phrase bias explorer: [[video]]() [manual: [es](https://shorturl.at/fkBL3) | en]
79
- * Data explorer: [[video]]() [manual: [es](https://shorturl.at/CIVY6) | en]
80
- * Crows-Pairs: [[video]]() [manual: [es](https://shorturl.at/gJLTU) | en]
81
- ### Interactive nooteboks
82
- * How to use (*road map*): [[es](notebook/EDIA_Road_Map.ipynb) | en]
83
- * Classes and methods docs: [[es](notebook/EDIA_Docs.ipynb) | en]
84
-
85
- ## Citation Information
86
- ```c
87
- @misc{https://doi.org/10.48550/arxiv.2207.06591,
88
- doi = {10.48550/ARXIV.2207.06591},
89
- url = {https://arxiv.org/abs/2207.06591},
90
- author = {Alemany, Laura Alonso and Benotti, Luciana and González, Lucía and Maina, Hernán and Busaniche, Beatriz and Halvorsen, Alexia and Bordone, Matías and Sánchez, Jorge},
91
- keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI),
92
- FOS: Computer and information sciences, FOS: Computer and information sciences},
93
- title = {A tool to overcome technical barriers for bias assessment in human language technologies},
94
- publisher = {arXiv},
95
- year = {2022},
96
- copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
97
- }
98
- ```
99
-
100
- ## License Information
101
- This project is under a [MIT license](LICENSE).
102
-
 
1
+ ---
2
+ title: Edia Full En
3
+ emoji: 👁
4
+ colorFrom: purple
5
+ colorTo: gray
6
+ sdk: gradio
7
+ sdk_version: 3.16.2
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ ---
12
+
13
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference