Emily McMilin
commited on
Commit
·
a458815
1
Parent(s):
652f191
simplify text more. Update DAG image
Browse files
app.py
CHANGED
@@ -481,25 +481,34 @@ with demo:
|
|
481 |
|
482 |
|
483 |
gr.Markdown("### Data Generating Process")
|
484 |
-
gr.Markdown("To pick values below that are most likely to cause spurious correlations, it helps to make some assumptions about the training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
485 |
|
486 |
-
gr.Markdown("A plausible data generating processes for both Wikipedia and Reddit sourced data is shown as a DAG below. These DAGs are prone to collider bias when conditioning on `access`. In other words, although in real life `place`, `date`, (subreddit) `interest` and gender are all unconditionally independent, when we condition on their common effect, `access`, they become unconditionally dependent. Composing a dataset often requires the dataset maintainers to condition on `access`. Thus LLMs learn these dataset induced dependencies, appearing to us as spurious correlations.")
|
487 |
|
488 |
gr.Markdown("""
|
489 |
<style>
|
490 |
img {
|
491 |
-
|
492 |
-
|
493 |
-
}
|
|
|
494 |
<center>
|
495 |
<img src="https://www.dropbox.com/s/4f07djirinl2qvy/show_g_crop.png?raw=1"
|
496 |
alt="DAG of possible data generating process for datasets used in training some of our LLMs.">
|
497 |
</center>
|
498 |
""")
|
499 |
|
500 |
-
gr.Markdown("
|
501 |
-
gr.Markdown("
|
502 |
-
|
|
|
503 |
demo.launch(debug=True)
|
504 |
|
|
|
505 |
# %%
|
|
|
481 |
|
482 |
|
483 |
gr.Markdown("### Data Generating Process")
|
484 |
+
gr.Markdown("To pick values below that are most likely to cause spurious correlations, it helps to make some assumptions about the training dataset's likely data generating process, and where selection bias may come in.")
|
485 |
+
|
486 |
+
gr.Markdown("A plausible data generating processes for Wiki-Bio and Reddit datasets is shown as a DAG below. The variables `W` : birth place, birth date or subreddit interest, and `G`: gender, are both independent variables that have no ancestral variables. However, `W` and `G` may have a role in causing one's access, `Z`. In the case of Wiki-Bio a functional form of `Z` may capture the general trend that access has become less gender-dependent over time, but not in every place. In the case of Reddit TLDR, `Z` may capture that despite some subreddits having gender-neutral topics, the specific style of moderation and community in the subreddit may reduce access to some genders.")
|
487 |
+
|
488 |
+
gr.Markdown("This DAG structure is prone to collider bias between `W` and `G` when conditioning on access, `Z`. In other words, although in real life *place*, *date*, and (subreddit) *interest* vs *gender* are unconditionally independent, when we condition on their common effect, *access*, they become unconditionally dependent.")
|
489 |
+
|
490 |
+
gr.Markdown("The obvious solution to not condition on access is unavailable to us, as we are required to in order to represent the process of selection into the dataset. Thus, a statistical relationship between `W` and `G` can be induced by the dataset formation, leading to possible spurious correlations, as shown here.")
|
491 |
+
|
492 |
|
|
|
493 |
|
494 |
gr.Markdown("""
|
495 |
<style>
|
496 |
img {
|
497 |
+
width: 30%;
|
498 |
+
max-width: 600px;
|
499 |
+
}
|
500 |
+
</style>
|
501 |
<center>
|
502 |
<img src="https://www.dropbox.com/s/4f07djirinl2qvy/show_g_crop.png?raw=1"
|
503 |
alt="DAG of possible data generating process for datasets used in training some of our LLMs.">
|
504 |
</center>
|
505 |
""")
|
506 |
|
507 |
+
gr.Markdown("### I Don't Buy It")
|
508 |
+
gr.Markdown("See something wrong above? Do you think we cherry picked our examples? Try your own, including your own x-axis. Think we cherry picked LLMs? Try the `add-your-own` model option. This demo _should_ work with any Hugging Face model that supports the [fill-mask](https://huggingface.co/models?pipeline_tag=fill-mask) task.")
|
509 |
+
gr.Markdown("Think our data generating process is wrong, or found an interesting spurious correlation you'd like to set as a default example? Use the community tab to discuss or pull request your fix.")
|
510 |
+
|
511 |
demo.launch(debug=True)
|
512 |
|
513 |
+
|
514 |
# %%
|