dataset integration
323a625
ABOUT_TEXT = """
## Web Languages Project
Welcome! This is a crowd-sourced effort to improve crawling
of low-resource languages. This dataset is public.
[Common Crawl](https://commoncrawl.github.io/cc-crawl-statistics/plots/languages)
recognizes a lot of languages, and we can see that we don't have
enough of languages like Hindi (500 million speakers!), smaller
country languages like Hungarian, and regional languages like Catalan.
We are interested in languages from all over the world. If you choose
to help, you'll be helping create lists of websites related to
languages that you read or speak.
### How can I contribute?
If you look below you'll see a huge list of living languages. If you
see one that looks interesting, click on it. You'll see a
language-specific document, probably mostly blank, that you can fill
out.
There are 2 ways to add to this document. If you aren't very familiar
with Github, you can copy the entire document into an email, fill it
out, and send it to web-languages ZAT commoncrawl ZOT org. We'll do the rest.
If you are familiar with Github, and are logged in, click on the pen
icon in the upper right corner to start editing the document.
Github will request that you fork the repo. Do that, edit the
document, and finally create a pull request.
To see a partially completed example, look at the
[Welsh](living/welsh.md) entry.
Sometimes asking a Large Language Model can be helpful: "What are some
top websites written in the Welsh language?"
### What kind of websites are you looking for?
If you look at the template, we have requested urls in a few
categories: News, Culture/History, Government, Political Parties, and
Other. Remember that we're looking for websites in this particular
language. If the language is only a part of the website, and that's
visible in the URL as https://example.com/catalan/, then that's the
URL you should add.
For a language like Hindi, with 500 million speakers, there are a lot
of websites to choose from. Please suggest websites that are important
and influential, and please think about diversity. Are all geographic
regions represented?
For a country-wide language like Hungarian, there are still probably a
wide variety of websites to choose from. If a website is all English,
however, that's not what we're looking for.
For a regional language like Catalan, things are trickier. Catalan has
multiple names -- it's called Valencian in some parts of Spain -- and
use of the Catalan language is a part of a vigorous debate in Spanish
national and regional politics. You might not be able to find
Catalan-language content for every political party, and government
websites might offer Catalan content one day and remove it after
the next election. In that case, please do the best you can.
If your favorite language has its own Wikipedia -- [check the list here](https://en.wikipedia.org/wiki/List_of_Wikipedias) --
please include this link under "Other".
"""