ABOUT_TEXT = """ ## Web Languages Project Welcome! This is a crowd-sourced effort to improve crawling of low-resource languages. This dataset is public. [Common Crawl](https://commoncrawl.github.io/cc-crawl-statistics/plots/languages) recognizes a lot of languages, and we can see that we don't have enough of languages like Hindi (500 million speakers!), smaller country languages like Hungarian, and regional languages like Catalan. We are interested in languages from all over the world. If you choose to help, you'll be helping create lists of websites related to languages that you read or speak. ### How can I contribute? If you look below you'll see a huge list of living languages. If you see one that looks interesting, click on it. You'll see a language-specific document, probably mostly blank, that you can fill out. There are 2 ways to add to this document. If you aren't very familiar with Github, you can copy the entire document into an email, fill it out, and send it to web-languages ZAT commoncrawl ZOT org. We'll do the rest. If you are familiar with Github, and are logged in, click on the pen icon in the upper right corner to start editing the document. Github will request that you fork the repo. Do that, edit the document, and finally create a pull request. To see a partially completed example, look at the [Welsh](living/welsh.md) entry. Sometimes asking a Large Language Model can be helpful: "What are some top websites written in the Welsh language?" ### What kind of websites are you looking for? If you look at the template, we have requested urls in a few categories: News, Culture/History, Government, Political Parties, and Other. Remember that we're looking for websites in this particular language. If the language is only a part of the website, and that's visible in the URL as https://example.com/catalan/, then that's the URL you should add. For a language like Hindi, with 500 million speakers, there are a lot of websites to choose from. Please suggest websites that are important and influential, and please think about diversity. Are all geographic regions represented? For a country-wide language like Hungarian, there are still probably a wide variety of websites to choose from. If a website is all English, however, that's not what we're looking for. For a regional language like Catalan, things are trickier. Catalan has multiple names -- it's called Valencian in some parts of Spain -- and use of the Catalan language is a part of a vigorous debate in Spanish national and regional politics. You might not be able to find Catalan-language content for every political party, and government websites might offer Catalan content one day and remove it after the next election. In that case, please do the best you can. If your favorite language has its own Wikipedia -- [check the list here](https://en.wikipedia.org/wiki/List_of_Wikipedias) -- please include this link under "Other". """