YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
CC_FILTER
this is ja cc filter for reference from ja wiki vs random ja common crawl, and build with following procedure.
- get ja wiki dump file, and extract the all url inside, get about 4M urls
- crawl 300K of 4M webpages from the urls
- get pure text and remove content len less than 1k,
- use langdetect to tell the lang of the pages, we finally get total 101K ja pages
- random sample from commoncrawl 202303 and use langdetect to find 101k ja pages
- tokenize all text with "rinna/japanese-roberta-base"
- feed tokens to fasttext to get model.bin
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.