FormatClassifier / README.md

Update README.md

67e7104 verified 5 months ago

3.87 kB

	---
	library_name: transformers
	datasets:
	- WebOrganizer/FormatAnnotations-Llama-3.1-8B
	- WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8
	base_model:
	- Alibaba-NLP/gte-base-en-v1.5
	---
	# WebOrganizer/FormatClassifier

	[[Paper](https://arxiv.org/abs/2502.10341)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]

	The FormatClassifier organizes web content into 24 categories based on the URL and text contents of web pages.
	The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
	1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
	2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)

	#### All Domain Classifiers
	- [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier) ← you are here!
	- [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL)
	- [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier)
	- [WebOrganizer/TopicClassifier-NoURL](https://huggingface.co/WebOrganizer/TopicClassifier-NoURL)

	## Usage

	This classifier expects input in the following input format:
	```
	{url}

	{text}
	```

	Example:
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier")
	model = AutoModelForSequenceClassification.from_pretrained(
	"WebOrganizer/FormatClassifier",
	trust_remote_code=True,
	use_memory_efficient_attention=False)

	web_page = """http://www.example.com

	How to make a good sandwich? [Click here to read article]"""

	inputs = tokenizer([web_page], return_tensors="pt")
	outputs = model(**inputs)

	probs = outputs.logits.softmax(dim=-1)
	print(probs.argmax(dim=-1))
	# -> 6 ("Truncated" format, which covers incomplete content)
	```

	You can convert the `logits` of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see `id2label` and `label2id` in the model config):
	1. Academic Writing
	2. Content Listing
	3. Creative Writing
	4. Customer Support
	5. Comment Section
	6. FAQ
	7. Truncated
	8. Knowledge Article
	9. Legal Notices
	10. Listicle
	11. News Article
	12. Nonfiction Writing
	13. About (Org.)
	14. News (Org.)
	15. About (Pers.)
	16. Personal Blog
	17. Product Page
	18. Q&A Forum
	19. Spam / Ads
	20. Structured Data
	21. Documentation
	22. Audio Transcript
	23. Tutorial
	24. User Review

	The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/formats.yaml).

	#### Efficient Inference
	We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ (see more [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers)) and loading the model like:
	```python
	AutoModelForSequenceClassification.from_pretrained(
	"WebOrganizer/FormatClassifier",
	trust_remote_code=True,
	unpad_inputs=True,
	use_memory_efficient_attention=True,
	torch_dtype=torch.bfloat16
	)
	```


	## Citation
	```bibtex
	@article{wettig2025organize,
	title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
	author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
	journal={arXiv preprint arXiv:2502.10341},
	year={2025}
	}
	```