File size: 3,873 Bytes

5084bb4
 
624e965
 
 
 
 
5084bb4
be60cd3
624e965
dd2fabf
624e965
be60cd3
624e965
 
 
 
dd2fabf
be60cd3
5106d44
 
 
624e965
 
 
be60cd3
624e965
be60cd3
 
624e965
 
 
 
 
 
 
be60cd3
624e965
be60cd3
624e965
 
 
be60cd3
 
 
624e965
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd2fabf
b50d335
624e965
 
be60cd3
624e965
 
 
 
 
 
 
be60cd3
624e965
 
 
 
 
67e7104
624e965
 
be60cd3

---
library_name: transformers
datasets:
- WebOrganizer/FormatAnnotations-Llama-3.1-8B
- WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8
base_model:
- Alibaba-NLP/gte-base-en-v1.5
---
# WebOrganizer/FormatClassifier

[[Paper](https://arxiv.org/abs/2502.10341)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]

The FormatClassifier organizes web content into 24 categories based on the URL and text contents of web pages.
The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)

#### All Domain Classifiers
- [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier) *← you are here!*
- [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL)
- [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier)
- [WebOrganizer/TopicClassifier-NoURL](https://huggingface.co/WebOrganizer/TopicClassifier-NoURL)

## Usage

This classifier expects input in the following input format:
```
{url}

{text}
```

Example:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier")
model = AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/FormatClassifier",
    trust_remote_code=True,
    use_memory_efficient_attention=False)

web_page = """http://www.example.com

How to make a good sandwich? [Click here to read article]"""

inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)

probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 6 ("Truncated" format, which covers incomplete content)
```

You can convert the `logits` of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see `id2label` and `label2id` in the model config):
1. Academic Writing
2. Content Listing
3. Creative Writing
4. Customer Support
5. Comment Section
6. FAQ
7. Truncated
8. Knowledge Article
9. Legal Notices
10. Listicle
11. News Article
12. Nonfiction Writing
13. About (Org.)
14. News (Org.)
15. About (Pers.)
16. Personal Blog
17. Product Page
18. Q&A Forum
19. Spam / Ads
20. Structured Data
21. Documentation
22. Audio Transcript
23. Tutorial
24. User Review

The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/formats.yaml).

#### Efficient Inference
We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ (see more [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers)) and loading the model like:
```python
AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/FormatClassifier",
    trust_remote_code=True,
    unpad_inputs=True,
    use_memory_efficient_attention=True,
    torch_dtype=torch.bfloat16
)
```


## Citation
```bibtex
@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  journal={arXiv preprint arXiv:2502.10341},
  year={2025}
}
```