Update README.md
Browse files
README.md
CHANGED
@@ -6,25 +6,27 @@ datasets:
|
|
6 |
base_model:
|
7 |
- Alibaba-NLP/gte-base-en-v1.5
|
8 |
---
|
9 |
-
# WebOrganizer/FormatClassifier
|
10 |
|
11 |
[[Paper](ARXIV_TBD)] [[Website](WEBSITE_TBD)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
|
12 |
|
13 |
-
The FormatClassifier
|
14 |
The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
|
15 |
1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
|
16 |
2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
|
17 |
|
18 |
##### All Domain Classifiers
|
19 |
-
- [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier)
|
20 |
-
- [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL)
|
21 |
- [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier) (using URL and text contents)
|
22 |
- [WebOrganizer/TopicClassifier-NoURL](https://huggingface.co/WebOrganizer/TopicClassifier-NoURL) (using only text contents)
|
23 |
|
24 |
## Usage
|
25 |
|
26 |
-
This classifier expects input in the following format:
|
27 |
```
|
|
|
|
|
28 |
{text}
|
29 |
```
|
30 |
|
@@ -32,13 +34,15 @@ Example:
|
|
32 |
```python
|
33 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
34 |
|
35 |
-
tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier
|
36 |
model = AutoModelForSequenceClassification.from_pretrained(
|
37 |
-
"WebOrganizer/FormatClassifier
|
38 |
trust_remote_code=True,
|
39 |
use_memory_efficient_attention=False)
|
40 |
|
41 |
-
web_page = """
|
|
|
|
|
42 |
|
43 |
inputs = tokenizer([web_page], return_tensors="pt")
|
44 |
outputs = model(**inputs)
|
@@ -80,7 +84,7 @@ The full definitions of the categories can be found in the [taxonomy config](htt
|
|
80 |
We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ and loading the model like
|
81 |
```python
|
82 |
AutoModelForSequenceClassification.from_pretrained(
|
83 |
-
"WebOrganizer/FormatClassifier
|
84 |
trust_remote_code=True,
|
85 |
unpad_inputs=True,
|
86 |
use_memory_efficient_attention=True,
|
@@ -89,6 +93,7 @@ AutoModelForSequenceClassification.from_pretrained(
|
|
89 |
```
|
90 |
See details [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers).
|
91 |
|
|
|
92 |
## Citation
|
93 |
```bibtex
|
94 |
@article{wettig2025organize,
|
@@ -96,4 +101,4 @@ See details [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-en
|
|
96 |
author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
|
97 |
year={2025}
|
98 |
}
|
99 |
-
```
|
|
|
6 |
base_model:
|
7 |
- Alibaba-NLP/gte-base-en-v1.5
|
8 |
---
|
9 |
+
# WebOrganizer/FormatClassifier
|
10 |
|
11 |
[[Paper](ARXIV_TBD)] [[Website](WEBSITE_TBD)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
|
12 |
|
13 |
+
The FormatClassifier organizes web content into 24 categories based on the URL and text contents of web pages.
|
14 |
The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
|
15 |
1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
|
16 |
2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
|
17 |
|
18 |
##### All Domain Classifiers
|
19 |
+
- [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier) *← you are here!*
|
20 |
+
- [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL) (using only text contents)
|
21 |
- [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier) (using URL and text contents)
|
22 |
- [WebOrganizer/TopicClassifier-NoURL](https://huggingface.co/WebOrganizer/TopicClassifier-NoURL) (using only text contents)
|
23 |
|
24 |
## Usage
|
25 |
|
26 |
+
This classifier expects input in the following input format:
|
27 |
```
|
28 |
+
{url}
|
29 |
+
|
30 |
{text}
|
31 |
```
|
32 |
|
|
|
34 |
```python
|
35 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
36 |
|
37 |
+
tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier")
|
38 |
model = AutoModelForSequenceClassification.from_pretrained(
|
39 |
+
"WebOrganizer/FormatClassifier",
|
40 |
trust_remote_code=True,
|
41 |
use_memory_efficient_attention=False)
|
42 |
|
43 |
+
web_page = """http://www.example.com
|
44 |
+
|
45 |
+
How to make a good sandwich? [Click here to read article]"""
|
46 |
|
47 |
inputs = tokenizer([web_page], return_tensors="pt")
|
48 |
outputs = model(**inputs)
|
|
|
84 |
We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ and loading the model like
|
85 |
```python
|
86 |
AutoModelForSequenceClassification.from_pretrained(
|
87 |
+
"WebOrganizer/FormatClassifier",
|
88 |
trust_remote_code=True,
|
89 |
unpad_inputs=True,
|
90 |
use_memory_efficient_attention=True,
|
|
|
93 |
```
|
94 |
See details [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers).
|
95 |
|
96 |
+
|
97 |
## Citation
|
98 |
```bibtex
|
99 |
@article{wettig2025organize,
|
|
|
101 |
author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
|
102 |
year={2025}
|
103 |
}
|
104 |
+
```
|