Text Classification
Transformers
Safetensors
new
custom_code
princeton-nlp commited on
Commit
be60cd3
·
verified ·
1 Parent(s): 624e965

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -10
README.md CHANGED
@@ -6,25 +6,27 @@ datasets:
6
  base_model:
7
  - Alibaba-NLP/gte-base-en-v1.5
8
  ---
9
- # WebOrganizer/FormatClassifier-NoURL
10
 
11
  [[Paper](ARXIV_TBD)] [[Website](WEBSITE_TBD)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
12
 
13
- The FormatClassifier-NoURL organizes web content into 24 categories based on the text contents of web pages (without using URL information).
14
  The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
15
  1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
16
  2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
17
 
18
  ##### All Domain Classifiers
19
- - [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier) (using URL and text contents)
20
- - [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL) *← you are here!*
21
  - [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier) (using URL and text contents)
22
  - [WebOrganizer/TopicClassifier-NoURL](https://huggingface.co/WebOrganizer/TopicClassifier-NoURL) (using only text contents)
23
 
24
  ## Usage
25
 
26
- This classifier expects input in the following format:
27
  ```
 
 
28
  {text}
29
  ```
30
 
@@ -32,13 +34,15 @@ Example:
32
  ```python
33
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
34
 
35
- tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier-NoURL")
36
  model = AutoModelForSequenceClassification.from_pretrained(
37
- "WebOrganizer/FormatClassifier-NoURL",
38
  trust_remote_code=True,
39
  use_memory_efficient_attention=False)
40
 
41
- web_page = """How to make a good sandwich? [Click here to read article]"""
 
 
42
 
43
  inputs = tokenizer([web_page], return_tensors="pt")
44
  outputs = model(**inputs)
@@ -80,7 +84,7 @@ The full definitions of the categories can be found in the [taxonomy config](htt
80
  We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ and loading the model like
81
  ```python
82
  AutoModelForSequenceClassification.from_pretrained(
83
- "WebOrganizer/FormatClassifier-NoURL",
84
  trust_remote_code=True,
85
  unpad_inputs=True,
86
  use_memory_efficient_attention=True,
@@ -89,6 +93,7 @@ AutoModelForSequenceClassification.from_pretrained(
89
  ```
90
  See details [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers).
91
 
 
92
  ## Citation
93
  ```bibtex
94
  @article{wettig2025organize,
@@ -96,4 +101,4 @@ See details [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-en
96
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
97
  year={2025}
98
  }
99
- ```
 
6
  base_model:
7
  - Alibaba-NLP/gte-base-en-v1.5
8
  ---
9
+ # WebOrganizer/FormatClassifier
10
 
11
  [[Paper](ARXIV_TBD)] [[Website](WEBSITE_TBD)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
12
 
13
+ The FormatClassifier organizes web content into 24 categories based on the URL and text contents of web pages.
14
  The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
15
  1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
16
  2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
17
 
18
  ##### All Domain Classifiers
19
+ - [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier) *← you are here!*
20
+ - [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL) (using only text contents)
21
  - [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier) (using URL and text contents)
22
  - [WebOrganizer/TopicClassifier-NoURL](https://huggingface.co/WebOrganizer/TopicClassifier-NoURL) (using only text contents)
23
 
24
  ## Usage
25
 
26
+ This classifier expects input in the following input format:
27
  ```
28
+ {url}
29
+
30
  {text}
31
  ```
32
 
 
34
  ```python
35
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
36
 
37
+ tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier")
38
  model = AutoModelForSequenceClassification.from_pretrained(
39
+ "WebOrganizer/FormatClassifier",
40
  trust_remote_code=True,
41
  use_memory_efficient_attention=False)
42
 
43
+ web_page = """http://www.example.com
44
+
45
+ How to make a good sandwich? [Click here to read article]"""
46
 
47
  inputs = tokenizer([web_page], return_tensors="pt")
48
  outputs = model(**inputs)
 
84
  We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ and loading the model like
85
  ```python
86
  AutoModelForSequenceClassification.from_pretrained(
87
+ "WebOrganizer/FormatClassifier",
88
  trust_remote_code=True,
89
  unpad_inputs=True,
90
  use_memory_efficient_attention=True,
 
93
  ```
94
  See details [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers).
95
 
96
+
97
  ## Citation
98
  ```bibtex
99
  @article{wettig2025organize,
 
101
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
102
  year={2025}
103
  }
104
+ ```