Update README.md
Browse files
README.md
CHANGED
|
@@ -4,6 +4,37 @@ datasets:
|
|
| 4 |
- dmitva/human_ai_generated_text
|
| 5 |
---
|
| 6 |
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
-
The "0xnu/AGTD-v0.1" model represents a significant breakthrough in distinguishing between human-generated and AI-generated text. It is rooted in sophisticated algorithms and offers exceptional accuracy and efficiency in text analysis and classification. Everything is detailed in the study and accessible [here](https://arxiv.org/abs/2311.15565).
|
|
|
|
| 4 |
- dmitva/human_ai_generated_text
|
| 5 |
---
|
| 6 |
|
| 7 |
+
# 0xnu/AGTD-v0.1
|
| 8 |
+
|
| 9 |
+
The "0xnu/AGTD-v0.1" model represents a significant breakthrough in distinguishing between human-generated and AI-generated text. It is rooted in sophisticated algorithms and offers exceptional accuracy and efficiency in text analysis and classification. Everything is detailed in the study and accessible [here](https://arxiv.org/abs/2311.15565).
|
| 10 |
+
|
| 11 |
+
## Instruction Format
|
| 12 |
+
|
| 13 |
+
```
|
| 14 |
+
<BOS> [CLS] [INST] Instruction [/INST] Model answer [SEP] [INST] Follow-up instruction [/INST] [SEP] [EOS]
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
Pseudo-code for tokenizing instructions with the new format:
|
| 18 |
+
|
| 19 |
+
```Python
|
| 20 |
+
def tokenize(text):
|
| 21 |
+
return tok.encode(text, add_special_tokens=False)
|
| 22 |
+
|
| 23 |
+
[BOS_ID] +
|
| 24 |
+
tokenize("[CLS]") + tokenize("[INST]") + tokenize(USER_MESSAGE_1) + tokenize("[/INST]") +
|
| 25 |
+
tokenize(BOT_MESSAGE_1) + tokenize("[SEP]") +
|
| 26 |
+
…
|
| 27 |
+
tokenize("[INST]") + tokenize(USER_MESSAGE_N) + tokenize("[/INST]") +
|
| 28 |
+
tokenize(BOT_MESSAGE_N) + tokenize("[SEP]") + [EOS_ID]
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
Notes:
|
| 32 |
+
|
| 33 |
+
- `[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`, and `[MASK]` tokens are integrated based on their definitions in the tokenizer configuration.
|
| 34 |
+
- `[INST]` and `[/INST]` are utilized to encapsulate instructions.
|
| 35 |
+
- The tokenize method should not automatically add BOS or EOS tokens but should add a prefix space.
|
| 36 |
+
- The `do_lower_case` parameter indicates that text should be in lowercase for consistent tokenization.
|
| 37 |
+
- `clean_up_tokenization_spaces` remove unnecessary spaces in the tokenization process.
|
| 38 |
+
- The `tokenize_chinese_chars` parameter indicates special handling for Chinese characters.
|
| 39 |
+
- The maximum model length is set to 512 tokens.
|
| 40 |
|
|
|