0xnu
/

AGTD-v0.1

Keras

TensorFlow

Safetensors

bert

Model card Files Files and versions

xet

Community

0xnu commited on Jan 1, 2024

Commit

9cb476d

1 Parent(s): 835a894

Update README.md

Browse files

Files changed (1) hide show

README.md +33 -2

README.md CHANGED Viewed

@@ -4,6 +4,37 @@ datasets:
 - dmitva/human_ai_generated_text
 ---
-## 0xnu/AGTD-v0.1
-The "0xnu/AGTD-v0.1" model represents a significant breakthrough in distinguishing between human-generated and AI-generated text. It is rooted in sophisticated algorithms and offers exceptional accuracy and efficiency in text analysis and classification. Everything is detailed in the study and accessible [here](https://arxiv.org/abs/2311.15565).

 - dmitva/human_ai_generated_text
 ---
+# 0xnu/AGTD-v0.1
+The "0xnu/AGTD-v0.1" model represents a significant breakthrough in distinguishing between human-generated and AI-generated text. It is rooted in sophisticated algorithms and offers exceptional accuracy and efficiency in text analysis and classification. Everything is detailed in the study and accessible [here](https://arxiv.org/abs/2311.15565).
+## Instruction Format
+```
+<BOS> [CLS] [INST] Instruction [/INST] Model answer [SEP] [INST] Follow-up instruction [/INST] [SEP] [EOS]
+```
+Pseudo-code for tokenizing instructions with the new format:
+```Python
+def tokenize(text):
+    return tok.encode(text, add_special_tokens=False)
+[BOS_ID] +
+tokenize("[CLS]") + tokenize("[INST]") + tokenize(USER_MESSAGE_1) + tokenize("[/INST]") +
+tokenize(BOT_MESSAGE_1) + tokenize("[SEP]") +
+…
+tokenize("[INST]") + tokenize(USER_MESSAGE_N) + tokenize("[/INST]") +
+tokenize(BOT_MESSAGE_N) + tokenize("[SEP]") + [EOS_ID]
+```
+Notes:
+- `[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`, and `[MASK]` tokens are integrated based on their definitions in the tokenizer configuration.
+- `[INST]` and `[/INST]` are utilized to encapsulate instructions.
+- The tokenize method should not automatically add BOS or EOS tokens but should add a prefix space.
+- The `do_lower_case` parameter indicates that text should be in lowercase for consistent tokenization.
+- `clean_up_tokenization_spaces` remove unnecessary spaces in the tokenization process.
+- The `tokenize_chinese_chars` parameter indicates special handling for Chinese characters.
+- The maximum model length is set to 512 tokens.