File size: 6,664 Bytes
fdd4714
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e110f2
 
 
 
fdd4714
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9f4c0a1
fdd4714
 
 
 
 
 
 
 
 
9f4c0a1
fdd4714
 
 
 
 
 
 
 
 
 
 
9f4c0a1
fdd4714
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9f4c0a1
8511385
9f4c0a1
 
 
 
 
 
 
fdd4714
9f4c0a1
 
fdd4714
 
 
8511385
fdd4714
 
9f4c0a1
fdd4714
9f4c0a1
 
fdd4714
 
 
 
9f4c0a1
fdd4714
9f4c0a1
fdd4714
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
license: llama3.1
base_model: meta-llama/Llama-3.1-70B
model-index:
- name: Tess-R1-Llama-3.1-70B
  results: []
---

<br>

![Tess-R1-Llama-3.1-70B](https://huggingface.co/migtissera/Tess-R1-Llama-3.1-70B/resolve/main/Tess-R1-2.jpg)

<br>


Welcome to the Tess-Reasoning-1 (Tess-R1) series of models. Tess-R1 is designed with test-time compute in mind, and has the capabilities to produce a Chain-of-Thought (CoT) reasoning before producing the final output. 

The model is trained to first think step-by-step, and contemplate on its answers. It can also write alternatives after contemplating. Once all the steps have been thought through, it writes the final output.

1. Step-by-step, Chain-of-Thought thinking process. Uses `<thinking>` `</thinking>` tags to indicate when the model is performing CoT.
2. `<contemplation>` `</contemplation>` tags are used when the model contemplate on its answers.
3. `<alternatively>` `</alternatively>` tags are used for alternate suggestions.
4. Finally, `<output>` `</output>` tags are used for the final output

# Important Note:
In a multi-turn conversation, only the contents between the `<output>` `</output>` tags (discarding the tags) should be carried forward. Otherwise the model will see out of distribution input data and will fail.


# Prompt Format

The model uses ChatML prompt format.

# System Message

The system message *must* be the following:

```You are Tess-R1, an advanced AI that was created for complex reasoning. Given a user query, you are able to first create a Chain-of-Thought (CoT) reasoning. Once the CoT is devised, you then proceed to first think about how to answer. While doing this, you have the capability to contemplate on the thought, and also provide alternatives. Once the CoT steps have been thought through, you then respond by creating the final output.```

# Inference

The model was trained mostly with Chain-of-Thought reasoning data, including the XML tags. However, to generalize model generations, some single-turn and multi-turn data without XML tags were also included. Due to this, in some instances the model does not produce XML tags and does not fully utilize test-time compute capabilities. There is two ways to get around this:

- Include a try/catch statement in your inference script, and only pass on the contents between the `<output>` `</output>` tags if it's available.
- Use the `<thinking>` tag as the seed in the generation. i.e: `f"{conversation}{user_input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n<thinking>"`

I have included a sample Python script below.

```python
import torch, json
from transformers import AutoModelForCausalLM, AutoTokenizer
import re


class LLM(object):
    def __init__(self, model_path):
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16,
            device_map="auto",
            load_in_4bit=False,
            trust_remote_code=False,
        )

        self.tokenizer = AutoTokenizer.from_pretrained(
            model_path, trust_remote_code=False
        )

        self.terminators = [
            self.tokenizer.convert_tokens_to_ids("<|end_of_text|>"),
            self.tokenizer.convert_tokens_to_ids("<|eot_id|>"),
        ]

    def generate_text(self, instruction):
        tokens = self.tokenizer.encode(instruction)
        tokens = torch.LongTensor(tokens).unsqueeze(0)
        tokens = tokens.to("cuda")

        instance = {
            "input_ids": tokens,
            "top_p": 1.0,
            "temperature": 0.75,
            "generate_len": 4096,
            "top_k": 50,
        }

        length = len(tokens[0])
        with torch.no_grad():
            rest = self.model.generate(
                input_ids=tokens,
                max_length=length + instance["generate_len"],
                use_cache=True,
                do_sample=True,
                top_p=instance["top_p"],
                temperature=instance["temperature"],
                top_k=instance["top_k"],
                num_return_sequences=1,
                pad_token_id=self.tokenizer.eos_token_id,
                eos_token_id=self.terminators,
            )
        output = rest[0][length:]
        string = self.tokenizer.decode(output, skip_special_tokens=True)
        return f"{string}"

    def extract_output(self, text):
        pattern = r"<output>(.*?)</output>"
        match = re.search(pattern, text, re.DOTALL)
        content = match.group(1).strip()
        return content

    def respond_llama3(self, user_prompt):
        conversation = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are Tess-R1, an advanced AI that was created for complex reasoning. Given a user query, you are able to first create a Chain-of-Thought (CoT) reasoning. Once the CoT is devised, you then proceed to first think about how to answer. While doing this, you have the capability to contemplate on the thought, and also provide alternatives. Once the CoT steps have been thought through, you then respond by creating the final output.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n"""
        llm_prompt = f"{conversation}{user_input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
        answer = self.generate_text(llm_prompt)
        try:
            answer_output = self.extract_output(answer)
            return answer_output
        except:
            return answer


model_path = "neurolattice/Tess-R1-Llama-3.1-70B"

llm = LLM(model_path)

conversation = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are Tess-R1, an advanced AI that was created for complex reasoning. Given a user query, you are able to first create a Chain-of-Thought (CoT) reasoning. Once the CoT is devised, you then proceed to first think about how to answer. While doing this, you have the capability to contemplate on the thought, and also provide alternatives. Once the CoT steps have been thought through, you then respond by creating the final output.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n"""
while True:
    user_input = input("You: ")
    llm_prompt = f"{conversation}{user_input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
    answer = llm.generate_text(llm_prompt)
    print("=" * 132)
    print(answer)
    try:
        answer_output = llm.extract_output(answer)
        print("=" * 132)
        print(answer_output)
        conversation = f"{llm_prompt}{answer_output}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n"
    except:
        conversation = f"{llm_prompt}{answer}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n"
```