File size: 3,927 Bytes
9093663
 
 
 
 
 
d5bbf6e
a6b8c22
f086e3f
 
 
1aeb55a
 
396f0bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
license: cc-by-nc-4.0
base_model:
- CohereForAI/c4ai-command-r7b-12-2024
library: gguf
pipeline_tag: text-generation
---
GGUF version of [c4ai-command-r7b-12-2024](https://huggingface.co/CohereForAI/c4ai-command-r7b-12-2024) from CohereForAI for testing until llama.cpp adds official support.

```sh
./build/bin/llama-cli -fa --no-display-prompt -c 0 -m ggml-c4ai-command-r-7b-12-2024-q4_k.gguf -p "<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a helpful assistant.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Tell me all about yourself.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>"
```

https://github.com/ggerganov/llama.cpp/issues/10816#issuecomment-2548574766

```
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_ctx_per_seq = 8192
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 1
llama_new_context_with_model: freq_base     = 50000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.98 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1328.31 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 841
llama_new_context_with_model: graph splits = 324 (with bs=512), 1 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 2760461191
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 8192
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1

I am Command, a sophisticated large language model built by the company Cohere. I assist users by providing thorough responses to a wide range of queries, offering information, and performing various tasks. My capabilities include answering questions, generating text, summarizing content, extracting data, and performing various other tasks based on the user's requirements.

I strive to provide accurate and helpful information while ensuring a positive and informative user experience. Feel free to ask me about any topic, and I'll do my best to assist you! [end of text]


llama_perf_sampler_print:    sampling time =      15.07 ms /   128 runs   (    0.12 ms per token,  8491.44 tokens per second)
llama_perf_context_print:        load time =    1076.84 ms
llama_perf_context_print: prompt eval time =     181.62 ms /    22 tokens (    8.26 ms per token,   121.13 tokens per second)
llama_perf_context_print:        eval time =    4938.01 ms /   105 runs   (   47.03 ms per token,    21.26 tokens per second)
llama_perf_context_print:       total time =    5163.42 ms /   127 tokens
```