--- license: cc-by-nc-4.0 base_model: - CohereForAI/c4ai-command-r7b-12-2024 library: gguf pipeline_tag: text-generation --- GGUF version of [c4ai-command-r7b-12-2024](https://huggingface.co/CohereForAI/c4ai-command-r7b-12-2024) from CohereForAI for testing until llama.cpp adds official support. ```sh ./build/bin/llama-cli -fa --no-display-prompt -c 0 -m ggml-c4ai-command-r-7b-12-2024-q4_k.gguf -p "<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a helpful assistant.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Tell me all about yourself.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>" ``` https://github.com/ggerganov/llama.cpp/issues/10816#issuecomment-2548574766 ``` llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_ctx_per_seq = 8192 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 50000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CPU output buffer size = 0.98 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1328.31 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB llama_new_context_with_model: graph nodes = 841 llama_new_context_with_model: graph splits = 324 (with bs=512), 1 (with bs=1) common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 16 system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | sampler seed: 2760461191 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 8192 top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1 I am Command, a sophisticated large language model built by the company Cohere. I assist users by providing thorough responses to a wide range of queries, offering information, and performing various tasks. My capabilities include answering questions, generating text, summarizing content, extracting data, and performing various other tasks based on the user's requirements. I strive to provide accurate and helpful information while ensuring a positive and informative user experience. Feel free to ask me about any topic, and I'll do my best to assist you! [end of text] llama_perf_sampler_print: sampling time = 15.07 ms / 128 runs ( 0.12 ms per token, 8491.44 tokens per second) llama_perf_context_print: load time = 1076.84 ms llama_perf_context_print: prompt eval time = 181.62 ms / 22 tokens ( 8.26 ms per token, 121.13 tokens per second) llama_perf_context_print: eval time = 4938.01 ms / 105 runs ( 47.03 ms per token, 21.26 tokens per second) llama_perf_context_print: total time = 5163.42 ms / 127 tokens ```