Csaba Kecskemeti PRO
AI & ML interests
Recent Activity
Organizations
csabakecskemeti's activity


All agreed.
I think I have decent amount of NVIDIA GPUs at home (a server with dual v100+p40, and a workstation with 4080 + 2x3090) so I was primarily just curious how much suffering is to use AMD.
Obviously the biggest issue is software. You have to "hack" together the ROCm versions of the things. Every simple step on NVIDIA echosystem trurns to a mini investigation on AMD :)
That's the main reason I'll put together a blogpost with all instructions to make it easier for others.
So far I've made the testbench post (there were some bios settings needed for the card to work - only on MI100, not on MI50 - ), setup ROCm drivers and pytorch for ROCm, managed to run inference with Llama.cpp, and run a 20 epoch LORA on the f32 Llama3.2 3B, and producing a model .
More details later this week in a blogpost

All correct!
What I've called out: a used MI100 is in the same price range as used V100 PCIe that's why I'm comparing with that.
And yes you're right FP16 performance would be more useful to mention (AFAIK V100 112TFLOPS, MI100 184TFLOPS) but regarding comparison it shows the same 164% (claimed) performance for MI100.
Please NOTE I'm building my hobby AI infra at home for myself, so mainly constrained by TOPS/$ :D

I've got my hands on an AMD Instinct MI100. It's about the same price used as a V100 but on paper has more TOPS (V100 14TOPS vs MI100 23TOPS) also the HBM has faster clock so the memory bandwidth is 1.2TB/s.
For quantized inference it's a beast (MI50 was also surprisingly fast)
For LORA training with this quick test I could not make the bnb config works so I'm running the FT on the fill size model.
Will share all the install, setup and setting I've learned in a blog post, together with the cooling shroud 3D design.

Here's a blogpost about it:
http://devquasar.com/ai/reasoning-system-prompt/

28 natural voices, unlimited generations, and WebGPU acceleration. Perfect for journalists and content creators.
Test it with full articles—sounds amazingly human! 🎯🎙️
Xenova/kokoro-web

Here you go:
https://devquasar.com/guided-browsing/
this is the guided browsing with chrome (canary) built-in Gemini Nano

Yes I did an tried it with chrome canary.
(I even have a demo page that utilizes it but now can’t recall the name will share later)
It’s working fine but :
- still not available just in experimental chrome
- How about different browsers
- you’ve locked in with one model
- what is you hosting your local AI on another local machine
All in all the chrome built it AI provides less flexibility on my view.
Appreciate the comment though

This is obviously a prototype.
Security is a big concern here, but is believe it’s possible to put together a proxy that is safe and does not allow anything else than forward generate requests between browser and local llm.

LLmaaS - Local LLM as a Service
With LLmaaS, I propose leveraging locally running LLMs as a service, providing a standardized way for websites to access and utilize them for LLM-powered operations directly on the user’s device.
Demo, code, more detailed description.
https://devquasar.com/llmaas/
https://github.com/csabakecskemeti/LLmaaS
https://youtu.be/OOWGr8jcP5Q
Call for contributors
Join me a develop the LLmaaS proxy to make this a generic purpose tool to leverage local LLMs on web. Build in security measures.
I'm looking for help to make the proxy more generic support multiple local LLM services without any change on the HTML side.
Also looking for ideas how to make the HTML par more modular and easy to use.

ok now all fixed

restarted the space, and regarding the speed I found forgot to offload the model to gpu :D
try now

Here you can try
https://huggingface.co/spaces/DevQuasar/Mi50
Bust something seems off with my network or with HF everything is very slow.
When llama benched the model I've get 60t/s on the mi50.
Anyway you can try it.
ROCR_VISIBLE_DEVICES=0 build/bin/llama-bench -m ~/Downloads/DevQuasar-R1-Uncensored-Llama-8B.Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon VII, compute capability 9.0, VMM: no
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 8B Q8_0 | 7.95 GiB | 8.03 B | ROCm | 99 | pp512 | 416.30 ± 0.07 |
llama 8B Q8_0 | 7.95 GiB | 8.03 B | ROCm | 99 | tg128 | 60.13 ± 0.02 |

Tested with lm-evaluation-harness on standard open llm leaderboard tests + hellaswag. Scores are improved in most. Details on the model card.
Model:
DevQuasar/DevQuasar-R1-Uncensored-Llama-8B
Quants:
DevQuasar/DevQuasar-R1-Uncensored-Llama-8B-GGUF

Here is the full result or the re-executed evaluation on deepseek-ai/DeepSeek-R1-Distill-Llama-8B with the suggested gen args.
I see some marginal changes in the scores but not much. If this is true the original Llama 3.1 8B wins more test than the Deepseek R1 distilled. I'm not sure what is going on. If anyone can perform the eval, please share your results.
Again I can be totally wrong here.
Full result data (results with 2025-01-26 date)
https://github.com/csabakecskemeti/lm_eval_results/blob/main/deepseek-ai__DeepSeek-R1-Distill-Llama-8B/results_2025-01-26T22-29-00.931915.json
Eval command:accelerate launch -m lm_eval --model hf --model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype="float16" --tasks hellaswag,leaderboard_gpqa,leaderboard_ifeval,leaderboard_math_hard,leaderboard_mmlu_pro,leaderboard_musr,leaderboard_bbh --batch_size auto:4 --log_samples --output_path eval_results --gen_kwargs temperature=0.6,top_p=0.95,do_sample=True
Eval output:
hf (pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype=float16), gen_kwargs: (temperature=0.6,top_p=0.95,do_sample=True), limit: None, num_fewshot: None, batch_size: auto:4 (1,16,64,64)
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
hellaswag | 1 | none | 0 | acc | ↑ | 0.5559 | ± | 0.0050 |
none | 0 | acc_norm | ↑ | 0.7436 | ± | 0.0044 | ||
leaderboard_bbh | N/A | |||||||
- leaderboard_bbh_boolean_expressions | 1 | none | 3 | acc_norm | ↑ | 0.8080 | ± | 0.0250 |
- leaderboard_bbh_causal_judgement | 1 | none | 3 | acc_norm | ↑ | 0.5508 | ± | 0.0365 |
- leaderboard_bbh_date_understanding | 1 | none | 3 | acc_norm | ↑ | 0.4240 | ± | 0.0313 |
- leaderboard_bbh_disambiguation_qa | 1 | none | 3 | acc_norm | ↑ | 0.2240 | ± | 0.0264 |
- leaderboard_bbh_formal_fallacies | 1 | none | 3 | acc_norm | ↑ | 0.5200 | ± | 0.0317 |
- leaderboard_bbh_geometric_shapes | 1 | none | 3 | acc_norm | ↑ | 0.2360 | ± | 0.0269 |
- leaderboard_bbh_hyperbaton | 1 | none | 3 | acc_norm | ↑ | 0.4840 | ± | 0.0317 |
- leaderboard_bbh_logical_deduction_five_objects | 1 | none | 3 | acc_norm | ↑ | 0.3240 | ± | 0.0297 |
- leaderboard_bbh_logical_deduction_seven_objects | 1 | none | 3 | acc_norm | ↑ | 0.4200 | ± | 0.0313 |
- leaderboard_bbh_logical_deduction_three_objects | 1 | none | 3 | acc_norm | ↑ | 0.4040 | ± | 0.0311 |
- leaderboard_bbh_movie_recommendation | 1 | none | 3 | acc_norm | ↑ | 0.6880 | ± | 0.0294 |
- leaderboard_bbh_navigate | 1 | none | 3 | acc_norm | ↑ | 0.6240 | ± | 0.0307 |
- leaderboard_bbh_object_counting | 1 | none | 3 | acc_norm | ↑ | 0.4040 | ± | 0.0311 |
- leaderboard_bbh_penguins_in_a_table | 1 | none | 3 | acc_norm | ↑ | 0.2945 | ± | 0.0379 |
- leaderboard_bbh_reasoning_about_colored_objects | 1 | none | 3 | acc_norm | ↑ | 0.4120 | ± | 0.0312 |
- leaderboard_bbh_ruin_names | 1 | none | 3 | acc_norm | ↑ | 0.4600 | ± | 0.0316 |
- leaderboard_bbh_salient_translation_error_detection | 1 | none | 3 | acc_norm | ↑ | 0.3440 | ± | 0.0301 |
- leaderboard_bbh_snarks | 1 | none | 3 | acc_norm | ↑ | 0.5112 | ± | 0.0376 |
- leaderboard_bbh_sports_understanding | 1 | none | 3 | acc_norm | ↑ | 0.4880 | ± | 0.0317 |
- leaderboard_bbh_temporal_sequences | 1 | none | 3 | acc_norm | ↑ | 0.2080 | ± | 0.0257 |
- leaderboard_bbh_tracking_shuffled_objects_five_objects | 1 | none | 3 | acc_norm | ↑ | 0.1800 | ± | 0.0243 |
- leaderboard_bbh_tracking_shuffled_objects_seven_objects | 1 | none | 3 | acc_norm | ↑ | 0.1040 | ± | 0.0193 |
- leaderboard_bbh_tracking_shuffled_objects_three_objects | 1 | none | 3 | acc_norm | ↑ | 0.3400 | ± | 0.0300 |
- leaderboard_bbh_web_of_lies | 1 | none | 3 | acc_norm | ↑ | 0.4880 | ± | 0.0317 |
leaderboard_gpqa | N/A | |||||||
- leaderboard_gpqa_diamond | 1 | none | 0 | acc_norm | ↑ | 0.2879 | ± | 0.0323 |
- leaderboard_gpqa_extended | 1 | none | 0 | acc_norm | ↑ | 0.3004 | ± | 0.0196 |
- leaderboard_gpqa_main | 1 | none | 0 | acc_norm | ↑ | 0.3036 | ± | 0.0217 |
leaderboard_ifeval | 3 | none | 0 | inst_level_loose_acc | ↑ | 0.4556 | ± | N/A |
none | 0 | inst_level_strict_acc | ↑ | 0.4400 | ± | N/A | ||
none | 0 | prompt_level_loose_acc | ↑ | 0.3087 | ± | 0.0199 | ||
none | 0 | prompt_level_strict_acc | ↑ | 0.2957 | ± | 0.0196 | ||
leaderboard_math_hard | N/A | |||||||
- leaderboard_math_algebra_hard | 2 | none | 4 | exact_match | ↑ | 0.4821 | ± | 0.0286 |
- leaderboard_math_counting_and_prob_hard | 2 | none | 4 | exact_match | ↑ | 0.2033 | ± | 0.0364 |
- leaderboard_math_geometry_hard | 2 | none | 4 | exact_match | ↑ | 0.2197 | ± | 0.0362 |
- leaderboard_math_intermediate_algebra_hard | 2 | none | 4 | exact_match | ↑ | 0.0750 | ± | 0.0158 |
- leaderboard_math_num_theory_hard | 2 | none | 4 | exact_match | ↑ | 0.4026 | ± | 0.0396 |
- leaderboard_math_prealgebra_hard | 2 | none | 4 | exact_match | ↑ | 0.4508 | ± | 0.0359 |
- leaderboard_math_precalculus_hard | 2 | none | 4 | exact_match | ↑ | 0.0963 | ± | 0.0255 |
leaderboard_mmlu_pro | 0.1 | none | 5 | acc | ↑ | 0.2741 | ± | 0.0041 |
leaderboard_musr | N/A | |||||||
- leaderboard_musr_murder_mysteries | 1 | none | 0 | acc_norm | ↑ | 0.5200 | ± | 0.0317 |
- leaderboard_musr_object_placements | 1 | none | 0 | acc_norm | ↑ | 0.3086 | ± | 0.0289 |
- leaderboard_musr_team_allocation | 1 | none | 0 | acc_norm | ↑ | 0.3120 | ± | 0.0294 |

I've rerun hellaswag with the suggested config, the results haven't improved:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
hellaswag | 1 | none | 0 | acc | ↑ | 0.5559 | ± | 0.0050 |
none | 0 | acc_norm | ↑ | 0.7436 | ± | 0.0044 |
command:accelerate launch -m lm_eval --model hf --model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype="float16" --tasks hellaswag --batch_size auto:4 --log_samples --output_path eval_results --gen_kwargs temperature=0.6,top_p=0.95,generate_until=64,do_sample=True


Thx, will try

Thx, will try

If anyone wants to double check the results are posted here:
https://github.com/csabakecskemeti/lm_eval_results
Am I made some mistake, or (at least this distilled version) not as good/better than the competition?
I'll run the same on the Qwen 7B distilled version too.