19 3 109

Csaba Kecskemeti PRO

csabakecskemeti

https://devquasar.com/

csabakecskemeti

AI & ML interests

None yet

Recent Activity

reacted to stefan-it's post with 👍 29 minutes ago

She arrived 😍 [Expect more models soon...]

updated a model about 5 hours ago

DevQuasar/homebrewltd.Ichigo-llama3.1-8B-v0.5-GGUF

replied to their post about 5 hours ago

Testing Training on AMD/ROCm the first time! I've got my hands on an AMD Instinct MI100. It's about the same price used as a V100 but on paper has more TOPS (V100 14TOPS vs MI100 23TOPS) also the HBM has faster clock so the memory bandwidth is 1.2TB/s. For quantized inference it's a beast (MI50 was also surprisingly fast) For LORA training with this quick test I could not make the bnb config works so I'm running the FT on the fill size model. Will share all the install, setup and setting I've learned in a blog post, together with the cooling shroud 3D design.

View all activity

Organizations

csabakecskemeti's activity

reacted to stefan-it's post with 👍 29 minutes ago

Post

1393

She arrived 😍

[Expect more models soon...]

1 reply

replied to their post about 5 hours ago

All agreed.
I think I have decent amount of NVIDIA GPUs at home (a server with dual v100+p40, and a workstation with 4080 + 2x3090) so I was primarily just curious how much suffering is to use AMD.

Obviously the biggest issue is software. You have to "hack" together the ROCm versions of the things. Every simple step on NVIDIA echosystem trurns to a mini investigation on AMD :)

That's the main reason I'll put together a blogpost with all instructions to make it easier for others.

So far I've made the testbench post (there were some bios settings needed for the card to work - only on MI100, not on MI50 - ), setup ROCm drivers and pytorch for ROCm, managed to run inference with Llama.cpp, and run a 20 epoch LORA on the f32 Llama3.2 3B, and producing a model .

More details later this week in a blogpost

replied to their post about 6 hours ago

All correct!
What I've called out: a used MI100 is in the same price range as used V100 PCIe that's why I'm comparing with that.

And yes you're right FP16 performance would be more useful to mention (AFAIK V100 112TFLOPS, MI100 184TFLOPS) but regarding comparison it shows the same 164% (claimed) performance for MI100.

Please NOTE I'm building my hobby AI infra at home for myself, so mainly constrained by TOPS/$ :D

posted an update 1 day ago

Post

1376

Testing Training on AMD/ROCm the first time!

I've got my hands on an AMD Instinct MI100. It's about the same price used as a V100 but on paper has more TOPS (V100 14TOPS vs MI100 23TOPS) also the HBM has faster clock so the memory bandwidth is 1.2TB/s.
For quantized inference it's a beast (MI50 was also surprisingly fast)

For LORA training with this quick test I could not make the bnb config works so I'm running the FT on the fill size model.

Will share all the install, setup and setting I've learned in a blog post, together with the cooling shroud 3D design.

4 replies

posted an update 10 days ago

Post

1592

I found if we apply the reasoning system prompt (that has been published on the NousResearch/DeepHermes-3-Llama-3-8B-Preview model card) other models are also react to it and start mimicking reasoning. Some better some worse. I've seen internal monologue and self questioning.

Here's a blogpost about it:
http://devquasar.com/ai/reasoning-system-prompt/

reacted to fdaudens's post with 😎 10 days ago

Post

2100

🔊 Meet Kokoro Web - Free, ML speech synthesis on your computer, that'll make you ditch paid services!

28 natural voices, unlimited generations, and WebGPU acceleration. Perfect for journalists and content creators.

Test it with full articles—sounds amazingly human! 🎯🎙️

Xenova/kokoro-web

replied to their post 21 days ago

Here you go:
https://devquasar.com/guided-browsing/
this is the guided browsing with chrome (canary) built-in Gemini Nano

replied to their post 21 days ago

Yes I did an tried it with chrome canary.
(I even have a demo page that utilizes it but now can’t recall the name will share later)

It’s working fine but :

still not available just in experimental chrome
How about different browsers
you’ve locked in with one model
what is you hosting your local AI on another local machine

All in all the chrome built it AI provides less flexibility on my view.

Appreciate the comment though

replied to their post 21 days ago

This is obviously a prototype.
Security is a big concern here, but is believe it’s possible to put together a proxy that is safe and does not allow anything else than forward generate requests between browser and local llm.

posted an update 22 days ago

Post

1855

Check out my idea:
LLmaaS - Local LLM as a Service

With LLmaaS, I propose leveraging locally running LLMs as a service, providing a standardized way for websites to access and utilize them for LLM-powered operations directly on the user’s device.

Demo, code, more detailed description.
https://devquasar.com/llmaas/
https://github.com/csabakecskemeti/LLmaaS
https://youtu.be/OOWGr8jcP5Q

Call for contributors
Join me a develop the LLmaaS proxy to make this a generic purpose tool to leverage local LLMs on web. Build in security measures.
I'm looking for help to make the proxy more generic support multiple local LLM services without any change on the HTML side.
Also looking for ideas how to make the HTML par more modular and easy to use.

4 replies

replied to their post 25 days ago

ok now all fixed

replied to their post 25 days ago

restarted the space, and regarding the speed I found forgot to offload the model to gpu :D
try now

replied to their post 26 days ago

Here you can try
https://huggingface.co/spaces/DevQuasar/Mi50

Bust something seems off with my network or with HF everything is very slow.

When llama benched the model I've get 60t/s on the mi50.
Anyway you can try it.

ROCR_VISIBLE_DEVICES=0 build/bin/llama-bench -m ~/Downloads/DevQuasar-R1-Uncensored-Llama-8B.Q8_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon VII, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm	99	pp512	416.30 ± 0.07
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm	99	tg128	60.13 ± 0.02

posted an update 27 days ago

Post

2084

I've made an uncensored version of DeepSeek-R1-Distill-Llama-8B with merge. It's passing the "say f***" censor test.
Tested with lm-evaluation-harness on standard open llm leaderboard tests + hellaswag. Scores are improved in most. Details on the model card.

Model:
DevQuasar/DevQuasar-R1-Uncensored-Llama-8B
Quants:
DevQuasar/DevQuasar-R1-Uncensored-Llama-8B-GGUF

6 replies

replied to their post 28 days ago

Here is the full result or the re-executed evaluation on deepseek-ai/DeepSeek-R1-Distill-Llama-8B with the suggested gen args.

I see some marginal changes in the scores but not much. If this is true the original Llama 3.1 8B wins more test than the Deepseek R1 distilled. I'm not sure what is going on. If anyone can perform the eval, please share your results.
Again I can be totally wrong here.

Full result data (results with 2025-01-26 date)
https://github.com/csabakecskemeti/lm_eval_results/blob/main/deepseek-ai__DeepSeek-R1-Distill-Llama-8B/results_2025-01-26T22-29-00.931915.json

Eval command:
accelerate launch -m lm_eval --model hf --model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype="float16" --tasks hellaswag,leaderboard_gpqa,leaderboard_ifeval,leaderboard_math_hard,leaderboard_mmlu_pro,leaderboard_musr,leaderboard_bbh --batch_size auto:4 --log_samples --output_path eval_results --gen_kwargs temperature=0.6,top_p=0.95,do_sample=True

Eval output:
hf (pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype=float16), gen_kwargs: (temperature=0.6,top_p=0.95,do_sample=True), limit: None, num_fewshot: None, batch_size: auto:4 (1,16,64,64)

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
hellaswag	1	none	0	acc	↑	0.5559	±	0.0050
		none	0	acc_norm	↑	0.7436	±	0.0044
leaderboard_bbh	N/A
- leaderboard_bbh_boolean_expressions	1	none	3	acc_norm	↑	0.8080	±	0.0250
- leaderboard_bbh_causal_judgement	1	none	3	acc_norm	↑	0.5508	±	0.0365
- leaderboard_bbh_date_understanding	1	none	3	acc_norm	↑	0.4240	±	0.0313
- leaderboard_bbh_disambiguation_qa	1	none	3	acc_norm	↑	0.2240	±	0.0264
- leaderboard_bbh_formal_fallacies	1	none	3	acc_norm	↑	0.5200	±	0.0317
- leaderboard_bbh_geometric_shapes	1	none	3	acc_norm	↑	0.2360	±	0.0269
- leaderboard_bbh_hyperbaton	1	none	3	acc_norm	↑	0.4840	±	0.0317
- leaderboard_bbh_logical_deduction_five_objects	1	none	3	acc_norm	↑	0.3240	±	0.0297
- leaderboard_bbh_logical_deduction_seven_objects	1	none	3	acc_norm	↑	0.4200	±	0.0313
- leaderboard_bbh_logical_deduction_three_objects	1	none	3	acc_norm	↑	0.4040	±	0.0311
- leaderboard_bbh_movie_recommendation	1	none	3	acc_norm	↑	0.6880	±	0.0294
- leaderboard_bbh_navigate	1	none	3	acc_norm	↑	0.6240	±	0.0307
- leaderboard_bbh_object_counting	1	none	3	acc_norm	↑	0.4040	±	0.0311
- leaderboard_bbh_penguins_in_a_table	1	none	3	acc_norm	↑	0.2945	±	0.0379
- leaderboard_bbh_reasoning_about_colored_objects	1	none	3	acc_norm	↑	0.4120	±	0.0312
- leaderboard_bbh_ruin_names	1	none	3	acc_norm	↑	0.4600	±	0.0316
- leaderboard_bbh_salient_translation_error_detection	1	none	3	acc_norm	↑	0.3440	±	0.0301
- leaderboard_bbh_snarks	1	none	3	acc_norm	↑	0.5112	±	0.0376
- leaderboard_bbh_sports_understanding	1	none	3	acc_norm	↑	0.4880	±	0.0317
- leaderboard_bbh_temporal_sequences	1	none	3	acc_norm	↑	0.2080	±	0.0257
- leaderboard_bbh_tracking_shuffled_objects_five_objects	1	none	3	acc_norm	↑	0.1800	±	0.0243
- leaderboard_bbh_tracking_shuffled_objects_seven_objects	1	none	3	acc_norm	↑	0.1040	±	0.0193
- leaderboard_bbh_tracking_shuffled_objects_three_objects	1	none	3	acc_norm	↑	0.3400	±	0.0300
- leaderboard_bbh_web_of_lies	1	none	3	acc_norm	↑	0.4880	±	0.0317
leaderboard_gpqa	N/A
- leaderboard_gpqa_diamond	1	none	0	acc_norm	↑	0.2879	±	0.0323
- leaderboard_gpqa_extended	1	none	0	acc_norm	↑	0.3004	±	0.0196
- leaderboard_gpqa_main	1	none	0	acc_norm	↑	0.3036	±	0.0217
leaderboard_ifeval	3	none	0	inst_level_loose_acc	↑	0.4556	±	N/A
		none	0	inst_level_strict_acc	↑	0.4400	±	N/A
		none	0	prompt_level_loose_acc	↑	0.3087	±	0.0199
		none	0	prompt_level_strict_acc	↑	0.2957	±	0.0196
leaderboard_math_hard	N/A
- leaderboard_math_algebra_hard	2	none	4	exact_match	↑	0.4821	±	0.0286
- leaderboard_math_counting_and_prob_hard	2	none	4	exact_match	↑	0.2033	±	0.0364
- leaderboard_math_geometry_hard	2	none	4	exact_match	↑	0.2197	±	0.0362
- leaderboard_math_intermediate_algebra_hard	2	none	4	exact_match	↑	0.0750	±	0.0158
- leaderboard_math_num_theory_hard	2	none	4	exact_match	↑	0.4026	±	0.0396
- leaderboard_math_prealgebra_hard	2	none	4	exact_match	↑	0.4508	±	0.0359
- leaderboard_math_precalculus_hard	2	none	4	exact_match	↑	0.0963	±	0.0255
leaderboard_mmlu_pro	0.1	none	5	acc	↑	0.2741	±	0.0041
leaderboard_musr	N/A
- leaderboard_musr_murder_mysteries	1	none	0	acc_norm	↑	0.5200	±	0.0317
- leaderboard_musr_object_placements	1	none	0	acc_norm	↑	0.3086	±	0.0289
- leaderboard_musr_team_allocation	1	none	0	acc_norm	↑	0.3120	±	0.0294

replied to their post 29 days ago

I've rerun hellaswag with the suggested config, the results haven't improved:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
hellaswag	1	none	0	acc	↑	0.5559	±	0.0050
		none	0	acc_norm	↑	0.7436	±	0.0044

command:
accelerate launch -m lm_eval --model hf --model_args pretrained=deepseek-ai/DeepSeek-R1-Distill-Llama-8B,parallelize=True,dtype="float16" --tasks hellaswag --batch_size auto:4 --log_samples --output_path eval_results --gen_kwargs temperature=0.6,top_p=0.95,generate_until=64,do_sample=True

replied to their post 30 days ago

I've missed this suggested configuration from the model card:
"For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1."

Thanks for @shb777 and @bin110 to pointing this out!

replied to their post 30 days ago

Thx, will try

replied to their post 30 days ago

Thx, will try

posted an update about 1 month ago

Post

2313

I've run the open llm leaderboard evaluations + hellaswag on deepseek-ai/DeepSeek-R1-Distill-Llama-8B and compared to meta-llama/Llama-3.1-8B-Instruct and at first glance R1 do not beat Llama overall.

If anyone wants to double check the results are posted here:
https://github.com/csabakecskemeti/lm_eval_results

Am I made some mistake, or (at least this distilled version) not as good/better than the competition?

I'll run the same on the Qwen 7B distilled version too.

7 replies