cancels does not work for zerogpu

#113
by skytnt - opened

If I cancel the streaming generation using cancels, it does not stop running, although it appears to stop running on the front end.

This is my space: https://huggingface.co/spaces/skytnt/midi-composer

I thought Gradio had to be queued for the cancellation to work?
However, I think it was implicitly queued in the middle of Gradio 3.x, so there may be an inconsistency around that in the Zero GPU space.

app.launch(server_port=opt.port, share=opt.share, inbrowser=True)

to

app.queue().launch(server_port=opt.port, share=opt.share, inbrowser=True)

Thanks, I'll try it.

No, still not working.

So is it purely a bug in the Zero GPU space...
Or worse, a bug in Spaces in general.
I understand that Zero GPU spaces are inevitably buggy due to the acrobatic things they do, but they are really buggy.๐Ÿ˜“

It works fine on the original T4 GPU space. So it should be a bug in the Zero GPU space.

It's definitely a bug in the Zero GPU space. Same condition as this guy.
https://huggingface.co/spaces/zero-gpu-explorers/README/discussions/111

I also found that the seed does not seem to change in my space.

Is there something wrong with the torch generator...?

@spaces.GPU(duration=get_duration)
def run(model_name, tab, mid_seq, continuation_state, instruments, drum_kit, bpm, time_sig, key_sig, mid, midi_events,
        reduce_cc_st, remap_track_channel, add_default_instr, remove_empty_channels, seed, seed_rand,
        gen_events, temp, top_p, top_k, allow_cc):
~
        generator = torch.Generator(opt.device).manual_seed(seed)
~
        midi_generator = generate(model, mid, max_len=max_len, temp=temp, top_p=top_p, top_k=top_k,
                                  disable_patch_change=disable_patch_change, disable_control_change=not allow_cc,
                                  disable_channels=disable_channels, generator=generator)

No, i mean random seed option.

Or maybe my input got mixed with other users'.

Got it...put in a print statement and it works in mysterious ways beyond my imagination.
This guy is able to get into the random process branch and generate the random number itself, but the same random numbers are being generated again?
Some of variables are fixed! This is maybe just like what I encountered. But it's even worse because it's changed halfway through.

    if seed_rand:
        seed = np.random.randint(0, MAX_SEED)
        print(f"Random Seed: {seed}")
    print(f"Numpy Randint: {np.random.randint(0, MAX_SEED)}")
    generator = torch.Generator(opt.device).manual_seed(seed)
    print(f"Seed: {seed}")
Pass 1:
Random Seed: 2060276394
Numpy Randint: 160616387
Seed: 2060276394
Pass 2:
Random Seed: 2106771594
Numpy Randint: 1326089247
Seed: 2106771594
Pass 3:
Random Seed: 2060276394
Numpy Randint: 160616387
Seed: 2060276394

I somehow called the standard Python Random and it worked. I have a workaround.
Only numpy's random function is buggy...?
Is it possible that numpy itself is buggy?

    from random import randint
    if seed_rand:
        seed = np.random.randint(0, MAX_SEED)
        seed = randint(0, MAX_SEED)
        print(f"Random Seed: {seed}")
    print(f"Numpy Randint: {np.random.randint(0, MAX_SEED)}, Python Randint: {randint(0, MAX_SEED)}")
    generator = torch.Generator(opt.device).manual_seed(seed)
    print(f"Seed: {seed}")
Random Seed: 966084986
Numpy Randint: 446158108, Python Randint: 1564154308
Seed: 966084986

Random Seed: 813152059
Numpy Randint: 446158108, Python Randint: 1417789405
Seed: 813152059

Random Seed: 1891756477
Numpy Randint: 446158108, Python Randint: 1120703155
Seed: 1891756477

P.S.

Actual code.
https://huggingface.co/spaces/John6666/midi-composer_error/blob/main/app.py

Thanks for the test. NumPy's random works fine on my local machine and the original T4 GPU, so I think it's still a problem of zerogpu space.

That's two bugs up.

ZeroGPU Explorers org

Hey Guys, I hope you are doing great, I wanted to know if we can host an LLM model on Vllm via Zerogpu and then use that API endpoint in a production level application. How effective will it be and what would be the approximate requests you can draw before you reach the permitted limit

I wanted to know if we can host an LLM model on Vllm via Zerogpu and then use that API endpoint in a production level application.

Hi. If that's what you're looking for, I probably have a real life example. It's someone else's space, but he's good at what he does, so it's helpful.
He was struggling with the fact that he couldn't let each user use their own Quota without explicitly letting them log in, and his Quota was consumed...
https://huggingface.co/spaces/gokaygokay/Random-Prompt-Generator/blob/main/llm_inference.py

self.flux_client = Client("KingNish/Realtime-FLUX", hf_token=self.huggingface_token)

Also, other Pro and Free users are often unaware of the sign-in feature; very few Zero GPU spaces have a sign-in button for Quota mitigation purposes, and I was under the misapprehension until recently. I'm basically just playing around, which is fine, but some people seem to notice it after joining for Quota mitigation purposes, which can be a problem.

I hope these helps you understand the situation...

ZeroGPU Explorers org

ahh interesting .Can you elaborate a bit further?

Of course I have no problem explaining the details, but I don't know which part to explain so I will explain in general.

In gokaygokay's case, within the program on the space he is letting users use other people's Zero GPU space with his Pro token, but actually he doesn't need such a long Quota, the Free user's or guest user's Quota is enough.
However, to get a user's token, a sign-in button has to be installed, which is a hurdle, not in terms of implementation difficulty, but in terms of user experience. In short, we don't want to require a token because it would scare users away.
Well, regardless of those issues, isn't this an example of how Free's Quota is sufficient for use the Zero GPU space via Endpoint?

As for the latter issue of no one knowing about the sign-in button for Quota mitigation, that is literally true.
I think the majority of people think that if they subscribe to a Pro subscription, Pro Quota will be applied on Spaces without any specific settings, and the same is true for Quota on HF's Free account. I see people questioning this a lot on the forums and Discussion, and I've been thinking that for some months, so...
There were a number of people trying hard to give tokens for Quota mitigation.
Since HF has multiple password-like features, it never occurred to me that there is a third password, or sign-in.

I understand that this is a more secure design, but it is very impractical to go around putting buttons on other people's existing Zero GPU spaces. However, without the button, the value of the Pro subscription would not be much with respect to Quota.
It's not an urgent issue, but the situation would be improved a bit if the specifications were made known again to demo makers and users, or clearly stated in error messages, or some kind of mild autosign-in feature limited to Quota mitigation.

ZeroGPU Explorers org
โ€ข
edited Oct 5, 2024

Ahh yes, that I understood however what about the VLLM Part?? Does ZeroGPU support VLLM or not? The reason why I think so is because as we know, ZeroGPU works on the efficient distribution and sharing of multiple A100 GPUs under the hood to help us get the free GPUs we are currently utilising but the VLLM Architecure needs a dedicated GPU so what do u think?Will it work or not

I'm really sorry! I misunderstood from the beginning ๐Ÿ™€, you were talking about using the VLLM library not VLM or LLM...
I haven't seen anyone try it in the Zero GPU space.
https://github.com/vllm-project/vllm
https://huggingface.co/spaces?sort=trending&search=vllm

ZeroGPU Explorers org

ahh thats what I was thinking, I was like what am I asking and what is this guy replying.I am talking about the VLLM Pipeline man which uses paged attention to stream inferences

Sorry to waste your time.๐Ÿ˜ญ
If I may make an excuse, I was just here to report a bug...
And it's music and image related software.
Since you asked me there, I mistakenly thought you were talking about UE and practical use of the Zero GPU space in general.

I think there are more active LLM-related users in the Post section. But I haven't seen anyone on HF talking about VLLM...
The only talk about anything other than the HF library is the occasional question about LangChain or Llamacpp on the Forum. I do see a lot of talk about parallelism in general, though. But generally everyone is trying to do something with torch or accelerate.

ZeroGPU Explorers org

Its ok man, at least you tried to help.Thats what matters.Thats why I like the HF community so much

Sign up or log in to comment