Angel Camilo Guillen Guzman
acamilogg88
ยท
AI & ML interests
Enhanced AI software development
Recent Activity
reacted
to
singhsidhukuldeep's
post
with ๐ฅ
5 months ago
Good folks at @PyTorch have just released torchao, a game-changing library for native architecture optimization.
-- How torchao Works (They threw the kitchen-sink at it...)
torchao leverages several advanced techniques to optimize PyTorch models, making them faster and more memory-efficient. Here's an overview of its key mechanisms:
Quantization
torchao employs various quantization methods to reduce model size and accelerate inference:
โข Weight-only quantization: Converts model weights to lower precision formats like int4 or int8, significantly reducing memory usage.
โข Dynamic activation quantization: Quantizes activations on-the-fly during inference, balancing performance and accuracy.
โข Automatic quantization: The `autoquant` function intelligently selects the best quantization strategy for each layer in a model.
Low-bit Datatypes
The library utilizes low-precision datatypes to speed up computations:
โข float8: Enables float8 training for linear layers, offering substantial speedups for large models like LLaMA 3 70B.
โข int4 and int8: Provide options for extreme compression of weights and activations.
Sparsity Techniques
torchao implements sparsity methods to reduce model density:
โข Semi-sparse weights: Combine quantization with sparsity for compute-bound models.
KV Cache Optimization
For transformer-based models, torchao offers KV cache quantization, leading to significant VRAM reductions for long context lengths.
Integration with PyTorch Ecosystem
torchao seamlessly integrates with existing PyTorch tools:
โข Compatible with `torch.compile()` for additional performance gains.
โข Works with FSDP2 for distributed training scenarios.
โข Supports most PyTorch models available on Hugging Face out-of-the-box.
By combining these techniques, torchao enables developers to significantly improve the performance and efficiency of their PyTorch models with minimal code changes and accuracy impact.
Organizations
None yet
acamilogg88's activity
reacted to
onekq's
post with ๐ฅ
about 1 month ago
reacted to
m-ric's
post with ๐ฅ
5 months ago
Post
1172
Emu3: Next-token prediction conquers multimodal tasks ๐ฅ
This is the most important research in months: weโre now very close to having a single architecture to handle all modalities. The folks at Beijing Academy of Artificial Intelligence (BAAI) just released Emu3, a single model that handles text, images, and videos all at once.
๐ช๐ต๐ฎ๐'๐ ๐๐ต๐ฒ ๐ฏ๐ถ๐ด ๐ฑ๐ฒ๐ฎ๐น?
๐ Emu3 is the first model to truly unify all these different types of data (text, images, video) using just one simple trick: predicting the next token.
And itโs only 8B, but really strong:
๐ผ๏ธ For image generation, it's matching the best specialized models out there, like SDXL.
๐๏ธ In vision tasks, it's outperforming top models like LLaVA-1.6-7B, which is a big deal for a model that wasn't specifically designed for this.
๐ฌ It's the first to nail video generation without using complicated diffusion techniques.
๐๐ผ๐ ๐ฑ๐ผ๐ฒ๐ ๐ถ๐ ๐๐ผ๐ฟ๐ธ?
๐งฉ Emu3 uses a special tokenizer (SBER-MoVQGAN) to turn images and video clips into sequences of 4,096 tokens.
๐ Then, it treats everything - text, images, and videos - as one long series of tokens to predict.
๐ฎ During training, it just tries to guess the next token, whether that's a word, part of an image, or a video frame.
๐๐ฎ๐๐ฒ๐ฎ๐๐ ๐ผ๐ป ๐๐ต๐ฒ ๐ฟ๐ฒ๐๐๐น๐๐:
๐ In image generation, Emu3 beats SDXL, but itโs also much bigger (8B vs 3.5B). It would be more difficult to beat the real diffusion GOAT FLUX-dev.
๐ In vision, authors also donโt show a comparison against all the current SOTA models like Qwen-VL or Pixtral.
This approach is exciting because it's simple (next token prediction) and scalable(handles all sorts of data)!
Read the paper ๐ Emu3: Next-Token Prediction is All You Need (2409.18869)
This is the most important research in months: weโre now very close to having a single architecture to handle all modalities. The folks at Beijing Academy of Artificial Intelligence (BAAI) just released Emu3, a single model that handles text, images, and videos all at once.
๐ช๐ต๐ฎ๐'๐ ๐๐ต๐ฒ ๐ฏ๐ถ๐ด ๐ฑ๐ฒ๐ฎ๐น?
๐ Emu3 is the first model to truly unify all these different types of data (text, images, video) using just one simple trick: predicting the next token.
And itโs only 8B, but really strong:
๐ผ๏ธ For image generation, it's matching the best specialized models out there, like SDXL.
๐๏ธ In vision tasks, it's outperforming top models like LLaVA-1.6-7B, which is a big deal for a model that wasn't specifically designed for this.
๐ฌ It's the first to nail video generation without using complicated diffusion techniques.
๐๐ผ๐ ๐ฑ๐ผ๐ฒ๐ ๐ถ๐ ๐๐ผ๐ฟ๐ธ?
๐งฉ Emu3 uses a special tokenizer (SBER-MoVQGAN) to turn images and video clips into sequences of 4,096 tokens.
๐ Then, it treats everything - text, images, and videos - as one long series of tokens to predict.
๐ฎ During training, it just tries to guess the next token, whether that's a word, part of an image, or a video frame.
๐๐ฎ๐๐ฒ๐ฎ๐๐ ๐ผ๐ป ๐๐ต๐ฒ ๐ฟ๐ฒ๐๐๐น๐๐:
๐ In image generation, Emu3 beats SDXL, but itโs also much bigger (8B vs 3.5B). It would be more difficult to beat the real diffusion GOAT FLUX-dev.
๐ In vision, authors also donโt show a comparison against all the current SOTA models like Qwen-VL or Pixtral.
This approach is exciting because it's simple (next token prediction) and scalable(handles all sorts of data)!
Read the paper ๐ Emu3: Next-Token Prediction is All You Need (2409.18869)
reacted to
singhsidhukuldeep's
post with ๐ฅ
5 months ago
Post
1282
Good folks at
@PyTorch
have just released torchao, a game-changing library for native architecture optimization.
-- How torchao Works (They threw the kitchen-sink at it...)
torchao leverages several advanced techniques to optimize PyTorch models, making them faster and more memory-efficient. Here's an overview of its key mechanisms:
Quantization
torchao employs various quantization methods to reduce model size and accelerate inference:
โข Weight-only quantization: Converts model weights to lower precision formats like int4 or int8, significantly reducing memory usage.
โข Dynamic activation quantization: Quantizes activations on-the-fly during inference, balancing performance and accuracy.
โข Automatic quantization: The
Low-bit Datatypes
The library utilizes low-precision datatypes to speed up computations:
โข float8: Enables float8 training for linear layers, offering substantial speedups for large models like LLaMA 3 70B.
โข int4 and int8: Provide options for extreme compression of weights and activations.
Sparsity Techniques
torchao implements sparsity methods to reduce model density:
โข Semi-sparse weights: Combine quantization with sparsity for compute-bound models.
KV Cache Optimization
For transformer-based models, torchao offers KV cache quantization, leading to significant VRAM reductions for long context lengths.
Integration with PyTorch Ecosystem
torchao seamlessly integrates with existing PyTorch tools:
โข Compatible with
โข Works with FSDP2 for distributed training scenarios.
โข Supports most PyTorch models available on Hugging Face out-of-the-box.
By combining these techniques, torchao enables developers to significantly improve the performance and efficiency of their PyTorch models with minimal code changes and accuracy impact.
-- How torchao Works (They threw the kitchen-sink at it...)
torchao leverages several advanced techniques to optimize PyTorch models, making them faster and more memory-efficient. Here's an overview of its key mechanisms:
Quantization
torchao employs various quantization methods to reduce model size and accelerate inference:
โข Weight-only quantization: Converts model weights to lower precision formats like int4 or int8, significantly reducing memory usage.
โข Dynamic activation quantization: Quantizes activations on-the-fly during inference, balancing performance and accuracy.
โข Automatic quantization: The
autoquant
function intelligently selects the best quantization strategy for each layer in a model.Low-bit Datatypes
The library utilizes low-precision datatypes to speed up computations:
โข float8: Enables float8 training for linear layers, offering substantial speedups for large models like LLaMA 3 70B.
โข int4 and int8: Provide options for extreme compression of weights and activations.
Sparsity Techniques
torchao implements sparsity methods to reduce model density:
โข Semi-sparse weights: Combine quantization with sparsity for compute-bound models.
KV Cache Optimization
For transformer-based models, torchao offers KV cache quantization, leading to significant VRAM reductions for long context lengths.
Integration with PyTorch Ecosystem
torchao seamlessly integrates with existing PyTorch tools:
โข Compatible with
torch.compile()
for additional performance gains.โข Works with FSDP2 for distributed training scenarios.
โข Supports most PyTorch models available on Hugging Face out-of-the-box.
By combining these techniques, torchao enables developers to significantly improve the performance and efficiency of their PyTorch models with minimal code changes and accuracy impact.
Will this model be public available?
1
#1 opened 8 months ago
by
acamilogg88
reacted to
clem's
post with ๐
10 months ago
Post
2678
We noticed that all the open-source models and datasets from https://huggingface.co/WizardLM in their personal Hugging Face account & in the Microsoft Hugging Face organization (https://huggingface.co/microsoft) have been made private by the author, which will lead some demos to fail (these models were collectively downloaded over a hundred thousand times a month).
This is the explanation that @WizardLM communicated a few hours ago: https://huggingface.co/posts/WizardLM/329547800484476#661e0d17bca1a6038b60503e
We apologize for the inconvenience & are trying to get in touch with the author & Microsoft in order to try to find a good resolution for community members. Let us know if you have any questions!
This is the explanation that @WizardLM communicated a few hours ago: https://huggingface.co/posts/WizardLM/329547800484476#661e0d17bca1a6038b60503e
We apologize for the inconvenience & are trying to get in touch with the author & Microsoft in order to try to find a good resolution for community members. Let us know if you have any questions!
reacted to
merve's
post with ๐
11 months ago
Post
3327
LLaVA-NeXT is recently merged to Hugging Face transformers and it outperforms many of the closed source models like Gemini on various benchmarks ๐คฉ Let's take a look!
Demo: merve/llava-next
Notebook: https://colab.research.google.com/drive/1afNudu72SNWZCYtCVrRlb9T9Vj9CFJEK?usp=sharing
LLaVA is essentially a vision-language model that consists of ViT-based CLIP encoder, a MLP projection and Vicuna as decoder โจ
LLaVA 1.5 was released with Vicuna, but LLaVA NeXT (1.6) is released with four different LLMs:
- Nous-Hermes-Yi-34B
- Mistral-7B
- Vicuna 7B & 13B
Mistral and Nous-Hermes-Yi-34B are performing better and have better commercial use.
Moreover, according to authors' findings, the improvements comes from more diverse and high quality data mixture and dynamic high resolution.
LLaVA based on Nous-Hermes-Yi-34B outperforms many other models, including Gemini in various multimodal understanding and generation benchmarks ๐
Demo: merve/llava-next
Notebook: https://colab.research.google.com/drive/1afNudu72SNWZCYtCVrRlb9T9Vj9CFJEK?usp=sharing
LLaVA is essentially a vision-language model that consists of ViT-based CLIP encoder, a MLP projection and Vicuna as decoder โจ
LLaVA 1.5 was released with Vicuna, but LLaVA NeXT (1.6) is released with four different LLMs:
- Nous-Hermes-Yi-34B
- Mistral-7B
- Vicuna 7B & 13B
Mistral and Nous-Hermes-Yi-34B are performing better and have better commercial use.
Moreover, according to authors' findings, the improvements comes from more diverse and high quality data mixture and dynamic high resolution.
LLaVA based on Nous-Hermes-Yi-34B outperforms many other models, including Gemini in various multimodal understanding and generation benchmarks ๐
reacted to
iofu728's
post with ๐ฅ
11 months ago
Post
1530
Welcome to LLMLingua-2, a small-size yet powerful prompt compression method trained via data distillation from GPT-4 for token classification with a BERT-level encoder, excels in task-agnostic compression. It surpasses LLMLingua in handling out-of-domain data, offering 3x-6x faster performance.
@qianhuiwu
website: https://llmlingua.com/llmlingua2.html
code: https://github.com/microsoft/LLMLingua
demo: microsoft/llmlingua-2
website: https://llmlingua.com/llmlingua2.html
code: https://github.com/microsoft/LLMLingua
demo: microsoft/llmlingua-2