[FEEDBACK] Daily Papers
Note that this is not a post about adding new papers, it's about feedback on the Daily Papers community update feature.
How to submit a paper to the Daily Papers, like @akhaliq (AK)?
- Submitting is available to paper authors
- Only recent papers (less than 7d) can be featured on the Daily
Then drop the arxiv id in the form at https://huggingface.co/papers/submit
- Add medias to the paper (images, videos) when relevant
- You can start the discussion to engage with the community
Please check out the documentation
We are excited to share our recent work on MLLM architecture design titled "Ovis: Structural Embedding Alignment for Multimodal Large Language Model".
Paper: https://arxiv.org/abs/2405.20797
Github: https://github.com/AIDC-AI/Ovis
Model: https://huggingface.co/AIDC-AI/Ovis-Clip-Llama3-8B
Data: https://huggingface.co/datasets/AIDC-AI/Ovis-dataset
@Yiwen-ntu for now we support only videos as paper covers in the Daily.
we are excited to share our work titled "Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models" : https://arxiv.org/abs/2406.12644
Consistency-diversity realism Pareto fronts of conditional image generative models -- http://arxiv.org/abs/2406.10429
"Data Contamination Can Cross Language Barriers". -- https://arxiv.org/pdf/2406.13236
How do I add papers that are on Nature rather than arXiv?
Share our latest paper: CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation (https://arxiv.org/abs/2406.05365)
Gorgeous: Create Your Desired Character Facial Makeup from Any Ideas https://arxiv.org/abs/2404.13944
๐ We are thrilled to announce the publication of my first research paper on model merging, Della-Merging. Della employs a magnitude-based sampling approach to eliminate redundant delta parameters, reducing interference when merging homologous models (those fine-tuned from the same backbone).
Paper: https://arxiv.org/abs/2406.11617
Github: https://github.com/declare-lab/della
Della outperforms existing homologous model merging techniques such as DARE and TIES. Across three expert models (LM, Math, Code) and their corresponding benchmark datasets (AlpacaEval, GSM8K, MBPP), Della achieves an improvement of 3.6 points over TIES and 1.2 points over DARE.
LVBench is a benchmark designed to evaluate and enhance the capabilities of multimodal models in understanding and extracting information from long videos up to two hours in duration. Our extensive evaluations reveal that current multimodal models still underperform on these demanding long video understanding tasks.
Paper: https://arxiv.org/abs/2406.08035
Github: https://github.com/THUDM/LVBench
STARLING: Self-supervised Training of Text-based Reinforcement Learning Agent with Large Language Models
Paper: https://arxiv.org/abs/2406.05872
Code: https://github.com/IBM/starling-agent
SIT: Fine-tuning Large Language Models with Sequential Instructions
Paper: https://arxiv.org/pdf/2403.07794
Data and model: https://seqit.github.io
Code: https://github.com/hanxuhu/SeqIns
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
paper: https://arxiv.org/pdf/2406.17419
code: https://github.com/MozerWang/Loong
TRAIT: Task Oriented In-Domain Data Augmentation (for Continual Pre-training of LLMs), https://arxiv.org/abs/2406.16694
We are excited to share our recent work: "Adam-mini: Use Fewer Learning Rates To Gain Moreโ https://arxiv.org/abs/2406.16793
We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini can also achieve 49.5% higher throughput than AdamW on Llama2-7B pre-training. The design of Adam-mini is inspired by certain Hessian structures we observed on Transformers. Code available at: https://github.com/zyushun/Adam-mini
We have developed a new text-to-video generation benchmark for metamorphic evaluation. We specifically design four major categories for time.lapse videos (as shown below), including biological, human-created, meteorological, and physical videos.and extend these to 75 subcategories.
paper: https://arxiv.org/abs/2406.18522
leaderboard: https://huggingface.co/spaces/BestWishYsh/ChronoMagic-Bench
code: https://github.com/PKU-YuanGroup/ChronoMagic-Bench
KV cache optimization for LLMs and MLLMs:
- LLMs: D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models, arxiv: https://arxiv.org/abs/2406.13035
- MLLMs: look-m: look-once optimization in kv cache for efficient multimodal long-context inference, arxiv: https://arxiv.org/html/2406.18139v1
We developed a generic schematic for the optimization loop to reduce the memory footprint of second order, full-matrix adaptive optimizers.๐พ
Our target optimizers are the ones that store a window of past gradients, such as M-FAC and GGT, which usually require storing around 500 to 1000 gradients (equivalent to this many model copies in the GPU memory).
Our technique uses sparse/low-rank gradients and Error Feedback and shows we can reduce the memory footprint of optimizers' state by 30x for GGT and 45x to 60x for M-FAC. ๐
Why is this important? ๐ค
In the case of M-FAC, which is an approximation of Natural Gradient (NG) (most commonly known optimizer about this is K-FAC), our work allows using approximations of NG at larger scale, such as ResNet-18 / ImageNet and BERT-Base finetuning.
Please experiment with this NG approximation and let us know about your findings!
๐ Our arxiv paper: https://arxiv.org/pdf/2306.06098
๐ป Our code on GitHub: https://github.com/IST-DASLab/EFCP
was waiting for papers to get verified on my account so i could submit but then the 7 day window closed :( any chance we can still submit ours?
Hi AK and HF team,
I would appreciate your considering my recent ArXiv paper "Model Callers for Transforming Predictive and Generative AI Applications" for inclusion in the HF daily papers. I could not submit directly to your site, since I don't already have a paper in HF DPs.
Paper: https://arxiv.org/abs/2406.15377
Github code: https://github.com/mukdal/modelcaller
Python library: pip install modelcaller
Abstract: We introduce a novel software abstraction termed "model caller," acting as an intermediary for AI and ML model calling, advocating its transformative utility beyond existing model-serving frameworks. This abstraction offers multiple advantages: enhanced accuracy and reduced latency in model predictions, superior monitoring and observability of models, more streamlined AI system architectures, simplified AI development and management processes, and improved collaboration and accountability across AI/ML/Data Science, software, data, and operations teams. Model callers are valuable for both creators and users of models within both predictive and generative AI applications. Additionally, we have developed and released a prototype Python library for model callers, accessible for installation via pip or for download from GitHub.
Thanks,
Mukesh Dalal
Hello AK and HF Team,
We would to add our recent paper "Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification" in HF daily papers. I am putting this request here, since I don't already have a paper in HF daily papers.
Paper: https://arxiv.org/pdf/2407.02352
Authors: Pritish Sahu, Karan Sikka, Ajay Divakaran
Thanks,
Pritish Sahu
Hello AK and HF Team,
We would like to share our 2022 paper now recently published in Automation in Construction, Science, Elsevier, "Vitruvio: Conditional variational autoencoder to generate building meshes via single perspective sketches"
๐ Paper: https://www.sciencedirect.com/science/article/pii/S0926580524002346?dgcid=author (50days free access).
๐ Our arxiv paper: https://arxiv.org/abs/2210.13634
We demonstrated the critical importance of considering building orientation in reconstruction projects. Additionally, we have provided a comprehensive baseline and dataset specifically for building reconstruction. Help us spread the word within the AEC industry to raise awareness about these advancements. Watch our video presenting the problem and our findings: VIDEO .
Code: https://github.com/CDInstitute/Vitruvio
Feel free to use this message on your social media, blog, or any platform where you wish to share your research and video.
Alberto Tono , Heyaojing Huang , Ashwin Agrawal, and Martin Fischer
@kramp
@akhaliq
Dear Team,
Thanks for the nice work. But our paper just released today was marked as "older than 7 days".
Could you please kind check?
paper: https://arxiv.org/abs/2407.06182 code: https://github.com/Adamdad/vico and project page https://adamdad.github.io/vico/
Best,
@kramp
@akhaliq
We would to share our recent paper "Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model" in HF daily papers. We find that even leading multimodal models such as gpt-4o or Claude-3.5-Sonnet have difficulty recognizing simple abstract images, e.g., reading time from a clock, understanding a flowchart, or planning a route using a road map. Therefore, we design a multimodal self-instruct to synthesize the abstract image benchmark and training set, and SFT a model to make it understand the abstract image.
If you are interested in our articles, we would appreciate it!
Paper: https://arxiv.org/abs/2407.07053
Code: https://github.com/zwq2018/Multi-modal-Self-instruct
Dataset: https://huggingface.co/datasets/zwq2018/Multi-modal-Self-instruct
Leaderboard: https://multi-modal-self-instruct.github.io/
Thanks,
Wenqi Zhang
๐ Launching ORLM: the first open-source Operations Research LLM, powered by our OR-Instruct process! ๐ ๏ธ
๐ ORLMs achieves SOTA on NL4OPT, MAMO, & the new IndustryOR benchmarks based on different 7b backbones!
๐ Paper: https://arxiv.org/pdf/2405.17743
๐ป Code: https://github.com/Cardinal-Operations/ORLM
How to submit a paper which is not submited on arxiv?
paper: https://cientgu.github.io/files/VisualSignalDecomposition.pdf
Hello AK and HF Team
@kramp
@akhaliq
:
We would like to share our recent paper "OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion". We propse the OV-DINO, a novel unified open vocabulary detection approach that offers superior performance and effectiveness for practical real-world application. It entails a Unified Data Integration pipeline that integrates diverse data sources for end-to-end pre-training, and a Language-Aware Selective Fusion module to improve the vision-language understanding of the model. And it shows significant performance improvement on COCO and LVIS benchmarks compared to previous methods, achieving relative improvements of +2.5% AP on COCO and +13.6% AP on LVIS compared to G-DINO in zero-shot evaluation.
If you are interested in our paper, we would great appreciate it!
Paper: https://arxiv.org/abs/2407.07844
Code: https://github.com/wanghao9610/OV-DINO
Thanks,
Hao Wang
Dear Team,
I will like to share our paper recently accepted by ECAI. In this paper, we introduce FlowLearn, a novel dataset designed to test the capabilities of Large Vision-Language Models (LVLMs) in understanding and interpreting flowcharts. To the best of our knowledge, this is the first release of a dataset specifically tailored for flowchart comprehension and includes a comprehensive evaluation of LVLMs. Our research reveals significant challenges these models face, such as recognizing textual and visual components and their relationships within flowcharts. Through comprehensive experiments, we assess various state-of-the-art LVLMs, highlighting the gaps and providing insights for future advancements in machine comprehension of graphical data.
Title: FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding
Paper: https://arxiv.org/abs/2407.05183v1
Dataset: https://huggingface.co/datasets/jopan/ FlowLearn
Code: https://github.com/Jo-Pan/FlowLearn
Dear Team,
We propose a novel approach to modulate LLM behaviors through direct parameter editing, offering an alternative to traditional alignment methods. Our new approach achieves efficient modulation with inference-level computational cost! Achieve up to 90% detoxification with inference-level computational cost!
Paper: https://arxiv.org/abs/2407.08770
Code: https://github.com/lucywang720/model-surgery/
@akhaliq @kramp Hi AK and HF team,
We would like to share our recent paper, "AUITestAgent: Natural Language-Driven GUI Functional Bug Tester." We propose AUITestAgent, the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification. Experiments on customized benchmarks demonstrate that AUITestAgent outperforms existing tools in the quality of generated GUI interactions and achieved the accuracy of verifications of 94%. Besides, field deployment in Meituan has shown AUITestAgent's practical usability.
Paper: https://arxiv.org/abs/2407.09018
Web: https://github.com/bz-lab/AUITestAgent
Thanks,
Yongxiang Hu
Hi everyone,
we'd like to share our objective evaluation of recent TTS systems.
Since many new TTS systems have been released, with a wide range of approaches, we think evaluating them is very important.
It would also be beneficial to have access to objective evaluation, since that way, models can be evaluated during training, and evaluating new models is easier.
We found that objective evaluation correlates well with human ratings when done accross a wide set of factors such as Prosody, Speaker, Environment, etc.
The TTS Arena recently released by huggingface and
@mrfakename
was a great human evaluation to compare against, and our methods showed a high correlation with their scores. We also showed strong correlation with MOS scores from the Blizzard 2008 challenge and the Back to the Future Blizzard Paper from 2022.
Paper: https://arxiv.org/abs/2407.12707
Web: https://ttsdsbenchmark.com
HF Space: https://huggingface.co/spaces/ttsds/benchmark
Code: https://github.com/ttsds/ttsds
It would be amazing to have our benchmark featured in the Daily Papers!
Cheers,
Christoph
We would like to present โจAurAโจ, a zero-overhead inference-time intervention upon LLM activations to mitigate harmful behaviors, such as toxicity. We will present it next week at ICML2024.
In the figure below we show the toxicity reduction between the original model (circles) and using our AURA intervention (stars), for different LLMs. PPL stands for Perplexity and RTP refers to the Real Toxicity Prompts dataset. We also show AurA can reduce toxicity even in the presence of adversarial prompts.
It would be great if you could feature it in the Daily Papers!
๐https://arxiv.org/abs/2407.12824
๐ Code: https://github.com/apple/ml-aura/tree/main
@akhaliq @kramp Hi AK and HF team,
We are excited to present our ECCV 2024 paper, "ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation". This innovative approach is the first to explore score distillation in training text-to-3D generation models, marking a shift from optimization-based to learning-based generation. Unlike existing data-driven methods, our approach enables the training of a high-quality text-to-3D generator in an unsupervised manner. This method opens up numerous potential applications, and we will continue to explore its possibilities.
Figure. First row: ASD application with optimization-based text-to-3D. Second row: ASD applied to learning-based text-to-3D, enabling the training of text-to-3D generation models without real 3D data. ASD can extend the training corpus to up to 100,000 text prompts.
Paper: https://arxiv.org/pdf/2407.02040
Code: https://github.com/theEricMa/ScaleDreamer
Hi all,
We have been working on a pretty cool AI agent framework.
AutoGRAMS is a software 2.5 framework that makes it easy to build very complex AI agents using either spreadsheets or python code or the AutoGRAMS scripting language.
Paper: https://arxiv.org/abs/2407.10049
Code: https://github.com/autograms/autograms
Happy to recieve any feedback.
Hi all, wanted to share our recent work on context-conditioned reward modeling: https://arxiv.org/abs/2407.14916 (accompanying context-conditioned preference dataset to follow shortly).
We construct a Reasonable Preference Reversal (RPR) dataset (to follow) and use it to finetune a 7B parameter reward model that outperforms Llama3-70B (as well as other reward models) on context-conditioned preference queries.
Our work could be used to improve performance when conditioning reward models on principles (e.g., for constitutional AI) or user profiles (e.g. for pluralistic alignment), or other contexts. The goal is to reduce ambiguity in preference queries, and work toward improving human preference modeling.
@akhaliq @kramp Hi AK and HF team
We propose a novel method, called DreamCar, to reconstruct 3D real cars in the moving-forward scenes. I guess you would be interested in https://arxiv.org/pdf/2407.16988
Here is our project page: https://xiaobiaodu.github.io/dreamcar-project/
@akhaliq @kramp Hi AK and HF team
We propose a novel dataset, called 3DRealCar, containing 2500 large-scale 3D real cars for various tasks. I guess you would be interested in https://arxiv.org/abs/2406.04875
Here is our project page: https://xiaobiaodu.github.io/3drealcar/index.html
@akhaliq
@kramp
Hi AK and HF team
This paper (https://arxiv.org/pdf/2407.18290) discusses several key questions in current visual generation community.
However, it cannot be submitted to daily paper because the arxiv reviews this paper for three weeks, even though it just appeared on arxiv today. Could you help post it on daily paper?
A survey paper (https://arxiv.org/abs/2407.20018) discusses the effcient LLM training system and infra.
@akhaliq @kramp Hi AK and HF team,
Our recent work "FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models" (https://arxiv.org/abs/2407.11522) builds a new dataset FIRE that empowers VLMs to integrate user feedback into the refined responses spontaneously, and provides a comprehensive evaluation for the feedback-refining ability of existing methods. We also host our dataset, model, and demo on Huggingface.
Project: https://mm-fire.github.io/
Dataset: https://huggingface.co/datasets/PengxiangLi/FIRE
Model: https://huggingface.co/li-qing/llava-next-llama3-8b-student-fire
Gradio Demo: https://li-qing-fire.hf.space
@akhaliq
@kramp
Hi AK and HF team,
Our recent work, Lumina-mGPT (https://arxiv.org/abs/2408.02657), introduces a multimodal autoregressive transformer capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. We have released our code and model on GitHub (https://github.com/Alpha-VLLM/Lumina-mGPT)
Hi HF team
This new paper (https://arxiv.org/abs/2408.01050 https://huggingface.co/papers/2408.01050 ) discusses hyperparameter optimization of HuggingFace pipelines and vLLM in the context of code generation.
We are excited to share our recent work "ncRNA Coding Potential Prediction Using BiLSTM and Transformer Encoder-Based Model".
paper already accepted
Paper: https://pubs.acs.org/doi/10.1021/acs.jcim.4c01097
Github: https://github.com/Minami-su/nBAT
Model: https://huggingface.co/Minami-su/nBAT
Data: https://huggingface.co/Minami-su/nBAT
Hey ๐
We're excited to share this paper on open human preferences for LLMs: https://arxiv.org/abs/2408.16961
Hi, we are excited to share our recent work "Relation DETR: Exploring Explicit Position Relation Prior for Object Detection".
Paper: https://arxiv.org/abs/2407.11699v1
Github: https://github.com/xiuqhou/Relation-DETR
Dataset: https://huggingface.co/datasets/xiuqhou/SA-Det-100k
Hi, we are excited to share our recent paper on modular large language models "Configurable Foundation Models: Building LLMs from a Modular Perspective".
In this paper, we provide a comprehensive overview of existing efforts to decompose LLMs into modules and conduct an empirical study to verify the modularity characteristic of densely trained LLMs, Llama3 and Mistral.
Paper: https://arxiv.org/abs/2409.02877
We are excited to share our recent work, "Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?"
Paper: https://arxiv.org/abs/2409.02727
Github: https://github.com/yixuantt/PoolingAndAttn
We're thrilled to share our latest work, "Skip-and-Play: Depth-Driven Pose-Preserved Image Generation for Any Objects"
Paper: https://arxiv.org/abs/2409.02653
We're thrilled to share our latest work, "Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models", the first first-order FL method with shared randomness that significantly enhances the scalability of existing federated full-parameter tuning approaches by achieving high computational efficiency, reduced communication overhead, and fast convergence, all while maintaining competitive model accuracy.
Paper: https://arxiv.org/abs/2409.06277
Github: https://github.com/allen4747/Ferret
Hi, I'd like to share our paper beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems
Paper https://arxiv.org/pdf/2409.10309
Github https://github.com/recombee/beeformer
๐ Excited to share our latest preprint: "CodonTransformer: a multispecies codon optimizer using context-aware neural networks"!
CodonTransformer is a groundbreaking deep learning model that optimizes DNA sequences for heterologous protein expression across 164 species.
By leveraging Transformer architecture and a novel training stratey named STREAM, it generates host-specific DNA sequences with natural-like codon patterns, minimizing negative regulatory elements.
๐ฅ Website
https://adibvafa.github.io/CodonTransformer/
โญ GitHub (Please give us a :star:!)
https://github.com/Adibvafa/CodonTransformer
๐ค Colab Notebook (Try it out!)
https://adibvafa.github.io/CodonTransformer/GoogleColab
๐ชผ Model
https://huggingface.co/adibvafa/CodonTransformer
๐ Paper
https://www.biorxiv.org/content/10.1101/2024.09.13.612903
No Saved Kaleidosope: an 100% Jitted Neural Network Coding Language with Pythonic Syntax
https://arxiv.org/abs/2409.11600
Contribution-based Low-Rank Adaptation with Pre-training Model for Real Image Restoration
paper: https://arxiv.org/pdf/2408.01099
Iโm excited to share our recent work with everyone: A Survey on the Honesty of Large Language Models. In this paper, we systematically review the current research on LLM honesty and propose potential future research directions, aiming to contribute to the development of this field.
Paper: https://arxiv.org/pdf/2409.18786
Project Page: https://github.com/SihengLi99/LLM-Honesty-Survey
Iโm excited to share our recent work with everyone: Do We Need Domain-Specific Embedding Models? An Empirical Investigation. In this paper, we introduce the FinMTEB and empirically analyze the significant performance drop of seven SOTA embedding models on domain-specific context with four controlling metrics, rethink the necessity of domain-specific llm-based embedding models and benchmarks
Paper: https://arxiv.org/pdf/2409.18511
Project Page: https://github.com/yixuantt/FinMTEB
Leveraging Foundation Models for Efficient Federated Learning in Resource-restricted Edge Networks
Excited to share our latest preprint: "Embodied-RAG: General non-parametric Embodied Memory for Retrieval and Generation"!
In recent years, we have seen progress of foundation models (RT-X models and V/LLM's) as embodied agents. However, methods to augment these models with general-purpose long-term/large-scale memory has been under-explored. Clearly, as the environment becomes larger (e.g. outdoors) in navigation/"mobile" manipulation, we need a general-purpose external memory.
We introduce Embodied-RAG, a General Non-Parametric Method for Retrieval and Generation.
Project Page: https://quanting-xie.github.io/Embodied-RAG-web/
Paper: https://arxiv.org/abs/2409.18313
Youtube Demo: https://youtu.be/LcB89Rdyxhg
More demos in the website!
Could you also enable indexing of other preprint servers like TechRxiv?
We are glad to share our preprint work: REAL: Response Embedding-based Alignment for LLMs. An efficient and offline high-quality data selection for LLM alignment.
Using the response embedding to select the preference and non-preference pairs for DPO fine-tuning.
The paper link is https://arxiv.org/pdf/2409.17169
Hello, everyone. We are pleased to present our paper: "Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding"
To the best of our knowledge, this is the first training-free acceleration method for auto-regressive text-to-image generation models.
You can access the full paper here: https://arxiv.org/abs/2410.01699
We're thrilled to share our recent works,
- ''Collaborative Performance Prediction for Large Language Models'', While scaling laws have been a popular method for predicting LLM performance on downstream tasks, our research shows that simpler approaches like matrix factorization and neural collaborative filtering can yield even better results. We encourage a collaborative framework where model design information is shared, allowing for accurate predictions of future models' performance on downstream tasks. Our framework supports integration with open-source leaderboards, such as Open Leaderboard and HELM, enabling developers to predict their models' performance by leveraging historical model data. You can access the full paper here: https://arxiv.org/abs/2407.01300.
- ''RevisEval: Improving LLM-as-a-Judge via Response-Adapted References'', Evaluation has long been a cornerstone of progress in text generation capabilities. With the limitations of traditional metrics, LLM-as-a-Judge has become a viable method for assessing generative abilities in open-ended tasks, though it still faces significant reliability gaps compared to human evaluation. By harnessing the revision capabilities of LLMs, we unlock the potential of references in traditional evaluations, generating response-adapted references that can significantly enhance general evaluation methods on various tasks. This approach not only boosts the accuracy of LLM-as-a-Judge but also revives traditional metrics like BLEU, enabling them to effectively evaluate tasks on benchmarks such as MTBench and Alpacafarm, with results that are even comparable to those of LLM-as-a-Judge. It also performs well in using weak LLMs for evaluation and mitigating positional bias. You can access the full paper here: https://arxiv.org/abs/2410.05193
M3GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation๏ผhttps://arxiv.org/pdf/2405.16273 ๏ผ accepted by NeurIPS 2024
I don't have a paper, but I made a small sample framework researchers could use for sampling experiments.
Text4Seg: Reimagining Image Segmentation as Text Generation
Paper: https://arxiv.org/abs/2410.09855
Github: https://github.com/mc-lan/Text4Seg
Depth Any Video with Scalable Synthetic Data
Depth Any Video introduces a scalable synthetic data pipeline, capturing 40,000 video clips from diverse games, and leverages powerful priors of generative video diffusion models to advance video depth estimation. By incorporating rotary position encoding, flow matching, and a mixed-duration training strategy, it robustly handles varying video lengths and frame rates. Additionally, a novel depth interpolation method enables high-resolution depth inference, achieving superior spatial accuracy and temporal consistency over previous models.
Arxiv link: https://arxiv.org/abs/2410.10815
Project page: https://depthanyvideo.github.io
Code: https://github.com/Nightmare-n/DepthAnyVideo
Huggingface gradio demo: https://huggingface.co/spaces/hhyangcs/depth-any-video
We are excited to share our recent proposed code completion benchmark "Codev-Bench: How Do LLMs Understand Develop-Centric Code Completion?".
๐ https://arxiv.org/abs/2410.01353
๐ https://github.com/LingmaTongyi/Codev-Bench๏ปฟ
๐ค https://huggingface.co/datasets/TongyiLingma/CodevBench
Hi AK and HF team,
Our paper https://arxiv.org/abs/2411.00785 titled "IGOR: Image-GOal Representations Are the Atomic Control Units for Foundation Models in Embodied AI" is just be made public today, although being onhold by arxiv for more than 7 days. However, the Daily paper submission website shows that it is more than 7 days old. We appreciate your help if you could help post the paper on the Daily Paper.
Hi AK and HF team,
I am happy to introduce our MicroAdam optimizer, a low-memory variant of Adam optimizer that has a memory footprint of 0.9d bytes, compared to 2d bytes of AdamW-8bits. We achieve this result by only storing 99% sparse gradients and reconstructing the optimizer states at each step, which is a fast operation due to our optimized implementation using CUDA kernels. MicroAdam was mainly developed for finetuning tasks in mind. Please check out our work:
(Paper) ๐: https://arxiv.org/pdf/2405.15593
(Code) ๐: https://github.com/IST-DASLab/MicroAdam
Hi everyone,
I would like to introduce GridSearcher, a tool we have been developing in our DAS-Lab @ ISTA to speed up the hyper-parameter tuning process. Grid searcher is a pure python project designed to bypass the bash scripts to run grids of parameters for the ML projects. It provides a more flexible and user friendly way to manage and execute multiple programs in parallel. It is designed for systems where users have direct SSH access to machines and can run their python scripts right away.
Do you have access to your GPUs via SLURM? no problem, you can run srun --gres=gpu:8 --partition=gpu100 --time=10-00:00:00 --mem=1000G --cpus-per-task=200 --pty bash
to request a bash session to your cluster to be able to use direct ssh access on the node, then use GridSearcher.
I am sure our project will help you save time, please check out our code on GitHub:
(Code) ๐: https://github.com/IST-DASLab/GridSearcher/
We are excited to introduce Qwen2vl-Flux, a state-of-the-art multimodal image generation model that enhances FLUX with Qwen2VL's vision-language understanding capabilities.
๐ฅ Key Features:
- Enhanced vision-language understanding through Qwen2VL-7B
- Multiple generation modes including variation, img2img, inpainting, and ControlNet guidance
- Flexible attention mechanism and structural control
๐ Technical Details:
- Integrates Qwen2VL (7B) with FLUX architecture
๐ค Resources:
We would like to introduce our new work: "HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads", which leverages different attention heads for real image editing. Our work is based on flux and is training free!
Project page: https://yuci-gpt.github.io/headrouter/
Paper: https://arxiv.org/abs/2411.15034
๐ฅ๐ฅ๐ฅ We are excited to share a new efficient small language model architecture with parallel Mamba and Attention fusion - Hymba.
We study the tradeoff between Mamba and Attention, exploring how they can be combined, how the attention sink and forced-to-attend phenomena can be mitigated, and how the KV cache can be shared across layers.
The team delivered an end-to-end solution featuring a novel architecture, selecting data, a five-stage training setup, and trained both Base and Instruct models. Release is with open license.
A standout feature is that the Hymba-1.5B Base model outperforms LLaMA 3.2-3B, despite being trained on 7ร fewer tokens and achieving 12ร cache reduction.
๐ Model: https://huggingface.co/collections/nvidia/hymba-673c35516c12c4b98b5e845f
๐ Paper: https://www.arxiv.org/abs/2411.13676
Inspired by "big-little core" chip design, we introduce "one-big-many-small" grouping for efficient multi-model deployment, cutting storage costs from NM to (1+rN)M!
Paper: https://arxiv.org/abs/2406.08903
Github: github.com/thunlp/Delta-CoMe
Hi AK and HF team,
I am happy to introduce our DiffusionDrive, a real-time end-to-end autonomous driving model, which is much faster (10x reduction in diffusion denoising steps), more accurate (3.5 higher PDMS on NAVSIM), and more diverse (64% higher mode diversity score) than the vanilla diffusion policy. Without bells and whistles, DiffusionDrive achieves record-breaking 88.1 PDMS on NAVSIM benchmark with the same ResNet-34 backbone by directly learning from human demonstrations, while running at a real-time speed of 45 FPS. Please check out our work:
(Paper) ๐: https://arxiv.org/abs/2411.15139
(Code) ๐: https://github.com/hustvl/DiffusionDrive
We demonstrate robust and safe driving in the real-world application
@akhaliq
@kramp
Dear AK and HF team ,
๐ We are pleased to share our latest research paper, "Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS," for your consideration, as we believe it may be of significant interest for HF Daily Paper. This work introduces HiAR-ICL, a novel paradigm to enhance the complex reasoning capabilities of large language models.
๐ Unlike traditional in-context learning, HiAR-ICL shifts the focus from example-based analogical learning to abstract thinking patterns. It employs Monte Carlo Tree Search to explore reasoning paths and creates "thought cards" to guide inferences. By dynamically matching test problems with appropriate thought cards through a proposed cognitive complexity framework, HiAR-ICL achieves state-of-the-art accuracy of 79.6% with 7B model on the challenging MATH benchmark, surpassing both GPT-4o and Claude 3.5.
๐ Paper: https://arxiv.org/pdf/2411.18478
๐ Project Page: https://jinyangwu.github.io/hiar-icl/
We would greatly appreciate your consideration of our paper for inclusion.
Best regards,
Jinyang Wu, Mingkuan Feng, Shuai Zhang, Feihu Che, Zengqi Wen, Jianhua Tao
Note that this is not a post about adding new papers, it's about feedback on the Daily Papers community update feature.
How to submit a paper to the Daily Papers, like @akhaliq (AK)?
- Submitting is available to paper authors
- Only recent papers (less than 7d) can be featured on the Daily
Then drop the arxiv id in the form at https://huggingface.co/papers/submit
- Add medias to the paper (images, videos) when relevant
- You can start the discussion to engage with the community
Please check out the documentation
Hi
@kramp
and
@akhaliq
please could you help me verify my authorship claim for this paper? https://huggingface.co/papers/2411.15640
Today makes it 6 days and I need to be able to feature it on the paper dailies.
I hope you're doing well! I would like to kindly request your assistance in verifying my authorship claim for this paper: https://huggingface.co/papers/2411.18478. Today marks the 6th day, and I would appreciate it if you could help expedite the verification process so that the paper can be featured on the daily papers.
Thank you so much for your help!
Best regards,
Jinyang Wu
Self-Supervised Unified Generation with Universal Editing: https://arxiv.org/pdf/2412.02114
Dear AK and HF team๏ผ
I would like to kindly request your assistance in verifying my authorship claim for this paper: https://huggingface.co/papers/2411.18478. Today marks the 7th day, and I would appreciate it if you could help expedite the verification process so that the paper can be featured on the daily papers.
Thank you so much for your help!
Best regards,
Mingyu Xu
@akhaliq
@kramp
Dear AK and HF team ,
๐ We would like to kindly request your assistance in sharing our latest research paper in less than 1 month(Nov. 14), "Golden Noise for Diffusion Models: A Learning Framework". We believe it may be of significant interest for HF Daily Paper.
๐ First, we identify a new concept termed noise prompt, which aims at turning a random noise into a golden noise by adding a small desirable perturbation derived from the text prompt. The golden noise perturbation can be considered as a kind of prompt for noise, as it is rich in semantic information and tailored to the given text prompt. Building upon this concept, we formulate a noise prompt learning framework that learns "prompted'' golden noises associated with text prompts for diffusion models.
๐ Second, to implement the formulated noise prompt learning framework, we propose the training dataset, namely the noise prompt dataset(NPD), and the learning model, namely the noise prompt network(NPNet). Specifically, we design a noise prompt data collection pipeline via re-denoise sampling, a way to produce noise pairs. We also incorporate AI-driven feedback mechanisms to ensure that the noise pairs are highly valuable. This pipeline enables us to collect a large-scale training dataset for noise prompt learning, so the trained NPNet can directly transform a random Gaussian noise into a golden noise to boost the performance of the T2I diffusion model.
๐ Third, we conduct extensive experiments across various mainstream diffusion models, including StableDiffusion-xl(SDXL), DreamShaper-xl-v2-turbo and Hunyuan-DiT, with 7 different samplers on 4 different datasets. We evaluate our model by utilizing 6 human preference metrics including Human Preference Score v2(HPSv2), PickScore Aesthetic Score(AES), ImageReward, CLIPScore and Multi-dimensional Preference Score(MPS). As illustrated in Fig.1, by leveraging the learned golden noises, not only is the overall quality and aesthetic style of the synthesized images visually enhanced, but all metrics also show significant improvements, demonstrating the effectiveness and generalization ability of our NPNet. For instance, on GenEval, our NPNet let SDXL improve the classical evaluation metric HPSv2 by 18%(24.04โ28.41)}, which even surpasses a recent much stronger DiT-based diffusion model Hunyuan-DiT(27.78). Furthermore, the NPNet is a compact and efficient neural network that functions as a plug-and-play module, introducing only a 3% extra inference time per image compared to the standard pipeline, while requiring approximately 3% of the memory required by the standard pipeline. This efficiency underscores the practical applicability of NPNet in real-world scenarios.
๐ Paper: https://arxiv.org/abs/2411.09502
๐ Project Page: https://github.com/xie-lab-ml/Golden-Noise-for-Diffusion-Models
We would greatly appreciate your assistance and consideration of our paper for inclusion.
Best regards,
Zikai Zhou, Shitong Shao, Lichen Bai, Zhiqiang Xu, Bo Han, Zeke Xie
@akhaliq
@kramp
Dear AK and HF team ,
๐ We would like to kindly request your assistance in sharing our latest research paper, "Bringing Objects to Life: 4D generation from 3D objects".
We believe it may be of significant interest for HF Daily Paper.
๐ Recent advancements in generative modeling now enable the creation of 4D content (moving 3D objects) controlled with text prompts.
4D generation has large potential in applications like virtual worlds, media, and gaming, but existing methods provide limited control over the appearance and geometry of generated content.
๐ In this work, we introduce a method for animating user-provided 3D objects by conditioning on textual prompts to guide 4D generation, enabling custom animations while maintaining the identity of the original object.
๐ We first convert a 3D mesh into a ``static" 4D Neural Radiance Field (NeRF) that preserves the visual attributes of the input object. Then, we animate the object using an Image-to-Video diffusion model driven by text. To improve motion realism, we introduce an incremental viewpoint selection protocol for sampling perspectives to promote lifelike movement and a masked Score Distillation Sampling (SDS) loss, which leverages attention maps to focus optimization on relevant regions.
๐ We evaluate our model in terms of temporal coherence, prompt adherence, and visual fidelity and find that our method outperforms baselines that are based on other approaches, achieving up to threefold improvements in identity preservation measured using LPIPS scores, and effectively balancing visual quality with dynamic content.
๐ Paper: https://arxiv.org/abs/2412.20422
๐ Project Page: https://3-to-4d.github.io/3-to-4d/
We would greatly appreciate your assistance and consideration of our paper for inclusion.
๐