8 63 20

Zesen Cheng

ClownRat

AI & ML interests

multi-modal foundation model; Segmentation, Detection, and Tracking;

Recent Activity

upvoted a paper 1 day ago

MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

authored a paper 1 day ago

MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

upvoted a paper 7 days ago

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

View all activity

Organizations

ClownRat's activity

upvoted a paper 1 day ago

MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

Paper • 2503.14428 • Published 8 days ago • 8

upvoted 2 papers 7 days ago

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 122

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Paper • 2410.18558 • Published Oct 24, 2024 • 20

upvoted a paper 12 days ago

Transformers without Normalization

Paper • 2503.10622 • Published 13 days ago • 136

upvoted 2 papers 25 days ago

LongRoPE2: Near-Lossless LLM Context Window Scaling

Paper • 2502.20082 • Published 27 days ago • 36

Self-rewarding correction for mathematical reasoning

Paper • 2502.19613 • Published 28 days ago • 82

upvoted 2 articles about 1 month ago

Article

Mixture of Experts Explained

Dec 11, 2023

• 486

Article

SigLIP 2: A better multilingual vision language encoder

Feb 21

• 145

upvoted 3 papers about 1 month ago

Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published Feb 19 • 172

Small Models Struggle to Learn from Strong Reasoners

Paper • 2502.12143 • Published Feb 17 • 32

LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization

Paper • 2502.13922 • Published Feb 19 • 25

upvoted 9 papers about 2 months ago

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Paper • 2501.12599 • Published Jan 22 • 111

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper • 2501.12948 • Published Jan 22 • 364

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

Paper • 2501.07888 • Published Jan 14 • 15

MiniMax-01: Scaling Foundation Models with Lightning Attention

Paper • 2501.08313 • Published Jan 14 • 281

Valley2: Exploring Multimodal Models with Scalable Vision-Language Design

Paper • 2501.05901 • Published Jan 10 • 1

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 87