Sailor2 Evaluation

community

AI & ML interests

None defined yet.

Recent Activity

gabrielchua authored a paper 21 days ago

Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security

gabrielchua authored a paper 26 days ago

LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators

binwang authored a paper 27 days ago

MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

View all activity

gabrielchua

authored a paper 21 days ago

Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security

Paper • 2507.19399 • Published 24 days ago • 1

gabrielchua

authored a paper 26 days ago

LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators

Paper • 2507.15339 • Published 28 days ago

binwang

authored a paper 27 days ago

MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

Paper • 2507.14683 • Published 30 days ago • 125

gabrielchua

authored a paper 28 days ago

Toxicity-Aware Few-Shot Prompting for Low-Resource Singlish Translation

Paper • 2507.11966 • Published Jul 16

SivilTaram

authored a paper about 1 month ago

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

Paper • 2507.12415 • Published Jul 16 • 41

kunato

authored 3 papers about 1 month ago

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

Paper • 2502.12982 • Published Feb 18 • 18

Mind the Gap! Static and Interactive Evaluations of Large Audio Models

Paper • 2502.15919 • Published Feb 21 • 4

FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning

Paper • 2506.16123 • Published Jun 19 • 8

gabrielchua

authored a paper about 1 month ago

Measuring What Matters: A Framework for Evaluating Safety Risks in Real-World LLM Applications

Paper • 2507.09820 • Published Jul 13

SivilTaram

authored a paper about 1 month ago

First Return, Entropy-Eliciting Explore

Paper • 2507.07017 • Published Jul 9 • 23

gabrielchua

authored a paper about 1 month ago

RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages

Paper • 2507.05980 • Published Jul 8 • 1

SivilTaram

authored a paper about 1 month ago

ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention

Paper • 2507.01004 • Published Jul 1 • 10

hynky

authored a paper about 2 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 65

SivilTaram

authored a paper 3 months ago

General-Reasoner: Advancing LLM Reasoning Across All Domains

Paper • 2505.14652 • Published May 20 • 23

yongzx

authored 4 papers 3 months ago

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Paper • 2406.10118 • Published Jun 14, 2024 • 33

Humanity's Last Exam

Paper • 2501.14249 • Published Jan 24 • 75

Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks

Paper • 2410.18210 • Published Oct 23, 2024

Crosslingual Reasoning through Test-Time Scaling

Paper • 2505.05408 • Published May 8 • 8

dreamerdeo

authored 2 papers 4 months ago

FlowReasoner: Reinforcing Query-Level Meta-Agents

Paper • 2504.15257 • Published Apr 21 • 47

NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation

Paper • 2504.13055 • Published Apr 17 • 19