diff --git "a/papers_with_abstracts_parallel.csv" "b/papers_with_abstracts_parallel.csv" --- "a/papers_with_abstracts_parallel.csv" +++ "b/papers_with_abstracts_parallel.csv" @@ -407,7 +407,6 @@ Fully Dynamic Euclidean Bi-Chromatic Matching in Sublinear Update Time,Gramoz Go Poly2Vec: Polymorphic Fourier-Based Encoding of Geospatial Objects for GeoAI Applications,Maria Despoina Siampou Jialiang Li John Krumm Cyrus Shahabi Hua Lu,https://icml.cc/virtual/2025/poster/44245,"Encoding geospatial objects is fundamental for geospatial artificial intelligence (GeoAI) applications, which leverage machine learning (ML) models to analyze spatial information. Common approaches transform each object into known formats, like image and text, for compatibility with ML models. However, this process often discards crucial spatial information, such as the object's position relative to the entire space, reducing downstream task effectiveness. Alternative encoding methods that preserve some spatial properties are often devised for specific data objects (e.g., point encoders), making them unsuitable for tasks that involve different data types (i.e., points, polylines, and polygons). To this end, we propose Poly2Vec, a polymorphic Fourier-based encoding approach that unifies the representation of geospatial objects, while preserving the essential spatial properties. Poly2Vec incorporates a learned fusion module that adaptively integrates the magnitude and phase of the Fourier transform for different tasks and geometries.We evaluate Poly2Vec on five diverse tasks, organized into two categories. The first empirically demonstrates that Poly2Vec consistently outperforms object-specific baselines in preserving three key spatial relationships: topology, direction, and distance. The second shows that integrating Poly2Vec into a state-of-the-art GeoAI workflow improves the performance in two popular tasks: population prediction and land use inference." Compute or Load KV Cache? Why Not Both?,Shuowei Jin Xueshen Liu Qingzhao Zhang Zhuoqing Mao,https://icml.cc/virtual/2025/poster/45020,"Large Language Models (LLMs) are increasingly deployed in large-scale online services, enabling sophisticated applications. However, the computational overhead of generating key-value (KV) caches in the prefill stage presents a major bottleneck, particularly for long-context inputs. Prefix caching mitigates this issue by storing KV caches for reuse, reducing redundant computation. Despite its advantages, prefix caching suffers from high latency due to the limited I/O bandwidth of storage devices, constraining inference efficiency. To address this challenge, we introduce Cake, a novel KV cache loading system that optimally utilizes both computational and I/O resources in parallel. Cake employs a bidirectional scheduling strategy that dynamically balances KV cache computation and loading, ensuring efficient resource utilization. Additionally, Cake incorporates an adaptive scheduling mechanism that seamlessly integrates with non-prefix caching requests, improving system throughput and adapting to fluctuating resource availabilty. Through extensive evaluations across various hardware configurations, datasets, and storage conditions, Cake achieves on average 2.6× reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods. Our findings highlight Cake as an effective and practical solution for optimizing long-context LLM inference, bridging the gap between computation and I/O efficiency in large-scale AI deployments." QUTE: Quantifying Uncertainty in TinyML models with Early-exit-assisted ensembles for model-monitoring,Nikhil Pratap Ghanathe Steven J E Wilton,https://icml.cc/virtual/2025/poster/45956,"Uncertainty quantification (UQ) provides a resource-efficient solution for on-device monitoring of tinyML models deployed remotely without access to true labels. However, existing UQ methods impose significant memory and compute demands, making them impractical for ultra-low-power, KB-sized tinyML devices. Prior work has attempted to reduce overhead by using early-exit ensembles to quantify uncertainty in a single forward pass, but these approaches still carry prohibitive costs. To address this, we propose QUTE, a novel resource-efficient early-exit-assisted ensemble architecture optimized for tinyML models. QUTE introduces additional output blocks at the final exit of the base network, distilling early-exit knowledge into these blocks to form a diverse yet lightweight ensemble. We show that QUTE delivers superior uncertainty quality on tiny models, achieving comparable performance on larger models with 59% smaller model sizes than the closest prior work. When deployed on a microcontroller, QUTE demonstrates a 31% reduction in latency on average. In addition, we show that QUTE excels at detecting accuracy-drop events, outperforming all prior works." -Stochastic Regret Guarantees for Online Zeroth- and First-Order Bilevel Optimization,Parvin Nazari Bojian Hou Davoud Ataee Tarzanagh Li Shen George Michailidis,https://openreview.net/forum?id=nYrj9ZwgYk, Balancing Model Efficiency and Performance: Adaptive Pruner for Long-tailed Data,Zhe Zhao HaiBin Wen Pengkun Wang ShuangWang Zhenkun Wang Qingfu Zhang Yang Wang,https://icml.cc/virtual/2025/poster/46627,"Long-tailed distribution datasets are prevalent in many machine learning tasks, yet existing neural network models still face significant challenges when handling such data. This paper proposes a novel adaptive pruning strategy, LTAP (Long-Tailed Adaptive Pruner), aimed at balancing model efficiency and performance to better address the challenges posed by long-tailed data distributions. LTAP introduces multi-dimensional importance scoring criteria and designs a dynamic weight adjustment mechanism to adaptively determine the pruning priority of parameters for different classes. By focusing on protecting parameters critical for tail classes, LTAP significantly enhances computational efficiency while maintaining model performance. This method combines the strengths of long-tailed learning and neural network pruning, overcoming the limitations of existing approaches in handling imbalanced data. Extensive experiments demonstrate that LTAP outperforms existing methods on various long-tailed datasets, achieving a good balance between model compression rate, computational efficiency, and classification accuracy. This research provides new insights into solving model optimization problems in long-tailed learning and is significant for improving the performance of neural networks on imbalanced datasets. The code is available at https://github.com/DataLab-atom/LT-VOTE." DANCE: Dual Unbiased Expansion with Group-acquired Alignment for Out-of-distribution Graph Fairness Learning,Yifan Wang Hourun Li Ling Yue Zhiping Xiao Jia Yang Changling Zhou Wei Ju Ming Zhang Xiao Luo,https://icml.cc/virtual/2025/poster/45227,"Graph neural networks (GNNs) have shown strong performance in graph fairness learning, which aims to ensure that predictions are unbiased with respect to sensitive attributes. However, existing approaches usually assume that training and test data share the same distribution, which rarely holds in the real world. To tackle this challenge, we propose a novel approach named Dual Unbiased Expansion with Group-acquired Alignment (DANCE) for graph fairness learning under distribution shifts. The core idea of our DANCE is to synthesize challenging yet unbiased virtual graph data in both graph and hidden spaces, simulating distribution shifts from a data-centric view. Specifically, we introduce the unbiased Mixup in the hidden space, prioritizing minor groups to address the potential imbalance of sensitive attributes. Simultaneously, we conduct fairness-aware adversarial learning in the graph space to focus on challenging samples and improve model robustness. To further bridge the domain gap, we propose a group-acquired alignment objective that prioritizes negative pair groups with identical sensitive labels. Additionally, a representation disentanglement objective is adopted to decorrelate sensitive attributes and target representations for enhanced fairness. Extensive experiments demonstrate the superior effectiveness of the proposed DANCE." Does Data Scaling Lead to Visual Compositional Generalization?,Arnas Uselis Andrea Dittadi Seong Joon Oh,https://icml.cc/virtual/2025/poster/45559,"Compositional understanding is crucial for human intelligence, yet it remains unclear whether contemporary vision models exhibit it. The dominant machine learning paradigm is built on the premise that scaling data and model sizes will improve out-of-distribution performance, including compositional generalization. We test this premise through controlled experiments that systematically vary data scale, concept diversity, and combination coverage. We find that compositional generalization is driven by data diversity, not mere data scale. Increased combinatorial coverage forces models to discover a linearly factored representational structure, where concepts decompose into additive components. We prove this structure is key to efficiency, enabling perfect generalization from few observed combinations. Evaluating pretrained models (DINO, CLIP), we find above-random yet imperfect performance, suggesting partial presence of this structure. Our work motivates stronger emphasis on constructing diverse datasets for compositional generalization, and considering the importance of representational structure that enables efficient compositional learning." @@ -595,7 +594,6 @@ Feature Importance Metrics in the Presence of Missing Data,Henrik von Kleist Jos Multi-Session Budget Optimization for Forward Auction-based Federated Learning,Xiaoli Tang Han Yu Zengxiang Li Xiaoxiao Li,https://icml.cc/virtual/2025/poster/44752,"Auction-based Federated Learning (AFL) has emerged as an important research field in recent years. The prevailing strategies for FL data consumers (DCs) assume that the entire team of the required data owners (DOs) for an FL task must be assembled before training can commence. In practice, a DC can trigger the FL training process multiple times. DOs can thus be gradually recruited over multiple FL model training sessions. Existing bidding strategies for AFL DCs are not designed to handle such scenarios. Therefore, the problem of multi-session AFL remains open. To address this problem, we propose the Multi-session Budget Optimization Strategy for forward Auction-based Federated Learning (MBOS-AFL). Based on hierarchical reinforcement learning, MBOS-AFL jointly optimizes intersession budget pacing and intra-session bidding for AFL DCs, with the objective of maximizing the total utility. Extensive experiments on six benchmark datasets show that it significantly outperforms seven state-of-the-art approaches. On average, MBOS-AFL achieves 12.28% higher utility, 14.52% more data acquired through auctions for a given budget, and 1.23% higher test accuracy achieved by the resulting FL model compared to the best baseline. To the best of our knowledge, it is the first budget optimization decision support method with budget pacing capability designed for DCs in multi-session forward AFL." Testing Conditional Mean Independence Using Generative Neural Networks,Yi Zhang Linjun Huang Yun Yang Xiaofeng Shao,https://icml.cc/virtual/2025/poster/44626,"Conditional mean independence (CMI) testing is crucial for statistical tasks including model determination and variable importance evaluation. In this work, we introduce a novel population CMI measure and a bootstrap-based testing procedure that utilizes deep generative neural networks to estimate the conditional mean functions involved in the population measure. The test statistic is thoughtfully constructed to ensure that even slowly decaying nonparametric estimation errors do not affect the asymptotic accuracy of the test. Our approach demonstrates strong empirical performance in scenarios with high-dimensional covariates and response variable, can handle multivariate responses, and maintains nontrivial power against local alternatives outside an $n^{-1/2}$ neighborhood of the null hypothesis. We also use numerical simulations and real-world imaging data applications to highlight the efficacy and versatility of our testing procedure." Isolated Causal Effects of Natural Language,Victoria Lin Louis-Philippe Morency Eli Ben-Michael,https://icml.cc/virtual/2025/poster/44884,"As language technologies become widespread, it is important to understand how changes in language affect reader perceptions and behaviors. These relationships may be formalized as theisolated causal effectof somefocallanguage-encoded intervention (e.g., factual inaccuracies) on an external outcome (e.g., readers' beliefs). In this paper, we introduce a formal estimation framework for isolated causal effects of language. We show that a core challenge of estimating isolated effects is the need to approximate allnon-focallanguage outside of the intervention. Drawing on the principle ofomitted variable bias, we provide measures for evaluating the quality of both non-focal language approximations and isolated effect estimates themselves. We find that poor approximation of non-focal language can lead to bias in the corresponding isolated effect estimates due to omission of relevant variables, and we show how to assess the sensitivity of effect estimates to such bias along the two key axes offidelityandoverlap. In experiments on semi-synthetic and real-world data, we validate the ability of our framework to correctly recover isolated effects and demonstrate the utility of our proposed measures." -The Minimal Search Space for Conditional Causal Bandits,Francisco N. F. Q. Simoes Itai Feigenbaum Mehdi Dastani Thijs van Ommen,https://openreview.net/forum?id=GOWRex7nOA, Generalization Performance of Ensemble Clustering: From Theory to Algorithm,Xu Zhang Haoye Qiu Weixuan Liang Hui LIU Junhui Hou Yuheng Jia,https://icml.cc/virtual/2025/poster/46061,"Ensemble clustering has demonstrated great success in practice; however, its theoretical foundations remain underexplored. This paper examines the generalization performance of ensemble clustering, focusing on generalization error, excess risk and consistency. We derive a convergence rate of generalization error bound and excess risk bound both of $\mathcal{O}(\sqrt{\frac{\log n}{m}}+\frac{1}{\sqrt{n}})$, with $n$ and $m$ being the numbers of samples and base clusterings. Based on this, we prove that when $m$ and $n$ approach infinity and $m$ is significantly larger than log $n$, i.e., $m,n\to \infty, m\gg \log n$, ensemble clustering is consistent. Furthermore, recognizing that $n$ and $m$ are finite in practice, the generalization error cannot be reduced to zero. Thus, by assigning varying weights to finite clusterings, we minimize the error between the empirical average clusterings and their expectation. From this, we theoretically demonstrate that to achieve better clustering performance, we should minimize the deviation (bias) of base clustering from its expectation and maximize the differences (diversity) among various base clusterings. Additionally, we derive that maximizing diversity is nearly equivalent to a robust (min-max) optimization model. Finally, we instantiate our theory to develop a new ensemble clustering algorithm. Compared with SOTA methods, our approach achieves average improvements of 6.1\%, 7.3\%, and 6.0\% on 10 datasets w.r.t. NMI, ARI, and Purity. The code is available at https://github.com/xuz2019/GPEC." MathConstruct: Challenging LLM Reasoning with Constructive Proofs,Mislav Balunovic Jasper Dekoninck Nikola Jovanović Ivo Petrov Martin Vechev,https://icml.cc/virtual/2025/poster/44502,"While Large Language Models (LLMs) demonstrate impressive performance in mathematics, existing math benchmarks come with significant limitations. Many focus on problems with fixed ground-truth answers, and are often saturated due to problem simplicity or the viability of guessing or memorization. Crucially, they capture only a narrow subset of relevant math problems. To address this research gap, we introduce MathConstruct, a new benchmark of 127 challenging problems sourced from various math competitions, which targetsconstructive proofs, a widely encountered problem type requiring the construction of mathematical objects with specific properties. These proofs are particularly suitable for LLM evaluation, as solution correctness can be easily verified. Our automated verifiers also enable MathConstruct to generate problem variations, used to evaluate robustness. State-of-the-art LLMs solve only 41\% of MathConstruct problems, highlighting its complexity and importance for LLM evaluation." Wait-Less Offline Tuning and Re-solving for Online Decision Making,Jingruo Sun Wenzhi Gao Ellen Vitercik Yinyu Ye,https://icml.cc/virtual/2025/poster/46114,"Online linear programming (OLP) has found broad applications in revenue management and resource allocation. State-of-the-art OLP algorithms achieve low regret by repeatedly solving linear programming (LP) subproblems that incorporate updated resource information. However, LP-based methods are computationally expensive and often inefficient for large-scale applications. By contrast, recent first-order OLP algorithms are more computationally efficient but typically suffer from weaker regret guarantees. To address these shortcomings, we propose a new algorithm that combines the strengths of LP-based and first-order OLP algorithms. Our algorithm re-solves the LP subproblems periodically at a predefined frequency $f$ and uses the latest dual prices to guide online decision-making. In parallel, a first-order method runs during each interval between LP re-solves and smooths resource consumption. Our algorithm achieves $\mathcal{O}(\log (T/f) + \sqrt{f})$ regret and delivers a ""wait-less"" online decision-making process that balances computational efficiency and regret guarantees. Extensive experiments demonstrate at least $10$-fold improvements in regret over first-order methods and $100$-fold improvements in runtime over LP-based methods." @@ -642,7 +640,6 @@ From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning,Noa The Role of Sparsity for Length Generalization in LLMs,Noah Golowich Samy Jelassi David Brandfonbrener Sham M. Kakade Eran Malach,https://icml.cc/virtual/2025/poster/45242,"Training large language models to predict beyond their training context lengths has drawn much attention in recent years, yet the principles driving such behavior of length generalization remain underexplored. We propose a new theoretical framework to study length generalization for the next-token prediction task, as performed by decoder-only transformers. Conceptually, we show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We formalize such tasks via a notion we call k-sparse planted correlation distributions, and show that an idealized model of transformers which generalize attention heads successfully length-generalize on such tasks. As a bonus, our theoretical model allows us to provide justifications for techniques to modify positional embeddings which have been introduced to improve length generalization, such as position coupling.We support our theoretical results with experiments on synthetic tasks and natural language, which confirm that a key factor driving length generalization is indeed a ``sparse'' dependency structure of each token on the previous ones. Further, inspired by our theory, we introduce Predictive Position Coupling, a generalization of position coupling which trains the transformer to predict the position IDs used in a positional coupling approach. Predictive Position Coupling thereby allows us to broaden the array of tasks to which Position Coupling can successfully be applied to achieve length generalization." Learning Parametric Distributions from Samples and Preferences,Marc Jourdan Gizem Yüce Nicolas Flammarion,https://icml.cc/virtual/2025/poster/45822,"Recent advances in language modeling have underscored the role of preference feedback in enhancing model performance. This paper investigates the conditions under which preference feedback improves parameter estimation in classes of continuous parametric distributions. In our framework, the learner observes pairs of samples from an unknown distribution along with their relative preferences depending on the same unknown parameter. We show that preferences-based M-estimators achieve a better asymptotic variance than sample-only M-estimators, further improved by deterministic preferences. Leveraging the hard constraints revealed by deterministic preferences, we propose an estimator achieving an estimation error scaling of $\mathcal{O}(1/n)$---a significant improvement over the $\Theta(1/\sqrt{n})$ rate attainable with samples alone. Next, we establish a lower bound that matches this accelerated rate; up to problem-dependent constants. While the assumptions underpinning our analysis are restrictive, they are satisfied by notable cases such as Gaussian or Laplace distributions for preferences based on the log-probability reward." "Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation",Juno Kim Denny Wu Jason D. Lee Taiji Suzuki,https://icml.cc/virtual/2025/poster/46591,"A key paradigm to improve the reasoning capabilities of large language models (LLMs) is to allocate more inference-time compute to search against a verifier or reward model. This process can then be utilized to refine the pretrained model or distill its reasoning patterns into more efficient models. In this paper, we study inference-time computation by viewing chain-of-thought (CoT) generation as a metastable Markov process: easy reasoning steps (e.g., algebraic manipulations) form densely connected clusters, while hard reasoning steps (e.g., applying a relevant theorem) create sparse, low-probability edges between clusters, leading to phase transitions at longer timescales. Under this framework, we prove that implementing a search protocol that rewards sparse edges improves CoT by decreasing the expected number of steps to reach different clusters. In contrast, we establish a limit on reasoning capability when the model is restricted to local information of the pretrained graph. We also show that the information gained by search can be utilized to obtain a better reasoning model: (1) the pretrained model can be directly finetuned to favor sparse edges via policy gradient methods, and moreover (2) a compressed \emph{metastable representation} of the reasoning dynamics can be distilled into a smaller, more efficient model." -On the Geometry of Regularization in Adversarial Training: High-Dimensional Asymptotics and Generalization Bounds,Matteo Vilucchio Nikolaos Tsilivis Bruno Loureiro Julia Kempe,https://openreview.net/forum?id=9Mxi4pMe4G, Retraining with Predicted Hard Labels Provably Increases Model Accuracy,Rudrajit Das Inderjit S Dhillon Alessandro Epasto Adel Javanmard Jieming Mao Vahab Mirrokni Sujay Sanghavi Peilin Zhong,https://icml.cc/virtual/2025/poster/43932,"The performance of a model trained with noisy labels is often improved by simply *retraining* the model with its *own predicted hard labels* (i.e., $1$/$0$ labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable binary classification setting with randomly corrupted labels given to us and prove that retraining can improve the population accuracy obtained by initially training with the given (noisy) labels. To the best of our knowledge, this is the first such theoretical result. Retraining finds application in improving training with local label differential privacy (DP), which involves training with noisy labels. We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at no extra privacy cost; we call this consensus-based retraining. For example, when training ResNet-18 on CIFAR-100 with $\epsilon=3$ label DP, we obtain more than $6$% improvement in accuracy with consensus-based retraining." Connecting Thompson Sampling and UCB: Towards More Efficient Trade-offs Between Privacy and Regret,Bingshan Hu ZHIMING HUANG Tianyue H. Zhang Mathias Lécuyer Nidhi Hegde,https://icml.cc/virtual/2025/poster/45953,"We address differentially private stochastic bandit problems by leveraging Thompson Sampling with Gaussian priors and Gaussian differential privacy (GDP). We propose DP-TS-UCB, a novel parametrized private algorithm that enables trading off privacy and regret. DP-TS-UCB satisfies $ \tilde{O} \left(T^{0.25(1-\alpha)}\right)$-GDP and achieves $O \left(K\ln^{\alpha+1}(T)/\Delta \right)$ regret bounds, where $K$ is the number of arms, $ \Delta$ is the sub-optimality gap, $T$ is the learning horizon, and $\alpha \in [0,1]$ controls the trade-off between privacy and regret. Theoretically, DP-TS-UCB relies on anti-concentration bounds for the Gaussian distributions, linking the exploration mechanisms of Thompson Sampling and Upper Confidence Bound, which may be of independent research interest." SlimLLM: Accurate Structured Pruning for Large Language Models,Jialong Guo Xinghao Chen Yehui Tang Yunhe Wang,https://icml.cc/virtual/2025/poster/46559,"Large language models(LLMs) have garnered significant attention and demonstrated impressive capabilities in a wide range of applications. However, due to their enormous computational costs, the deployment and application of LLMs are often severely limited. To address this issue, structured pruning is an effective solution to compress the parameters of LLMs. Determining the importance of each sub-module in LLMs and minimizing performance loss are critical issues that need to be carefully addressed in structured pruning. In this paper, we propose an effective and fast structured pruning method named SlimLLM for large language models. For channel and attention head pruning, we evaluate the importance based on the entire channel or head, rather than merely aggregating the importance of individual elements within a sub-module. This approach enables a more holistic consideration of the interdependence among elements within the sub-module. In addition, we design a simple linear regression strategy for the output matrix to quickly recover performance. We also propose layer-based importance ratio to determine the pruning ratio for each layer. Based on the LLaMA benchmark results, our SlimLLM outperforms other methods and achieves state-of-the-art performance." @@ -678,10 +675,8 @@ Latent Mamba Operator for Partial Differential Equations,Karn Tiwari Niladri Dut OneForecast: A Universal Framework for Global and Regional Weather Forecasting,Yuan Gao Hao Wu Ruiqi Shu huanshuo dong Fan Xu Rui Ray Chen Yibo Yan Qingsong Wen Xuming Hu Kun Wang Jiahao Wu Li Qing Hui Xiong Xiaomeng Huang,https://icml.cc/virtual/2025/poster/46192,"Accurate weather forecasts are important for disaster prevention, agricultural planning, etc. Traditional numerical weather prediction (NWP) methods offer physically interpretable high-accuracy predictions but are computationally expensive and fail to fully leverage rapidly growing historical data. In recent years, deep learning models have made significant progress in weather forecasting, but challenges remain, such as balancing global and regional high-resolution forecasts, excessive smoothing in extreme event predictions, and insufficient dynamic system modeling. To address these issues, this paper proposes a global-regional nested weather forecasting framework (OneForecast) based on graph neural networks. By combining a dynamic system perspective with multi-grid theory, we construct a multi-scale graph structure and densify the target region to capture local high-frequency features. We introduce an adaptive messaging mechanism, using dynamic gating units to deeply integrate node and edge features for more accurate extreme event forecasting. For high-resolution regional forecasts, we propose a neural nested grid method to mitigate boundary information loss. Experimental results show that OneForecast performs excellently across global to regional scales and short-term to long-term forecasts, especially in extreme event predictions. Codes link: \url{https://github.com/YuanGao-YG/OneForecast}." Perceptually Constrained Precipitation Nowcasting Model,Wenzhi Feng Xutao Li Zhe Wu Kenghong Lin Demin Yu Yunming Ye Yaowei Wang,https://icml.cc/virtual/2025/poster/45973,"Most current precipitation nowcasting methods aim to capture the underlying spatiotemporal dynamics of precipitation systems by minimizing the mean square error (MSE). However, these methods often neglect effective constraints on the data distribution, leading to unsatisfactory prediction accuracy and image quality, especially for long forecast sequences. To address this limitation, we propose a precipitation nowcasting model incorporating perceptual constraints. This model reformulates precipitation nowcasting as a posterior MSE problem under such constraints. Specifically, we first obtain the posteriori mean sequences of precipitation forecasts using a precipitation estimator. Subsequently, we construct the transmission between distributions using rectified flow. To enhance the focus on distant frames, we design a frame sampling strategy that gradually increases the corresponding weights. We theoretically demonstrate the reliability of our solution, and experimental results on two publicly available radar datasets demonstrate that our model is effective and outperforms current state-of-the-art models." Physics-Informed Generative Modeling of Wireless Channels,Benedikt Böck Andreas Oeldemann Timo Mayer Francesco Rossetto Wolfgang Utschick,https://icml.cc/virtual/2025/poster/45899,"Learning the site-specific distribution of the wireless channel within a particular environment of interest is essential to exploit the full potential of machine learning (ML) for wireless communications and radar applications. Generative modeling offers a promising framework to address this problem. However, existing approaches pose unresolved challenges, including the need for high-quality training data, limited generalizability, and a lack of physical interpretability. To address these issues, we combine the physics-related compressibility of wireless channels with generative modeling, in particular, sparse Bayesian generative modeling (SBGM), to learn the distribution of the underlying physical channel parameters. By leveraging the sparsity-inducing characteristics of SBGM, our methods can learn from compressed observations received by an access point (AP) during default online operation. Moreover, they are physically interpretable and generalize over system configurations without requiring retraining." -Sample-efficient diffusion-based control of complex nonlinear systems,Hongyi Chen Jingtao Ding Jianhai Shu Xinchun Yu Xiaojun Liang Yong Li Xiao-Ping Zhang,https://openreview.net/forum?id=rUXTjhFMG5, Scalable Equilibrium Sampling with Sequential Boltzmann Generators,Charlie B. Tan Joey Bose Chen Lin Leon Klein Michael M. Bronstein Alexander Tong,https://icml.cc/virtual/2025/poster/45137,"Scalable sampling of molecular states in thermodynamic equilibrium is a long-standing challenge in statistical physics. Boltzmann generators tackle this problem by pairing normalizing flows with importance sampling to obtain uncorrelated samples under the target distribution. In this paper, we extend the Boltzmann generator framework with two key contributions, denoting our framework Sequential Boltzmann Generators (SBG). The first is a highly efficient Transformer-based normalizing flow operating directly on all-atom Cartesian coordinates. In contrast to the equivariant continuous flows of prior methods, we leverage exactly invertible non-equivariant architectures which are highly efficient during both sample generation and likelihood evaluation. This efficiency unlocks more sophisticated inference strategies beyond standard importance sampling. In particular, we perform inference-time scaling of flow samples using a continuous-time variant of sequential Monte Carlo, in which flow samples are transported towards the target distribution with annealed Langevin dynamics. SBG achieves state-of-the-art performance w.r.t. all metrics on peptide systems, demonstrating the first equilibrium sampling in Cartesian coordinates of tri-, tetra- and hexa-peptides that were thus far intractable for prior Boltzmann generators." Sidechain conditioning and modeling for full-atom protein sequence design with FAMPNN,Talal Widatalla Richard W. Shuai Brian Hie Possu Huang,https://icml.cc/virtual/2025/poster/45963,"Leading deep learning-based methods for fixed-backbone protein sequence design do not model protein sidechain conformation during sequence generation despite the large role the three-dimensional arrangement of sidechain atoms play in protein conformation, stability, and overall protein function. Instead, these models implicitly reason about crucial sidechain interactions based solely on backbone geometry and amino-acid sequence. To address this, we present FAMPNN (Full-Atom MPNN), a sequence design method that explicitly models both sequence identity and sidechain conformation for each residue, where the per-token distribution of a residue’s discrete amino acid identity and its continuous sidechain conformation are learned with a combined categorical cross-entropy and diffusion loss objective. We demonstrate learning these distributions jointly is a highly synergistic task that both improves sequence recovery while achieving state-of-the-art sidechain packing. Furthermore, benefits from explicit full-atom modeling generalize from sequence recovery to practical protein design applications, such as zero-shot prediction of experimental binding and stability measurements." -Spectral Informed Neural Networks,Tianchi Yu Yiming Qi Ivan Oseledets Shiyi Chen,https://openreview.net/forum?id=MZyfS1cf0u, Zebra: In-Context Generative Pretraining for Solving Parametric PDEs,Louis Serrano Armand Kassaï Koupaï Thomas X Wang Pierre ERBACHER Patrick Gallinari,https://icml.cc/virtual/2025/poster/46609,"Solving time-dependent parametric partial differential equations (PDEs) is challenging for data-driven methods, as these models must adapt to variations in parameters such as coefficients, forcing terms, and initial conditions. State-of-the-art neural surrogates perform adaptation through gradient-based optimization and meta-learning to implicitly encode the variety of dynamics from observations. This often comes with increased inference complexity. Inspired by the in-context learning capabilities of large language models (LLMs), we introduce Zebra, a novel generative auto-regressive transformer designed to solve parametric PDEs without requiring gradient adaptation at inference. By leveraging in-context information during both pre-training and inference, Zebra dynamically adapts to new tasks by conditioning on input sequences that incorporate context example trajectories. As a generative model, Zebra can be used to generate new trajectories and allows quantifying the uncertainty of the predictions. We evaluate Zebra across a variety of challenging PDE scenarios, demonstrating its adaptability, robustness, and superior performance compared to existing approaches." 3D-LMVIC: Learning-based Multi-View Image Compression with 3D Gaussian Geometric Priors,Yujun Huang Bin Chen Niu Lian Xin Wang Baoyi An Tao Dai Shu-Tao Xia,https://icml.cc/virtual/2025/poster/46162,"Existing multi-view image compression methods often rely on 2D projection-based similarities between views to estimate disparities. While effective for small disparities, such as those in stereo images, these methods struggle with the more complex disparities encountered in wide-baseline multi-camera systems, commonly found in virtual reality and autonomous driving applications. To address this limitation, we propose 3D-LMVIC, a novel learning-based multi-view image compression framework that leverages 3D Gaussian Splatting to derive geometric priors for accurate disparity estimation. Furthermore, we introduce a depth map compression model to minimize geometric redundancy across views, along with a multi-view sequence ordering strategy based on a defined distance measure between views to enhance correlations between adjacent views. Experimental results demonstrate that 3D-LMVIC achieves superior performance compared to both traditional and learning-based methods. Additionally, it significantly improves disparity estimation accuracy over existing two-view approaches." ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization,Wenhao Shen Wanqi Yin Xiaofeng Yang Cheng Chen Chaoyue Song Zhongang Cai Lei Yang Hao Wang Guosheng Lin,https://icml.cc/virtual/2025/poster/43558,"Human mesh recovery (HMR) from a single image is inherently ill-posed due to depth ambiguity and occlusions. Probabilistic methods have tried to solve this by generating numerous plausible 3D human mesh predictions, but they often exhibit misalignment with 2D image observations and weak robustness to in-the-wild images. To address these issues, we propose ADHMR, a framework thatAligns aDiffusion-basedHMRmodel in a preference optimization manner. First, we train a human mesh prediction assessment model, HMR-Scorer, capable of evaluating predictions even for in-the-wild images without 3D annotations. We then use HMR-Scorer to create a preference dataset, where each input image has a pair of winner and loser mesh predictions. This dataset is used to finetune the base model using direct preference optimization. Moreover, HMR-Scorer also helps improve existing HMR models by data cleaning, even with fewer training samples. Extensive experiments show that ADHMR outperforms current state-of-the-art methods. Code is available at:https://github.com/shenwenhao01/ADHMR." @@ -810,7 +805,6 @@ GrokFormer: Graph Fourier Kolmogorov-Arnold Transformers,GuoguoAi Guansong Pang N2GON: Neural Networks for Graph-of-Net with Position Awareness,Yejiang Wang Yuhai Zhao Zhengkui Wang Wen Shan Ling Li Qian Li Miaomiao Huang Meixia Wang Shirui Pan Xingwei Wang,https://icml.cc/virtual/2025/poster/43890,"Graphs, fundamental in modeling various research subjects such as computing networks, consist of nodes linked by edges. However, they typically function as components within larger structures in real-world scenarios, such as in protein-protein interactions where each protein is a graph in a larger network. This study delves into the Graph-of-Net (GON), a structure that extends the concept of traditional graphs by representing each node as a graph itself. It provides a multi-level perspective on the relationships between objects, encapsulating both the detailed structure of individual nodes and the broader network of dependencies. To learn node representations within the GON, we propose a position-aware neural network for Graph-of-Net which processes both intra-graph and inter-graph connections and incorporates additional data like node labels. Our model employs dual encoders and graph constructors to build and refine a constraint network, where nodes are adaptively arranged based on their positions, as determined by the network's constraint system. Our model demonstrates significant improvements over baselines in empirical evaluations on various datasets." Outlier-Aware Post-Training Quantization for Discrete Graph Diffusion Models,Zheng Gong Ying Sun,https://icml.cc/virtual/2025/poster/43639,"Discrete Graph Diffusion Models (DGDMs) mark a pivotal advancement in graph generation, effectively preserving sparsity and structural integrity, thereby enhancing the learning of graph data distributions for diverse generative applications. Despite their potential, DGDMs are computationally intensive due to the numerous low-parameter yet high-computation operations, thereby increasing the need of inference acceleration. A promising solution to mitigate this issue is model quantization. However, existing quantization techniques for Image Diffusion Models (IDMs) face limitations in DGDMs due to differing diffusion processes, while Large Language Model (LLM) quantization focuses on reducing memory access latency of loading large parameters, unlike DGDMs, where inference bottlenecks are computations due to smaller model sizes. To fill this gap, we introduce Bit-DGDM, a post-training quantization framework for DGDMs which incorporates two novel ideas: (i) sparse-dense activation quantization sparsely modeling the activation outliers through adaptively selected, data-free thresholds in full-precision and quantizing the remaining to low-bit, and (ii) ill-conditioned low-rank decomposition decomposing the weights into low-rank component enable faster inference and an $\alpha$-sparsity matrix that models outliers. Extensive experiments demonstrate that Bit-DGDM not only reducing the memory usage from the FP32 baseline by up to $2.8\times$ and achieve up to $2.5\times$ speedup, but also achieve comparable performance to ultra-low precision of up to 4-bit." Positional Encoding meets Persistent Homology on Graphs,Yogesh Verma Amauri H Souza Vikas K Garg,https://icml.cc/virtual/2025/poster/46612,"The local inductive bias of message-passing graph neural networks (GNNs) hampers their ability to exploit key structural information (e.g., connectivity and cycles). Positional encoding (PE) and Persistent Homology (PH) have emerged as two promising approaches to mitigate this issue. PE schemes endow GNNs with location-aware features, while PH methods enhance GNNs with multiresolution topological features. However, a rigorous theoretical characterization of the relative merits and shortcomings of PE and PH has remained elusive. We bridge this gap by establishing that neither paradigm is more expressive than the other, providing novel constructions where one approach fails but the other succeeds. Our insights inform the design of a novel learnable method, PiPE (Persistence-informed Positional Encoding), which is provably more expressive than both PH and PE. PiPE demonstrates strong performance across a variety of tasks (e.g., molecule property prediction, graph classification, and out-of-distribution generalization), thereby advancing the frontiers of graph representation learning. Code is available at https://github.com/Aalto-QuML/PIPE" -Pruning Spurious Subgraphs for Graph Out-of-Distribtuion Generalization,Tianjun Yao Haoxuan Li Yongqiang Chen Tongliang Liu Le Song Eric P. Xing Zhiqiang Shen,https://openreview.net/forum?id=5yeBvI1qtf, SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval,Nikolaos Chaidos Angeliki Dimitriou Maria Lymperaiou Giorgos Stamou,https://icml.cc/virtual/2025/poster/43841,"Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability.To address these, we presentSCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches.We further advocate forGraph Edit Distance(GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval.The source code is available at https://github.com/nickhaidos/scenir-icml2025." SPHINX: Structural Prediction using Hypergraph Inference Network,Iulia Duta Pietro Lio,https://icml.cc/virtual/2025/poster/43825,"The importance of higher-order relations is widely recognized in numerous real-world systems. However, annotating them is a tedious and sometimes even impossible task. Consequently, current approaches for data modelling either ignore the higher-order interactions altogether or simplify them into pairwise connections. To facilitate higher-order processing, even when a hypergraph structure is not available, we introduce SPHINX, a model that learns to infer a latent hypergraph structure in an unsupervised way, solely from the final task-dependent signal. To ensure broad applicability, we design the model to be end-to-end differentiable, capable of generating a discrete hypergraph structure compatible with any modern hypergraph networks, and easily optimizable without requiring additional regularization losses.Through extensive ablation studies and experiments conducted on four challenging datasets, we demonstrate that our model is capable of inferring suitable latent hypergraphs in both transductive and inductive tasks. Moreover, the inferred latent hypergraphs are interpretable and contribute to enhancing the final performance, outperforming existing methods for hypergraph prediction." Test-Time Graph Neural Dataset Search With Generative Projection,Xin Zheng Wei Huang Chuan Zhou Ming Li Shirui Pan,https://icml.cc/virtual/2025/poster/46295,"In this work, we address the test-time adaptation challenge in graph neural networks (GNNs), focusing on overcoming the limitations in flexibility and generalization inherent in existing data-centric approaches. To this end, we propose a novel research problem, test-time graph neural dataset search, which seeks to learn a parameterized test-time graph distribution to enhance the inference performance of unseen test graphs on well-trained GNNs. Specifically, we propose a generative Projection based test-time Graph Neural Dataset Search method, named PGNDS, which maps the unseen test graph distribution back to the known training distribution through a generation process guided by well-trained GNNs. The proposed PGNDS framework consists of three key modules: (1) dual conditional diffusion for GNN-guided generative projection through test-back-to-training distribution mapping; (2) dynamic search from the generative sampling space to select the most expressive test graphs; (3) ensemble inference to aggregate information from original and adapted test graphs. Extensive experiments on real-world graphs demonstrate the superior ability of our proposed PGNDS for improved test-time GNN inference." @@ -966,7 +960,6 @@ Approximately Correct Label Distribution Learning,Weiwei Li Haitao Wu Yunan Lu X Focal-SAM: Focal Sharpness-Aware Minimization for Long-Tailed Classification,Sicong Li Qianqian Xu Zhiyong Yang Zitai Wang Linchao Zhang Xiaochun Cao Qingming Huang,https://icml.cc/virtual/2025/poster/44211,"Real-world datasets often follow a long-tailed distribution, making generalization to tail classes difficult. Recent methods resorted to long-tail variants of Sharpness-Aware Minimization (SAM), such as ImbSAM and CC-SAM, to improve generalization by flattening the loss landscape. However, these attempts face a trade-off between computational efficiency and control over the loss landscape. On the one hand, ImbSAM is efficient but offers only coarse control as it excludes head classes from the SAM process. On the other hand, CC-SAM provides fine-grained control through class-dependent perturbations but at the cost of efficiency due to multiple backpropagations. Seeing this dilemma, we introduce Focal-SAM, which assigns different penalties to class-wise sharpness, achieving fine-grained control without extra backpropagations, thus maintaining efficiency. Furthermore, we theoretically analyze Focal-SAM's generalization ability and derive a sharper generalization bound. Extensive experiments on both traditional and foundation models validate the effectiveness of Focal-SAM." Human Cognition-Inspired Hierarchical Fuzzy Learning Machine,Junbiao Cui Qin Yue Jianqing Liang Jiye Liang,https://icml.cc/virtual/2025/poster/45364,"Classification is a cornerstone of machine learning research. Most of the existing classifiers assume that the concepts corresponding to classes can be precisely defined. This notion diverges from the widely accepted understanding in cognitive science, which posits that real-world concepts are often inherently ambiguous. To bridge this big gap, we propose a Human Cognition-Inspired Hierarchical Fuzzy Learning Machine (HC-HFLM), which leverages a novel hierarchical alignment loss to integrate rich class knowledge from human knowledge system into learning process. We further theoretically prove that minimizing this loss can align the hierarchical structure derived from data with those contained in class knowledge, resulting in clear semantics and high interpretability. Systematic experiments verify that the proposed method can achieve significant gains in interpretability and generalization performance." Learning Imbalanced Data with Beneficial Label Noise,Guangzheng Hu Feng Liu Mingming Gong Guanghui Wang Liuhua Peng,https://icml.cc/virtual/2025/poster/46163,"Data imbalance is a common factor hindering classifier performance. Data-level approaches for imbalanced learning, such as resampling, often lead to information loss or generative errors. Building on theoretical studies of imbalance ratio in binary classification, it is found that adding suitable label noise can adjust biased decision boundaries and improve classifier performance. This paper proposes the Label-Noise-based Re-balancing (LNR) approach to solve imbalanced learning by employing a novel design of an asymmetric label noise model. In contrast to other data-level methods, LNR alleviates the issues of informative loss and generative errors and can be integrated seamlessly with any classifier or algorithm-level method. We validated the superiority of LNR on synthetic and real-world datasets. Our work opens a new avenue for imbalanced learning, highlighting the potential of beneficial label noise." -Regression Trees Know Calculus,Nathan Wycoff,https://openreview.net/forum?id=S5v9UheAhP, Adapter Naturally Serves as Decoupler for Cross-Domain Few-Shot Semantic Segmentation,Jintao Tong Ran Ma Yixiong Zou Guangyao Chen Yuhua Li Ruixuan Li,https://icml.cc/virtual/2025/poster/45369,"Cross-domain few-shot segmentation (CD-FSS) is proposed to first pre-train the model on a source-domain dataset with sufficient samples, and then transfer the model to target-domain datasets where only a few training samples are available for efficient finetuning. There are majorly two challenges in this task: (1) the domain gap and (2) finetuning with scarce data. To solve these challenges, we revisit the adapter-based methods, and discover an intriguing insight not explored in previous works: the adapter not only helps the fine-tuning of downstream tasks but also naturally serves as a domain information decoupler. Then, we delve into this finding for an interpretation, and we find the model's inherent structure could lead to a natural decoupling of domain information. Building upon this insight, we propose the Domain Feature Navigator (DFN), which is a structure-based decoupler instead of loss-based ones like current works, to capture domain-specific information, thereby directing the model's attention towards domain-agnostic knowledge. Moreover, to prevent the potential excessive overfitting of DFN during the source-domain training, we further design the SAM-SVN method to constrain DFN from learning sample-specific knowledge. On target domains, we freeze the model and fine-tune the DFN to learn knowledge specific to target domains. Extensive experiments demonstrate that our method surpasses the state-of-the-art method in CD-FSS significantly by 2.69% and 4.68% average MIoU in 1-shot and 5-shot scenarios, respectively." Be Confident: Uncovering Overfitting in MLLM Multi-Task Tuning,Wenke Huang Jian Liang G. W. Didi Zhu He Li Jiawei Shao Mang Ye Bo Du Dacheng Tao,https://icml.cc/virtual/2025/poster/44726,"Fine-tuning Multimodal Large Language Models (MLLMs) in multi-task learning scenarios has emerged as an effective strategy for achieving cross-domain specialization. However, multi-task fine-tuning frequently induces performance degradation on open-response datasets. We posit that free-form answer generation primarily depends on language priors, and strengthening the integration of visual behavioral cues is critical for enhancing prediction robustness. In this work, we propose Noise Resilient Confidence Alignment to address the challenge of open-response overfitting during multi-task fine-tuning. Our approach prioritizes maintaining consistent prediction patterns in MLLMs across varying visual input qualities. To achieve this, we employ Gaussian perturbations to synthesize distorted visual inputs and enforce token prediction confidence alignment towards the normal visual branch. By explicitly linking confidence calibration to visual robustness, this method reduces over-reliance on language priors. We conduct extensive empirical evaluations across diverse multi-task downstream settings via popular MLLM architectures. The comprehensive experiment demonstrates the effectiveness of our method, showcasing its ability to alleviate open-response overfitting while maintaining satisfying multi-task fine-tuning performance." CALM: Consensus-Aware Localized Merging for Multi-Task Learning,Kunda Yan Min Zhang Sen Cui Qu Zikun Bo Jiang Feng Liu Changshui Zhang,https://icml.cc/virtual/2025/poster/45426,"Model merging aims to integrate the strengths of multiple fine-tuned models into a unified model while preserving task-specific capabilities. Existing methods, represented by task arithmetic, are typically classified into global- and local-aware methods. However, global-aware methods inevitably cause parameter interference, while local-aware methods struggle to maintain the effectiveness of task-specific details in the merged model. To address these limitations, we propose a Consensus Aware Localized Merging (CALM) method which incorporates localized information aligned with global task consensus, ensuring its effectiveness post-merging. CALM consists of three key components: (1) class-balanced entropy minimizationsampling, providing a more flexible and reliable way to leverage unsupervised data; (2) an efficient-aware framework, selecting a small set of tasks for sequential merging with high scalability; (3) a consensus-aware mask optimization, aligning localized binary masks with global task consensus and merging them conflict-free. Experiments demonstrate the superiority and robustness of our CALM, significantly outperforming existing methods and achieving performance close to traditional MTL." @@ -979,7 +972,6 @@ Self-Disentanglement and Re-Composition for Cross-Domain Few-Shot Segmentation,J Update Your Transformer to the Latest Release: Re-Basin of Task Vectors,Filippo Rinaldi Giacomo Capitani Lorenzo Bonicelli Donato Crisostomi Federico Bolelli ELISA FICARRA Emanuele Rodolà Simone Calderara Angelo Porrello,https://icml.cc/virtual/2025/poster/43843,"Foundation models serve as the backbone for numerous specialized models developed through fine-tuning. However, when the underlying pretrained model is updated or retrained (e.g., on larger and more curated datasets), the fine-tuned model becomes obsolete, losing its utility and requiring retraining. This raises the question: is it possible to transfer fine-tuning to a new release of the model? In this work, we investigate how to transfer fine-tuning to a new checkpoint without having to re-train, in a data-free manner. To do so, we draw principles from model re-basin and provide a recipe based on weight permutations to re-base the modifications made to the original base model, often called task vector. In particular, our approach tailors model re-basin for Transformer models, taking into account the challenges of residual connections and multi-head attention layers. Specifically, we propose a two-level method rooted in spectral theory, initially permuting the attention heads and subsequently adjusting parameters within select pairs of heads. Through extensive experiments on visual and textual tasks, we achieve the seamless transfer of fine-tuned knowledge to new pre-trained backbones without relying on a single training step or datapoint. Code is available at https://github.com/aimagelab/TransFusion." E-LDA: Toward Interpretable LDA Topic Models with Strong Guarantees in Logarithmic Parallel Time,Adam Breuer,https://icml.cc/virtual/2025/poster/46643,"In this paper, we provide the first practical algorithms with provable guarantees for the problem of inferring the topics assigned to each document in an LDA topic model. This is the primary inference problem for many applications of topic models in social science, data exploration, and causal inference settings. We obtain this result by showing a novel non-gradient-based, combinatorial approach to estimating topic models. This yields algorithms that converge to near-optimal posterior probability in logarithmic parallel computation time (adaptivity)---exponentially faster than any known LDA algorithm. We also show that our approach can provide interpretability guarantees such that each learned topic is formally associated with a known keyword. Finally, we show that unlike alternatives, our approach can maintain the independence assumptions necessary to use the learned topic model for downstream causal inference methods that allow researchers to study topics as treatments. In terms of practical performance, our approach consistently returns solutions of higher semantic quality than solutions from state-of-the-art LDA algorithms, neural topic models, and LLM-based topic models across a diverse range of text datasets and evaluation parameters." Leveraging Diffusion Model as Pseudo-Anomalous Graph Generator for Graph-Level Anomaly Detection,Jinyu Cai Yunhe Zhang Fusheng Liu See-Kiong Ng,https://icml.cc/virtual/2025/poster/44832,"A fundamental challenge in graph-level anomaly detection (GLAD) is the scarcity of anomalous graph data, as the training dataset typically contains only normal graphs or very few anomalies. This imbalance hinders the development of robust detection models. In this paper, we proposeAnomalousGraphDiffusion (AGDiff), a framework that explores the potential of diffusion models in generating pseudo-anomalous graphs for GLAD. Unlike existing diffusion-based methods that focus on modeling data normality, AGDiff leverages the latent diffusion framework to incorporate subtle perturbations into graph representations, thereby generating pseudo-anomalous graphs that closely resemble normal ones. By jointly training a classifier to distinguish these generated graph anomalies from normal graphs, AGDiff learns more discriminative decision boundaries. The shift from solely modeling normality to explicitly generating and learning from pseudo graph anomalies enables AGDiff to effectively identify complex anomalous patterns that other approaches might overlook. Comprehensive experimental results demonstrate that the proposed AGDiff significantly outperforms several state-of-the-art GLAD baselines." -Convergence Analysis of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks,Xianliang Xu Ting Du Wang Kong Bin Shan Ye Li Zhongyi Huang,https://openreview.net/forum?id=DM2aCc6LzI, Fast Tensor Completion via Approximate Richardson Iteration,Mehrdad Ghadiri Matthew Fahrbach Yunbum Kook Ali Jadbabaie,https://icml.cc/virtual/2025/poster/44964,"We study tensor completion (TC) through the lens of low-rank tensor decomposition (TD). Many TD algorithms use fast alternating minimization methods to solvehighly structuredlinear regression problems at each step (e.g., for CP, Tucker, and tensor-train decompositions). However, such algebraic structure is often lost in TC regression problems, making direct extensions unclear. This work proposes a novelliftingmethod for approximately solving TC regression problems using structured TD regression algorithms as blackbox subroutines, enabling sublinear-time methods. We analyze the convergence rate of our approximate Richardson iteration-based algorithm, and our empirical study shows that it can be 100x faster than direct methods for CP completion on real-world tensors." Flat-LoRA: Low-Rank Adaptation over a Flat Loss Landscape,Tao Li Zhengbao He Yujun Li Yasheng Wang Lifeng Shang Xiaolin Huang,https://icml.cc/virtual/2025/poster/46534,"Fine-tuning large-scale pre-trained models is prohibitively expensive in terms of computation and memory costs. Low-Rank Adaptation (LoRA), a popular Parameter-Efficient Fine-Tuning (PEFT) method, offers an efficient solution by optimizing only low-rank matrices. Despite recent progress in improving LoRA's performance, the relationship between the LoRA optimization space and the full parameter space is often overlooked. A solution that appears flat in the loss landscape of the LoRA space may still exhibit sharp directions in the full parameter space, potentially compromising generalization.We introduce Flat-LoRA, which aims to identify a low-rank adaptation situated in a flat region of the full parameter space.Instead of adopting the well-established sharpness-aware minimization approach, which incurs significant computation and memory overheads, we employ a Bayesian expectation loss objective to preserve training efficiency. Further, we design a refined strategy for generating random perturbations to enhance performance and carefully manage memory overhead using random seeds.Experiments across diverse tasks—including mathematical reasoning, coding abilities, dialogue generation, instruction following, and text-to-image generation—demonstrate that Flat-LoRA improves both in-domain and out-of-domain generalization.Code is available at https://github.com/nblt/Flat-LoRA." MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training,Yang Luo Zangwei Zheng Ziheng Qin Zirui Zhu Yong Liu Yang You,https://icml.cc/virtual/2025/poster/45497,"Large-batch training has become a cornerstone in accelerating the training of deep neural networks, yet it poses challenges in optimization and generalization. Existing optimizers like AdamW present performance degradation during language models' large-batch training, due to the information bottleneck in attention layers caused by the sharp increase of max attention logit. While the LAMB optimizer partially addresses this issue, some attention layers still face this issue. The reason is that $l_2$-norm-based trust ratios in LAMB are less effective in directly influencing the max value of query/key weights. Furthermore, the weight-wise trust ratio in LAMB is error-prone as it overlooks relationships of weight values within rows or columns. Building on these observations, we propose a novel optimizer, MERIT, which leverages the max-norm to calculate the trust ratio to constrain the max attention logit more effectively. Moreover, we further construct element-wise trust ratios to provide more robust update scaling by focusing on local weight structures. Extensive experiments of large-batch training across various sizes of GPT-2 models demonstrate the superior performance of MERIT. Notably, during the training of GPT-2 Medium, MERIT enables a 6k batch size without any performance degradation compared to the standard batch size (480) with 48B training tokens.This work highlights the importance of considering the max attention logit and finer-granularity trust ratio in large-batch training. It successfully improves the training stability and paves the way for larger batch usage, enabling faster development and iteration of large language models. Code is available at https://github.com/NUS-HPC-AI-Lab/MERIT." @@ -1160,12 +1152,10 @@ Latent Imputation before Prediction: A New Computational Paradigm for De Novo Pe Learning Condensed Graph via Differentiable Atom Mapping for Reaction Yield Prediction,Ankit Ghosh Gargee Kashyap Sarthak Mittal Nupur Jain Raghavan B Sunoj Abir De,https://icml.cc/virtual/2025/poster/43812,"Yield of chemical reactions generally depends on the activation barrier, i.e., the energy difference between the reactant and the transition state. Computing the transition state from the reactant and product graphs requires prior knowledge of the correct node alignment (i.e., atom mapping), which is not available in yield prediction datasets. In this work, we propose YieldNet, a neural yield prediction model, which tackles these challenges. Here, we first approximate the atom mapping between the reactants and products using a differentiable node alignment network. We then use this approximate atom mapping to obtain a noisy realization of the condensed graph of reaction (CGR), which is a supergraph encompassing both the reactants and products. This CGR serves as a surrogate for the transition state graph structure. The CGR embeddings of different steps in a multi-step reaction are then passed into a transformer-guided reaction path encoder.Our experiments show that YieldNet can predict the yield more accurately than the baselines. Furthermore, the model is trained only under the distant supervision of yield values, without requiring fine-grained supervision of atom mapping." Physics-Informed Weakly Supervised Learning For Interatomic Potentials,Makoto Takamoto Viktor Zaverkin Mathias Niepert,https://icml.cc/virtual/2025/poster/44619,"Machine learning is playing an increasingly important role in computational chemistry and materials science, complementing expensive ab initio and first-principles methods. However, machine-learned interatomic potentials (MLIPs) often struggle with generalization and robustness, leading to unphysical energy and force predictions in atomistic simulations. To address this, we propose a physics-informed, weakly supervised training framework for MLIPs. Our method introduces two novel loss functions: one based on Taylor expansions of the potential energy and another enforcing conservative force constraints. This approach enhances accuracy, particularly in low-data regimes, and reduces the reliance on large, expensive training datasets. Extensive experiments across benchmark datasets show up to 2× reductions in energy and force errors for multiple baseline models. Additionally, our method improves the stability of molecular dynamics simulations and facilitates effective fine-tuning of ML foundation models on sparse, high-accuracy ab initio data. An implementation of our method and scripts for executing experiments are available at \url{https://github.com/nec-research/PICPS-ML4Sci}." David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training,Weijian Luo colin zhang Debing Zhang Zhengyang Geng,https://icml.cc/virtual/2025/poster/46154,"We propose Diff-Instruct(DI), a data-efficient post-training approach to one-step text-to-image generative models to improve its human preferences without requiring image data. Our method frames alignment as online reinforcement learning from human feedback (RLHF), which optimizes a human reward function while regularizing the generator to stay close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the KL divergence for regularization, we introduce a novel score-based divergence regularization that substantially improves performance. Although such a score-based RLHF objective seems intractable when optimizing, we derive a strictly equivalent tractable loss function in theory that can efficiently compute its gradient for optimizations. Building upon this framework, we train DI-SDXL-1step, a 1-step text-to-image model based on Stable Diffusion-XL (2.6B parameters), capable of generating 1024x1024 resolution images in a single step. The 2.6B DI-SDXL-1step model outperforms the 12B FLUX-dev model in ImageReward, PickScore, and CLIP score on the Parti prompts benchmark while using only 1.88% of the inference time. This result strongly supports the thought that with proper post-training, the small one-step model is capable of beating huge multi-step models. We will open-source our industry-ready model to the community." -Debiased Orthogonal Boundary-driven Efficient Noise Mitigation,Hao Li Jiayang Gu Jingkuan Song An Zhang Lianli Gao,https://openreview.net/forum?id=kLOoc8a3bc, EvFocus: Learning to Reconstruct Sharp Images from Out-of-Focus Event Streams,Lin Zhu Xiantao Ma Xiao Wang Lizhi Wang Hua Huang,https://icml.cc/virtual/2025/poster/44694,"Event cameras are innovative sensors that capture brightness changes as asynchronous events rather than traditional intensity frames. These cameras offer substantial advantages over conventional cameras, including high temporal resolution, high dynamic range, and the elimination of motion blur. However, defocus blur, a common image quality degradation resulting from out-of-focus lenses, complicates the challenge of event-based imaging. Due to the unique imaging mechanism of event cameras, existing focusing algorithms struggle to operate efficiently on sparse event data. In this work, we propose EvFocus, a novel architecture designed to reconstruct sharp images from defocus event streams for the first time. Our work includes the development of an event-based out-of-focus camera model and a simulator to generate realistic defocus event streams for robust training and testing. EvDefous integrates a temporal information encoder, a blur-aware two-branch decoder, and a reconstruction and re-defocus module to effectively learn and correct defocus blur. Extensive experiments on both simulated and real-world datasets demonstrate that EvFocus outperforms existing methods across varying lighting conditions and blur sizes, proving its robustness and practical applicability in event-based defocus imaging." Geometric Feature Embedding for Effective 3D Few-Shot Class Incremental Learning,Xiangqi Li Libo Huang Zhulin An Weilun Feng Chuanguang Yang Boyu Diao Fei Wang Yongjun Xu,https://icml.cc/virtual/2025/poster/46035,"3D few-shot class incremental learning (FSCIL) aims to learn new point cloud categories from limited samples while preventing the forgetting of previously learned categories. This research area significantly enhances the capabilities of self-driving vehicles and computer vision systems. Existing 3D FSCIL approaches primarily utilize multimodal pre-trained models to extract the semantic features, heavily dependent on meticulously designed high-quality prompts and fine-tuning strategies. To reduce this dependence, this paper proposes a novel method for3DFSCILwithEmbeddedGeometric features (3D-FLEG). Specifically, 3D-FLEG develops a point cloudgeometric feature extraction moduleto capture category-related geometric characteristics. To address the modality heterogeneity issues that arise from integrating geometric and text features, 3D-FLEG introduces ageometric feature embedding module. By augmenting text prompts with spatial geometric features through these modules, 3D-FLEG can learn robust representations of new categories even with limited samples, while mitigating forgetting of the previously learned categories. Experiments conducted on several publicly available 3D point cloud datasets, including ModelNet, ShapeNet, ScanObjectNN, and CO3D, demonstrate 3D-FLEG's superiority over existing state-of-the-art 3D FSCIL methods. Code is available at https://github.com/lixiangqi707/3D-FLEG." Learning Adaptive Lighting via Channel-Aware Guidance,Qirui Yang Peng-Tao Jiang Hao Zhang Jinwei Chen Bo Li Huanjing Yue Jingyu Yang,https://icml.cc/virtual/2025/poster/44791,"Learning lighting adaptation is a crucial step in achieving good visual perception and supporting downstream vision tasks. Current research often addresses individual light-related challenges, such as high dynamic range imaging and exposure correction, in isolation. However, we identify shared fundamental properties across these tasks:i) different color channels have different light properties, and ii) the channel differences reflected in the spatial and frequency domains are different. Leveraging these insights, we introduce the channel-aware Learning Adaptive Lighting Network (LALNet), a multi-task framework designed to handle multiple light-related tasks efficiently. Specifically, LALNet incorporates color-separated features that highlight the unique light properties of each color channel, integrated with traditional color-mixed features by Light Guided Attention (LGA). The LGA utilizes color-separated features to guide color-mixed features focusing on channel differences and ensuring visual consistency across all channels. Additionally, LALNet employs dual domain channel modulation for generating color-separated features and a mixed channel modulation and light state space module for producing color-mixed features. Extensive experiments on four representative light-related tasks demonstrate that LALNet significantly outperforms state-of-the-art methods on benchmark tests and requires fewer computational resources. We provide an anonymous online demo atLALNet." Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion,Xingpei Ma Jiaran Cai Yuansheng Guan Shenneng Huang Qiang Zhang Shunsi Zhang,https://icml.cc/virtual/2025/poster/46072,"Recent diffusion-based talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate not only outperforms existing state-of-the-art methods in terms of video quality, but also exhibits strong competitiveness in lip synchronization while offering improved flexibility in controlling emotion and head pose. The code will be available at https://github.com/Playmate111/Playmate." -Pose Prior Learner: Unsupervised Categorical Prior Learning for Pose Estimation,Ziyu Wang Shuangpeng Han Mengmi Zhang,https://openreview.net/forum?id=0OdIwdiFb7, SITCOM: Step-wise Triple-Consistent Diffusion Sampling For Inverse Problems,Ismail Alkhouri Shijun Liang Cheng-Han Huang Jimmy Dai Qing Qu Saiprasad Ravishankar Rongrong Wang,https://icml.cc/virtual/2025/poster/46601,"Diffusion models (DMs) are a class of generative models that allow sampling from a distribution learned over a training set. When applied to solving inverse problems, the reverse sampling steps are modified to approximately sample from a measurement-conditioned distribution. However, these modifications may be unsuitable for certain settings (e.g., presence of measurement noise) and non-linear tasks, as they often struggle to correct errors from earlier steps and generally require a large number of optimization and/or sampling steps. To address these challenges, we state three conditions for achieving measurement-consistent diffusion trajectories. Building on these conditions, we propose a new optimization-based sampling method that not only enforces standard data manifold measurement consistency and forward diffusion consistency, as seen in previous studies, but also incorporates our proposed step-wise and network-regularized backward diffusion consistency that maintains a diffusion trajectory by optimizing over the input of the pre-trained model at every sampling step. By enforcing these conditions (implicitly or explicitly), our sampler requires significantly fewer reverse steps. Therefore, we refer to our method asStep-wiseTriple-Consistent Sampling (SITCOM). Compared to SOTA baselines, our experiments across several linear and non-linear tasks (with natural and medical images) demonstrate that SITCOM achieves competitive or superior results in terms of standard similarity metrics and run-time." TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation,Tianyi Liang Jiangqi Liu Yifei Huang Shiqi Jiang Jianshen Shi Changbo Wang Chenhui Li,https://icml.cc/virtual/2025/poster/44110,"Text-to-image (T2I) generation has made remarkable progress in producing high-quality images, but a fundamental challenge remains: creating backgrounds that naturally accommodate text placement without compromising image quality. This capability is non-trivial for real-world applications like graphic design, where clear visual hierarchy between content and text is essential.Prior work has primarily focused on arranging layouts within existing static images, leaving unexplored the potential of T2I models for generating text-friendly backgrounds.We present TextCenGen, a training-free approach that actively relocates objects before optimizing text regions, rather than directly reducing cross-attention which degrades image quality. Our method introduces: (1) a force-directed graph approach that detects conflicting objects and guides them relocation using cross-attention maps, and (2) a spatial attention constraint that ensures smooth background generation in text regions. Our method is plug-and-play, requiring no additional training while well balancing both semantic fidelity and visual quality.Evaluated on our proposed text-friendly T2I benchmark of 27,000 images across three seed datasets, TextCenGen outperforms existing methods by achieving 23\% lower saliency overlap in text regions while maintaining 98\% of the original semantic fidelity measured by CLIP score and our proposed Visual-Textual Concordance Metric (VTCM)." CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities,Yuxuan Zhu Antony Kellermann Dylan Bowman Philip Li Akul Gupta Adarsh Danda Richard Fang Conner Jensen Eric Ihli Jason Benn Jet Geronimo Avi Dhir Sudhit Rao Kaicheng Yu Twm Stone Daniel Kang,https://icml.cc/virtual/2025/poster/46522,"Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture-the-Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized exper-tise to reproduce exploits and a systematic approach to evaluating unpredictable attacks. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our experiments show that the state-of-the-art agent framework can exploit up to 13% of the vulnerabilities." @@ -1322,8 +1312,6 @@ Nesterov Method for Asynchronous Pipeline Parallel Optimization,Thalaiyasingam A Attention-Only Transformers via Unrolled Subspace Denoising,Peng Wang Yifu Lu Yaodong Yu Druv Pai Qing Qu Yi Ma,https://icml.cc/virtual/2025/poster/45735,"Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of \textit{only} self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations \textit{at a linear rate} with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures such as GPT-2 and CRATE." Recommendations with Sparse Comparison Data: Provably Fast Convergence for Nonconvex Matrix Factorization,Suryanarayana Sankagiri Jalal Etesami Matthias Grossglauser,https://icml.cc/virtual/2025/poster/43663,"In this paper, we consider a recommender system that elicits user feedback through pairwise comparisons instead of ratings. We study the problem of learning personalised preferences from such comparison data via collaborative filtering. Similar to the classical matrix completion setting, we assume that users and items are endowed with low-dimensional latent features. These features give rise to user-item utilities, and the comparison outcomes are governed by a discrete choice model over these utilities. The task of learning these features is then formulated as a maximum likelihood problem over the comparison dataset. Despite the resulting optimization problem being nonconvex, we show that gradient-based methods converge exponentially to the latent features, given a warm start. Importantly, this result holds in a sparse data regime, where each user compares only a few pairs of items. Our main technical contribution is to extend key concentration results commonly used in matrix completion to our model. Simulations reveal that the empirical performance of the method exceeds theoretical predictions, even when some assumptions are relaxed. Our work demonstrates that learning personalised recommendations from comparison data is both computationally and statistically efficient." BARK: A Fully Bayesian Tree Kernel for Black-box Optimization,Toby Boyne Jose Pablo Folch Robert Matthew Lee Behrang Shafei Ruth Misener,https://icml.cc/virtual/2025/poster/46003,"We perform Bayesian optimization using a Gaussian process perspective on Bayesian Additive Regression Trees (BART). Our BART Kernel (BARK) uses tree agreement to define a posterior over piecewise-constant functions, and we explore the space of tree kernels using a Markov chain Monte Carlo approach. Where BART only samples functions, the resulting BARK model obtains samples of Gaussian processes defining distributions over functions, which allow us to build acquisition functions for Bayesian optimization. Our tree-based approach enables global optimization over the surrogate, even for mixed-feature spaces. Moreover, where many previous tree-based kernels provide uncertainty quantification over function values, our sampling scheme captures uncertainty over the tree structure itself. Our experiments show the strong performance of BARK on both synthetic and applied benchmarks, due to the combination of our fully Bayesian surrogate and the optimization procedure." -Multivariate Conformal Prediction using Optimal Transport,Michal Klein Louis Béthune Eugene Ndiaye marco cuturi,https://openreview.net/forum?id=haEAhTexqm, -Temporal-Difference Variational Continual Learning,Luckeciano Carvalho Melo Alessandro Abate Yarin Gal,https://openreview.net/forum?id=6Uh73Wl8Je, FedBEns: One-Shot Federated Learning based on Bayesian Ensemble,Jacopo Talpini Marco Savi Giovanni Neglia,https://icml.cc/virtual/2025/poster/44060,"One-Shot Federated Learning (FL) is a recent paradigm that enables multiple clients to cooperatively learn a global model in a single round of communication with a central server. In this paper, we analyze the One-Shot FL problem through the lens of Bayesian inference and propose FedBEns, an algorithm that leverages the inherent multimodality of local loss functions to find better global models.Our algorithm leverages a mixture of Laplace approximations for the clients' local posteriors, which the server then aggregates to infer the global model. We conduct extensive experiments on various datasets, demonstrating that the proposed method outperforms competing baselines that typically rely on unimodal approximations of the local losses." Enabling Optimal Decisions in Rehearsal Learning under CARE Condition,Wen-Bo Du Hao-Yi Lei Lue Tao Tian-Zuo Wang Zhi-Hua Zhou,https://icml.cc/virtual/2025/poster/44293,"In the field of machine learning (ML), an essential type of decision-related problem is known as AUF (Avoiding Undesired Future): if an ML model predicts an undesired outcome, how can decisions be made to prevent it? Recently, a novel framework calledrehearsal learninghas been proposed to address the AUF problem. Despite its utility in modeling uncertainty for decision-making, it remains unclearunder what conditionsandhowoptimal actions that maximize theAUF probabilitycan be identified. In this paper, we proposeCARE(CAnonical REctangle), a condition under which the maximum AUF probability can be achieved. Under the CARE condition, we present a projection-Newton algorithm to select actions and prove that the algorithm achieves superlinear convergence to the optimal one. Besides, we provide a generalization method for adopting the algorithm to AUF scenarios beyond the CARE condition. Finally, we demonstrate that a closed-form solution exists when the outcome is a singleton variable, substantially reducing the time complexity of decision-making. Experiments validate the effectiveness and efficiency of our method." Identification of Latent Confounders via Investigating the Tensor Ranks of the Nonlinear Observations,Zhengming Chen Yewei Xia Feng Xie Jie Qiao Zhifeng Hao Ruichu Cai Kun Zhang,https://icml.cc/virtual/2025/poster/45026,"We study the problem of learning discrete latent variable causal structures from mixed-type observational data. Traditional methods, such as those based on the tensor rank condition, are designed to identify discrete latent structure models and provide robust identification bounds for discrete causal models. However, when observed variables—specifically, those representing the children of latent variables—are collected at various levels with continuous data types, the tensor rank condition is not applicable, limiting further causal structure learning for latent variables. In this paper, we consider a more general case where observed variables can be either continuous or discrete, and further allow for scenarios where multiple latent parents cause the same set of observed variables. We show that, under the completeness condition, it is possible to discretize the data in a way that satisfies the full-rank assumption required by the tensor rank condition. This enables the identifiability of discrete latent structure models within mixed-type observational data. Moreover, we introduce the two-sufficient measurement condition, a more general structural assumption under which the tensor rank condition holds and the underlying latent causal structure is identifiable by a proposed two-stage identification algorithm. Extensive experiments on both simulated and real-world data validate the effectiveness of our method." @@ -1337,7 +1325,6 @@ Time-Aware World Model for Adaptive Prediction and Control,Anh N Nhu Sanghyun So Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning,Chen-Xiao Gao Chenyang Wu Mingjun Cao Chenjun Xiao Yang Yu Zongzhang Zhang,https://icml.cc/virtual/2025/poster/44003,"Behavior regularization, which constrains the policy to stay close to some behavior policy, is widely used in offline reinforcement learning (RL) to manage the risk of hazardous exploitation of unseen actions. Nevertheless, existing literature on behavior-regularized RL primarily focuses on explicit policy parameterizations, such as Gaussian policies. Consequently, it remains unclear how to extend this framework to more advanced policy parameterizations, such as diffusion models. In this paper, we introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies, thereby combining the expressive power of diffusion policies and the robustness provided by regularization. The key ingredient of our method is to calculate the Kullback-Leibler (KL) regularization analytically as the accumulated discrepancies in reverse-time transition kernels along the diffusion trajectory. By integrating the regularization, we develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint. Comprehensive evaluations conducted on synthetic 2D tasks and continuous control tasks from the D4RL benchmark validate its effectiveness and superior performance." Constrained Exploitability Descent: An Offline Reinforcement Learning Method for Finding Mixed-Strategy Nash Equilibrium,Runyu Lu Yuanheng Zhu Dongbin Zhao,https://icml.cc/virtual/2025/poster/43717,"This paper proposes Constrained Exploitability Descent (CED), a model-free offline reinforcement learning (RL) algorithm for solving adversarial Markov games (MGs). CED combines the game-theoretical approach of Exploitability Descent (ED) with policy constraint methods from offline RL. While policy constraints can perturb the optimal pure-strategy solutions in single-agent scenarios, we find the side effect less detrimental in adversarial games, where the optimal policy can be a mixed-strategy Nash equilibrium. We theoretically prove that, under the uniform coverage assumption on the dataset, CED converges to a stationary point in deterministic two-player zero-sum Markov games. We further prove that the min-player policy at the stationary point follows the property of mixed-strategy Nash equilibrium in MGs. Compared to the model-based ED method that optimizes the max-player policy, our CED method no longer relies on a generalized gradient. Experiments in matrix games, a tree-form game, and an infinite-horizon soccer game verify that CED can find an equilibrium policy for the min-player as long as the offline dataset guarantees uniform coverage. Besides, CED achieves a significantly lower NashConv compared to an existing pessimism-based method and can gradually improve the behavior policy even under non-uniform data coverages. When combined with neural networks, CED also outperforms behavior cloning and offline self-play in a large-scale two-team robotic combat game." Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens,Jihwan Jeong Xiaoyu Wang Jingmin Wang Scott Sanner Pascal Poupart,https://icml.cc/virtual/2025/poster/44010,"Offline reinforcement learning (RL) is crucial when online exploration is costly or unsafe but often struggles with high epistemic uncertainty due to limited data. Existing methods rely on fixed conservative policies, restricting adaptivity and generalization. To address this, we propose Reflect-then-Plan (RefPlan), a noveldoubly Bayesianoffline model-based (MB) planning approach. RefPlan unifies uncertainty modeling and MB planning by recasting planning as Bayesian posterior estimation. At deployment, it updates a belief over environment dynamics using real-time observations, incorporating uncertainty into MB planning via marginalization. Empirical results on standard benchmarks show that RefPlan significantly improves the performance of conservative offline RL policies. In particular, RefPlan maintains robust performance under high epistemic uncertainty and limited data, while demonstrating resilience to changing environment dynamics, improving the flexibility, generalizability, and robustness of offline-learned policies." -Semi-gradient DICE for Offline Constrained Reinforcement Learning,Woosung Kim JunHo Seo Jongmin Lee Byung-Jun Lee,https://openreview.net/forum?id=wtcRXLaFrJ, ARS: Adaptive Reward Scaling for Multi-Task Reinforcement Learning,Myungsik Cho Jongeui Park Jeonghye Kim Youngchul Sung,https://icml.cc/virtual/2025/poster/45144,"Multi-task reinforcement learning (RL) encounters significant challenges due to varying task complexities and their reward distributions from the environment. To address these issues, in this paper, we propose Adaptive Reward Scaling (ARS), a novel framework that dynamically adjusts reward magnitudes and leverages a periodic network reset mechanism. ARS introduces a history-based reward scaling strategy that ensures balanced reward distributions across tasks, enabling stable and efficient training. The reset mechanism complements this approach by mitigating overfitting and ensuring robust convergence. Empirical evaluations on the Meta-World benchmark demonstrate that ARS significantly outperforms baseline methods, achieving superior performance on challenging tasks while maintaining overall learning efficiency. These results validate ARS's effectiveness in tackling diverse multi-task RL problems, paving the way for scalable solutions in complex real-world applications." "Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning",Adrià López Escoriza Nicklas Hansen Stone Tao Tongzhou Mu Hao Su,https://icml.cc/virtual/2025/poster/46087,"Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable sub-goals. In this work, we propose DEMO³, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult taskscompared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations." The Impact of On-Policy Parallelized Data Collection on Deep Reinforcement Learning Networks,Walter Mayor Johan Obando-Ceron Aaron Courville Pablo Samuel Castro,https://icml.cc/virtual/2025/poster/44665,"The use of parallel actors for data collection has been an effective technique used in reinforcement learning (RL) algorithms. The manner in which data is collected in these algorithms, controlled via the number of parallel environments and the rollout length, induces a form of bias-variance trade-off; the number of training passes over the collected data, on the other hand, must strike a balance between sample efficiency and overfitting. We conduct an empirical analysis of these trade-offs on PPO, one of the most popular RL algorithms that uses parallel actors, and establish connections to network plasticity and, more generally, optimization stability. We examine its impact on network architectures, as well as the hyper-parameter sensitivity when scaling data. Our analyses indicate that larger dataset sizes can increase final performance across a variety of settings, and that scaling parallel environments is more effective than increasing rollout lengths. These findings highlight the critical role of data collection strategies in improving agent performance." @@ -1364,7 +1351,6 @@ On the Role of Label Noise in the Feature Learning Process,Andi Han Wei Huang Zh Provable In-Context Vector Arithmetic via Retrieving Task Concepts,Dake Bu Wei Huang Andi Han Atsushi Nitanda Qingfu Zhang Hau-San Wong Taiji Suzuki,https://icml.cc/virtual/2025/poster/45998,"In-context learning (ICL) has garnered significant attention for its ability to grasp functions/tasks from demonstrations. Recent studies suggest the presence of a latenttask/function vectorin LLMs during ICL. Merullo et al. (2024) showed that LLMs leverage this vector alongside the residual stream for Word2Vec-like vector arithmetic, solving factual-recall ICL tasks. Additionally, recent work empirically highlighted the key role of Question-Answer data in enhancing factual-recall capabilities. Despite these insights, a theoretical explanation remains elusive. To move one step forward, we propose a theoretical framework building on empirically groundedhierarchicalconcept modeling. We develop an optimization theory, showing how nonlinear residual transformers trained via gradient descent on cross-entropy loss perform factual-recall ICL tasks via vector arithmetic. We prove 0-1 loss convergence and show the strong generalization, including robustness to concept recombination and distribution shifts. These results elucidate the advantages of transformers over static embedding predecessors. Empirical simulations corroborate our theoretical insights." Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions,Yihao Xue Jiping Li Baharan Mirzasoleiman,https://icml.cc/virtual/2025/poster/43511,"Weak-to-Strong Generalization (W2SG), where a weak model supervises a stronger one, serves as an important analogy for understanding how humans might guide superhuman intelligence in the future. Promising empirical results revealed that a strong model can surpass its weak supervisor. While recent work has offered theoretical insights into this phenomenon, a clear understanding of the interactions between weak and strong models that drive W2SG remains elusive. We investigate W2SG through a theoretical lens and show that it can be characterized using kernels derived from the principal components of weak and strong models' internal representations. These kernels can be used to define a space that, at a high level, captures what the weak model is unable to learn but is learnable by the strong model. The projection of labels onto this space quantifies how much the strong model falls short of its full potential due to weak supervision. This characterization also provides insights into how certain errors in weak supervision can be corrected by the strong model, regardless of overfitting. Our theory has significant practical implications, providing a representation-based metric that predicts W2SG performance trends without requiring labels, as shown in experiments on molecular predictions with transformers and 5 NLP tasks involving 52 LLMs." A General Representation-Based Approach to Multi-Source Domain Adaptation,Ignavier Ng Yan Li Zijian Li Yujia Zheng Guangyi Chen Kun Zhang,https://icml.cc/virtual/2025/poster/46089,"A central problem in unsupervised domain adaptation is determining what to transfer from labeled source domains to an unlabeled target domain. To handle high-dimensional observations (e.g., images), a line of approaches use deep learning to learn latent representations of the observations, which facilitate knowledge transfer in the latent space. However, existing approaches often rely on restrictive assumptions to establish identifiability of the joint distribution in the target domain, such as independent latent variables or invariant label distributions, limiting their real-world applicability. In this work, we propose a general domain adaptation framework that learns compact latent representations to capture distribution shifts relative to the prediction task and address the fundamental question of what representations should be learned and transferred. Notably, we first demonstrate that learning representations based on all the predictive information, i.e., the label's Markov blanket in terms of the learned representations, is often underspecified in general settings. Instead, we show that, interestingly, general domain adaptation can be achieved by partitioning the representations of Markov blanket into those of the label's parents, children, and spouses. Moreover, its identifiability guarantee can be established. Building on these theoretical insights, we develop a practical, nonparametric approach for domain adaptation in a general setting, which can handle different types of distribution shifts." -Unsupervised Transfer Learning via Adversarial Contrastive Training,Chenguang Duan Yuling Jiao Huazhen Lin Wensen Ma Jerry Zhijian Yang,https://openreview.net/forum?id=90ghmFUwIT, Identifying Metric Structures of Deep Latent Variable Models,Stas Syrota Yevgen Zainchkovskyy Johnny Xi Benjamin Bloem-Reddy Søren Hauberg,https://icml.cc/virtual/2025/poster/45898,"Deep latent variable models learn condensed representations of data that, hopefully, reflect the inner workings of the studied phenomena. Unfortunately, these latent representations are not statistically identifiable, meaning they cannot be uniquely determined. Domain experts, therefore, need to tread carefully when interpreting these. Current solutions limit the lack of identifiability through additional constraints on the latent variable model, e.g. by requiring labeled training data, or by restricting the expressivity of the model. We change the goal: instead of identifying the latent variables, we identify relationships between them such as meaningful distances, angles, and volumes. We prove this is feasible under very mild model conditions and without additional labeled data. We empirically demonstrate that our theory results in more reliable latent distances, offering a principled path forward in extracting trustworthy conclusions from deep latent variable models." Safety Certificate against Latent Variables with Partially Unidentifiable Dynamics,Haoming Jing yorie nakahira,https://icml.cc/virtual/2025/poster/46176,"Many systems contain latent variables that make their dynamics partially unidentifiable or cause distribution shifts in the observed statistics between offline and online data. However, existing control techniques often assume access to complete dynamics or perfect simulators with fully observable states, which are necessary to verify whether the system remains within a safe set (forward invariance) or safe actions are consistently feasible at all times. To address this limitation, we propose a technique for designing probabilistic safety certificates for systems with latent variables. A key technical enabler is the formulation of invariance conditions in probability space, which can be constructed using observed statistics in the presence of distribution shifts due to latent variables. We use this invariance condition to construct a safety certificate that can be implemented efficiently in real-time control. The proposed safety certificate can continuously find feasible actions that control long-term risk to stay within tolerance. Stochastic safe control and (causal) reinforcement learning have been studied in isolation until now. To the best of our knowledge, the proposed work is the first to use causal reinforcement learning to quantify long-term risk for the design of safety certificates. This integration enables safety certificates to efficiently ensure long-term safety in the presence of latent variables. The effectiveness of the proposed safety certificate is demonstrated in numerical simulations." Computing Voting Rules with Improvement Feedback,Evi Micha Vasilis Varsamis,https://icml.cc/virtual/2025/poster/45074,"Aggregating preferences under incomplete or constrained feedback is a fundamental problem in social choice and related domains. While prior work has established strong impossibility results for pairwise comparisons, this paper extends the inquiry to improvement feedback, where voters express incremental adjustments rather than complete preferences. We provide a complete characterization of the positional scoring rules that can be computed given improvement feedback. Interestingly, while plurality is learnable under improvement feedback—unlike with pairwise feedback—strong impossibility results persist for many other positional scoring rules. Furthermore, we show that improvement feedback, unlike pairwise feedback, does not suffice for the computation of any Condorcet-consistent rule. We complement our theoretical findings with experimental results, providing further insights into the practical implications of improvement feedback for preference aggregation." @@ -1383,7 +1369,6 @@ EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vul Sounding that Object: Interactive Object-Aware Image to Audio Generation,Tingle Li Baihe Huang Xiaobin Zhuang Dongya Jia Jiawei Chen Yuping Wang Zhuo Chen Gopala Anumanchipalli Yuxuan Wang,https://icml.cc/virtual/2025/poster/46382,"Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an interactive object-aware audio generation model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the object level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds." Synthesizing Software Engineering Data in a Test-Driven Manner,Lei Zhang Jiaxi Yang Min Yang Jian Yang Mouxiang Chen Jiajun Zhang Zeyu Cui Binyuan Hui Junyang Lin,https://icml.cc/virtual/2025/poster/45400,"We introduceSWE-Flow, a novel data synthesis framework grounded in Test-Driven Development (TDD).Unlike existing software engineering data that rely on human-submitted issues,SWE-Flowautomatically infers incremental development steps directly from unit tests, which inherently encapsulate high-level requirements.The core ofSWE-Flowis the construction of a Runtime Dependency Graph (RDG), which precisely captures function interactions, enabling the generation of a structured, step-by-stepdevelopment schedule.At each step,SWE-Flowproduces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks.With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating theSWE-Flow-Evalbenchmark.Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding.To facilitate further research, we release all code, datasets, models, and Docker images atGithub." Towards Practical Defect-Focused Automated Code Review,Junyi Lu Lili Jiang Xiaojia Li Jianbing Fang Fengjun Zhang Li Yang Chun Zuo,https://icml.cc/virtual/2025/poster/44165,"The complexity of code reviews has driven efforts to automate review comments, but prior approaches oversimplify this task by treating it as snippet-level code-to-text generation and relying on text similarity metrics like BLEU for evaluation. These methods overlook repository context, real-world merge request evaluation, and defect detection, limiting their practicality. To address these issues, we explore the full automation pipeline within the online recommendation service of a company with nearly 400 million daily active users, analyzing industry-grade C++ codebases comprising hundreds of thousands of lines of code. We identify four key challenges: 1) capturing relevant context, 2) improving key bug inclusion (KBI), 3) reducing false alarm rates (FAR), and 4) integrating human workflows. To tackle these, we propose 1) code slicing algorithms for context extraction, 2) a multi-role LLM framework for KBI, 3) a filtering mechanism for FAR reduction, and 4) a novel prompt design for better human interaction. Our approach, validated on real-world merge requests from historical fault reports, achieves a 2× improvement over standard LLMs and a 10× gain over previous baselines. While the presented results focus on C++, the underlying framework design leverages language-agnostic principles (e.g., AST-based analysis), suggesting potential for broader applicability." -A standard transformer and attention with linear biases for molecular conformer generation,Viatcheslav Gurev Timothy Rumbell,https://openreview.net/forum?id=BjjerMYL3F, All-atom Diffusion Transformers: Unified generative modelling of molecules and materials,Chaitanya K. Joshi Xiang Fu Yi-Lun Liao Vahe Gharakhanyan Benjamin Kurt Miller Anuroop Sriram Zachary Ward Ulissi,https://icml.cc/virtual/2025/poster/46288,"Diffusion models are the standard toolkit for generative modelling of 3D atomic systems. However, for different types of atomic systems -- such as molecules and materials -- the generative processes are usually highly specific to the target system despite the underlying physics being the same. We introduce the All-atom Diffusion Transformer (ADiT), a unified latent diffusion framework for jointly generating both periodic materials and non-periodic molecular systems using the same model: (1) An autoencoder maps a unified, all-atom representations of molecules and materials to a shared latent embedding space; and (2) A diffusion model is trained to generate new latent embeddings that the autoencoder can decode to sample new molecules or materials. Experiments on MP20, QM9 and GEOM-DRUGS datasets demonstrate that jointly trained ADiT generates realistic and valid molecules as well as materials, obtaining state-of-the-art results on par with molecule and crystal-specific models. ADiT uses standard Transformers with minimal inductive biases for both the autoencoder and diffusion model, resulting in significant speedups during training and inference compared to equivariant diffusion models. Scaling ADiT up to half a billion parameters predictably improves performance, representing a step towards broadly generalizable foundation models for generative chemistry. Open source code: https://github.com/facebookresearch/all-atom-diffusion-transformer" AnalogGenie-Lite: Enhancing Scalability and Precision in Circuit Topology Discovery through Lightweight Graph Modeling,Jian Gao Weidong Cao Xuan Zhang,https://icml.cc/virtual/2025/poster/45643,"The sustainable performance improvements of integrated circuits (ICs) drive the continuous advancement of nearly all transformative technologies. Since its invention, IC performance enhancements have been dominated by scaling the semiconductor technology. Yet, as Moore's law tapers off, a crucial question arises: ***How can we sustain IC performance in the post-Moore era?*** Creating new circuit topologies has emerged as a promising pathway to address this fundamental need. This work proposes AnalogGenie-Lite, a decoder-only transformer that discovers novel analog IC topologies with significantly enhanced scalability and precision via lightweight graph modeling.AnalogGenie-Lite makes several unique contributions, including concise device-pin representations (i.e., advancing the best prior art from $O\left(n^2\right)$ to $O\left(n\right)$), frequent sub-graph mining, and optimal sequence modeling. Compared to state-of-the-art circuit topology discovery methods, it achieves $5.15\times$ to $71.11\times$ gains in scalability and 23.5\% to 33.6\% improvements in validity. Case studies on other domains' graphs are also provided to show the broader applicability of the proposed graph modeling approach. Source code: https://github.com/xz-group/AnalogGenie-Lite." Diagonal Symmetrization of Neural Network Solvers for the Many-Electron Schrödinger Equation,Kevin Han Huang Ni Zhan Elif Ertekin Peter Orbanz Ryan P Adams,https://icml.cc/virtual/2025/poster/45806,"Incorporating group symmetries into neural networks has been a cornerstone of success in many AI-for-science applications. Diagonal groups of isometries, which describe the invariance under a simultaneous movement of multiple objects, arise naturally in many-body quantum problems. Despite their importance, diagonal groups have received relatively little attention, as they lack a natural choice of invariant maps except in special cases. We study different ways of incorporating diagonal invariance in neural network ansatze trained via variational Monte Carlo methods, and consider specifically data augmentation, group averaging and canonicalization. We show that, contrary to standard ML setups, in-training symmetrization destabilizes training and can lead to worse performance. Our theoretical and numerical results indicate that this unexpected behavior may arise from a unique computational-statistical tradeoff not found in standard ML analyses of symmetrization. Meanwhile, we demonstrate that post hoc averaging is less sensitive to such tradeoffs and emerges as a simple, flexible and effective method for improving neural network solvers." @@ -1498,7 +1483,6 @@ Transolver++: An Accurate Neural Solver for PDEs on Million-Scale Geometries,Hua Nonparametric Modern Hopfield Models,Jerry Yao-Chieh Hu Bo-Yu Chen Dennis Wu Feng Ruan Han Liu,https://icml.cc/virtual/2025/poster/43568,"We present a nonparametric interpretation for deep learning compatible modern Hopfield models and utilize this new perspective to debut efficient variants. Our key contribution stems from interpreting the memory storage and retrieval processes in modern Hopfield models as a nonparametric regression problem subject to a set of query-memory pairs.Interestingly,our framework not only recovers the known results from the original dense modern Hopfield model but also fills the void in the literature regarding efficient modern Hopfield models, by introducing *sparse-structured* modern Hopfield models with sub-quadratic complexity.We establish that this sparse model inherits the appealing theoretical properties of its dense analogue --- connection with transformer attention, fixed point convergence and exponential memory capacity.Additionally, we showcase the versatility of our framework by constructing a family of modern Hopfield models as extensions, including linear, random masked, top-$K$ and positive random feature modern Hopfield models.Empirically, we validate our framework in both synthetic and realistic settings for memory retrieval and learning tasks." Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity,Haocheng Xi Shuo Yang Yilong Zhao Chenfeng Xu Muyang Li Xiuyu Li Yujun Lin Han Cai Jintao Zhang Dacheng Li Jianfei Chen Ion Stoica Kurt Keutzer Song Han,https://icml.cc/virtual/2025/poster/43743,"Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D full attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D full attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28$\times$ and 2.33$\times$ end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality. Our code will be open-sourced upon publication." "The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training",Matteo Saponati Pascal Josef Sager Pau Vilimelis Aceituno Thilo Stadelmann Benjamin F Grewe,https://icml.cc/virtual/2025/poster/44452,"Self-attention is essential to Transformer architectures, yet how information is embedded in the self-attention matrices and how different objective functions impact this process remains unclear. We present a mathematical framework to analyze self-attention matrices by deriving the structures governing their weight updates. Using this framework, we demonstrate that bidirectional training induces symmetry in the weight matrices, while autoregressive training results in directionality and column dominance. Our theoretical findings are validated across multiple Transformer models — including ModernBERT, GPT, LLaMA3, and Mistral — and input modalities like text, vision, and audio. Finally, we apply these insights by showing that symmetric initialization improves the performance of encoder-only models on language tasks. This mathematical analysis offers a novel theoretical perspective on how information is embedded through self-attention, thereby improving the interpretability of Transformer models." -An Architecture Built for Federated Learning: Addressing Data Heterogeneity through Adaptive Normalization-Free Feature Recalibration,Vasilis Siomos Jonathan Passerat-Palmbach Giacomo Tarroni,https://openreview.net/forum?id=CUJcSQ19ao, Concept-Centric Token Interpretation for Vector-Quantized Generative Models,Tianze Yang Yucheng Shi Mengnan Du Xuansheng Wu Qiaoyu Tan Jin Sun Ninghao Liu,https://icml.cc/virtual/2025/poster/43803,"Vector-Quantized Generative Models (VQGMs) have emerged as powerful tools for image generation. However, the key component of VQGMs---the codebook of discrete tokens---is still not well understood, e.g., which tokens are critical to generate an image of a certain concept?This paper introduces Concept-Oriented Token Explanation (CORTEX), a novel approach for interpreting VQGMs by identifying concept-specific token combinations. Our framework employs two methods: (1) a sample-level explanation method that analyzes token importance scores in individual images, and (2) a codebook-level explanation method that explores the entire codebook to find globally relevant tokens. Experimental results demonstrate CORTEX's efficacy in providing clear explanations of token usage in the generative process, outperforming baselines across multiple pretrained VQGMs. Besides enhancing VQGMs transparency, CORTEX is useful in applications such as targeted image editing and shortcut feature detection. Our code is available at https://github.com/YangTianze009/CORTEX." Generative Human Trajectory Recovery via Embedding-Space Conditional Diffusion,Kaijun Liu Sijie Ruan Liang Zhang Cheng Long Shuliang Wang liang yu,https://icml.cc/virtual/2025/poster/45338,"Recovering human trajectories from incomplete or missing data is crucial for many mobility-based urban applications, e.g., urban planning, transportation, and location-based services. Existing methods mainly rely on recurrent neural networks or attention mechanisms. Though promising, they encounter limitations in capturing complex spatial-temporal dependencies in low-sampling trajectories. Recently, diffusion models show potential in content generation. However, most of proposed methods are used to generate contents in continuous numerical representations, which cannot be directly adapted to the human location trajectory recovery. In this paper, we introduce a conditional diffusion-based trajectory recovery method, namely, DiffMove. It first transforms locations in trajectories into the embedding space, in which the embedding denoising is performed, and then missing locations are recovered by an embedding decoder. DiffMove not only improves accuracy by introducing high-quality generative methods in the trajectory recovery, but also carefully models the transition, periodicity, and temporal patterns in human mobility. Extensive experiments based on two representative real-world mobility datasets are conducted, and the results show significant improvements (an average of 11% in recall) over the best baselines." How to Train Your Multi-Exit Model? Analyzing the Impact of Training Strategies,Piotr Kubaty Bartosz Wójcik Bartłomiej Tomasz Krzepkowski Monika Michaluk Tomasz Trzcinski Jary Pomponi Kamil Adamczewski,https://icml.cc/virtual/2025/poster/43657,"Early exits enable the network's forward pass to terminate early by attaching trainable internal classifiers to the backbone network. Existing early-exit methods typically adopt either a joint training approach, where the backbone and exit heads are trained simultaneously, or a disjoint approach, where the heads are trained separately. However, the implications of this choice are often overlooked, with studies typically adopting one approach without adequate justification. This choice influences training dynamics and its impact remains largely unexplored. In this paper, we introduce a set of metrics to analyze early-exit training dynamics and guide the choice of training strategy. We demonstrate that conventionally used joint and disjoint regimes yield suboptimal performance. To address these limitations, we propose a mixed training strategy: the backbone is trained first, followed by the training of the entire multi-exit network. Through comprehensive evaluations of training strategies across various architectures, datasets, and early-exit methods we present strengths and weaknesses of the early exit training strategies. In particular, we show consistent improvements in performance and efficiency using the proposed mixed strategy." @@ -1619,7 +1603,6 @@ Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mi MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations,Kaixuan Huang Jiacheng Guo Zihao Li Xiang Ji Jiawei Ge Wenzhe Li Yingqing Guo Tianle Cai Hui Yuan Runzhe Wang Yue Wu Ming Yin Shange Tang Yangsibo Huang Chi Jin Xinyun Chen Chiyuan Zhang Mengdi Wang,https://icml.cc/virtual/2025/poster/45435,"Large language models have demonstrated impressive performance on challenging mathematical reasoning tasks, which has triggered the discussion of whether the performance is achieved by true reasoning capability or memorization. To investigate this question, prior work has constructed mathematical benchmarks when questions undergo simple perturbations -- modifications that still preserve the underlying reasoning patterns of the solutions. However, no work has explored hard perturbations, which fundamentally change the nature of the problem so that the original solution steps do not apply. To bridge the gap, we construct MATH-P-Simple and MATH-P-Hard via simple perturbation and hard perturbation, respectively. Each consists of 279 perturbed math problems derived from level-5 (hardest) problems in the MATH dataset (Hendrycks et al., 2021). We observe significant performance drops on MATH-P-Hard across various models, including o1-mini (-16.49%) and gemini-2.0-flash-thinking (-12.9%). We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models. The project is available at https://math-perturb.github.io/." Mitigating Heterogeneous Token Overfitting in LLM Knowledge Editing,Tianci Liu Ruirui Li Zihan Dong Hui Liu Xianfeng Tang Qingyu Yin Linjun Zhang Haoyu Wang Jing Gao,https://icml.cc/virtual/2025/poster/43678,"Large language models (LLMs) have achieved remarkable performance on various natural language tasks. However, they are trained on static corpora and their knowledge can become outdated quickly in the fast-changing world. This motivates the development of knowledge editing (KE) to update specific knowledge in LLMs without changing unrelated others or compromising their pre-trained capabilities. Previous efforts sought to update a small amount of parameters of a LLM and proved effective for making selective updates. Nonetheless, the edited LLM often exhibits degraded ability to reason about the new knowledge. In this work, we identify a key issue: \textit{heterogeneous token overfitting} (HTO), where the LLM overfits different tokens in the provided knowledge at varying rates.To tackle this, we propose {OVERTONE}, a token-level smoothing method that mitigates HTO by adaptively refining the target distribution. Theoretically, OVERTONE offers better parameter updates with negligible computation overhead. It also induces an implicit DPO but does not require preference data pairs. Extensive experiments across four editing methods, two LLMs, and diverse scenarios demonstrate the effectiveness and versatility of our method." MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance,Zhixuan Chen Xing Hu Dawei Yang Zukang Xu XUCHEN Zhihang Yuan Sifan Zhou JiangyongYu,https://icml.cc/virtual/2025/poster/46674,"Mixture-of-Experts (MoE) large language models (LLMs), which leverage dynamic routing and sparse activation to enhance efficiency and scalability, have achieved higher performance while reducing computational costs. However, these models face significant memory overheads, limiting their practical deployment and broader adoption. Post-training quantization (PTQ), a widely used method for compressing LLMs, encounters severe accuracy degradation and diminished generalization performance when applied to MoE models. This paper investigates the impact of MoE’s sparse and dynamic characteristics on quantization and identifies two primary challenges: (1) Inter-expert imbalance, referring to the uneven distribution of samples across experts, which leads to insufficient and biased calibration for less frequently utilized experts; (2) Intra-expert imbalance, arising from MoE's unique aggregation mechanism, which leads to varying degrees of correlation between different samples and their assigned experts. To address these challenges, we propose MoEQuant, a novel quantization framework tailored for MoE LLMs. MoEQuant includes two novel techniques: 1) Expert-Balanced Self-Sampling (EBSS) is an efficient sampling method that efficiently constructs a calibration set with balanced expert distributions by leveraging the cumulative probabilities of tokens and expert balance metrics as guiding factors. 2) Affinity-Guided Quantization (AGQ), which incorporates affinities between experts and samples into the quantization process, thereby accurately assessing the impact of individual samples on different experts within the MoE layer. Experiments demonstrate that MoEQuant achieves substantial performance gains (more than 10 points accuracy gain in the HumanEval for DeepSeekMoE-16B under 4-bit quantization) and boosts efficiency." -Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies,Han Zhou Xingchen Wan Ruoxi Sun Hamid Palangi Shariq Iqbal Ivan Vulić Anna Korhonen Sercan O Arik,https://openreview.net/forum?id=uCKvHweh1g, Multi-Turn Code Generation Through Single-Step Rewards,Arnav Kumar Jain Gonzalo Gonzalez-Pumariega Wayne Chen Alexander M Rush Wenting Zhao Sanjiban Choudhury,https://icml.cc/virtual/2025/poster/44806,"We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards.We propose a simple yet scalable approach, $\mu$CODE, that solves multi-turn code generation using only single-step rewards.Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn.$\mu$CODE iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code.Experimental evaluations show that our approach achieves significant improvements over state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$CODE at utilizing the execution feedback." Oracle-MoE: Locality-preserving Routing in the Oracle Space for Memory-constrained Large Language Model Inference,Jixian Zhou Fang Dong Ruijun Huang Hengjie Cao Mengyi Chen Yifeng Yang Anrui Chen Mingzhi Dong Yujiang Wang Dongsheng Li David A. Clifton Qin Lv Rui Zhu Chun Zhang Fan Yang Tun Lu Ning Gu Li Shang,https://icml.cc/virtual/2025/poster/43606,"Mixture-of-Experts (MoE) is widely adopted to deploy Large Language Models (LLMs) on edge devices with limited memory budgets.Although MoE is, in theory, an inborn memory-friendly architecture requiring only a few activated experts to reside in the memory for inference, current MoE architectures cannot effectively fulfill this advantage and will yield intolerable inference latencies of LLMs on memory-constrained devices. Our investigation pinpoints the essential cause as the remarkable temporal inconsistencies of inter-token expert activations, which generate overly frequent expert swapping demands dominating the latencies. To this end, we propose a novel MoE architecture, Oracle-MoE, to fulfill the real on-device potential of MoE-based LLMs. Oracle-MoE route tokens in a highly compact space suggested by attention scores, termed theoracle space, to effectively maintain the semantic locality across consecutive tokens to reduce expert activation variations, eliminating massive swapping demands. Theoretical analysis proves that Oracle-MoE is bound to provide routing decisions with better semantic locality and, therefore, better expert activation consistencies. Experiments on the pretrained GPT-2 architectures of different sizes (200M, 350M, 790M, and 2B) and downstream tasks demonstrate that without compromising task performance, our Oracle-MoE has achieved state-of-the-art inference speeds across varying memory budgets, revealing its substantial potential for LLM deployments in industry." Organize the Web: Constructing Domains Enhances Pre-Training Data Curation,Alexander Wettig Kyle Lo Sewon Min Hannaneh Hajishirzi Danqi Chen Luca Soldaini,https://icml.cc/virtual/2025/poster/44718,"Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation." @@ -1654,7 +1637,6 @@ What Limits Bidirectional Model's Generative Capabilities? A Uni-Bi-Directional Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas,Shiqi Chen Tongyao Zhu Ruochen Zhou Jinghan Zhang Siyang Gao Juan Carlos Niebles Mor Geva Junxian He Jiajun Wu Manling Li,https://icml.cc/virtual/2025/poster/44272,"Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing “under” or “behind” relationships between only two objects, pose significant challenges for current VLMs. We believe it is crucial to use the lens of mechanism interpretability, opening up the model and diving into model’s internal states to examine the interactions between image and text tokens during spatial reasoning. Our analysis of attention behaviors reveals significant differences in how VLMs allocate attention to image versus text. By tracing the areas of images that receive the highest attention scores throughout intermediate layers, we observe a notable pattern: errors often coincide with attention being misdirected towards irrelevant objects within the image. Moreover, such attention patterns exhibit substantial differences between familiar (e.g., “on the left side of ”) and unfamiliar (e.g.,“in front of ”) spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when the model exhibits high confidence, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible additional cost." XAttention: Block Sparse Attention with Antidiagonal Scoring,Ruyi Xu Guangxuan Xiao Haofeng Huang Junxian Guo Song Han,https://icml.cc/virtual/2025/poster/45650,"Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements.In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference.Across comprehensive evaluations on demanding long-context benchmarks—including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation—XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications." ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation,Yupeng Hou Jianmo Ni Zhankui He Noveen Sachdeva Wang-Cheng Kang Ed H. Chi Julian McAuley Derek Zhiyuan Cheng,https://icml.cc/virtual/2025/poster/44439,"Generative recommendation (GR) is an emerging paradigm where user actions are tokenized into discrete token patterns and autoregressively generated as predictions. However, existing GR models tokenize each action independently, assigning the same fixed tokens to identical actions across all sequences without considering contextual relationships. This lack of context-awareness can lead to suboptimal performance, as the same action may hold different meanings depending on its surrounding context. To address this issue, we propose ActionPiece to explicitly incorporate context when tokenizing action sequences. In ActionPiece, each action is represented as asetof item features. Given the action sequence corpora, we construct the vocabulary by merging feature patterns as new tokens, based on their co-occurrence frequency both within individual sets and across adjacent sets. Considering the unordered nature of feature sets, we further introduce set permutation regularization, which produces multiple segmentations of action sequences with the same semantics. Our code is available at: https://github.com/google-deepmind/action_piece." -Manipulation Inversion by Adversarial Learning on Latent Statistical Manifold,Jialu Zhang yifei li Mai Xu Shengxi Li Shuting Liu Lai Jiang Shuhang Gu,https://openreview.net/forum?id=qyGurHI4As, The Logical Implication Steering Method for Conditional Interventions on Transformer Generation,Damjan Kalajdzievski,https://icml.cc/virtual/2025/poster/45970,"The field of mechanistic interpretability in pre-trained transformer models has demonstrated substantial evidence supporting the ''linear representation hypothesis'', which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept's vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand-engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models." Understanding the Emergence of Multimodal Representation Alignment,Megan Tjandrasuwita Chanakya Ekbote Liu Ziyin Paul Pu Liang,https://icml.cc/virtual/2025/poster/46488,"Multimodal representation learning is fundamentally about transforming incomparable modalities into comparable representations. While prior research has primarily focused onexplicitlyaligning these representations through targeted learning objectives and model architectures, a recent line of work has found that independently trained unimodal models of increasing scale and performance can becomeimplicitlyaligned with each other. These findings raise fundamental questions regarding the emergence of aligned representations in multimodal learning. Specifically: (1) when and why does alignment emerge implicitly? and (2) is alignment a reliable indicator of performance? Through a comprehensive empirical investigation, we demonstrate that both the emergence of alignment and its relationship with task performance depend on several critical data characteristics. These include, but are not necessarily limited to, the degree of similarity between the modalities and the balance between redundant and unique information they provide for the task. Our findings suggest that alignment may not be universally beneficial; rather, its impact on performance varies depending on the dataset and task. These insights can help practitioners determine whether increasing alignment between modalities is advantageous or, in some cases, detrimental to achieving optimal performance." unMORE: Unsupervised Multi-Object Segmentation via Center-Boundary Reasoning,Yafei YANG Zihui Zhang Bo Yang,https://icml.cc/virtual/2025/poster/43684,"We study the challenging problem of unsupervised multi-object segmentation on single images. Existing methods, which rely on image reconstruction objectives to learn objectness or leverage pretrained image features to group similar pixels, often succeed only in segmenting simple synthetic objects or discovering a limited number of real-world objects. In this paper, we introduce unMORE, a novel two-stage pipeline designed to identify many complex objects in real-world images. The key to our approach involves explicitly learning three levels of carefully defined object-centric representations in the first stage. Subsequently, our multi-object reasoning module utilizes these learned object priors to discover multiple objects in the second stage. Notably, this reasoning module is entirely network-free and does not require human labels. Extensive experiments demonstrate that unMORE significantly outperforms all existing unsupervised methods across 6 real-world benchmark datasets, including the challenging COCO dataset, achieving state-of-the-art object segmentation results. Remarkably, our method excels in crowded images where all baselines collapse. Our code and data are available at https://github.com/vLAR-group/unMORE." @@ -1671,7 +1653,6 @@ Long-Short Alignment for Effective Long-Context Modeling in LLMs,Tianqi Du Haoti Self-Supervised Learning of Intertwined Content and Positional Features for Object Detection,Kang-Jun Liu Masanori Suganuma Takayuki Okatani,https://icml.cc/virtual/2025/poster/45621,"We present a novel self-supervised feature learning method using Vision Transformers (ViT) as the backbone, specifically designed for object detection and instance segmentation. Our approach addresses the challenge of extracting features that capture both class and positional information, which are crucial for these tasks. The method introduces two key components: (1) a positional encoding tied to the cropping process in contrastive learning, which utilizes a novel vector field representation for positional embeddings; and (2) masking and prediction, similar to conventional Masked Image Modeling (MIM), applied in parallel to both content and positional embeddings of image patches. These components enable the effective learning of intertwined content and positional features. We evaluate our method against state-of-the-art approaches, pre-training on ImageNet-1K and fine-tuning on downstream tasks. Our method outperforms the state-of-the-art SSL methods on the COCO object detection benchmark, achieving significant improvements with fewer pre-training epochs. These results suggest that better integration of positional information into self-supervised learning can improve performance on the dense prediction tasks." Simplifying DINO via Coding Rate Regularization,Ziyang Wu Jingyuan Zhang Druv Pai XuDong Wang Chandan Singh Jianwei Yang Jianfeng Gao Yi Ma,https://icml.cc/virtual/2025/poster/43820,"DINO and DINOv2 are two model families being widely used to learn representations from unlabeled imagery data at large scales. Their learned representations often enable state-of-the-art performance for downstream tasks, such as image classification and segmentation. However, they employ many empirically motivated design choices and their training pipelines are highly complex and unstable --- many hyperparameters need to be carefully tuned to ensure that the representations do not collapse --- which poses considerable difficulty to improving them or adapting them to new domains. In this work, we posit that we can remove most such-motivated idiosyncrasies in the pre-training pipelines, and only need to add an explicit coding rate term in the loss function to avoid collapse of the representations. As a result, we obtain highly simplified variants of the DINO and DINOv2 which we call SimDINO and SimDINOv2, respectively. Remarkably, these simplified models are more robust to different design choices, such as network architecture and hyperparameters, and they learn even higher-quality representations, measured by performance on downstream tasks, offering a Pareto improvement over the corresponding DINO and DINOv2 models. This work highlights the potential of using simplifying design principles to improve the empirical practice of deep learning. Code and model checkpoints are available at https://github.com/RobinWu218/SimDINO." A Closer Look at Transformers for Time Series Forecasting: Understanding Why They Work and Where They Struggle,Yu Chen Nathalia Céspedes Payam Barnaghi,https://icml.cc/virtual/2025/poster/44262,"Time-series forecasting is crucial across various domains, including finance, healthcare, and energy. Transformer models, originally developed for natural language processing, have demonstrated significant potential in addressing challenges associated with time-series data. These models utilize different tokenization strategies, point-wise, patch-wise, and variate-wise, to represent time-series data, each resulting in different scope of attention maps. Despite the emergence of sophisticated architectures, simpler transformers consistently outperform their more complex counterparts in widely used benchmarks. This study examines why point-wise transformers are generally less effective, why intra- and inter-variate attention mechanisms yield similar outcomes, and which architectural components drive the success of simpler models. By analyzing mutual information and evaluating models on synthetic datasets, we demonstrate that intra-variate dependencies are the primary contributors to prediction performance on benchmarks, while inter-variate dependencies have a minor impact. Additionally, techniques such as Z-score normalization and skip connections are also crucial. However, these results are largely influenced by the self-dependent and stationary nature of benchmark datasets. By validating our findings on real-world healthcare data, we provide insights for designing more effective transformers for practical applications." -ASCENSION: Autoencoder-Based Latent Space Class Expansion for Time Series Data Augmentation,Matthieu OLEKHNOVITCH Dorian Joubaud Adrien Bolling Evgeny Zotov Sylvain KUBLER Maxime Cordy Mike Papadakis YVES LE TRAON,https://openreview.net/forum?id=pkdA4gC4p2, Continuously Updating Digital Twins using Large Language Models,Harry Amad Nicolás Astorga Mihaela van der Schaar,https://icml.cc/virtual/2025/poster/44291,"Digital twins are models of real-world systems that can simulate their dynamics in response to potential actions. In complex settings, the state and action variables, and available data and knowledge relevant to a system can constantly change, requiring digital twins to continuously update with these changes to remain relevant. Current approaches struggle in this regard, as they require fixed, well-defined modelling environments, and they cannot adapt to novel variables without re-designs, or incorporate new information without re-training. To address this, we frame digital twinning as an in-context learning problem using large language models, enabling seamless updates to the twin at inference time. We develop CALM-DT, a Context-Adaptive Language Model-based Digital Twin that can accurately simulate across diverse state-action spaces using in-context learning alone by utilising fine-tuned encoders for sample retrieval. We empirically demonstrate CALM-DT's competitive performance with existing digital twin approaches, and its unique ability to adapt to changes in its modelling environment without parameter updates." Enhancing Foundation Models for Time Series Forecasting via Wavelet-based Tokenization,Luca Masserano Abdul Fatir Ansari Boran Han Xiyuan Zhang Christos Faloutsos Michael W. Mahoney Andrew Gordon Wilson Youngsuk Park Syama Sundar Rangapuram Danielle C. Maddix Bernie Wang,https://icml.cc/virtual/2025/poster/46131,"How to best develop foundational models for time series forecasting remains an important open question. Tokenization is a crucial consideration in this effort: what is an effective discrete vocabulary for a real-valued sequential input? To address this question, we develop WaveToken, a wavelet-based tokenizer that allows models to learn complex representations directly in the space of time-localized frequencies. Our method first scales and decomposes the input time series, then thresholds and quantizes the wavelet coefficients, and finally pre-trains an autoregressive model to forecast coefficients for the forecast horizon. By decomposing coarse and fine structures in the inputs, wavelets provide an eloquent and compact language for time series forecasting that simplifies learning. Empirical results on a comprehensive benchmark, including 42 datasets for both in-domain and zero-shot settings, show that WaveToken: i) performs on par or better than recently proposed foundation models for forecasting while using a much smaller vocabulary (1024 tokens), and is competitive with modern deep learning models trained specifically on each dataset; ii) exhibits superior generalization capabilities, achieving the best average rank across all datasets for three complementary metrics; and iii) easily captures complex temporal patterns of practical relevance that are challenging for other recent pre-trained models, including trends, sparse spikes, and non-stationary time series with varying frequencies evolving over time." LangTime: A Language-Guided Unified Model for Time Series Forecasting with Proximal Policy Optimization,Wenzhe Niu Zongxia Xie Yanru Sun Wei He Man Xu Chao Hao,https://icml.cc/virtual/2025/poster/45059,"Recent research has shown an increasing interest in utilizing pre-trained large language models (LLMs) for a variety of time series applications. However, there are three main challenges when using LLMs as foundational models for time series forecasting: (1) Cross-domain generalization. (2) Cross-modality alignment. (3) Error accumulation in autoregressive frameworks. To address these challenges, we proposedLangTime, alanguage-guided unified model fortimeseries forecasting that incorporates cross-domain pre-training with reinforcement learning-based fine-tuning. Specifically, LangTime constructs Temporal Comprehension Prompts (TCPs), which include dataset-wise and channel-wise instructions, to facilitate domain adaptation and condense time series into a single token, enabling LLMs to understand better and align temporal data. To improve autoregressive forecasting, we introduce TimePPO, a reinforcement learning-based fine-tuning algorithm. TimePPO mitigates error accumulation by leveraging a multidimensional rewards function tailored for time series and a repeat-based value estimation strategy. Extensive experiments demonstrate that LangTime achieves state-of-the-art cross-domain forecasting performance, while TimePPO fine-tuning effectively enhances the stability and accuracy of autoregressive forecasting." @@ -1697,9 +1678,7 @@ Conformal Prediction with Cellwise Outliers: A Detect-then-Impute Approach,Qian Conformity Score Averaging for Classification,Rui Luo Zhixin Zhou,https://icml.cc/virtual/2025/poster/45362,"Conformal prediction provides a robust framework for generating prediction sets with finite-sample coverage guarantees, independent of the underlying data distribution. However, existing methods typically rely on a single conformity score function, which can limit the efficiency and informativeness of the prediction sets. In this paper, we present a novel approach that enhances conformal prediction for multi-class classification by optimally averaging multiple conformity score functions. Our method involves assigning weights to different score functions and employing various data splitting strategies. Additionally, our approach bridges concepts from conformal prediction and model averaging, offering a more flexible and efficient tool for uncertainty quantification in classification tasks. We provide a comprehensive theoretical analysis grounded in Vapnik–Chervonenkis (VC) theory, establishing finite-sample coverage guarantees and demonstrating the efficiency of our method. Empirical evaluations on benchmark datasets show that our weighted averaging approach consistently outperforms single-score methods by producing smaller prediction sets without sacrificing coverage." Direct Prediction Set Minimization via Bilevel Conformal Classifier Training,Yuanjie Shi Hooman Shahrokhi Xuesong Jia Xiongzhi Chen Jana Doppa Yan Yan,https://icml.cc/virtual/2025/poster/45700,"Conformal prediction (CP) is a promising uncertainty quantification framework which works as a wrapper around a black-box classifier to construct prediction sets (i.e., subset of candidate classes) with provable guarantees. However, standard calibration methods for CP tend to produce large prediction sets which makes them less useful in practice. This paper considers the problem of integrating conformal principles into the training process of deep classifiers to directly minimize the size of prediction sets. We formulate conformal training as a bilevel optimization problem and propose the {\em Direct Prediction Set Minimization (DPSM)} algorithm to solve it. The key insight behind DPSM is to minimize a measure of the prediction set size (upper level) that is conditioned on the learned quantile of conformity scores (lower level). We analyze that DPSM has a learning bound of $O(1/\sqrt{n})$ (with $n$ training samples),while prior conformal training methods based on stochastic approximation for the quantile has a bound of $\Omega(1/s)$ (with batch size $s$ and typically $s \ll \sqrt{n}$).Experiments on various benchmark datasets and deep models show that DPSM significantly outperforms the best prior conformal training baseline with $20.46\\%\downarrow$ in the prediction set size and validates our theory." Efficient Heterogeneity-Aware Federated Active Data Selection,Ying-Peng Tang Chao Ren Xiaoli Tang Sheng-Jun Huang Lizhen Cui Han Yu,https://icml.cc/virtual/2025/poster/44007,"Federated Active Learning (FAL) aims to learn an effective global model, while minimizing label queries. Owing to privacy requirements, it is challenging to design effective active data selection schemes due to the lack of cross-client query information. In this paper, we bridge this important gap by proposing the \underline{F}ederated \underline{A}ctive data selection by \underline{LE}verage score sampling (FALE) method. It is designed for regression tasks in the presence of non-i.i.d. client data to enable the server to select data globally in a privacy-preserving manner. Based on FedSVD, FALE aims to estimate the utility of unlabeled data and perform data selection via leverage score sampling. Besides, a secure model learning framework is designed for federated regression tasks to exploit supervision. FALE can operate without requiring an initial labeled set and select the instances in a single pass, significantly reducing communication overhead. Theoretical analyze establishes the query complexity for FALE to achieve constant factor approximation and relative error approximation. Extensive experiments on 11 benchmark datasets demonstrate significant improvements of FALE over existing state-of-the-art methods." -Enhancing Pruned Models by Input Compensation,Weisen Jiang Shuhao Chen Baijiong Lin James Kwok Yu Zhang,https://openreview.net/forum?id=omSoakIsr6, False Coverage Proportion Control for Conformal Prediction,Alexandre Blain Bertrand Thirion Pierre Neuvial,https://icml.cc/virtual/2025/poster/45792,"Split Conformal Prediction (SCP) provides a computationally efficient way to construct confidence intervals in prediction problems. Notably, most of the theory built around SCP is focused on the single test point setting. In real-life, inference sets consist of multiplepoints, which raises the question of coverage guarantees for many points simultaneously. While *on average*, the False Coverage Proportion (FCP) remains controlled, it can fluctuate strongly around its mean, the False Coverage Rate (FCR). We observe that when adataset is split multiple times, classical SCP may not control the FCP in a majority of the splits. We propose CoJER, a novel method that achieves sharp FCP control in probability for conformal prediction, based on a recent characterization of the distribution of conformal $p$-values in a transductive setting. This procedure incorporates an aggregation scheme which provides robustness with respect to modeling choices. We show through extensive real data experiments that CoJER provides FCP control while standard SCP does not. Furthermore, CoJER yields shorter intervals than the *state-of-the-art method* for FCP control and only slightly larger intervals than standard SCP." -Gated Integration of Low-Rank Adaptation for Continual Learning of Language Models,Yan-Shuo Liang Wu-Jun Li,https://openreview.net/forum?id=vmfRdANa6a, Navigating Conflicting Views: Harnessing Trust for Learning,Jueqing Lu Wray Buntine YUANYUAN QI Joanna Dipnall Belinda Gabbe Lan Du,https://icml.cc/virtual/2025/poster/43734,"Resolving conflicts is critical for improving the reliability of multi-view classification.While prior work focuses on learning consistent and informative representations across views, it often assumes perfect alignment and equal importance of all views, an assumption rarely met in real-world scenarios, as some views may express distinct information. To address this, we develop a computational trust-based discounting method that enhances the Evidential Multi-view framework by accounting for the instance-wise reliability of each view through a probability-sensitive trust mechanism.We evaluate our method on six real-world datasets using Top-1 Accuracy, Fleiss’ Kappa, and a new metric, Multi-View Agreement with Ground Truth, to assess prediction reliability. We also assess the effectiveness of uncertainty in indicating prediction correctness via AUROC.Additionally, we test the scalability of our method through end-to-end training on a large-scale dataset.The experimental results show that computational trust can effectively resolve conflicts, paving the way for more reliable multi-view classification models in real-world applications.Codes available at: https://github.com/OverfitFlow/Trust4Conflict" Provable Maximum Entropy Manifold Exploration via Diffusion Models,Riccardo De Santi Marin Vlastelica Ya-Ping Hsieh Zebang Shen Niao He Andreas Krause,https://icml.cc/virtual/2025/poster/45982,"Exploration is critical for solving real-world decision-making problems such as scientific discovery, where the objective is to generate truly novel designs rather than mimic existing data distributions. In this work, we address the challenge of leveraging the representational power of generative models for exploration without relying on explicit uncertainty quantification. We introduce a novel framework that casts exploration as entropy maximization over the approximate data manifold implicitly defined by a pre-trained diffusion model. Then, we present a novel principle for exploration based on density estimation, a problem well-known to be challenging in practice. To overcome this issue and render this method truly scalable, we leverage a fundamental connection between the entropy of the density induced by a diffusion model and its score function. Building on this, we develop an algorithm based on mirror descent that solves the exploration problem as sequential fine-tuning of a pre-trained diffusion model. We prove its convergence to the optimal exploratory diffusion model under realistic assumptions by leveraging recent understanding of mirror flows. Finally, we empirically evaluate our approach on both synthetic and high-dimensional text-to-image diffusion, demonstrating promising results." Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges,Nayoung Lee Ziyang Cai Avi Schwarzschild Kangwook Lee Dimitris Papailiopoulos,https://icml.cc/virtual/2025/poster/44828,"Large language models often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a self-improvement approach where models iteratively generate and learn from their own solutions, progressively tackling harder problems while maintaining a standard transformer architecture. Across diverse tasks including arithmetic, string manipulation, and maze solving, our method enables models to solve problems far beyond their initial training distribution—for instance, generalizing from 10-digit to 100-digit addition without apparent saturation. We observe that filtering for correct self-generated examples leads to exponential improvements in out-of-distribution performance across training rounds. Additionally, starting from pretrained models significantly accelerates this self-improvement process for several tasks. Our results demonstrate how controlled weak-to-strong curricula can systematically expand model capabilities while preserving architectural simplicity." @@ -1770,7 +1749,6 @@ Convergence of Mean-Field Langevin Stochastic Descent-Ascent for Distributional Efficient Curvature-Aware Hypergradient Approximation for Bilevel Optimization,Youran Dong Junfeng Yang Wei Yao Jin Zhang,https://icml.cc/virtual/2025/poster/44051,"Bilevel optimization is a powerful tool for many machine learning problems, such as hyperparameter optimization and meta-learning. Estimating hypergradients (also known as implicit gradients) is crucial for developing gradient-based methods for bilevel optimization. In this work, we propose a computationally efficient technique for incorporating curvature information into the approximation of hypergradients and present a novel algorithmic framework based on the resulting enhanced hypergradient computation. We provide convergence rate guarantees for the proposed framework in both deterministic and stochastic scenarios, particularly showing improved computational complexity over popular gradient-based methods in the deterministic setting. This improvement in complexity arises from a careful exploitation of the hypergradient structure and the inexact Newton method. In addition to the theoretical speedup, numerical experiments demonstrate the significant practical performance benefits of incorporating curvature information." Efficient First-Order Optimization on the Pareto Set for Multi-Objective Learning under Preference Guidance,Lisha Chen Quan Xiao Ellen Hidemi Fukuda Xinyi Chen Kun Yuan Tianyi Chen,https://icml.cc/virtual/2025/poster/45381,"Multi-objective learning under user-specified preference is common in real-world problems such as multi-lingual speech recognition under fairness. In this work, we frame such a problem as a semivectorial bilevel optimization problem, whose goal is to optimize a pre-defined preference function, subject to the constraint that the model parameters are weakly Pareto optimal. To solve this problem, we convert the multi-objective constraints to a single-objective constraint through a merit function with an easy-to-evaluate gradient, and then, we use a penalty-based reformulation of the bilevel optimization problem. We theoretically establish the properties of the merit function, and the relations of solutions for the penalty reformulation and the constrained formulation. Then we propose algorithms to solve the reformulated single-level problem, and establish its convergence guarantees. We test the method on various synthetic and real-world problems. The results demonstrate the effectiveness of the proposed method in finding preference-guided optimal solutions to the multi-objective problem." Joint Learning of Energy-based Models and their Partition Function,Michael Eli Sander Vincent Roulet Tianlin Liu Mathieu Blondel,https://icml.cc/virtual/2025/poster/43730,"Energy-based models (EBMs) offer a flexible framework for parameterizing probability distributions using neural networks.However, learning EBMs by exact maximum likelihood estimation (MLE) is generally intractable, due to the need to compute the partition function.In this paper, we propose a novel min-min formulation for approximately learning probabilistic EBMs in combinatorially-large discrete spaces, such as sets or permutations. Our key idea is to jointly learn both an energy model and its log-partition, parameterized as a neural network. Our approach not only provides a novel tractable objective criterion to learn EBMs by stochastic gradient descent (without relying on MCMC), but also a novel means to estimate the log-partition function on unseen data points.On the theoretical side, we show that our approach recovers the optimal MLE solution when optimizing in the space of continuous functions.Furthermore, we show that our approach naturally extends to the broader family of Fenchel-Young losses, allowing us to obtainthe first tractable method for optimizing the sparsemax loss in combinatorially-large spaces.We demonstrate our approach on multilabel classification and label ranking." -Learning Multiple Initial Solutions to Optimization Problems,Elad Sharony Heng Yang Tong Che Marco Pavone Shie Mannor Peter Karkus,https://openreview.net/forum?id=EMQfiikGRJ, Momentum-Driven Adaptivity: Towards Tuning-Free Asynchronous Federated Learning,Wenjing Yan Xiangyu Zhong Xiaolu Wang Ying-Jun Angela Zhang,https://icml.cc/virtual/2025/poster/44676,"Asynchronous federated learning (AFL) has emerged as a promising solution to address system heterogeneity and improve the training efficiency of federated learning. However, existing AFL methods face two critical limitations: 1) they rely on strong assumptions about bounded data heterogeneity across clients, and 2) they require meticulous tuning of learning rates based on unknown system parameters. In this paper, we tackle these challenges by leveraging momentum-based optimization and adaptive learning strategies. We first propose MasFL, a novel momentum-driven AFL framework that successfully eliminates the need for data heterogeneity bounds by effectively utilizing historical descent directions across clients and iterations. By mitigating the staleness accumulation caused by asynchronous updates, we prove that MasFL achieves state-of- the-art convergence rates with linear speedup in both the number of participating clients and local updates. Building on this foundation, we further introduce AdaMasFL, an adaptive variant that incorporates gradient normalization into local updates. Remarkably, this integration removes all dependencies on problem-specific parameters, yielding a fully tuning-free AFL approach while retaining theoretical guarantees. Extensive experiments demonstrate that AdaMasFL consistently outperforms state-of-the-art AFL methods in run- time efficiency and exhibits exceptional robustness across diverse learning rate configurations and system conditions." Online Conformal Prediction via Online Optimization,Felipe Areces Christopher Mohri Tatsunori Hashimoto John Duchi,https://icml.cc/virtual/2025/poster/45619,"We introduce a family of algorithms for online conformal prediction with coverage guarantees for both adversarial and stochastic data. In the adversarial setting, we establish the standard guarantee: over time, a pre-specified target fraction of confidence sets cover the ground truth. For stochastic data, we provide a guarantee at every time instead of just on average over time: the probability that a confidence set covers the ground truth—conditioned on past observations—converges to a pre-specified target when the conditional quantiles of the errors are a linear function of past data. Complementary to our theory, our experiments spanning over $15$ datasets suggest that the performance improvement of our methods over baselines grows with the magnitude of the data’s dependence, even when baselines are tuned on the test set. We put these findings to the test by pre-registering an experiment for electricity demand forecasting in Texas, where our algorithms achieve over a $10$\% reduction in confidence set sizes, a more than a $30$\% improvement in quantile and absolute losses with respect to the observed errors, and significant outcomes on all $78$ out of $78$ pre-registered hypotheses. We provide documentation for the pypi package implementing our algorithms here: \url{https://conformalopt.readthedocs.io/}." Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness,Qi He Peiran Yu Ziyi Chen Heng Huang,https://icml.cc/virtual/2025/poster/43996,"Shuffling-type gradient methods are favored in practice for their simplicity and rapid empirical performance. Despite extensive development of convergence guarantees under various assumptions in recent years, most require the Lipschitz smoothness condition, which is often not met in common machine learning models. We highlight this issue with specific counterexamples. To address this gap, we revisit the convergence rates of shuffling-type gradient methods without assuming Lipschitz smoothness. Using our stepsize strategy, the shuffling-type gradient algorithm not only converges under weaker assumptions but also match the current best-known convergence rates, thereby broadening its applicability. We prove the convergence rates for nonconvex, strongly convex, and non-strongly convex cases, each under both random reshuffling and arbitrary shuffling schemes, under a general bounded variance condition. Numerical experiments further validate the performance of our shuffling-type gradient algorithm, underscoring its practical efficacy." @@ -1801,7 +1779,6 @@ Learning Survival Distributions with the Asymmetric Laplace Distribution,Deming Robust Conformal Outlier Detection under Contaminated Reference Data,Meshi Bashari Matteo Sesia Yaniv Romano,https://icml.cc/virtual/2025/poster/43852,"Conformal prediction is a flexible framework for calibrating machine learning predictions, providing distribution-free statistical guarantees. In outlier detection, this calibration relies on a reference set of labeled inlier data to control the type-I error rate. However, obtaining a perfectly labeled inlier reference set is often unrealistic, and a more practical scenario involves access to a contaminated reference set containing a small fraction of outliers. This paper analyzes the impact of such contamination on the validity of conformal methods. We prove that under realistic, non-adversarial settings, calibration on contaminated data yields conservative type-I error control, shedding light on the inherent robustness of conformal methods. This conservativeness, however, typically results in a loss of power. To alleviate this limitation, we propose a novel, active data-cleaning framework that leverages a limited labeling budget and an outlier detection model to selectively annotate data points in the contaminated reference set that are suspected as outliers. By removing only the annotated outliers in this ``suspicious'' subset, we can effectively enhance power while mitigating the risk of inflating the type-I error rate, as supported by our theoretical analysis. Experiments on real datasets validate the conservative behavior of conformal methods under contamination and show that the proposed data-cleaning strategy improves power without sacrificing validity." TRACE Back from the Future: A Probabilistic Reasoning Approach to Controllable Language Generation,Gwen Yidou Weng Benjie Wang Guy Van den Broeck,https://icml.cc/virtual/2025/poster/45579,"As large language models (LMs) advance, there is an increasing need to control their outputs to align with human values (e.g., detoxification) or desired attributes (e.g., personalization, topic). However, autoregressive models focus on next-token predictions and struggle with global properties that require looking ahead. Existing solutions either post-train LMs for each new attribute—expensive and inflexible—or approximate the Expected Attribute Probability (EAP) of future sequences by sampling or training, which is slow and unreliable for rare attributes. We introduceTRACE(Tractable Probabilistic Reasoning for Adaptable Controllable gEneration), a novel framework that efficiently computes EAP and adapts to new attributes through tractableprobabilisticreasoning and lightweightcontrol. TRACE distills a Hidden Markov Model (HMM) from an LM and pairs it with a small classifier to estimate attribute probabilities, enabling exact EAP computation over the HMM’s predicted futures. This EAP is then used to reweigh the LM’s next-token probabilities for globally compliant continuations. Empirically, TRACE achieves state-of-the-art detoxification results with only 20% decoding overhead, yields 76 low-resource personalized LMs within seconds, and seamlessly extends to composite attributes." Learning Likelihood-Free Reference Priors,Nicholas George Bishop Daniel Jarne Ornia Joel Dyer Ani Calinescu Michael J. Wooldridge,https://icml.cc/virtual/2025/poster/46468,"Simulation modeling offers a flexible approach to constructing high-fidelity synthetic representations of complex real-world systems. However, the increased complexity of such models introduces additional complications, for example when carrying out statistical inference procedures. This has motivated a large and growing literature onlikelihood-freeorsimulation-basedinference methods, which approximate (e.g., Bayesian) inference without assuming access to the simulator's intractable likelihood function. A hitherto neglected problem in the simulation-based Bayesian inference literature is the challenge of constructing minimally informativereference priorsfor complex simulation models. Such priors maximise an expected Kullback-Leibler distance from the prior to the posterior, thereby influencing posterior inferences minimally and enabling an ``objective'' approach to Bayesian inference that does not necessitate the incorporation of strong subjective prior beliefs. In this paper, we propose and test a selection of likelihood-free methods for learning reference priors for simulation models, using variational approximations to these priors and a variety of mutual information estimators. Our experiments demonstrate that good approximations to reference priors for simulation models are in this way attainable, providing a first step towards the development of likelihood-free objective Bayesian inference procedures." -Mixed Likelihood Variational Gaussian Processes,Kaiwen Wu Craig Sanders Benjamin Letham Phillip Guan,https://openreview.net/forum?id=6vsAh1qBJb, Pareto-frontier Entropy Search with Variational Lower Bound Maximization,Masanori Ishikura Masayuki Karasuyama,https://icml.cc/virtual/2025/poster/46222,"This study considers multi-objective Bayesian optimization (MOBO) through the information gain of the Pareto-frontier. To calculate the information gain, a predictive distribution conditioned on the Pareto-frontier plays a key role, which is defined as a distribution truncated by the Pareto-frontier. However, it is usually impossible to obtain the entire Pareto-frontier in a continuous domain, and therefore, the complete truncation cannot be known. We consider an approximation of the truncated distribution by using a mixture distribution consisting of two possible approximate truncations obtainable from a subset of the Pareto-frontier, which we call over- and under-truncation. Since the optimal balance of the mixture is unknown beforehand, we propose optimizing the balancing coefficient through the variational lower bound maximization framework, by which the approximation error of the information gain can be minimized. Our empirical evaluation demonstrates the effectiveness of the proposed method particularly when the number of objective functions is large." "A Generic Family of Graphical Models: Diversity, Efficiency, and Heterogeneity",Yufei Huang Changhu Wang Junjie Tang Weichi Wu Ruibin Xi,https://icml.cc/virtual/2025/poster/44227,"Traditional network inference methods, such as Gaussian Graphical Models, which are built on continuity and homogeneity, face challenges when modeling discrete data and heterogeneous frameworks. Furthermore, under high-dimensionality, the parameter estimation of such models can be hindered by the notorious intractability of high-dimensional integrals. In this paper, we introduce a new and flexible device for graphical models, which accommodates diverse data types, including Gaussian, Poisson log-normal, and latent Gaussian copula models. The new device is driven by a new marginally recoverable parametric family, which can be effectively estimated without evaluating the high-dimensional integration in high-dimensional settings thanks to the marginal recoverability. We further introduce a mixture of marginally recoverable models to capture ubiquitous heterogeneous structures. We show the validity of the desirable properties of the models and the effective estimation methods, and demonstrate their advantages over the state-of-the-art network inference methods via extensive simulation studies and a gene regulatory network analysis of real single-cell RNA sequencing data." Annealing Flow Generative Models Towards Sampling High-Dimensional and Multi-Modal Distributions,Dongze Wu Yao Xie,https://icml.cc/virtual/2025/poster/44357,"Sampling from high-dimensional, multi-modal distributions remains a fundamental challenge across domains such as statistical Bayesian inference and physics-based machine learning. In this paper, we propose Annealing Flow (AF), a method built on Continuous Normalizing Flows (CNFs) for sampling from high-dimensional and multi-modal distributions. AF is trained with a dynamic Optimal Transport (OT) objective incorporating Wasserstein regularization, and guided by annealing procedures, facilitating effective exploration of modes in high-dimensional spaces. Compared to recent NF methods, AF significantly improves training efficiency and stability, with minimal reliance on MC assistance. We demonstrate the superior performance of AF compared to state-of-the-art methods through extensive experiments on various challenging distributions and real-world datasets, particularly in high-dimensional and multi-modal settings. We also highlight AF’s potential for sampling the least favorable distributions." @@ -1892,7 +1869,6 @@ Test-Time Multimodal Backdoor Detection by Contrastive Prompting,Yuwei Niu Shuo The Limits of Predicting Agents from Behaviour,Alexis Bellot Jonathan Richens Tom Everitt,https://icml.cc/virtual/2025/poster/44820,"As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour is important for safely deploying AI. For agents, the most natural abstractions for predicting behaviour attribute beliefs, intentions and goals to the system. If an agent behaves as if it has a certain goal or belief, then we can make reasonable predictions about how it will behave in novel situations, including those where comprehensive safety evaluations are untenable. How well can we infer an agent’s beliefs from their behaviour, and how reliably can these inferred beliefs predict the agent’s behaviour in novel situations? We provide a precise answer to this question under the assumption that the agent’s behaviour is guided by a world model. Our contribution is the derivation of novel bounds on the agent's behaviour in new (unseen) deployment environments, which represent a theoretical limit for predicting intentional agents from behavioural data alone. We discuss the implications of these results for several research areas including fairness and safety." The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret,Lukas Fluri Leon Lang Alessandro Abate Patrick Forré David Krueger Joar Max Viktor Skalse,https://icml.cc/virtual/2025/poster/45208,"In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue bylearningthe reward function. However, a learned reward model may have a low error on the data distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has anerror-regret mismatch. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for anyfixedexpected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to measure their quality reliably." WMarkGPT: Watermarked Image Understanding via Multimodal Large Language Models,Songbai Tan Xuerui Qiu Yao Shu Gang Xu Linrui Xu Xiangyu Xu Huiping Zhuang Ming Li Fei Yu,https://icml.cc/virtual/2025/poster/45767,"Invisible watermarking is widely used to protect digital images from unauthorized use. Accurate assessment of watermarking efficacy is crucial for advancing algorithmic development. However, existing statistical metrics, such as PSNR, rely on access to original images, which are often unavailable in text-driven generative watermarking and fail to capture critical aspects of watermarking, particularly visibility. More importantly, these metrics fail to account for potential corruption of image content. To address these limitations, we propose WMarkGPT, the first multimodal large language model (MLLM) specifically designed for comprehensive watermarked image understanding, without accessing original images. WMarkGPT not only predicts watermark visibility but also generates detailed textual descriptions of its location, content, and impact on image semantics, enabling a more nuanced interpretation of watermarked images. Tackling the challenge of precise location description and understanding images with vastly different content, we construct three visual question-answering (VQA) datasets: an object location-aware dataset, a synthetic watermarking dataset, and a real watermarking dataset. We introduce a meticulously designed three-stage learning pipeline to progressively equip WMarkGPT with the necessary abilities. Extensive experiments on synthetic and real watermarking QA datasets demonstrate that WMarkGPT outperforms existing MLLMs, achieving significant improvements in visibility prediction and content description. The datasets and code are released at https://github.com/TanSongBai/WMarkGPT." -$DPOT_{L_0}$: Concealing Backdoored model updates in Federated Learning by Data Poisoning with $L_0$-norm-bounded Optimized Triggers,Yujie Zhang Neil Zhenqiang Gong Michael K. Reiter,https://openreview.net/forum?id=O6OugpQ1Pq, Adaptive Median Smoothing: Adversarial Defense for Unlearned Text-to-Image Diffusion Models at Inference Time,Xiaoxuan Han Songlin Yang Wei Wang Yang Li Jing Dong,https://icml.cc/virtual/2025/poster/45379,"Text-to-image (T2I) diffusion models have raised concerns about generating inappropriate content, such as ""nudity"". Despite efforts to erase undesirable concepts through unlearning techniques, these unlearned models remain vulnerable to adversarial inputs that can potentially regenerate such content. To safeguard unlearned models, we propose a novel inference-time defense strategy that mitigates the impact of adversarial inputs. Specifically, we first reformulate the challenge of ensuring robustness in unlearned diffusion models as a robust regression problem. Building upon the naive median smoothing for regression robustness, which employs isotropic Gaussian noise, we develop a generalized median smoothing framework that incorporates anisotropic noise. Based on this framework, we introduce a token-wiseAdaptive Median Smoothingmethod that dynamically adjusts noise intensity according to each token's relevance to target concepts. Furthermore, to improve inference efficiency, we explore implementations of this adaptive method at the text-encoding stage. Extensive experiments demonstrate that our approach enhances adversarial robustness while preserving model utility and inference efficiency, outperforming baseline defense techniques." AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs,Anselm Paulus Arman Zharmagambetov Chuan Guo Brandon Amos Yuandong Tian,https://icml.cc/virtual/2025/poster/44613,"Large Language Models (LLMs) are vulnerable tojailbreaking attacksthat lead to generation of inappropriate or harmful content. Manual red-teaming requires a time-consuming search for adversarial prompts, whereas automatic adversarial prompt generation often leads to semantically meaningless attacks that do not scale well.In this paper, we present a novel method that uses another LLM, calledAdvPrompter, to generate human-readable adversarial prompts in seconds.AdvPrompter, which is trained using an alternating optimization algorithm, generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response.Experimental results on popular open source TargetLLM show highly competitive results on the AdvBench and HarmBench datasets, that also transfer to closed-source black-box LLMs.We also show that training on adversarial suffixes generated by AdvPrompter is a promising strategy for improving the robustness of LLMs to jailbreaking attacks." Black-Box Adversarial Attacks on LLM-Based Code Completion,Slobodan Jenko Niels Mündler Jingxuan He Mark Vero Martin Vechev,https://icml.cc/virtual/2025/poster/44302,"Modern code completion engines, powered by large language models (LLMs), assist millions of developers with their strong capabilities to generate functionally correct code. Due to this popularity, it is crucial to investigate the security implications of relying on LLM-based code completion. In this work, we demonstrate that state-of-the-art black-box LLM-based code completion engines can be stealthily biased by adversaries to significantly increase their rate of insecure code generation. We present the first attack, named INSEC, that achieves this goal. INSEC works by injecting an attack string as a short comment in the completion input. The attack string is crafted through a query-based optimization procedure starting from a set of carefully designed initialization schemes. We demonstrate INSEC's broad applicability and effectiveness by evaluating it on various state-of-the-art open-source models and black-box commercial services (e.g., OpenAI API and GitHub Copilot). On a diverse set of security-critical test cases, covering 16 CWEs across 5 programming languages, INSEC increases the rate of generated insecure code by more than 50%, while maintaining the functional correctness of generated code. We consider INSEC practical - it requires low resources and costs less than 10 US dollars to develop on commodity hardware. Moreover, we showcase the attack's real-world deployability, by developing an IDE plug-in that stealthily injects INSEC into the GitHub Copilot extension." @@ -1903,13 +1879,11 @@ MixBridge: Heterogeneous Image-to-Image Backdoor Attack through Mixture of Schr PoisonedEye: Knowledge Poisoning Attack on Retrieval-Augmented Generation based Large Vision-Language Models,Chenyang Zhang Xiaoyu Zhang Jian Lou Kai Wu Zilong Wang Xiaofeng Chen,https://icml.cc/virtual/2025/poster/46373,"Vision-Language Retrieval-Augmented Generation (VLRAG) systems have been widely applied to Large Vision-Language Models (LVLMs) to enhance their generation ability. However, the reliance on external multimodal knowledge databases renders VLRAG systems vulnerable to malicious poisoning attacks. In this paper, we introduce PoisonedEye, the first knowledge poisoning attack designed for VLRAG systems. Our attack successfully manipulates the response of the VLRAG system for the target query by injecting only one poison sample into the knowledge database. To construct the poison sample, we follow two key properties for the retrieval and generation process, and identify the solution by satisfying these properties. Besides, we also introduce a class query targeted poisoning attack, a more generalized strategy that extends the poisoning effect to an entire class of target queries. Extensive experiments on multiple query datasets, retrievers, and LVLMs demonstrate that our attack is highly effective in compromising VLRAG systems." Stealix: Model Stealing via Prompt Evolution,Zhixiong Zhuang Hui-Po Wang Maria-Irina Nicolae Mario Fritz,https://icml.cc/virtual/2025/poster/44026,"Model stealing poses a significant security risk in machine learning by enabling attackers to replicate a black-box model without access to its training data, thus jeopardizing intellectual property and exposing sensitive information.Recent methods that use pre-trained diffusion models for data synthesis improve efficiency and performance but rely heavily on manually crafted prompts, limiting automation and scalability, especially for attackers with little expertise.To assess the risks posed by open-source pre-trained models, we propose a more realistic threat model that eliminates the need for prompt design skills or knowledge of class names.In this context, we introduce Stealix, the first approach to perform model stealing without predefined prompts. Stealix uses two open-source pre-trained models to infer the victim model’s data distribution, and iteratively refines prompts through a genetic algorithm, progressively improving the precision and diversity of synthetic images.Our experimental results demonstrate that Stealix significantly outperforms other methods, even those with access to class names or fine-grained prompts, while operating under the same query budget. These findings highlight the scalability of our approach and suggest that the risks posed by pre-trained generative models in model stealing may be greater than previously recognized." Measuring Diversity: Axioms and Challenges,Mikhail Mironov Liudmila Prokhorenkova,https://icml.cc/virtual/2025/poster/46567,"This paper addresses the problem of quantifying diversity for a set of objects. First, we conduct a systematic review of existing diversity measures and explore their undesirable behavior in certain cases. Based on this review, we formulate three desirable properties (axioms) of a reliable diversity measure: monotonicity, uniqueness, and continuity. We show that none of the existing measures has all three properties and thus these measures are not suitable for quantifying diversity. Then, we construct two examples of measures that have all the desirable properties, thus proving that the list of axioms is not self-contradictory. Unfortunately, the constructed examples are too computationally expensive (NP-hard) for practical use. Thus, we pose an open problem of constructing a diversity measure that has all the listed properties and can be computed in practice or proving that all such measures are NP-hard to compute." -Risk Quadrangle and Robust Optimization Based on Extended $\varphi$-Divergence,Cheng Peng Anton Malandii Stan Uryasev,https://openreview.net/forum?id=UycrIxt96b, Theoretical guarantees on the best-of-n alignment policy,Ahmad Beirami Alekh Agarwal Jonathan Berant Alexander Nicholas D'Amour Jacob Eisenstein Chirag Nagpal Ananda Theertha Suresh,https://icml.cc/virtual/2025/poster/43750,"A simple and effective method for the inference-time alignment of generative models is the best-of-$n$ policy, where $n$ samples are drawn from a reference policy, ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the reference policy is equal to $\log (n) - (n-1)/n.$ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes, and propose a new estimator for the KL divergence and empirically show that it provides a tight approximation. We also show that the win rate of the best-of-$n$ policy against the reference policy is upper bounded by $n/(n+1)$ and derive bounds on the tightness of this characterization. We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy, which demonstrate that very good tradeoffs are achievable with $n < 1000$." Probabilistic Factorial Experimental Design for Combinatorial Interventions,Divya Shyamal Jiaqi Zhang Caroline Uhler,https://icml.cc/virtual/2025/poster/45285,"A _combinatorial intervention_, consisting of multiple treatments applied to a single unit with potential interactive effects, has substantial applications in fields such as biomedicine, engineering, and beyond. Given $p$ possible treatments, conducting all possible $2^p$ combinatorial interventions can be laborious and quickly becomes infeasible as $p$ increases. Here we introduce the _probabilistic factorial experimental design_, formalized from how scientists perform lab experiments. In this framework, the experimenter selects a dosage for each possible treatment and applies it to a group of units. Each unit independently receives a random combination of treatments, sampled from a product Bernoulli distribution determined by the dosages. Additionally, the experimenter can carry out such experiments over multiple rounds, adapting the design in an active manner. We address the optimal experimental design problem within a novel intervention model that imposes bounded-degree interactions between treatments. In the passive setting, we provide a closed-form solution for the near-optimal design. Our results prove that a dosage of $\frac{1}{2}$ for each treatment is optimal up to a factor of $1+O(\frac{\ln(n)}{n})$ for estimating any $k$-way interaction model, regardless of $k$, and imply that $O\big(kp^{3k}\ln(p)\big)$ observations are required to accurately estimate this model. For the multi-round setting, we provide a near-optimal acquisition function that can be numerically optimized. We also explore several extensions of the design problem and finally validate our findings through simulations." Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment,Yifan Zhang Ge Zhang Yue Wu Kangping Xu Quanquan Gu,https://icml.cc/virtual/2025/poster/45103,"Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. In this paper, we introduce \emph{preference embedding}, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback (RLHF). Experimental results show that our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the RewardBench benchmark and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0, following the language model post-training with GPO and our general preference model, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https://github.com/general-preference/general-preference-model." Compact Matrix Quantum Group Equivariant Neural Networks,Edward Pearce-Crump,https://icml.cc/virtual/2025/poster/43997,"Group equivariant neural networks have proven effective in modelling a wide range of tasks where the data lives in a classical geometric space and exhibits well-defined group symmetries. However, these networks are not suitable for learning from data that lives in a non-commutative geometry, described formally by non-commutative $\mathcal{C}^{\ast}$-algebras, since the $\mathcal{C}^{\ast}$-algebra of continuous functions on a compact matrix group is commutative. To address this limitation, we derive the existence of a new type of equivariant neural network, called compact matrix quantum group equivariant neural networks, which encode symmetries that are described by compact matrix quantum groups. We characterise the weight matrices that appear in these neural networks for the easy compact matrix quantum groups, which are defined by set partitions. As a result, we obtain new characterisations of equivariant weight matrices for some compact matrix groups that have not appeared previously in the machine learning literature." LoRA Training Provably Converges to a Low-Rank Global Minimum Or It Fails Loudly (But it Probably Won't Fail),Junsu Kim Jaeyeon Kim Ernest K. Ryu,https://icml.cc/virtual/2025/poster/44076,"Low-rank adaptation (LoRA) has become a standard approach for fine-tuning large foundation models. However, our theoretical understanding of LoRA remains limited as prior analyses of LoRA's training dynamics either rely on linearization arguments or consider highly simplified setups. In this work, we analyze the LoRA loss landscape without such restrictive assumptions. We define two regimes: a ""special regime"", which includes idealized setups where linearization arguments hold, and a ""generic regime"" representing more realistic setups where linearization arguments do not hold. In the generic regime, we show that LoRA training converges to a global minimizer with low rank and small magnitude, or a qualitatively distinct solution with high rank and large magnitude. Finally, we argue that the zero-initialization and weight decay in LoRA training induce an implicit bias toward the low-rank, small-magnitude region of the parameter space—where global minima lie—thus shedding light on why LoRA training usually succeeds in finding global minima." -Tensor Product Attention Is All You Need,Yifan Zhang Yifeng Liu Huizhuo Yuan Zhen Qin Yang Yuan Quanquan Gu Andrew C Yao,https://openreview.net/forum?id=IEDkPrCLtE, Test-Time Training Provably Improves Transformers as In-context Learners,Halil Alperen Gozeten Muhammed Emrullah Ildiz Xuechen Zhang Mahdi Soltanolkotabi Marco Mondelli Samet Oymak,https://icml.cc/virtual/2025/poster/44720,"Test-time training (TTT) methods explicitly update the weights of a model to adapt to the specific test instance, and they have found success in a variety of settings, including most recently language modeling and reasoning. To demystify this success, we investigate a gradient-based TTT algorithm for in-context learning, where we train a transformer model on the in-context demonstrations provided in the test prompt. Specifically, we provide a comprehensive theoretical characterization of linear transformers when the update rule is a single gradient step. Our theory (i) delineates the role of alignment between pretraining distribution and target task, (ii) demystifies how TTT can alleviate distribution shift, and (iii) quantifies the sample complexity of TTT including how it can significantly reduce the eventual sample size required for in-context learning. As our empirical contribution, we study the benefits of TTT for TabPFN, a tabular foundation model. In line with our theory, we demonstrate that TTT significantly reduces the required sample size for tabular classification (3 to 5 times fewer) unlocking substantial inference efficiency with a negligible training cost." Towards a Formal Theory of Representational Compositionality,Eric Elmoznino Thomas Jiralerspong Yoshua Bengio Guillaume Lajoie,https://icml.cc/virtual/2025/poster/44520,"Compositionality is believed to be fundamental to intelligence. In humans, it underlies the structure of thought and language. In AI, it enables a powerful form of out-of-distribution generalization, in which a model systematically adapts to novel combinations of known concepts. However, while we have strong intuitions about what compositionality is, we lack satisfying formal definitions for it. Here, we propose such a definition called representational compositionality that is conceptually simple, quantitative, and grounded in algorithmic information theory. Intuitively, representational compositionality states that a compositional representation is both expressive and describable as a simple function of parts. We validate our definition on both real and synthetic data, and show how it unifies disparate intuitions from across the literature in both AI and cognitive science. We hope that our definition can inspire the design of novel, theoretically-driven models that better capture the mechanisms of compositional thought. We make our code available at https://github.com/EricElmoznino/complexity_compositionality." Optimal Transfer Learning for Missing Not-at-Random Matrix Completion,Akhil Jalan Yassir Jedra Arya Mazumdar Soumendu Sundar Mukherjee Purnamrita Sarkar,https://icml.cc/virtual/2025/poster/45622,"We study transfer learning for matrix completion in a Missing Not-at-Random (MNAR) setting that is motivated by biological problems. The target matrix $Q$ has entire rows and columns missing, making estimation impossible without side information. To address this, we usea noisy and incomplete source matrix $P$, which relates to $Q$ via a feature shift in latent space. We consider both the *active* and *passive* sampling of rows and columns. We establish minimax lower bounds for entrywise estimation error in each setting. Our computationally efficient estimation framework achieves this lower bound for the active setting, which leverages the source data to query the most informative rows and columns of $Q$. This avoids the need for *incoherence* assumptions required for rate optimality in the passive sampling setting. We demonstrate the effectiveness of our approach through comparisons with existing algorithms on real-world biological datasets." @@ -2000,11 +1974,8 @@ Federated Oriented Learning: A Practical One-Shot Personalized Federated Learnin From Feature Interaction to Feature Generation: A Generative Paradigm of CTR Prediction Models,Mingjia Yin Junwei Pan Hao Wang Ximei Wang Shangyu Zhang Jie Jiang Defu Lian Enhong Chen,https://icml.cc/virtual/2025/poster/45999,"Click-Through Rate (CTR) prediction models estimate the probability of users clicking on items based on feature interactions, inherently following a discriminative paradigm. However, this paradigm is prone to embedding dimensional collapse and information redundancy due to limitations of vanilla feature embeddings.This motivates us to reformulate it into a generative paradigm to generate new feature embeddings. Unlike sequential recommendation, which naturally fits a generative ""next-item prediction"" paradigm, it's hard to formulate CTR models into this paradigm without explicit feature order.Therefore, we propose a novel Supervised Feature Generation framework for CTR models, shifting from the discriminative ""feature interaction"" paradigm to the generative ""feature generation"" paradigm.Specifically, we predict each feature embedding based on the concatenation of all feature embeddings.Besides, this paradigm naturally accommodates a supervised binary cross-entropy loss to indicate whether the sample is positive or negative.The framework can reformulate nearly every existing CTR model and bring significant performance lifts.Moreover, it produces less-collapsed and redundancy-reduced feature embeddings, thereby mitigating the inherent limitations of the discriminative paradigm.The code can be found at https://github.com/USTC-StarTeam/GE4Rec." Generalization Principles for Inference over Text-Attributed Graphs with Large Language Models,Haoyu Peter Wang Shikun Liu Rongzhe Wei Pan Li,https://icml.cc/virtual/2025/poster/44628,"Large language models (LLMs) have recently been introduced to graph learning, aiming to extend their zero-shot generalization success to tasks where labeled graph data is scarce. Among these applications, inference over text-attributed graphs (TAGs) presents unique challenges: existing methods struggle with LLMs' limited context length for processing large node neighborhoods and the misalignment between node embeddings and the LLM token space. To address these issues, we establish two key principles for ensuring generalization and derive the framework LLM-BP accordingly: (1)Unifying the attribute space with task-adaptive embeddings, where we leverage LLM-based encoders and task-aware prompting to enhance generalization of the text attribute embeddings; (2)Developing a generalizable graph information aggregation mechanism, for which we adopt belief propagation with LLM-estimated parameters that adapt across graphs. Evaluations on 11 real-world TAG benchmarks demonstrate that LLM-BP significantly outperforms existing approaches, achieving 8.10\% improvement with task-conditional embeddings and an additional 1.71\% gain from adaptive aggregation. The code and task-adaptive embeddings are publicly available." POQD: Performance-Oriented Query Decomposer for Multi-vector retrieval,Yaoyang Liu Junlin Li Yinjun Wu zhen chen,https://icml.cc/virtual/2025/poster/44047,"Although Multi-Vector Retrieval (MVR) has achieved the state of the art on many information retrieval (IR) tasks, its performance highly depends on how to decompose queries into smaller pieces, say phrases or tokens. However, optimizing query decomposition for MVR performance is not end-to-end differentiable. Even worse, jointly solving this problem and training the downstream retrieval-based systems, say RAG systems could be highly inefficient. To overcome these challenges, we propose Performance-Oriented Query Decomposer (POQD), a novel query decomposition framework for MVR. POQD leverages one LLM for query decomposition and searches the optimal prompt with an LLM-based optimizer. We further propose an end-to-end training algorithm to alternatively optimize the prompt for query decomposition and the downstream models. This algorithm can achieve superior MVR performance at a reasonable training cost as our theoretical analysis suggests. POQD can be integrated seamlessly into arbitrary retrieval-based systems such as Retrieval-Augmented Generation (RAG) systems. Extensive empirical studies on representative RAG-based QA tasks show that POQD outperforms existing query decomposition strategies in both retrieval performance and end-to-end QA accuracy. POQD is available at https://github.com/PKU-SDS-lab/POQD-ICML25." -Robust Federated Finetuning of LLMs via Alternating Optimization of LoRA,Shuangyi Chen Yuanxin Guo Yue Ju Hardik Dalal Ashish J Khisti,https://openreview.net/forum?id=u4mobiHTJl, StealthInk: A Multi-bit and Stealthy Watermark for Large Language Models,Ya Jiang Chuxiong Wu Massieh Kordi Boroujeny Brian Mark Kai Zeng,https://icml.cc/virtual/2025/poster/44621,"Watermarking for large language models (LLMs) offers a promising approach to identifying AI-generated text. Existing approaches, however, either compromise the distribution of original generated text by LLMs or are limited to embedding zero-bit information that only allows for watermark detection but ignores identification. We present StealthInk, a stealthy multi-bit watermarking scheme that preserves the original text distribution while enabling the embedding of provenance data, such as userID, TimeStamp, and modelID, within LLM-generated text. This enhances fast traceability without requiring access to the language model's API or prompts. We derive a lower bound on the number of tokens necessary for watermark detection at a fixed equal error rate, which provides insights on how to enhance the capacity. Comprehensive empirical evaluations across diverse tasks highlight the stealthiness, detectability, and resilience of StealthInk, establishing it as an effective solution for LLM watermarking applications." -Unified (Semi) Unbalanced and Classic Optimal Transport with Equivalent Transformation Mechanism and KKT-Multiplier Regularization,Weiming Liu Xinting Liao Jun Dan Fan Wang Hua Yu Junhao Dong Shunjie Dong Lianyong Qi Yew-Soon Ong,https://openreview.net/forum?id=6hmQt7dTX3, Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale,Rogerio Bonatti Dan Zhao Francesco Bonacci Dillon Dupont Sara Abdali Yinheng Li Yadong Lu Justin Wagle Kazuhito Koishida Arthur Bucker Lawrence Keunho Jang Zheng Hui,https://icml.cc/virtual/2025/poster/45035,"Large language models (LLMs) show potential as computer agents, enhancing productivity and software accessibility in multi-modal tasks. However, measuring agent performance in sufficiently realistic and complex environments becomes increasingly challenging as: (i) most benchmarks are limited to specific modalities/domains (e.g., text-only, web navigation, Q&A) and (ii) full benchmark evaluations are slow (on order of magnitude of multiple hours/days) given the multi-step sequential nature of tasks.To address these challenges, we introduce Windows Agent Arena: a general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real OS to use the same applications and tools available to human users when performing tasks.We create 150+ diverse tasks across representative domains that require agentic abilities in planning, screen understanding, and tool usage.Our benchmark is scalable and can be seamlessly parallelized for a full benchmark evaluation in as little as $20$ minutes.Our work not only speeds up the development and evaluation cycle of multi-modal agents, but also highlights and analyzes existing shortfalls in the agentic abilities of several multimodal LLMs as agents within the Windows computing environment---with the best achieving only a 19.5\% success rate compared to a human success rate of 74.5\%." -A Physics-preserved Transfer Learning Method for Differential Equations,Hao-Ran Yang Chuan-Xian Ren,https://openreview.net/forum?id=a9VfggyjQa, Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling,Shuqi Lu Xiaohong Ji Bohang Zhang Lin Yao Siyuan Liu Zhifeng Gao Linfeng Zhang Guolin Ke,https://icml.cc/virtual/2025/poster/45004,"Molecular pretrained representations (MPR) has emerged as a powerful approach for addressing the challenge of limited supervised data in applications such as drug discovery and material design. While early MPR methods relied on 1D sequences and 2D graphs, recent advancements have incorporated 3D conformational information to capture rich atomic interactions. However, these prior models treat molecules merely as discrete atom sets, overlooking the space surrounding them. We argue from a physical perspective that only modeling these discrete points is insufficient. We first present a simple yet insightful observation: naively adding randomly sampled virtual points beyond atoms can surprisingly enhance MPR performance. In light of this, we propose a principled framework that incorporates the entire 3D space spanned by molecules. We implement the framework via a novel Transformer-based architecture, dubbed SpaceFormer, with three key components:(1)grid-based space discretization; (2)grid sampling/merging; and (3)efficient 3D positional encoding.Extensive experiments show that SpaceFormer significantly outperforms previous 3D MPR models across various downstream tasks with limited data, validating the benefit of leveraging the additional 3D space beyond atoms in MPR models." CoastalBench: A Decade-Long High-Resolution Dataset to Emulate Complex Coastal Processes,Zelin Xu Yupu Zhang Tingsong Xiao Maitane Olabarrieta Lizaso Jose M. Gonzalez-Ondina Zibo Liu Shigang Chen Zhe Jiang,https://icml.cc/virtual/2025/poster/44437,"Over 40\% of the global population lives within 100 kilometers of the coast, which contributes more than \$8 trillion annually to the global economy. Unfortunately, coastal ecosystems are increasingly vulnerable to more frequent and intense extreme weather events and rising sea levels. Coastal scientists use numerical models to simulate complex physical processes, but these models are often slow and expensive. In recent years, deep learning has become a promising alternative to reduce the cost of numerical models. However, progress has been hindered by the lack of a large-scale, high-resolution coastal simulation dataset to train and validate deep learning models. Existing studies often focus on relatively small datasets and simple processes. To fill this gap, we introduce a decade-long, high-resolution (<100m) coastal circulation modeling dataset on a real-world 3D mesh in southwest Florida with around 6 million cells. The dataset contains key oceanography variables (e.g., current velocities, free surface level, temperature, salinity) alongside external atmospheric and river forcings. We evaluated a customized Vision Transformer model that takes initial and boundary conditions and external forcings and predicts ocean variables at varying lead times. The dataset provides an opportunity to benchmark novel deep learning models for high-resolution coastal simulations (e.g., physics-informed machine learning, neural operator learning).The code and dataset can be accessed at https://github.com/spatialdatasciencegroup/CoastalBench." Compositional Flows for 3D Molecule and Synthesis Pathway Co-design,Tony Shen Seonghwan Seo Ross Irwin Kieran Didi Simon Olsson Woo Youn Kim Martin Ester,https://icml.cc/virtual/2025/poster/46473,"Many generative applications, such as synthesis-based 3D molecular design, involve constructing compositional objects with continuous features.Here, we introduce Compositional Generative Flows (CGFlow), a novel framework that extends flow matching to generate objects in compositional steps while modeling continuous states. Our key insight is that modeling compositional state transitions can be formulated as a straightforward extension of the flow matching interpolation process.We further build upon the theoretical foundations of generative flow networks (GFlowNets), enabling reward-guided sampling of compositional structures. We apply CGFlow to synthesizable drug design by jointly designing the molecule's synthetic pathway with its 3D binding pose.Our approach achieves state-of-the-art binding affinity and synthesizability on all 15 targets from the LIT-PCBA benchmark, and 4.2x improvement in sampling efficiency compared to 2D synthesis-based baseline.To our best knowledge, our method is also the first to achieve state of-art-performance in both Vina Dock (-9.42) and AiZynth success rate (36.1\%) on the CrossDocked2020 benchmark." @@ -2039,7 +2010,7 @@ Zero-Shot Cyclic Peptide Design via Composable Geometric Constraints,Dapeng Jian Action Dubber: Timing Audible Actions via Inflectional Flow,Wenlong Wan Weiying Zheng Tianyi Xiang Guiqing Li Shengfeng He,https://icml.cc/virtual/2025/poster/46228,"We introduce the task of Audible Action Temporal Localization, which aims to identify the spatio-temporal coordinates of audible movements. Unlike conventional tasks such as action recognition and temporal action localization, which broadly analyze video content, our task focuses on the distinct kinematic dynamics of audible actions. It is based on the premise that key actions are driven by inflectional movements; for example, collisions that produce sound often involve abrupt changes in motion. To capture this, we propose $TA^{2}Net$, a novel architecture that estimates inflectional flow using the second derivative of motion to determine collision timings without relying on audio input. $TA^{2}Net$ also integrates a self-supervised spatial localization strategy during training, combining contrastive learning with spatial analysis. This dual design improves temporal localization accuracy and simultaneously identifies sound sources within video frames. To support this task, we introduce a new benchmark dataset, $Audible623$, derived from Kinetics and UCF101 by removing non-essential vocalization subsets. Extensive experiments confirm the effectiveness of our approach on $Audible623$ and show strong generalizability to other domains, such as repetitive counting and sound source localization. Code and dataset are available at https://github.com/WenlongWan01/Audible623." Adaptive Sensitivity Analysis for Robust Augmentation against Natural Corruptions in Image Segmentation,Laura Yu Zheng Wenjie Wei Tony Wu Jacob Clements Shreelekha Revankar Andre Harrison Yu Shen Ming Lin,https://icml.cc/virtual/2025/poster/43645,"Achieving robustness in image segmentation models is challenging due to the fine-grained nature of pixel-level classification. These models, which are crucial for many real-time perception applications, particularly struggle when faced with natural corruptions in the wild for autonomous systems. While sensitivity analysis can help us understand how input variables influence model outputs, its application to natural and uncontrollable corruptions in training data is computationally expensive. In this work, we present an adaptive, sensitivity-guided augmentation method to enhance robustness against natural corruptions. Our sensitivity analysis on average runs 10 times faster and requires about 200 times less storage than previous sensitivity analysis, enabling practical, on-the-fly estimation during training for a model-free augmentation policy. With minimal fine-tuning, our sensitivity-guided augmentation method achieves improved robustness on both real-world and synthetic datasets compared to state-of-the-art data augmentation techniques in image segmentation." Are High-Quality AI-Generated Images More Difficult for Models to Detect?,Yao Xiao Binbin Yang Weiyan Chen Jiahao Chen Zijie Cao ZiYi Dong Xiangyang Ji Liang Lin Wei Ke Pengxu Wei,https://icml.cc/virtual/2025/poster/43842,"The remarkable evolution of generative models has enabled the generation of high-quality, visually attractive images, often perceptually indistinguishable from real photographs to human eyes. This has spurred significant attention on AI-generated image (AIGI) detection. Intuitively, higher image quality should increase detection difficulty. However, our systematic study on cutting-edge text-to-image generators reveals a counterintuitive finding: AIGIs with higher quality scores, as assessed by human preference models, tend to be more easily detected by existing models. To investigate this, we examine how the text prompts for generation and image characteristics influence both quality scores and detector accuracy. We observe that images from short prompts tend to achieve higher preference scores while being easier to detect. Furthermore, through clustering and regression analyses, we verify that image characteristics like saturation, contrast, and texture richness collectively impact both image quality and detector accuracy. Finally, we demonstrate that the performance of off-the-shelf detectors can be enhanced across diverse generators and datasets by selecting input patches based on the predicted scores of our regression models, thus substantiating the broader applicability of our findings. Code and data are available at \href{https://github.com/Coxy7/AIGI-Detection-Quality-Paradox}{GitHub}." -Attributes Shape the Embedding Space of Face Recognition Models,Pierrick Leroy Antonio Mastropietro Marco Nurisso Francesco Vaccarino,https://icml.cc/virtual/2025/poster/45064,"Face Recognition (FR) tasks have made significant progress with the advent of Deep Neural Networks, particularly through margin-based triplet losses that embed facial images into high-dimensional feature spaces. During training, these contrastive losses focus exclusively on identity information as labels. However, we observe a multiscale geometric structure emerging in the embedding space, influenced by interpretable facial (e.g., hair color) and image attributes (e.g., contrast).We propose a geometric approach to describe the dependence or invariance of FR models to these attributes and introduce a physics-inspired alignment metric. We evaluate the proposed metric on controlled, simplified models and widely used FR models fine-tuned with synthetic data for targeted attribute augmentation. Our findings reveal that the models exhibit varying degrees of invariance across different attributes, providing insight into their strengths and weaknesses and enabling deeper interpretability.Code available here: https://github.com/mantonios107/attrs-fr-embs." +Attributes Shape the Embedding Space of Face Recognition Models,Pierrick Leroy Antonio Mastropietro Marco Nurisso Francesco Vaccarino,https://icml.cc/virtual/2025/poster/45064, Balancing Preservation and Modification: A Region and Semantic Aware Metric for Instruction-Based Image Editing,Zhuoying Li Zhu Xu Yuxin Peng Yang Liu,https://icml.cc/virtual/2025/poster/45478,"Instruction-based image editing, which aims to modify the image faithfully towards instruction while preserving irrelevant content unchanged, has made advanced progresses. However, there still lacks a comprehensive metric for assessing the editing quality. Existing metrics either require high costs concerning human evaluation, which hinders large-scale evaluation, or adapt from other tasks and lose specified concerns, failing to comprehensively evaluate the modification of instruction and the preservation of irrelevant regions, resulting in biased evaluation. To tackle it, we introduce a new metric Balancing Preservation Modification (BPM), that tailored for instruction-based image editing by explicitly disentangling the image into editing-relevant and irrelevant regions for specific consideration. We first identify and locate editing-relevant regions, followed by a two-tier process to assess editing quality: Region-Aware Judge evaluates whether the position and size of the edited region align with instruction, and Semantic-Aware Judge further assesses the instruction content compliance within editing-relevant regions as well as content preservation within irrelevant regions, yielding comprehensive and interpretable quality assessment. Moreover, the editing-relevant region localization in BPM can be integrated into image editing approaches to improve the editing quality, manifesting its wild application. We verify the effectiveness of BPM metric on comprehensive instruction-editing data, and the re- sults show that we yield the highest alignment with human evaluation compared to existing metrics, indicating efficacy. The code is available at https://joyli-x.github.io/BPM/." "Boosting Virtual Agent Learning and Reasoning: A Step-Wise, Multi-Dimensional, and Generalist Reward Model with Benchmark",Bingchen Miao Yang Wu Minghe Gao Qifan Yu Wendong Bu Wenqiao Zhang liyunfei Siliang Tang Tat-Seng Chua Juncheng Li,https://icml.cc/virtual/2025/poster/45451,"The development of Generalist Virtual Agents (GVAs) has shown significant promise in autonomous task execution. However, current training paradigms face critical limitations, including reliance on outcome supervision and labor-intensive human annotations. To address these challenges, we proposeSimilar, astep-wisemulti-dimensionalgeneralistreward model, which offers fine-grained signals for agent training and can choose better actions for inference-time scaling. Specifically, we begin by systematically defining five dimensions for evaluating agent actions. Building on this framework, we design an MCTS-P algorithm to automatically collect and annotate step-wise, five-dimensional agent execution data. Using this data, we trainSimilarwith our crafted Triple-M strategy. Furthermore, we introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation, namedSRM. This benchmark consists of two components:SRMTrain, which serves as the training set forSimilar, andSRMEval, a manually selected test set for evaluating the reward model. Experimental results demonstrate thatSimilar, through its step-wise, multi-dimensional assessment and synergistic gain, provides GVAs with effective intermediate signals during both training and inference-time scaling. The code is available athttps://github.com/antgroup/Similar." Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention,Dejia Xu Yifan Jiang Chen Huang Liangchen Song Thorsten Gernoth Liangliang Cao Zhangyang Wang Hao Tang,https://icml.cc/virtual/2025/poster/44487,"In recent years there have been remarkable breakthroughs in image-to-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduce Cavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image into multiple spatiotemporally consistent videos. Our framework extends the spatial and temporal attention modules into view-integrated attention modules, improving both viewpoint and temporal consistency. This flexible design allows for joint training with diverse curated data sources, including scene-level static videos, object-level synthetic multi-view dynamic videos, and real-world monocular dynamic videos. To the best of our knowledge, Cavia is the first framework that enables users to generate multiple videos of the same scene with precise control over camera motion, while simultaneously preserving object motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality." @@ -2066,7 +2037,6 @@ MaskTwins: Dual-form Complementary Masking for Domain-Adaptive Image Segmentatio MiraGe: Editable 2D Images using Gaussian Splatting,Joanna Waczynska Tomasz Szczepanik Piotr Borycki Slawomir Tadeja Thomas Bohné Przemysław Spurek,https://icml.cc/virtual/2025/poster/45385,"Implicit Neural Representations (INRs) approximate discrete data through continuous functions and are commonly used for encoding 2D images. Traditional image-based INRs employ neural networks to map pixel coordinates to RGB values, capturing shapes, colors, and textures within the network’s weights. Recently, GaussianImage has been proposed as an alternative, using Gaussian functions instead of neural networks to achieve comparable quality and compression. Such a solution obtains a quality and compression ratio similar to classical INR models but does not allow image modification. In contrast, our work introduces a novel method, MiraGe, which uses mirror reflections to perceive 2D images in 3D space and employs flat-controlled Gaussians for precise 2D image editing. Our approach improves the rendering quality and allows realistic image modifications, including human-inspired perception of photos in the 3D world. Thanks to modeling images in 3D space, we obtain the illusion of 3D-based modification in 2D images. We also show that our Gaussian representation can be easily combined with a physics engine to produce physics-based modification of 2D images. Consequently, MiraGe allows for better quality than the standard approach and natural modification of 2D images." MIRROR: Make Your Object-Level Multi-View Generation More Consistent with Training-Free Rectification,Tianchi Xing Bonan Li Congying Han Xinmin Qiu Zicheng Zhang Tiande Guo,https://icml.cc/virtual/2025/poster/43478,"Multi-view Diffusion has greatly advanced the development of 3D content creation by generating multiple images from distinct views, achieving remarkable photorealistic results. However, existing works are still vulnerable to inconsistent 3D geometric structures (commonly known as Janus Problem) and severe artifacts. In this paper, we introduce MIRROR, a versatile plug-and-play method that rectifies such inconsistencies in a training-free manner, enabling the acquisition of high-fidelity, realistic structures without compromising diversity. Our key idea focuses on tracing the motion trajectory of physical points across adjacent viewpoints, enabling rectifications based on neighboring observations of the same region. Technically, MIRROR comprises two core modules: Trajectory Tracking Module (TTM) for pixel-wise trajectory tracking that labels identical points across views, and Feature Rectification Module (FRM) for explicitly adjustment of each pixel embedding on noisy synthesized images by minimizing the distance to corresponding block features in neighboring views, thereby achieving consistent outputs. Extensive evaluations demonstrate that MIRROR can seamlessly integrate with a diverse range of off-the-shelf object-level multi-view diffusion models, significantly enhancing both the consistency and the fidelity in an efficient way." More Than Meets the Eye: Enhancing Multi-Object Tracking Even with Prolonged Occlusions,Bishoy Galoaa Somaieh Amraee Sarah Ostadabbas,https://icml.cc/virtual/2025/poster/46062,"This paper introduces MOTE (MOre Than meets the Eye), a novel multi-object tracking (MOT) algorithm designed to address the challenges of tracking occluded objects. By integrating deformable detection transformers with a custom disocclusion matrix, MOTE significantly enhances the ability to track objects even when they are temporarily hidden from view. The algorithm leverages optical flow to generate features that are processed through a softmax splatting layer, which aids in the creation of a disocclusion matrix. This matrix plays a crucial role in maintaining track consistency by estimating the motion of occluded objects. MOTE's architecture includes modifications to the enhanced track embedding module (ETEM), which allows it to incorporate these advanced features into the track query layer embeddings. This integration ensures that the model not only tracks visible objects but also accurately predicts the trajectories of occluded ones, much like the human visual system. The proposed method is evaluated on multiple datasets, including MOT17, MOT20, and DanceTrack, where it achieves impressive tracking metrics--82.0 MOTA and 66.3 HOTA on the MOT17 dataset, 81.7 MOTA and 65.8 HOTA on the MOT20 dataset, and 93.2 MOTA and 74.2 HOTA on the DanceTrack dataset. Notably, MOTE excels in reducing identity switches and maintaining consistent tracking in complex real-world scenarios with frequent occlusions, outperforming existing state-of-the-art methods across all tested benchmarks." -OD³: Optimization-free Dataset Distillation for Object Detection,Salwa K. Al Khatib Ahmed Elhagry Shitong Shao Zhiqiang Shen,https://openreview.net/forum?id=w6imoBZT0F, One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation,Jianze Li Jiezhang Cao Yong Guo Wenbo Li Yulun Zhang,https://icml.cc/virtual/2025/poster/43875,"Diffusion models (DMs) have significantly advanced the development of real-world image super-resolution (Real-ISR), but the computational cost of multi-step diffusion models limits their application. One-step diffusion models generate high-quality images in a one sampling step, greatly reducing computational overhead and inference latency. However, most existing one-step diffusion methods are constrained by the performance of the teacher model, where poor teacher performance results in image artifacts. To address this limitation, we propose FluxSR, a novel one-step diffusion Real-ISR technique based on flow matching models. We use the state-of-the-art diffusion model FLUX.1-dev as both the teacher model and the base model. First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR. Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss and introduce Attention Diversification Loss (ADL) as a regularization term to reduce token similarity in transformer, thereby eliminating high-frequency artifacts. Comprehensive experiments demonstrate that our method outperforms existing one-step diffusion-based Real-ISR methods. The code and model will be released at \url{https://github.com/JianzeLi-114/FluxSR}." Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models,Zehan Wang Ziang Zhang Tianyu Pang Chao Du Hengshuang Zhao Zhou Zhao,https://icml.cc/virtual/2025/poster/43594,"Orientation is a fundamental attribute of objects, essential for understanding their spatial pose and arrangement. However, practical solutions for estimating the orientation of open-world objects in monocular images remain underexplored. In this work, we introduce Orient Anything, the first foundation model for zero-shot object orientation estimation. A key challenge in this task is the scarcity of orientation annotations for open-world objects. To address this, we propose leveraging the vast resources of 3D models. By developing a pipeline to annotate the front face of 3D objects and render them from random viewpoints, we curate 2 million images with precise orientation annotations across a wide variety of object categories. To fully leverage the dataset, we design a robust training objective that models the 3D orientation as probability distributions over three angles and predicts the object orientation by fitting these distributions. Besides, we propose several strategies to further enhance the synthetic-to-real transfer. Our model achieves state-of-the-art orientation estimation accuracy on both rendered and real images, demonstrating impressive zero-shot capabilities across various scenarios. Furthermore, it shows great potential in enhancing high-level applications, such as understanding complex spatial concepts in images and adjusting 3D object pose." PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop,Chenyu Li Oscar Michel Xichen Pan Sainan Liu Mike Roberts Saining Xie,https://icml.cc/virtual/2025/poster/45288,"Large-scale pre-trained video generation models excel in content creation but are not reliable as physically accurate world simulators out of the box. This work studies the process of post-training these models for accurate world modeling through the lens of the simple, yet fundamental, physics task of modeling object freefall. We show state-of-the-art video generation models struggle with this basic task, despite their visually impressive outputs. To remedy this problem, we find that fine-tuning on a relatively small amount of simulated videos is effective in inducing the dropping behavior in the model, and we can further improve results through a novel reward modeling procedure we introduce. Our study also reveals key limitations of post-training in generalization and distribution modeling. Additionally, we release a benchmark for this task that may serve as a useful diagnostic tool for tracking physical accuracy in large-scale video generative model development. Code is available at this repository: https://github.com/vision-x-nyu/pisa-experiments." @@ -2095,9 +2065,7 @@ Enhancing Graph Contrastive Learning for Protein Graphs from Perspective of Inva Feature-Mapping Topology Optimization with Neural Heaviside Signed Distance Functions,Aleksandr Kolomeitsev ANH-HUY PHAN,https://icml.cc/virtual/2025/poster/46233,"Topology optimization plays a crucial role in designing efficient and manufacturable structures. Traditional methods often yield free-form voids that, although providing design flexibility, introduce significant manufacturing challenges and require extensive post-processing. Conversely, feature-mapping topology optimization reduces post-processing efforts by constructing topologies using predefined geometric features. Nevertheless, existing approaches are significantly constrained by the limited set of geometric features available, the variety of parameters that each type of geometric feature can possess, and the necessity of employing differentiable signed distance functions. In this paper, we present a novel method that combines Neural Heaviside Signed Distance Functions (Heaviside SDFs) with structured latent shape representations to generate manufacturable voids directly within the optimization framework. Our architecture incorporates encoder and decoder networks to effectively approximate the Heaviside function and facilitate optimization within a unified latent space, thus addressing the feature diversity limitations of current feature-mapping techniques. Experimental results validate the effectiveness of our approach in balancing structural compliance, offering a new pathway to CAD-integrated design with minimal human intervention." FLAM: Frame-Wise Language-Audio Modeling,Yusong Wu Christos Tsirigotis Ke Chen Cheng-Zhi Anna Huang Aaron Courville Oriol Nieto Prem Seetharaman Justin Salamon,https://icml.cc/virtual/2025/poster/46310,"Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks." Galileo: Learning Global & Local Features of Many Remote Sensing Modalities,Gabriel Tseng Anthony Fuller Marlena Reil Henry Herzog Patrick Beukema Favyen Bastani James R Green Evan Shelhamer Hannah Kerner David Rolnick,https://icml.cc/virtual/2025/poster/44450,"We introduce a highly multimodal transformer to represent many remote sensing modalities - multispectral optical, synthetic aperture radar, elevation, weather, pseudo-labels, and more - across space and time. These inputs are useful for diverse remote sensing tasks, such as crop mapping and flood detection. However, learning shared representations of remote sensing data is challenging, given the diversity of relevant data modalities, and because objects of interest vary massively in scale, from small boats (1-2 pixels and fast) to glaciers (thousands of pixels and slow). We present a novel self-supervised learning algorithm that extracts multi-scale features across a flexible set of input modalities through masked modeling. Our dual global and local contrastive losses differ in their targets (deep representations vs. shallow input projections) and masking strategies (structured vs. not). Our Galileo is a single generalist model that outperforms SoTA specialist models for satellite images and pixel time series across eleven benchmarks and multiple tasks." -It's Not Just a Phase: On Investigating Phase Transitions in Deep Learning-based Side-channel Analysis,Sengim Karayalcin Marina Krček Stjepan Picek,https://openreview.net/forum?id=IhGc3ZM5zN, LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation,Chen-Chia Chang Wan-Hsuan Lin Yikang Shen Yiran Chen Xin Zhang,https://icml.cc/virtual/2025/poster/44935,"Automation of analog topology design is crucial due to customized requirements of modern applications with heavily manual engineering efforts. The state-of-the-art work applies a sequence-to-sequence approach and supervised finetuning on language models to generate topologies given user specifications.However, its circuit formulation is inefficient due to $O(|V|^2)$ token length and suffers from low precision sensitivity to numeric inputs.In this work, we introduce LaMAGIC2, a succinct float-input canonical formulationwith identifier (SFCI) for language model-based analog topology generation.SFCI addresses these challenges by improving component-type recognition through identifier-based representations, reducing token length complexity to $O(|V|)$, and enhancing numeric precision sensitivity for better performance under tight tolerances.Our experiments demonstrate that LaMAGIC2 achieves 34\% higher success rates under a tight tolerance 0.01 and 10X lower MSEs compared to a prior method. LaMAGIC2 also exhibits better transferability for circuits with more vertices with up to 58.5\% improvement.These advancements establish LaMAGIC2 as a robust framework for analog topology generation." -Likelihood-based Finetuning of Protein Language Models for Few-shot Fitness Prediction and Design,Alex Hawkins-Hooker Shikha Surana Jakub Kmec Oliver Bent Paul Duckworth,https://openreview.net/forum?id=QUZ1xpU3P0, Machines and Mathematical Mutations: Using GNNs to Characterize Quiver Mutation Classes,Jesse He Helen Jenne Herman Chau Davis Brown Mark Raugas Sara C. Billey Henry Kvinge,https://icml.cc/virtual/2025/poster/44529,"Machine learning is becoming an increasingly valuable tool in mathematics, enabling one to identify subtle patterns across collections of examples so vast that they would be impossible for a single researcher to feasibly review and analyze. In this work, we use graph neural networks to investigate quiver mutation---an operation that transforms one quiver (or directed multigraph) into another---which is central to the theory of cluster algebras with deep connections to geometry, topology, and physics. In the study of cluster algebras, the question of mutation equivalence is of fundamental concern: given two quivers, can one efficiently determine if one quiver can be transformed into the other through a sequence of mutations? In this paper, we use graph neural networks and AI explainability techniques to independently discover mutation equivalence criteria for quivers of type $\tilde{D}$. Along the way, we also show that even without explicit training to do so, our model captures structure within its hidden representation that allows us to reconstruct known criteria from type $D$, adding to the growing evidence that modern machine learning models are capable of learning abstract and general rules from mathematical data." Targeted control of fast prototyping through domain-specific interface,Yu-Zhe Shi Mingchen Liu Hanlu Ma Qiao Xu Huamin Qu Kun He Lecheng Ruan Qining Wang,https://icml.cc/virtual/2025/poster/46378,"Industrial designers have long sought a natural and intuitive way to achieve the targeted control of prototype models---using simple natural language instructions to configure and adjust the models seamlessly according to their intentions, without relying on complex modeling commands. While Large Language Models have shown promise in this area, their potential for controlling prototype models through language remains partially underutilized. This limitation stems from gaps between designers' languages and modeling languages, including mismatch in abstraction levels, fluctuation in semantic precision, and divergence in lexical scopes. To bridge these gaps, we propose an interface architecture that serves as a medium between the two languages. Grounded in design principles derived from a systematic investigation of fast prototyping practices, we devise the interface's operational mechanism and develop an algorithm for its automated domain specification. Both machine-based evaluations and human studies on fast prototyping across various product design domains demonstrate the interface's potential to function as an auxiliary module for Large Language Models, enabling precise and effective targeted control of prototype models." The Case for Learned Provenance-based System Behavior Baseline,Yao Zhu Zhenyuan LI Yangyang Wei Shouling Ji,https://icml.cc/virtual/2025/poster/45219,"Provenance graphs describe data flows and causal dependencies of host activities, enabling to track the data propagation and manipulation throughout the systems, which provide a foundation for intrusion detection. However, these Provenance-based Intrusion Detection Systems (PIDSes) face significant challenges in storage, representation, and analysis, which impede the efficacy of machine learning models such as Graph Neural Networks (GNNs) in processing and learning from these graphs. This paper presents a novel learning-based anomaly detection method designed to efficiently embed and analyze large-scale provenance graphs. Our approach integrates dynamic graph processing with adaptive encoding, facilitating compact embeddings that effectively address out-of-vocabulary (OOV) elements and adapt to normality shifts in dynamic real-world environments. Subsequently, we incorporate this refined baseline into a tag-propagation framework for real-time detection. Our evaluation demonstrates the method's accuracy and adaptability in anomaly path mining, significantly advancing the state-of-the-art in handling and analyzing provenance graphs for anomaly detection." @@ -2118,7 +2086,6 @@ PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Reliable Algorithm Selection for Machine Learning-Guided Design,Clara Fannjiang Ji Won Park,https://icml.cc/virtual/2025/poster/44164,"Algorithms for machine learning-guided design, ordesign algorithms, use machine learning-based predictions to propose novel objects with desired property values. Given a new design task—for example, to design novel proteins with high binding affinity to a therapeutic target—one must choose a design algorithm and specify any hyperparameters and predictive and/or generative models involved. How can these decisions be made such that the resulting designs are successful? This paper proposes a method fordesign algorithm selection, which aims to select design algorithms that will produce a distribution of design labels satisfying a user-specified success criterion—for example, that at least ten percent of designs’ labels exceed a threshold. It does so by combining designs’ predicted property values with held-out labeled data to reliably forecast characteristics of the label distributions produced by different design algorithms, building upon techniques from prediction-powered inference (Angelopoulos et al., 2023). The method is guaranteed with high probability to return design algorithms that yield successful label distributions (or the null set if none exist), if the density ratios between the design and labeled data distributions are known. We demonstrate the method’s effectiveness in simulated protein and RNA design tasks, in settings with either known or estimated density ratios." SAFER: A Calibrated Risk-Aware Multimodal Recommendation Model for Dynamic Treatment Regimes,Yishan Shen Yuyang Ye Hui Xiong Yong Chen,https://icml.cc/virtual/2025/poster/46321,"Dynamic treatment regimes (DTRs) are critical to precision medicine, optimizing long-term outcomes through personalized, real-time decision-making in evolving clinical contexts, but require careful supervision for unsafe treatment risks. Existing efforts rely primarily on clinician-prescribed gold standards despite the absence of a known optimal strategy, and predominantly using structured EHR data without extracting valuable insights from clinical notes, limiting their reliability for treatment recommendations. In this work, we introduce SAFER, a calibrated risk-aware tabular-language recommendation framework for DTR that integrates both structured EHR and clinical notes, enabling them to learn from each other, and addresses inherent label uncertainty by assuming ambiguous optimal treatment solution for deceased patients. Moreover, SAFER employs conformal prediction to provide statistical guarantees, ensuring safe treatment recommendations while filtering out uncertain predictions. Experiments on two publicly available sepsis datasets demonstrate that SAFER outperforms state-of-the-art baselines across multiple recommendation metrics and counterfactual mortality rate, while offering robust formal assurances. These findings underscore SAFER’s potential as a trustworthy and theoretically grounded solution for high-stakes DTR applications." sciLaMA: A Single-Cell Representation Learning Framework to Leverage Prior Knowledge from Large Language Models,Hongru Hu Shuwen Zhang Yongin Choi Venkat S. Malladi Gerald Quon,https://icml.cc/virtual/2025/poster/46669,"Single-cell RNA sequencing (scRNA-seq) enables high-resolution exploration of cellular diversity and gene regulation, yet analyzing such data remains challenging due to technical and methodological limitations. Existing task-specific deep generative models like Variational Auto-Encoder (VAE) and its variants struggle to incorporate external biological knowledge, while transformer-based foundational large Language Models (LLMs or large LaMs) face limitations in computational cost and applicability to tabular gene expression data. Here, we introduce sciLaMA (single-cell interpretable Language Model Adapter), a novel representation learning framework that bridges these gaps by integrating static gene embeddings from multimodal LaMs with scRNA-seq tabular data through a paired-VAE architecture. Our approach generates context-aware representations for both cells and genes and outperforms state-of-the-art methods in key single-cell downstream tasks, including batch effect correction, cell clustering, and cell-state-specific gene marker and module identification, while maintaining computational efficiency. sciLaMA offers a computationally efficient, unified framework for comprehensive single-cell data analysis and biologically interpretable gene module discovery." -Screener: Self-supervised Pathology Segmentation Model for 3D Medical Images,Mikhail Goncharov Eugenia Soboleva Mariia Donskova Ivan Oseledets Marina Munkhoeva Maxim Panov,https://openreview.net/forum?id=fNyfV5otuV, SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model,Zhao Yang Jiwei Zhu Bing Su,https://icml.cc/virtual/2025/poster/44082,"Inspired by the success of unsupervised pre-training paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic profile prediction serves as a more effective alternative to pure sequence pre-training. Furthermore, considering the multi-species and multi-profile nature of genomic profile prediction, we introduce ourSpecies-ProfileAdaptiveCollaborativeExperts (SPACE) that leverages Mixture of Experts (MoE) to better capture the relationships between DNA sequences across different species and genomic profiles, thereby learning more effective DNA representations. Through extensive experiments across various tasks, our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners." SToFM: a Multi-scale Foundation Model for Spatial Transcriptomics,Suyuan Zhao YIZHEN LUO Ganbo Yang Yan Zhong Hao Zhou Zaiqing Nie,https://icml.cc/virtual/2025/poster/45384,"Spatial Transcriptomics (ST) technologies provide biologists with rich insights into single-cell biology by preserving spatial context of cells.Building foundational models for ST can significantly enhance the analysis of vast and complex data sources, unlocking new perspectives on the intricacies of biological tissues. However, modeling ST data is inherently challenging due to the need to extract multi-scale information from tissue slices containing vast numbers of cells. This process requires integrating macro-scale tissue morphology, micro-scale cellular microenvironment, and gene-scale gene expression profile.To address this challenge, we proposeSToFM, a multi-scaleSpatialTranscriptomicsFoundationModel.SToFM first performs multi-scale information extraction on each ST slice, to construct a set of ST sub-slices that aggregate macro-, micro- and gene-scale information. Then an SE(2) Transformer is used to obtain high-quality cell representations from the sub-slices.Additionally, we constructSToCorpus-88M, the largest high-resolution spatial transcriptomics corpus for pretraining. SToFM achieves outstanding performance on a variety of downstream tasks, such as tissue region semantic segmentation and cell type annotation, demonstrating its comprehensive understanding of ST data through capturing and integrating multi-scale information." Unified Screening for Multiple Diseases,Yiğit Narter Alihan Hüyük Mihaela van der Schaar Cem Tekin,https://icml.cc/virtual/2025/poster/43498,"Current screening programs that focus on improving patient health while minimizing screening costs are tailored for individual diseases. Designing unified screening programs for multiple diseases requires carefully balancing competing disease risks, which is an open problem. In this work, we address this problem by casting unified screening as a referral problem, in which we choose to activate a subset of screening policies for individual diseases by accounting for competing risks that influence patient outcomes. We introduce a novel optimization framework that incorporates disease risks, budget constraints, and diagnostic error limits and characterize the structural properties of the optimal referral policy. For the unified screening of two diseases, we show that the optimal activation threshold for the screening of one disease depends on the risk of the other, resulting in decision boundaries with distinct risk-dependent profiles. We compare our unified model with independent screening programs that apply isolated activation thresholds for screening of each disease. Our approach optimizes screening decisions collectively, improving overall survival outcomes, particularly for patients with high disease risks." @@ -2164,7 +2131,6 @@ VerbalTS: Generating Time Series from Texts,Shuqi Gu Chuyue Li Baoyu Jing Kan Re GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric Algebras,Ekaterina Filimoshina Dmitry Shirokov,https://icml.cc/virtual/2025/poster/45802,"We propose, implement, and compare with competitors a new architecture of equivariant neural networks based on geometric (Clifford) algebras: Generalized Lipschitz Group Equivariant Neural Networks (GLGENN). These networks are equivariant to all pseudo-orthogonal transformations, including rotations and reflections, of a vector space with any non-degenerate or degenerate symmetric bilinear form. We propose a weight-sharing parametrization technique that takes into account the fundamental structures and operations of geometric algebras. Due to this technique, GLGENN architecture is parameter-light and has less tendency to overfitting than baseline equivariant models. GLGENN outperforms or matches competitors on several benchmarking equivariant tasks, including estimation of an equivariant function and a convex hull experiment, while using significantly fewer optimizable parameters." How to Synthesize Text Data without Model Collapse?,Xuekai Zhu Daixuan Cheng Hengli Li Kaiyan Zhang Ermo Hua Xingtai Lv Ning Ding Zhouhan Lin Zilong Zheng Bowen Zhou,https://icml.cc/virtual/2025/poster/44341,"Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-$\{n\}$ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance." Lightweight-Mark: Rethinking Deep Learning-Based Watermarking,Yupeng Qiu Han Fang Ee-Chien Chang,https://icml.cc/virtual/2025/poster/44782,"Deep learning-based watermarking models play a crucial role in copyright protection across various applications. However, many high-performance models are limited in practical deployment due to their large number of parameters. Meanwhile, the robustness and invisibility performance of existing lightweight models are unsatisfactory. This presents a pressing need for a watermarking model that combines lightweight capacity with satisfactory performance. Our research identifies a key reason that limits the performance of existing watermarking frameworks: a mismatch between commonly used decoding losses (e.g., mean squared error and binary cross-entropy loss) and the actual decoding goal, leading to parameter redundancy. We propose two innovative solutions: (1) Decoding-oriented surrogate loss (DO), which redesigns the loss function to mitigate the influence of decoding-irrelevant optimization directions; and (2) Detachable projection head (PH), which incorporates a detachable redundant module during training to handle these irrelevant directions and is discarded during inference. Additionally, we propose a novel watermarking framework comprising five submodules, allowing for independent parameter reduction in each component. Our proposed model achieves better efficiency, invisibility, and robustness while utilizing only 2.2\% of the parameters compared to the state-of-the-art frameworks. By improving efficiency while maintaining robust copyright protection, our model is well suited for practical applications in resource-constrained environments. The DO and PH methods are designed to be plug-and-play, facilitating seamless integration into future lightweight models." -msf-CNN: Multi-Stage Fusion with Convolutional Neural Networks for TinyML,Zhaolan Huang Emmanel Baccelli,https://openreview.net/forum?id=YEm8MV6b6X, Neural Genetic Search in Discrete Spaces,Hyeonah Kim Sanghyeok Choi Jiwoo Son Jinkyoo Park Changhyun Kwon,https://icml.cc/virtual/2025/poster/43742,"Effective search methods are crucial for improving the performance of deep generative models at test time. In this paper, we introduce a novel test-time search method, Neural Genetic Search (NGS), which incorporates the evolutionary mechanism of genetic algorithms into the generation procedure of deep models. The core idea behind NGS is its crossover, which is defined as parent-conditioned generation using trained generative models. This approach offers a versatile and easy-to-implement search algorithm for deep generative models. We demonstrate the effectiveness and flexibility of NGS through experiments across three distinct domains: routing problems, adversarial prompt generation for language models, and molecular design." Noise-Guided Predicate Representation Extraction and Diffusion-Enhanced Discretization for Scene Graph Generation,Guoqing Zhang Shichao Kan Fanghui Zhang Wanru Xu Yue Zhang Yigang Cen,https://icml.cc/virtual/2025/poster/43766,"Scene Graph Generation (SGG) is a fundamental task in visual understanding, aimed at providing more precise local detail comprehension for downstream applications. Existing SGG methods often overlook the diversity of predicate representations and the consistency among similar predicates when dealing with long-tail distributions. As a result, the model's decision layer fails to effectively capture details from the tail end, leading to biased predictions. To address this, we propose a Noise-Guided Predicate Representation Extraction and Diffusion-Enhanced Discretization (NoDIS) method. On the one hand, expanding the predicate representation space enhances the model's ability to learn both common and rare predicates, thus reducing prediction bias caused by data scarcity. We propose a conditional diffusion model to reconstructs features and increase the diversity of representations for same category predicates. On the other hand, independent predicate representations in the decision phase increase the learning complexity of the decision layer, making accurate predictions more challenging. To address this issue, we introduce a discretization mapper that learns consistent representations among similar predicates, reducing the learning difficulty and decision ambiguity in the decision layer. To validate the effectiveness of our method, we integrate NoDIS with various SGG baseline models and conduct experiments on multiple datasets. The results consistently demonstrate superior performance." Open-Det: An Efficient Learning Framework for Open-Ended Detection,Guiping Cao Tao Wang Wenjian Huang Xiangyuan Lan Jianguo Zhang Dongmei Jiang,https://icml.cc/virtual/2025/poster/45000,"Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1.5% of the training data (0.077M vs. 5.077M), 20.8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1.0% in APr). The source codes are available at: https://github.com/Med-Process/Open-Det." @@ -2254,7 +2220,6 @@ TabNAT: A Continuous-Discrete Joint Generative Framework for Tabular Data,Hengru Towards a Unified Framework of Clustering-based Anomaly Detection,Zeyu Fang Ming Gu Sheng Zhou Jiawei Chen Qiaoyu Tan Haishuai Wang Jiajun Bu,https://icml.cc/virtual/2025/poster/46624,"Unsupervised Anomaly Detection (UAD) plays a crucial role in identifying abnormal patterns within data without labeled examples, holding significant practical implications across various domains. Although the individual contributions of representation learning and clustering to anomaly detection are well-established, their interdependencies remain under-explored due to the absence of a unified theoretical framework. Consequently, their collective potential to enhance anomaly detection performance remains largely untapped. To bridge this gap, in this paper, we propose a novel probabilistic mixture model for anomaly detection to establish a theoretical connection among representation learning, clustering, and anomaly detection. By maximizing a novel anomaly-aware data likelihood, representation learning and clustering can effectively reduce the adverse impact of anomalous data and collaboratively benefit anomaly detection. Meanwhile, a theoretically substantiated anomaly score is naturally derived from this framework. Lastly, drawing inspiration from gravitational analysis in physics, we have devised an improved anomaly score that more effectively harnesses the combined power of representation learning and clustering. Extensive experiments, involving 17 baseline methods across 30 diverse datasets, validate the effectiveness and generalization capability of the proposed method, surpassing state-of-the-art methods." Upcycling Text-to-Image Diffusion Models for Multi-Task Capabilities,Ruchika Chavhan Abhinav Mehrotra Malcolm Chadwick Alberto Gil Couto Pimentel Ramos Luca Morreale Mehdi Noroozi Sourav Bhattacharya,https://icml.cc/virtual/2025/poster/45817,"Text-to-image synthesis has witnessed remarkable advancements in recent years. Many attempts have been made to adopt text-to-image models to support multiple tasks. However, existing approaches typically require resource-intensive re-training or additional parameters to accommodate for the new tasks, which makes the model inefficient for on-device deployment. We proposeMulti-Task Upcycling(MTU), a simple yet effective recipe that extends the capabilities of a pre-trained text-to-image diffusion model to support a variety of image-to-image generation tasks. MTU replaces Feed-Forward Network (FFN) layers in the diffusion model with smaller FFNs, referred to asexperts, and combines them with a dynamic routing mechanism. To the best of our knowledge, MTU is the first multi-task diffusion modeling approach that seamlessly blends multi-tasking with on-device compatibility, by mitigating the issue of parameter inflation. We show that the performance of MTU is on par with the single-task fine-tuned diffusion models across several tasks includingimage editing, super-resolution, andinpainting, while maintaining similar latency and computational load (GFLOPs) as the single-task fine-tuned models." Variational Control for Guidance in Diffusion Models,Kushagra Pandey Farrin Marouf Sofian Felix Draxler Theofanis Karaletsos Stephan Mandt,https://icml.cc/virtual/2025/poster/44885,"Diffusion models exhibit excellent sample quality, but existing guidance methods often require additional model training or are limited to specific tasks. We revisit guidance in diffusion models from the perspective of variational inference and control, introducing \emph{Diffusion Trajectory Matching (DTM)} that enables guiding pretrained diffusion trajectories to satisfy a terminal cost. DTM unifies a broad class of guidance methods and enables novel instantiations. We introduce a new method within this framework that achieves state-of-the-art results on several linear, non-linear, and blind inverse problems without requiring additional model training or specificity to pixel or latent space diffusion models. Our code will be available at https://github.com/czi-ai/oc-guidance." -Varying Manifolds in Diffusion: From Time-varying Geometries to Visual Saliency,Junhao Chen Manyi Li zherong pan Xifeng Gao Changhe Tu,https://openreview.net/forum?id=mIGCz3ZmmX, ZipAR: Parallel Autoregressive Image Generation through Spatial Locality,Yefei He Feng Chen Yuanyu He Shaoxuan He Hong Zhou Kaipeng Zhang Bohan Zhuang,https://icml.cc/virtual/2025/poster/45251,"In this paper, we propose ZipAR, a training-free, plug-and-play parallel decoding framework for accelerating autoregressive (AR) visual generation. The motivation stems from the observation that images exhibit local structures, and spatially distant regions tend to have minimal interdependence. Given a partially decoded set of visual tokens, in addition to the original next-token prediction scheme in the row dimension, the tokens corresponding to spatially adjacent regions in the column dimension can be decoded in parallel. To ensure alignment with the contextual requirements of each token, we employ an adaptive local window assignment scheme with rejection sampling analogous to speculative decoding. By decoding multiple tokens in a single forward pass, the number of forward passes required to generate an image is significantly reduced, resulting in a substantial improvement in generation efficiency. Experiments demonstrate that ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining." A Cognac Shot To Forget Bad Memories: Corrective Unlearning for Graph Neural Networks,Varshita Kolipaka Akshit Sinha Debangan Mishra Sumit Kumar Arvindh Arun Shashwat Goel Ponnurangam Kumaraguru,https://icml.cc/virtual/2025/poster/44563,"Graph Neural Networks (GNNs) are increasingly being used for a variety of ML applications on graph data. Because graph data does not follow the independently and identically distributed *i.i.d.* assumption, adversarial manipulations or incorrect data can propagate to other data points through message passing, which deteriorates the model's performance. To allow model developers to remove the adverse effects of manipulated entities from a trained GNN, we study the recently formulated problem of *Corrective Unlearning*. We find that current graph unlearning methods fail to unlearn the effect of manipulations even when the whole manipulated set is known. We introduce a new graph unlearning method,**Cognac**, which can unlearn the effect of the manipulation set even when only $5$% of it is identified. It recovers most of the performance of a strong oracle with fully corrected training data, even beating retraining from scratch without the deletion set, and is $8$x more efficient while also scaling to large datasets. We hope our work assists GNN developers in mitigating harmful effects caused by issues in real-world data, post-training." Commute Graph Neural Networks,Wei Zhuo Han Yu Guang Tan Xiaoxiao Li,https://icml.cc/virtual/2025/poster/46600,"Graph Neural Networks (GNNs) have shown remarkable success in learning from graph-structured data. However, their application to directed graphs (digraphs) presents unique challenges, primarily due to the inherent asymmetry in node relationships. Traditional GNNs are adept at capturing unidirectional relations but fall short in encoding the mutual path dependencies between nodes, such as asymmetrical shortest paths typically found in digraphs. Recognizing this gap, we introduce Commute Graph Neural Networks (CGNN), an approach that seamlessly integrates node-wise commute time into the message passing scheme. The cornerstone of CGNN is an efficient method for computing commute time using a newly formulated digraph Laplacian. Commute time is then integrated into the neighborhood aggregation process, with neighbor contributions weighted according to their respective commute time to the central node in each layer. It enables CGNN to directly capture the mutual, asymmetric relationships in digraphs. Extensive experiments on 8 benchmarking datasets confirm the superiority of CGNN against 13 state-of-the-art methods." @@ -2371,7 +2336,6 @@ The Missing Alignment Link of In-context Learning on Sequences,Harshvardhan Agar Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization,Zishun Yu Tengyu Xu Di Jin Karthik Abinav Sankararaman Yun He Wenxuan Zhou Zhouhao Zeng Eryk Helenowski Chen Zhu Sinong Wang Hao Ma Han Fang,https://icml.cc/virtual/2025/poster/46693,"Solving mathematics problems has been an intriguing capability of large language models, and many efforts have been made to improve reasoning by extending reasoning length, such as through self-correction and extensive long chain-of-thoughts. While promising in problem-solving, advanced long reasoning chain models exhibit an undesired single-modal behavior, where trivial questions require unnecessarily tedious long chains of thought. In this work, we propose a way to allow models to be aware of inference budgets by formulating it as utility maximization with respect to an inference budget constraint, hence naming our algorithm Inference Budget-Constrained Policy Optimization (IBPO). In a nutshell, models fine-tuned through IBPO learn to ``understand'' the difficulty of queries and allocate inference budgets to harder ones. With different inference budgets, our best models are able to have a $4.14$\% and $5.74$\% absolute improvement ($8.08$\% and $11.2$\% relative improvement) on MATH500 using $2.16$x and $4.32$x inference budgets respectively, relative to LLaMA3.1 8B Instruct. These improvements are approximately $2$x those of self-consistency under the same budgets." Thinking LLMs: General Instruction Following with Thought Generation,Tianhao Wu Janice Lan Weizhe Yuan Jiantao Jiao Jason E Weston Sainbayar Sukhbaatar,https://icml.cc/virtual/2025/poster/43495,"LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning -- but can be applied toanytask. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks." Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning,Jinlong Pang Na Di Zhaowei Zhu Jiaheng Wei Hao Cheng Chen Qian Yang Liu,https://icml.cc/virtual/2025/poster/43777,"Recent studies show that in supervised fine-tuning (SFT) of large language models (LLMs), data quality matters more than quantity. While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens within a sample can vary significantly. After pre-training, even in high-quality samples, patterns or phrases that are not task-related can be redundant, uninformative, or even harmful. Continuing to fine-tune on these patterns may offer limited benefit and even degrade downstream task performance.In this paper, we investigate token quality from a noisy-label perspective and propose a generic token cleaning pipeline for SFT tasks. Our method filters out uninformative tokens while preserving those carrying key task-specific information. Specifically, we first evaluate token quality by examining the influence of model updates on each token, then apply a threshold-based separation. The token influence can be measured in a single pass with a fixed reference model or iteratively with self-evolving reference models. The benefits and limitations of both methods are analyzed theoretically by error upper bounds. Extensive experiments show that our framework consistently improves downstream performance. Code is available at https://github.com/UCSC-REAL/TokenCleaning." -ULPT: Prompt Tuning with Ultra-Low-Dimensional Optimization,Zijun Wu Yongchang Hao Lili Mou,https://openreview.net/forum?id=ypweyCzJRT, Understanding Bias Reinforcement in LLM Agents Debate,Jihwan Oh Minchan Jeong Jongwoo Ko Se-Young Yun,https://icml.cc/virtual/2025/poster/46607,"Large Language Models (LLMs) solve complex problems using training-free methods like prompt engineering and in-context learning, yet ensuring reasoning correctness remains challenging. While self-correction methods such as self-consistency and self-refinement aim to improve reliability, they often reinforce biases due to the lack of effective feedback mechanisms. Multi-Agent Debate (MAD) has emerged as an alternative, but we identify two key limitations: bias reinforcement, where debate amplifies model biases instead of correcting them, and lack of perspective diversity, as all agents share the same model and reasoning patterns, limiting true debate effectiveness. To systematically evaluate these issues, we introduce $\textit{MetaNIM Arena}$, a benchmark designed to assess LLMs in adversarial strategic decision-making, where dynamic interactions influence optimal decisions. To overcome MAD’s limitations, we propose $\texttt{\textbf{DReaMAD}}$ ($\textbf{D}$iverse $\textbf{Rea}$soning via $\textbf{M}$ulti-$\textbf{A}$gent $\textbf{D}$ebate with Refined Prompt), a novel framework that (1) refines LLMs’ strategic prior knowledge to improve reasoning quality and (2) promotes diverse viewpoints within a single model by systematically modifying prompts, reducing bias. Empirical results show that $\texttt{\textbf{DReaMAD}}$ significantly improves decision accuracy, reasoning diversity, and bias mitigation across multiple strategic tasks, establishing it as a more effective approach for LLM-based decision-making." Understanding Chain-of-Thought in LLMs through Information Theory,Jean-Francois Ton Muhammad Faaiz Taufiq Yang Liu,https://icml.cc/virtual/2025/poster/45723,"Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through the use of Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable sub-tasks. However, existing CoT evaluation techniques either require annotated CoT data or fall short of accurately assessing intermediate reasoning steps, leading to high rates of false positives. In this paper, we formalize CoT reasoning in LLMs through an information-theoretic lens. Specifically, our framework quantifies the `information gain' at each reasoning step, enabling the identification of failure modes in LLMs without the need for expensive annotated datasets. We demonstrate the efficacy of our approach through extensive experiments on toy arithmetic, GSM8K and PRM800k datasets, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual tasks." Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach,Changdae Oh Zhen Fang Shawn Im Xuefeng Du Yixuan Li,https://icml.cc/virtual/2025/poster/44373,"Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world. By taking an information-theoretic perspective, we propose the first theoretical framework that enables the quantification of the maximum risk of MLLMs under distribution shifts. Central to our framework is the introduction of Effective Mutual Information (EMI), a principled metric that quantifies the relevance between input queries and model responses. We derive an upper bound for the EMI difference between in-distribution (ID) and out-of-distribution (OOD) data, connecting it to visual and textual distributional discrepancies.Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights." @@ -2397,7 +2361,6 @@ CAT: Contrastive Adversarial Training for Evaluating the Robustness of Protectiv Geometric Median (GM) Matching for Robust k-Subset Selection from Noisy Data,Anish Acharya Sujay Sanghavi Alex Dimakis Inderjit S Dhillon,https://icml.cc/virtual/2025/poster/43976,"Data pruning -- the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large-scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. Existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers. In response, this work proposes Geometric Median (GM) Matching, a novel k-subset selection strategy that leverages the Geometric Median (GM) , a robust estimator with an optimal breakdown point of 1/2; to enhance resilience against noisy data. Our method iteratively selects a $k$-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption. We provide theoretical guarantees, showing that GM Matching enjoys an improved $\mathcal{O}(1/k)$ convergence rate, outperforming $\mathcal{O}(1/\sqrt{k})$ scaling of uniform sampling, even under arbitrary corruption. Extensive experiments across image classification and image generation tasks demonstrate that GM Matching consistently outperforms existing pruning approaches, particularly in high-corruption settings; making it a strong baseline for robust data pruning." IT$^3$: Idempotent Test-Time Training,Nikita Durasov Assaf Shocher Doruk Oner Gal Chechik Alexei A Efros Pascal Fua,https://icml.cc/virtual/2025/poster/45551,"Deep learning models often struggle when deployed in real-world settings due to distribution shifts between training and test data. While existing approaches like domain adaptation and test-time training (TTT) offer partial solutions, they typically require additional data or domain-specific auxiliary tasks. We present Idempotent Test-Time Training (IT3), a novel approach that enables on-the-fly adaptation to distribution shifts using only the current test instance, without any auxiliary task design. Our key insight is that enforcing idempotence---where repeated applications of a function yield the same result---can effectively replace domain-specific auxiliary tasks used in previous TTT methods. We theoretically connect idempotence to prediction confidence and demonstrate that minimizing the distance between successive applications of our model during inference leads to improved out-of-distribution performance. Extensive experiments across diverse domains (including image classification, aerodynamics prediction, and aerial segmentation) and architectures (MLPs, CNNs, GNNs) show that IT3 consistently outperforms existing approaches while being simpler and more widely applicable. Our results suggest that idempotence provides a universal principle for test-time adaptation that generalizes across domains and architectures." LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models,Fanfei Li Thomas Klein Wieland Brendel Robert Geirhos Roland S. Zimmermann,https://icml.cc/virtual/2025/poster/45771,"Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers." -Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models,Ying Yang Jie Zhang Xiao Lv Di Lin Tao Xiang Qing Guo,https://openreview.net/forum?id=ORUVYM8OaJ, OOD-Chameleon: Is Algorithm Selection for OOD Generalization Learnable?,Liangze Jiang Damien Teney,https://icml.cc/virtual/2025/poster/46662,"Out-of-distribution (OOD) generalization is challenging because distribution shifts come in many forms. Numerous algorithms exist to address specific settings, butchoosing the right training algorithm for the right datasetwithout trial and error is difficult. Indeed, real-world applications often involve multiple types and combinations of shifts that are hard to analyze theoretically.Method.This work explores the possibility oflearningthe selection of a training algorithm for OOD generalization. We propose a proof of concept (OOD-Chameleon) that formulates the selection as a multi-label classification over candidate algorithms, trained on adataset of datasetsrepresenting a variety of shifts. We evaluate the ability of OOD-Chameleon to rank algorithms on unseen shifts and datasets based only on dataset characteristics, i.e., without training models first, unlike traditional model selection.Findings.Extensive experiments show that the learned selector identifies high-performing algorithms across synthetic, vision, and language tasks. Further inspection shows that it learns non-trivial decision rules, which provide new insights into the applicability of existing algorithms. Overall, this new approach opens the possibility of better exploiting and understanding the plethora of existing algorithms for OOD generalization." Pixel-level Certified Explanations via Randomized Smoothing,Alaa Anani Tobias Lorenz Mario Fritz Bernt Schiele,https://icml.cc/virtual/2025/poster/45484,"Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel's importance against $\ell_2$-bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is at [https://github.com/AlaaAnani/certified-attributions](https://github.com/AlaaAnani/certified-attributions)." Probably Approximately Global Robustness Certification,Peter Blohm Patrick Indri Thomas Gärtner SAGAR MALHOTRA,https://icml.cc/virtual/2025/poster/45128,"We propose and investigate probabilistic guarantees for the adversarial robustness of classification algorithms.While traditional formal verification approaches for robustness are intractable and sampling-based approaches do not provide formal guarantees, our approach is able to efficiently certify a probabilistic relaxation of robustness.The key idea is to sample an $\epsilon$-net and invoke a local robustness oracle on the sample.Remarkably, the size of the sample needed to achieve probably approximately global robustness guarantees is independent of the input dimensionality, the number of classes, and the learning algorithm itself.Our approach can, therefore, be applied even to large neural networks that are beyond the scope of traditional formal verification.Experiments empirically confirm that it characterizes robustness better thanstate-of-the-art sampling-based approaches and scales better than formal methods." @@ -2409,7 +2372,6 @@ SMART-PC: Skeletal Model Adaptation for Robust Test-Time Training in Point Cloud Targeted Unlearning with Single Layer Unlearning Gradient,Zikui Cai Yaoteng Tan M. Salman Asif,https://icml.cc/virtual/2025/poster/46379,"Machine unlearning methods aim to remove sensitive or unwanted content from trained models, but typically demand extensive model updates at significant computational cost while potentially degrading model performance on both related and unrelated tasks. We propose Single Layer Unlearning Gradient (SLUG) as an efficient method to unlearn targeted information by updating a single critical layer using a one-time gradient computation. SLUG uses layer importance and gradient alignment metrics to identify the optimal layer for targeted information removal while preserving the model utility. We demonstrate the effectiveness of SLUG for CLIP, Stable Diffusion, and vision-language models (VLMs) in removing concrete (e.g., identities and objects) and abstract concepts (e.g., artistic styles). On the UnlearnCanvas benchmark, SLUG achieves comparable unlearning performance to existing methods while requiring significantly less computational resources. Our proposed approach offers a practical solution for targeted unlearning that is computationally efficient and precise. Our code is available at https://github.com/CSIPlab/SLUG" The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models,Zichao Li Xueru Wen Jie Lou Yuqiu Ji Yaojie Lu Xianpei Han Debing Zhang Le Sun,https://icml.cc/virtual/2025/poster/44766,"Multimodal Reward Models (MM-RMs) are crucial for aligning Large Language Models (LLMs) with human preferences, particularly as LLMs increasingly interact with multimodal data. However, we find that MM-RMs trained on existing datasets often struggle to generalize to out-of-distribution data due to their reliance on unimodal spurious correlations, primarily text-only shortcuts within the training distribution, which prevents them from leveraging true multimodal reward functions. To address this, we introduce a Shortcut-aware MM-RM learning algorithm that mitigates this issue by dynamically reweighting training samples, shifting the distribution toward better multimodal understanding, and reducing dependence on unimodal spurious correlations. Our experiments demonstrate significant improvements in generalization, downstream task performance, and scalability, establishing a more robust framework for multimodal reward modeling. Our source code is provided on https://github.com/alignrm/Generalizable-MM-RM." "Unveiling AI's Blind Spots: An Oracle for In-Domain, Out-of-Domain, and Adversarial Errors",Shuangpeng Han Mengmi Zhang,https://icml.cc/virtual/2025/poster/44793,"AI models make mistakes when recognizing images—whether in-domain, out-of-domain, or adversarial. Predicting these errors is critical for improving system reliability, reducing costly mistakes, and enabling proactive corrections in real-world applications such as healthcare, finance, and autonomous systems. However, understanding what mistakes AI models make, why they occur, and how to predict them remains an open challenge. Here, we conduct comprehensive empirical evaluations using a ""mentor"" model—a deep neural network designed to predict another ""mentee"" model’s errors. Our findings show that the mentor excels at learning from a mentee's mistakes on adversarial images with small perturbations and generalizes effectively to predict in-domain and out-of-domain errors of the mentee. Additionally, transformer-based mentor models excel at predicting errors across various mentee architectures. Subsequently, we draw insights from these observations and develop an ""oracle"" mentor model, dubbed SuperMentor, that can outperform baseline mentors in predicting errors across different error types from the ImageNet-1K dataset. Our framework paves the way for future research on anticipating and correcting AI model behaviors, ultimately increasing trust in AI systems. Our data and code are available athere." -What is Adversarial Training for Diffusion Models?,Maria Rosaria Briglia Mujtaba Hussain Mirza Giuseppe Lisanti Iacopo Masi,https://openreview.net/forum?id=xCrgcGytLR, Collapse-Proof Non-Contrastive Self-Supervised Learning,Emanuele Sansone Tim Lebailly Tinne Tuytelaars,https://icml.cc/virtual/2025/poster/43625,"We present a principled and simplified design of the projector and loss function for non-contrastive self-supervised learning based on hyperdimensional computing. We theoretically demonstrate that this design introduces an inductive bias that encourages representations to be simultaneously decorrelated and clustered, without explicitly enforcing these properties. This bias provably enhances generalization and suffices to avoid known training failure modes, such as representation, dimensional, cluster, and intracluster collapses. We validate our theoretical findings on image datasets, including SVHN, CIFAR-10, CIFAR-100, and ImageNet-100. Our approach effectively combines the strengths of feature decorrelation and cluster-based self-supervised learning methods, overcoming training failure modes while achieving strong generalization in clustering and linear classification tasks." ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts,Samar Khanna Medhanie Irgau David B. Lobell Stefano Ermon,https://icml.cc/virtual/2025/poster/45411,"Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 8% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the art approaches. Our ablation studies confirm the efficacy of our approach over other baselines such as PEFT. Code is available at: https://samar-khanna.github.io/ExPLoRA/" GraphCL: Graph-based Clustering for Semi-Supervised Medical Image Segmentation,Mengzhu Wang houcheng su Jiao Li Chuan Li Nan Yin Li Shen Jingcai Guo,https://icml.cc/virtual/2025/poster/45355,"Semi-supervised learning (SSL) has made notable advancements in medical image segmentation (MIS), particularly in scenarios with limited labeled data and significantly enhancing data utilization efficiency. Previous methods primarily focus on complex training strategies to utilize unlabeled data but neglect the importance of graph structural information. Different from existing methods, we propose a graph-based clustering for semi-supervised medical image segmentation (GraphCL) by jointly modeling graph data structure in a unified deep model. The proposed GraphCL model enjoys several advantages. Firstly, to the best of our knowledge, this is the first work to model the data structure information for semi-supervised medical image segmentation (SSMIS). Secondly, to get the clustered features across different graphs, we integrate both pairwise affinities between local image features and raw features as inputs. Extensive experimental results on three standard benchmarks show that the proposed GraphCL algorithm outperforms state-of-the-art semi-supervised medical image segmentation methods." @@ -2420,12 +2382,10 @@ One Leaf Reveals the Season: Occlusion-Based Contrastive Learning with Semantic- A Generalizable Physics-Enhanced State Space Model for Long-Term Dynamics Forecasting in Complex Environments,Yuchen Wang Hongjue Zhao Haohong Lin Enze Xu Lifang He Huajie Shao,https://icml.cc/virtual/2025/poster/46230,"This work aims to address the problem of long-term dynamic forecasting in complex environments where data are noisy and irregularly sampled. While recent studies have introduced some methods to improve prediction performance, these approaches still face a significant challenge in handling long-term extrapolation tasks under such complex scenarios. To overcome this challenge, we propose Phy-SSM, a general-purpose framework that integrates partial physics knowledge into state space models (SSMs) for long-term dynamics forecasting in complex environments. Our motivation is that SSMs can effectively capture long-range dependencies in sequential data and model continuous dynamical systems, while the incorporation of physics knowledge improves generalization ability. The key challenge lies in how to seamlessly incorporate partially known physics into SSMs. To achieve this, we decompose partially known system dynamics into known and unknown state matrices, which are integrated into a Phy-SSM unit. To further enhance long-term prediction performance, we introduce a physics state regularization term to make the estimated latent states align with system dynamics. Besides, we theoretically analyze the uniqueness of the solutions for our method. Extensive experiments on three real-world applications, including vehicle motion prediction, drone state prediction, and COVID-19 epidemiology forecasting, demonstrate the superior performance of Phy-SSM over the baselines in both long-term interpolation and extrapolation tasks. The source code will be publicly available upon publication." Bayesian Basis Function Approximation for Scalable Gaussian Process Priors in Deep Generative Models,Mehmet Yiğit Balık Maksim Sinelnikov Priscilla Ong Harri Lähdesmäki,https://icml.cc/virtual/2025/poster/44629,"High-dimensional time-series datasets are common in domains such as healthcare and economics. Variational autoencoder (VAE) models, where latent variables are modeled with a Gaussian process (GP) prior, have become a prominent model class to analyze such correlated datasets. However, their applications are challenged by the inherent cubic time complexity that requires specific GP approximation techniques, as well as the general challenge of modeling both shared and individual-specific correlations across time. Though inducing points enhance GP prior VAE scalability, optimizing them remains challenging, especially since discrete covariates resist gradient‑based methods. In this work, we propose a scalable basis function approximation technique for GP prior VAEs that mitigates these challenges and results in linear time complexity, with a global parametrization that eliminates the need for amortized variational inference and the associated amortization gap, making it well-suited for conditional generation tasks where accuracy and efficiency are crucial. Empirical evaluations on synthetic and real-world benchmark datasets demonstrate that our approach not only improves scalability and interpretability but also drastically enhances predictive performance." Discovering Physics Laws of Dynamical Systems via Invariant Function Learning,Shurui Gui Xiner Li Shuiwang Ji,https://icml.cc/virtual/2025/poster/44382,"We consider learning underlying laws of dynamical systems governed by ordinary differential equations (ODE). A key challenge is how to discover intrinsic dynamics across multiple environments while circumventing environment-specific mechanisms. Unlike prior work, we tackle more complex environments where changes extend beyond function coefficients to entirely different function forms. For example, we demonstrate the discovery of ideal pendulum's natural motion $\alpha^2 \sin{\theta_t}$ by observing pendulum dynamics in different environments, such as the damped environment $\alpha^2 \sin(\theta_t) - \rho \omega_t$ and powered environment $\alpha^2 \sin(\theta_t) + \rho \frac{\omega_t}{\left|\omega_t\right|}$. Here, we formulate this problem as an *invariant function learning* task and propose a new method, known as **D**isentanglement of **I**nvariant **F**unctions (DIF), that is grounded in causal analysis. We propose a causal graph and design an encoder-decoder hypernetwork that explicitly disentangles invariant functions from environment-specific dynamics. The discovery of invariant functions is guaranteed by our information-based principle that enforces the independence between extracted invariant functions and environments. Quantitative comparisons with meta-learning and invariant learning baselines on three ODE systems demonstrate the effectiveness and efficiency of our method. Furthermore, symbolic regression explanation results highlight the ability of our framework to uncover intrinsic laws." -Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop,Yushan Jiang Wenchao Yu Geon Lee Dongjin Song Kijung Shin Wei Cheng Yanchi Liu Haifeng Chen,https://openreview.net/forum?id=DHRUMwDOXy, LAST SToP for Modeling Asynchronous Time Series,Shubham Gupta Thibaut Durand Graham W. Taylor Lilian Bialokozowicz,https://icml.cc/virtual/2025/poster/45155,"We present a novel prompt design for Large Language Models (LLMs) tailored toAsynchronous Time Series. Unlike regular time series, which assume values at evenly spaced time points, asynchronous time series consist of timestamped events occurring at irregular intervals, each described in natural language. Our approach effectively utilizes the rich natural language of event descriptions, allowing LLMs to benefit from their broad world knowledge for reasoning across different domains and tasks. This allows us to extend the scope of asynchronous time series analysis beyond forecasting to include tasks like anomaly detection and data imputation. We further introduceStochastic Soft Prompting, a novel prompt-tuning mechanism that significantly improves model performance, outperforming existing finetuning methods such as QLORA. Through extensive experiments on real-world datasets, we demonstrate that our approach achieves state-of-the-art performance across different tasks and datasets." Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting,Jiecheng Lu Shihao Yang,https://icml.cc/virtual/2025/poster/45192,"Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention often outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models." LSCD: Lomb--Scargle Conditioned Diffusion for Time series Imputation,Elizabeth Fons Alejandro Sztrajman Yousef El-Laham Luciana Ferrer Svitlana Vyetrenko Manuela Veloso,https://icml.cc/virtual/2025/poster/45821,"Time series with missing or irregularly sampled data are a persistent challenge in machine learning. Many methods operate on the frequency-domain, relying on the Fast Fourier Transform (FFT) which assumes uniform sampling, therefore requiring prior interpolation that can distort the spectra. To address this limitation, we introduce a differentiable Lomb--Scargle layer that enables a reliable computation of the power spectrum of irregularly sampled data.We integrate this layer into a novel score-based diffusion model (LSCD) for time series imputation conditioned on the entire signal spectrum. Experiments on synthetic and real-world benchmarks demonstrate that our method recovers missing data more accurately than purely time-domain baselines, while simultaneously producing consistent frequency estimates. Crucially, our method can be easily integrated into learning frameworks, enabling broader adoption of spectral guidance in machine learning approaches involving incomplete or irregular data." SpikF: Spiking Fourier Network for Efficient Long-term Prediction,Wenjie Wu Dexuan Huo Hong Chen,https://icml.cc/virtual/2025/poster/46411,"Spiking Neural Networks (SNNs) have demonstrated remarkable potential across many domains, including computer vision and natural language processing, owing to their energy efficiency and biological plausibility. However, their application in long-term prediction tasks remains underexplored, which is primarily due to two critical challenges: (1) current SNN encoding methods are unable to effectively encode long temporal information, leading to increased computational complexity and energy consumption; (2) though Transformer-based models have achieved state-of-the-art accuracy in temporal prediction tasks, the absence of proper positional encoding for spiking self-attention restricts Spiking Transformer from effectively utilizing positional information, resulting in performance degradation. To address these challenges, we introduce an attention-free framework, **Spik**ing **F**ourier Network (**SpikF**), that encodes input sequences in patches and employs an innovative frequency domain selection mechanism to effectively utilize the sequential properties of time-series data. Extensive evaluations on eight well-established long-term prediction datasets demonstrate that SpikF achieves an averaged $1.9\\%$ reduction in Mean Absolute Error (MAE) compared to state-of-the-art models, while lowering total energy consumption by $3.16\times$. Our code is available at https://github.com/WWJ-creator/SpikF." -TS-RAG: Retrieval-Augmented Generation based Time Series Foundation Models are Stronger Zero-Shot Forecaster,Kanghui Ning Zijie Pan Yu Liu Yushan Jiang James Y. Zhang Kashif Rasul Anderson Schneider Lintao Ma Yuriy Nevmyvaka Dongjin Song,https://openreview.net/forum?id=TJuUelhGQr, Understanding and Improving Length Generalization in Recurrent Models,Ricardo Buitrago Albert Gu,https://icml.cc/virtual/2025/poster/46587,"Recently, recurrent models such as state space models and linear attention have become popular due to their linear complexity in the sequence length. Thanks to their recurrent nature, in principle they can process arbitrarily long sequences, but their performance sometimes drops considerably beyond their training context lengths---i.e. they fail to length generalize.In this work, we provide comprehensive empirical and theoretical analysis to support the \textit{unexplored states hypothesis}, which posits that models fail to length generalize when during training they are only exposed to a limited subset of the distribution of all \textit{attainable} states (i.e. states that would be attained if the recurrence was applied to long sequences).Furthermore, we investigate simple training interventions that aim to increase the coverage of the states that the model is trained on, e.g. by initializing the state with Gaussian noise or with the final state of a different input sequence. With only 500 post-training steps ($\sim 0.1\%$ of the pre-training budget), these interventions enable length generalization for sequences that are orders of magnitude longer than the training context (e.g. $2k\longrightarrow 128k$) and show improved performance in long context tasks, thus presenting a simple and efficient way to enable robust length generalization in general recurrent models." A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models,Shivam Kumar Yun Yang Lizhen Lin,https://icml.cc/virtual/2025/poster/46645,"In this work, we explore the theoretical properties of conditional deep generative models under the statistical framework of distribution regression where the response variable lies in a high-dimensional ambient space but concentrates around a potentially lower-dimensional manifold. More specifically, we study the large-sample properties of a likelihood-based approach for estimating these models. Our results lead to the convergence rate of a sieve maximum likelihood estimator (MLE) for estimating the conditional distribution (and its devolved counterpart) of the response given predictors in the Hellinger (Wasserstein) metric. Our rates depend solely on the intrinsic dimension and smoothness of the true conditional distribution. These findings provide an explanation of why conditional deep generative models can circumvent the curse of dimensionality from the perspective of statistical foundations and demonstrate that they can learn a broader class of nearly singular conditional distributions. Our analysis also emphasizes the importance of introducing a small noise perturbation to the data when they are supported sufficiently close to a manifold. Finally, in our numerical studies, we demonstrate the effective implementation of the proposed approach using both synthetic and real-world datasets, which also provide complementary validation to our theoretical findings." A Simple Model of Inference Scaling Laws,Noam Itzhak Levi,https://icml.cc/virtual/2025/poster/46402,"Neural scaling laws have garnered significant interest due to their ability to predict model performance as a function of increasing parameters, data, and compute. In this work, we propose a simple statistical ansatz based on memorization to study scaling laws in the context of inference. Specifically, how performance improves with multiple inference attempts. We explore the coverage, or pass@k metric, which measures the chance of success over repeated attempts and provide a motivation for the observed functional form of the inference scaling behavior of the coverage in large language models (LLMs) on reasoning tasks. We then define an ""inference loss"", which exhibits a power law decay as the number of trials increases, and connect this result with prompting costs. We further test the universality of our construction by conducting experiments on a simple generative model, and find that our predictions are in agreement with the empirical coverage curves in a controlled setting. Our simple framework sets the ground for incorporating inference scaling with other known scaling laws." @@ -2465,7 +2425,6 @@ On Efficient Estimation of Distributional Treatment Effects under Covariate-Adap Rethinking Causal Ranking: A Balanced Perspective on Uplift Model Evaluation,Minqin Zhu Zexu Sun Ruoxuan Xiong Anpeng Wu Baohong Li Caizhi Tang JUN ZHOU Fei Wu Kun Kuang,https://icml.cc/virtual/2025/poster/44364,"Uplift modeling is crucial for identifying individuals likely to respond to a treatment in applications like marketing and customer retention, but evaluating these models is challenging due to the inaccessibility of counterfactual outcomes in real-world settings.In this paper, we identify a fundamental limitation in existing evaluation metrics, such as the uplift and Qini curves, which fail to rank individuals with binary negative outcomes accurately.This can lead to biased evaluations, where biased models receive higher curve values than unbiased ones, resulting in suboptimal model selection.To address this, we propose the Principled Uplift Curve (PUC), a novel evaluation metric that assigns equal curve values of individuals with both positive and negative binary outcomes, offering a more balanced and unbiased assessment. We then derive the Principled Uplift Loss (PUL) function from the PUC and integrate it into a new uplift model, the Principled Treatment and Outcome Network (PTONet), to reduce bias during uplift model training.Experiments on both simulated and real-world datasets demonstrate that the PUC provides less biased evaluations, while PTONet outperforms existing methods. The source code is available at: https://github.com/euzmin/PUC." Telling Peer Direct Effects from Indirect Effects in Observational Network Data,Xiaojing Du Jiuyong Li Debo Cheng Lin Liu Wentao Gao XIONGREN CHEN Ziqi Xu,https://icml.cc/virtual/2025/poster/43930,"Estimating causal effects is crucial for decision-makers in many applications, but it is particularly challenging with observational network data due to peer interactions. Some algorithms have been proposed to estimate causal effects involving network data, particularly peer effects, but they often fail to tell apart diverse peer effects. To address this issue, we propose a general setting which considers both peer direct effects and peer indirect effects, and the effect of an individual's own treatment, and provide the identification conditions of these causal effects. To differentiate these effects, we leverage causal mediation analysis and tailor it specifically for network data. Furthermore, given the inherent challenges of accurately estimating effects in networked environments, we propose to incorporate attention mechanisms to capture the varying influences of different neighbors and to explore high-order neighbor effects using multi-layer graph neural networks (GNNs). Additionally, we employ the Hilbert-Schmidt Independence Criterion (HSIC) to further enhance the model’s robustness and accuracy. Extensive experiments on two semi-synthetic datasets derived from real-world networks and on a dataset from a recommendation system confirm the effectiveness of our approach. Our findings have the potential to improve intervention strategies in networked systems, particularly in social networks and public health." Variational Counterfactual Intervention Planning to Achieve Target Outcomes,Xin Wang Shengfei Lyu Chi Luo Xiren Zhou Huanhuan Chen,https://icml.cc/virtual/2025/poster/44461,"A key challenge in personalized healthcare is identifying optimal intervention sequences to guide temporal systems toward target outcomes, a novel problem we formalize as counterfactual target achievement. In addressing this problem, directly adopting counterfactual estimation methods face compounding errors due to the unobservability of counterfactuals. To overcome this, we propose Variational Counterfactual Intervention Planning (VCIP), which reformulates the problem by modeling the conditional likelihood of achieving target outcomes, implemented through variational inference. By leveraging the g-formula to bridge the gap between interventional and observational log-likelihoods, VCIP enables reliable training from observational data. Experiments on both synthetic and real-world datasets show that VCIP significantly outperforms existing methods in target achievement accuracy." -Zero-Shot Learning of Causal Models,Divyat Mahajan Jannes Gladrow Agrin Hilmkil Cheng Zhang Meyer Scetbon,https://openreview.net/forum?id=S2XD3gIgzc, Clustering via Self-Supervised Diffusion,Roy Uziel Irit Chelly Oren Freifeld Ari Pakman,https://icml.cc/virtual/2025/poster/46196,"Diffusion models, widely recognized for their success in generative tasks, have not yet been applied to clustering. We introduce Clustering via Diffusion (CLUDI), a self-supervised framework that combines the generative power of diffusion models with pre-trained Vision Transformer features to achieve robust and accurate clustering. CLUDI is trained via a teacher–student paradigm: the teacher uses stochastic diffusion-based sampling to produce diverse cluster assignments, which the student refines into stable predictions. This stochasticity acts as a novel data augmentation strategy, enabling CLUDI to uncover intricate structures in high-dimensional data. Extensive evaluations on challenging datasets demonstrate that CLUDI achieves state-of-the-art performance in unsupervised classification, setting new benchmarks in clustering robustness and adaptability to complex data distributions." Heterogeneous Sufficient Dimension Reduction and Subspace Clustering,Lei Yan Xin Zhang Qing Mai,https://icml.cc/virtual/2025/poster/43791,"Scientific and engineering applications are often heterogeneous, making it beneficial to account for latent clusters or sub-populations when learning low-dimensional subspaces in supervised learning, and vice versa. In this paper, we combine the concept of subspace clustering with model-based sufficient dimension reduction and thus generalize the sufficient dimension reduction framework from homogeneous regression setting to heterogeneous data applications. In particular, we propose the mixture of principal fitted components (mixPFC) model, a novel framework that simultaneously achieves clustering, subspace estimation, and variable selection, providing a unified solution for high-dimensional heterogeneous data analysis. We develop a group Lasso penalized expectation-maximization (EM) algorithm and obtain its non-asymptotic convergence rate. Through extensive simulation studies, mixPFC demonstrates superior performance compared to existing methods across various settings. Applications to real world datasets further highlight its effectiveness and practical advantages." TANGO: Clustering with Typicality-Aware Nonlocal Mode-Seeking and Graph-Cut Optimization,Haowen Ma Zhiguo Long Hua Meng,https://icml.cc/virtual/2025/poster/43507,"Density-based mode-seeking methods generate a density-ascending dependency from low-density points towards higher-density neighbors.Current mode-seeking methods identify modes by breaking some dependency connections, but relying heavily on local data characteristics, requiring case-by-case threshold settings or human intervention to be effective for different datasets. To address this issue, we introduce a novel concept called typicality, by exploring the locally defined dependency from a global perspective, to quantify how confident a point would be a mode. We devise an algorithm that effectively and efficiently identifies modes with the help of the global-view typicality. To implement and validate our idea, we design a clustering method called TANGO, which not only leverages typicality to detect modes, but also utilizes graph-cut with an improved path-based similarity to aggregate data into the final clusters. Moreover, this paper also provides some theoretical analysis on the proposed algorithm. Experimental results on several synthetic and extensive real-world datasets demonstrate the effectiveness and superiority of TANGO. The code is available at https://github.com/SWJTU-ML/TANGO_code." @@ -2486,7 +2445,6 @@ WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales Efficient Quantification of Multimodal Interaction at Sample Level,Zequn Yang Hongfa Wang Di Hu,https://icml.cc/virtual/2025/poster/45816,"Interactions between modalities—redundancy, uniqueness, and synergy—collectively determine the composition of multimodal information. Understanding these interactions is crucial for analyzing information dynamics in multimodal systems, yet their accurate sample-level quantification presents significant theoretical and computational challenges. To address this, we introduce the Lightweight Sample-wise Multimodal Interaction (LSMI) estimator, rigorously grounded in pointwise information theory. We first develop a redundancy estimation framework, employing an appropriate pointwise information measure to quantify this most decomposable and measurable interaction.Building upon this, we propose a general interaction estimation method that employs efficient entropy estimation, specifically tailored for sample-wise estimation in continuous distributions. Extensive experiments on synthetic and real-world datasets validate LSMI's precision and efficiency. Crucially, our sample-wise approach reveals fine-grained sample- and category-level dynamics within multimodal data, enabling practical applications such as redundancy-informed sample partitioning, targeted knowledge distillation, and interaction-aware model ensembling. The code is available at https://github.com/GeWu-Lab/LSMI_Estimator." Interaction-Aware Gaussian Weighting for Clustered Federated Learning,Alessandro Licciardi Davide Leo Eros Fanì Barbara Caputo Marco Ciccone,https://icml.cc/virtual/2025/poster/44632,"Federated Learning (FL) emerged as a decentralized paradigm to train models while preserving privacy. However, conventional FL struggles with data heterogeneity and class imbalance, which degrade model performance.Clustered FL balances personalization and decentralized training by grouping clients with analogous data distributions, enabling improved accuracy while adhering to privacy constraints. This approach effectively mitigates the adverse impact of heterogeneity in FL.In this work, we propose a novel clustering method for FL,FedGWC(Federated Gaussian Weighting Clustering), which groups clients based on their data distribution, allowing training of a more robust and personalized model on the identified clusters.FedGWCidentifies homogeneous clusters by transforming individual empirical losses to model client interactions with a Gaussian reward mechanism. Additionally, we introduce theWasserstein Adjusted Score, a new clustering metric for FL to evaluate cluster cohesion with respect to the individual class distribution. Our experiments on benchmark datasets show thatFedGWCoutperforms existing FL algorithms in cluster quality and classification accuracy, validating the efficacy of our approach." Online Differentially Private Conformal Prediction for Uncertainty Quantification,Qiangqiang Zhang Ting Li Xinwei Feng Xiaodong Yan Jinhan Xie,https://icml.cc/virtual/2025/poster/44617,"Traditional conformal prediction faces significant challenges with the rise of streaming data and increasing concerns over privacy. In this paper, we introduce a novel online differentially private conformal prediction framework, designed to construct dynamic, model-free private prediction sets. Unlike existing approaches that either disregard privacy or require full access to the entire dataset, our proposed method ensures individual privacy with a one-pass algorithm, ideal for real-time, privacy-preserving decision-making. Theoretically, we establish guarantees for long-run coverage at the nominal confidence level. Moreover, we extend our method to conformal quantile regression, which is fully adaptive to heteroscedasticity. We validate the effectiveness and applicability of the proposed method through comprehensive simulations and real-world studies on the ELEC2 and PAMAP2 datasets." -Robust Online Conformal Prediction under Uniform Label Noise,HuaJun Xi Kangdao Liu Hao Zeng Wenguang Sun Hongxin Wei,https://openreview.net/forum?id=UWXB0MJ43M, Scalable Sobolev IPM for Probability Measures on a Graph,Tam Le Truyen Nguyen Hideitsu Hino Kenji Fukumizu,https://icml.cc/virtual/2025/poster/45057,"We investigate the Sobolev IPM problem for probability measures supported on a graph metric space. Sobolev IPM is an important instance of integral probability metrics (IPM), and is obtained by constraining a critic function within a unit ball defined by the Sobolev norm. In particular, it has been used to compare probability measures and is crucial for several theoretical works in machine learning. However, to our knowledge, there are no efficient algorithmic approaches to compute Sobolev IPM effectively, which hinders its practical applications. In this work, we establish a relation between Sobolev norm and weighted $L^p$-norm, and leverage it to propose a *novel regularization* for Sobolev IPM. By exploiting the graph structure, we demonstrate that the regularized Sobolev IPM provides a *closed-form* expression for fast computation. This advancement addresses long-standing computational challenges, and paves the way to apply Sobolev IPM for practical applications, even in large-scale settings. Additionally, the regularized Sobolev IPM is negative definite. Utilizing this property, we design positive-definite kernels upon the regularized Sobolev IPM, and provide preliminary evidences of their advantages for comparing probability measures on a given graph for document classification and topological data analysis." When to retrain a machine learning model,Florence Regol Leo Schwinn Kyle Sprague Mark Coates Thomas Markovich,https://icml.cc/virtual/2025/poster/44981,"A significant challenge in maintaining real-world machine learning models is responding to the continuous and unpredictable evolution of data. Most practitioners are faced with the difficult question: when should I retrain or update my machine learning model? This seemingly straightforward problem is particularly challenging for three reasons: 1) decisions must be made based on very limited information - we usually have access to only a few examples, 2) the nature, extent, and impact of the distribution shift are unknown, and 3) it involves specifying a cost ratio between retraining and poor performance, which can be hard to characterize. Existing works address certain aspects of this problem, but none offer a comprehensive solution. Distribution shift detection falls short as it cannot account for the cost trade-off; the scarcity of the data, paired with its unusual structure, makes it a poor fit for existing offline reinforcement learning methods, and the online learning formulation overlooks key practical considerations. To address this, we present a principled formulation of the retraining problem and propose an uncertainty-based method that makes decisions by continually forecasting the evolution of model performance evaluated with a bounded metric. Our experiments, addressing classification tasks, show that the method consistently outperforms existing baselines on 7 datasets. We thoroughly assess its robustness to varying cost trade-off values and mis-specified cost trade-offs." An Expressive and Self-Adaptive Dynamical System for Efficient Function Learning,Chuan Liu Chunshu Wu Ruibing Song Ang Li Ying Nian Wu Tong Geng,https://icml.cc/virtual/2025/poster/45640,"Function learning forms the foundation of numerous scientific and engineering tasks. While modern machine learning (ML) methods model complex functions effectively, their escalating complexity and computational demands pose challenges to efficient deployment. In contrast, natural dynamical systems exhibit remarkable computational efficiency in representing and solving complex functions. However, existing dynamical system approaches are limited by low expressivity and inefficient training. To this end, we propose EADS, an Expressive and self-Adaptive Dynamical System capable of accurately learning a wide spectrum of functions with extraordinary efficiency. Specifically, (1) drawing inspiration from biological dynamical systems, we integrate hierarchical architectures and heterogeneous dynamics into EADS, significantly enhancing its capacity to represent complex functions. (2) We propose an efficient on-device training method that leverages intrinsic electrical signals to update parameters, making EADS self-adaptive at negligible cost. Experimental results across diverse domains demonstrate that EADS achieves higher accuracy than existing works, while offering orders-of-magnitude speedups and energy efficiency over traditional neural network solutions on GPUs for both inference and training, showcasing its broader impact in overcoming computational bottlenecks across various fields." @@ -2507,7 +2465,6 @@ Rethinking Point Cloud Data Augmentation: Topologically Consistent Deformation,J ELMO : Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces,Jinbin Zhang Nasib Ullah Erik Schultheis Rohit Babbar,https://icml.cc/virtual/2025/poster/44648,"Large output spaces, also referred to as Extreme multilabel classification (XMC), is a setting that arises, e.g., in large-scale tagging and product-to-product recommendation, and is characterized by the number of labels ranging from hundreds of thousands to millions. This means that the linear classification head, usually only a tiny fraction of the overall model, turns into the main driver for compute and memory demand. Current state-of-the-art XMC methods predominantly rely on FP16-FP32 mixed-precision training, which we show can be unstable, and inefficient in terms of memory usage and computational overhead. Meanwhile, existing low-precision methods typically retain higher precision for the classification layer. In this work, we propose ELMO, a pure low-precision training framework for XMC models using BFloat16 and Float8 data types. By leveraging Kahan summation and stochastic rounding, we demonstrate that XMC models can be effectively trained entirely in Float8, without relying on single-precision master weights or tensor scaling. Low-precision training, combined with our proposed memory optimizations---gradient fusion and chunking---enables significant reductions in GPU memory usage. For example, we train a 3-million-label XMC model with only 6.6 GiB of GPU memory, compared to the 39.7GiB required by the optimized SOTA method, Renee without compromising accuracy." Improved Coresets for Vertical Federated Learning: Regularized Linear and Logistic Regressions,Supratim Shit Gurmehak kaur chadha Surendra kumar Bapi Chatterjee,https://icml.cc/virtual/2025/poster/43903,"Coreset, as a summary of training data, offers an efficient approach for reducing data processing and storage complexity during training. In the emerging vertical federated learning (VFL) setting, where scattered clients store different data features, it directly reduces communication complexity. In this work, we introduce coresets construction for regularized logistic regression both in centralized and VFL settings. Additionally, we improve the coreset size for regularized linear regression in the VFL setting. We also eliminate the dependency of the coreset size on a property of the data due to the VFL setting. The improvement in the coreset sizes is due to our novel coreset construction algorithms that capture the reduced model complexity due to the added regularization and its subsequent analysis. In experiments, we provide extensive empirical evaluation that backs our theoretical claims. We also report the performance of our coresets by comparing the models trained on the complete data and on the coreset." Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence,İlker Işık Ramazan Gokberk Cinbis Ebru Aydin Gol,https://icml.cc/virtual/2025/poster/46588,"Language models lack the notion of interchangeable tokens: symbols that are semantically equivalent yet distinct, such as bound variables in formal logic. This limitation prevents generalization to larger vocabularies and hinders the model's ability to recognize alpha-equivalence, where renaming bound variables preserves meaning. We formalize this machine learning problem and introduce alpha-covariance, a metric for evaluating robustness to such transformations. To tackle this task, we propose a dual-part token embedding strategy: a shared component ensures semantic consistency, while a randomized component maintains token distinguishability. Compared to a baseline that relies on alpha-renaming for data augmentation, our approach demonstrates improved generalization to unseen tokens in linear temporal logic solving, propositional logic assignment prediction, and copying with an extendable vocabulary, while introducing a favorable inductive bias for alpha-equivalence. Our findings establish a foundation for designing language models that can learn interchangeable token representations, a crucial step toward more flexible and systematic reasoning in formal domains. Our code and project page are available at https://necrashter.github.io/interchangeable-token-embeddings" -Scaling Embedding Layers in Language Models,Da Yu Edith Cohen Badih Ghazi Yangsibo Huang Pritish Kamath Ravi Kumar Daogao Liu Chiyuan Zhang,https://openreview.net/forum?id=ZjrId3p45T, SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization,Runsheng Bai Bo Liu qiang liu,https://icml.cc/virtual/2025/poster/45671,"Large Language Models (LLMs) exhibit impressive performance across various tasks, but deploying them for inference poses challenges. Their high resource demands often necessitate complex, costly multi-GPU pipelines, or the use of smaller, less capable models. While quantization offers a promising solution utilizing lower precision for model storage, existing methods frequently experience significant performance drops at lower precision levels. Additionally, they typically provide only a limited set of solutions at specific bit levels, many of which are extensively manually tuned. To address these challenges, we propose a new method called \textbf{SKIM}: Scaled K-means clustering wIth Mixed precision. Our approach introduces two novel techniques: 1. A \textit{greedy algorithm} to solve approximately optimal bit allocation across weight channels, and 2. A \textit{trainable scaling vector} for non-differentiable K-means clustering. These techniques substantially improve the model performance and can be adapted to any given bit. Notably, in terms of perplexity, our method narrows the gap between quantized LLaMA models and their full precision counterparts by around \textbf{14\%} on average." Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging,Pierre Ablin Angelos Katharopoulos Skyler Seto David Grangier,https://icml.cc/virtual/2025/poster/45553,"Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances.We propose the Soup-of-Experts, a novel architecture that can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input domain weights.To train this architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch of data sampled with these domain weights.We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly.Soup-of-Experts are particularly appealing when one needs to ship many different specialist models quickly under a size constraint." Geometric Contact Flows: Contactomorphisms for Dynamics and Control,Andrea Testa Søren Hauberg Tamim Asfour Leonel Rozo,https://icml.cc/virtual/2025/poster/43700,"Accurately modeling and predicting complex dynamical systems, particularly those involving force exchange and dissipation, is crucial for applications ranging from fluid dynamics to robotics, but presents significant challenges due to the intricate interplay of geometric constraints and energy transfer. This paper introduces Geometric Contact Flows (GFC), a novel framework leveraging Riemannian and Contact geometry as inductive biases to learn such systems. GCF constructs a latent contact Hamiltonian model encoding desirable properties like stability or energy conservation. An ensemble of contactomorphisms then adapts this model to the target dynamics while preserving these properties. This ensemble allows for uncertainty-aware geodesics that attract the system’s behavior toward the data support, enabling robust generalization and adaptation to unseen scenarios. Experiments on learning dynamics for physical systems and for controlling robots on interaction tasks demonstrate the effectiveness of our approach." @@ -2525,7 +2482,6 @@ DPCore: Dynamic Prompt Coreset for Continual Test-Time Adaptation,Yunbei Zhang A iDPA: Instance Decoupled Prompt Attention for Incremental Medical Object Detection,Huahui Yi Wei Xu Ziyuan Qin Xi Chen Xiaohu Wu Kang Li Qicheng Lao,https://icml.cc/virtual/2025/poster/44405,"Existing prompt-based approaches have demonstrated impressive performance in continual learning, leveraging pre-trained large-scale models for classification tasks; however, the tight coupling between foreground-background information and the coupled attention between prompts and image-text tokens present significant challenges in incremental medical object detection tasks, due to the conceptual gap between medical and natural domains. To overcome these challenges, we introduce the iDPA framework, which comprises two main components: 1) Instance-level Prompt Generation (IPG), which decouples fine-grained instance-level knowledge from images and generates prompts that focus on dense predictions, and 2) Decoupled Prompt Attention (DPA), which decouples the original prompt attention, enabling a more direct and efficient transfer of prompt information while reducing memory usage and mitigating catastrophic forgetting. We collect 13 clinical, cross-modal, multi-organ, and multi-category datasets, referred to as ODinM-13, and experiments demonstrate that iDPA outperforms existing SOTA methods, with FAP improvements of f 5.44%, 4.83%, 12.88%, and 4.59% in full data, 1-shot, 10-shot, and 50-shot settings, respectively." Knowledge-Guided Wasserstein Distributionally Robust Optimization,Zitao Wang Ziyuan Wang Molei Liu Nian Si,https://icml.cc/virtual/2025/poster/43697,"Wasserstein Distributionally Robust Optimization (WDRO) is a principled framework for robust estimation under distributional uncertainty. However, its standard formulation can be overly conservative, particularly in small-sample regimes. We propose a novel knowledge-guided WDRO (KG-WDRO) framework for transfer learning, which adaptively incorporates multiple sources of external knowledge to improve generalization accuracy. Our method constructs smaller Wasserstein ambiguity sets by controlling the transportation along directions informed by the source knowledge. This strategy can alleviate perturbations on the predictive projection of the covariates and protect against information loss. Theoretically, we establish the equivalence between our WDRO formulation and the knowledge-guided shrinkage estimation based on collinear similarity, ensuring tractability and geometrizing the feasible set. This also reveals a novel and general interpretation for recent shrinkage-based transfer learning approaches from the perspective of distributional robustness. In addition, our framework can adjust for scaling differences in the regression models between the source and target and accommodates general types of regularization such as lasso and ridge. Extensive simulations demonstrate the superior performance and adaptivity of KG-WDRO in enhancing small-sample transfer learning." Learning Time-Aware Causal Representation for Model Generalization in Evolving Domains,Zhuo He Shuang Li Wenze Song Longhui Yuan Jian Liang Han Li Kun Gai,https://icml.cc/virtual/2025/poster/45628,"Endowing deep models with the ability to generalize in dynamic scenarios is of vital significance for real-world deployment, given the continuous and complex changes in data distribution. Recently, evolving domain generalization (EDG) has emerged to address distribution shifts over time, aiming to capture evolving patterns for improved model generalization. However, existing EDG methods may suffer from spurious correlations by modeling only the dependence between data and targets across domains, creating a shortcut between task-irrelevant factors and the target, which hinders generalization. To this end, we design a time-aware structural causal model (SCM) that incorporates dynamic causal factors and the causal mechanism drifts, and proposeStatic-DYNamicCausal Representation Learning (SYNC), an approach that effectively learns time-aware causal representations. Specifically, it integrates specially designed information-theoretic objectives into a sequential VAE framework which captures evolving patterns, and produces the desired representations by preserving intra-class compactness of causal factors both across and within domains. Moreover, we theoretically show that our method can yield the optimal causal predictor for each time domain. Results on both synthetic and real-world datasets exhibit that SYNC can achieve superior temporal generalization performance." -Learning to Plan with Personalized Preferences,Manjie Xu Xinyi Yang Wei Liang Chi Zhang Yixin Zhu,https://openreview.net/forum?id=r53lwSSfcI, Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent,Yongxian Wei Anke Tang Li Shen Zixuan Hu Chun Yuan Xiaochun Cao,https://icml.cc/virtual/2025/poster/45923,"Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data. Existing methods attempt to alleviate task conflicts by sparsifying task vectors or promoting orthogonality among them. However, they overlook the fundamental target of model merging: the merged model performs as closely as possible to task-specific models on respective tasks. We find these methods inevitably discard task-specific information that, while causing conflicts, is crucial for performance. Based on our findings, we frame model merging as a constrained optimization problem ($\textit{i.e.}$, minimizing the gap between the merged model and individual models, subject to the constraint of retaining shared knowledge) and solve it via adaptive projective gradient descent. Specifically, we align the merged model with individual models by decomposing and reconstituting the loss function, alleviating conflicts through $\textit{data-free}$ optimization of task vectors. To retain shared knowledge, we optimize this objective by projecting gradients within a $\textit{shared subspace}$ spanning all tasks. Moreover, we view merging coefficients as adaptive learning rates and propose a task-aware, training-free strategy. Experiments show that our plug-and-play approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains." Parameter-Efficient Fine-Tuning of State Space Models,Kevin Galim Wonjun Kang Yuchen Zeng Hyung Il Koo Kangwook Lee,https://icml.cc/virtual/2025/poster/46398,"Deep State Space Models (SSMs), such as Mamba (Gu & Dao, 2024), have become powerful tools for language modeling, offering high performance and linear scalability with sequence length. However, the application of parameter-efficient fine-tuning (PEFT) methods to SSM-based models remains largely underexplored. We start by investigating two fundamental questions on existing PEFT methods: (i) How do they perform on SSM-based models? (ii) Which parameters should they target for optimal results? Our analysis shows that LoRA and its variants consistently outperform all other PEFT methods. While LoRA is effective for linear projection matrices, it fails on SSM modules—yet still outperforms other methods applicable to SSMs, indicating their limitations. This underscores the need for a specialized SSM tuning approach. To address this, we propose Sparse Dimension Tuning (SDT), a PEFT method tailored for SSM modules. Combining SDT for SSMs with LoRA for linear projection matrices, we achieve state-of-the-art performance across extensive experiments." PTTA: Purifying Malicious Samples for Test-Time Model Adaptation,Jing Ma Hanlin Li Xiang Xiang,https://icml.cc/virtual/2025/poster/45190,"Test-Time Adaptation (TTA) enables deep neural networks to adapt to arbitrary distributions during inference. Existing TTA algorithms generally tend to select benign samples that help achieve robust online prediction and stable self-training. Although malicious samples that would undermine the model's optimization should be filtered out, it also leads to a waste of test data. To alleviate this issue, we focus on how to make full use of the malicious test samples for TTA by transforming them into benign ones, and propose a plug-and-play method, PTTA. The core of our solution lies in the purification strategy, which retrieves benign samples having opposite effects on the objective function to perform Mixup with malicious samples, based on a saliency indicator for encoding benign and malicious data. This strategy results in effective utilization of the information in malicious samples and an improvement of the models' online test accuracy. In this way, we can directly apply the purification loss to existing TTA algorithms without the need to carefully adjust the sample selection threshold. Extensive experiments on four types of TTA tasks as well as classification, segmentation, and adversarial defense demonstrate the effectiveness of our method. Code is available at https://github.com/HAIV-Lab/PTTA." @@ -2539,8 +2495,6 @@ A Theoretical Framework For Overfitting In Energy-based Modeling,Giovanni Catani Expressive Score-Based Priors for Distribution Matching with Geometry-Preserving Regularization,Ziyu Gong Jim Lim David I. Inouye,https://icml.cc/virtual/2025/poster/45058,"Distribution matching (DM) is a versatile domain-invariant representation learning technique that has been applied to tasks such as fair classification, domain adaptation, and domain translation. Non-parametric DM methods struggle with scalability and adversarial DM approaches suffer from instability and mode collapse.While likelihood-based methods are a promising alternative, they often impose unnecessary biases through fixed priors or require explicit density models (e.g., flows) that can be challenging to train.We address this limitation by introducing a novel approach to training likelihood-based DM using expressive score-based prior distributions.Our key insight is that gradient-based DM training only requires the prior's score function---not its density---allowing us to train the prior via denoising score matching. This approach eliminates biases from fixed priors (e.g., in VAEs), enabling more effective use of geometry-preserving regularization, while avoiding the challenge of learning an explicit prior density model (e.g., a flow-based prior). Our method also demonstrates better stability and computational efficiency compared to other diffusion-based priors (e.g., LSGM). Furthermore, experiments demonstrate superior performance across multiple tasks, establishing our score-based method as a stable and effective approach to distribution matching. Source code available at https://github.com/inouye-lab/SAUB." The Noisy Laplacian: a Threshold Phenomenon for Non-Linear Dimension Reduction,Alex Kokot Octavian-Vlad Murad Marina Meila,https://icml.cc/virtual/2025/poster/45836,"In this paper, we clarify the effect of noise on common spectrallymotivated algorithms such as Diffusion Maps (DM) for dimensionreduction. Empirically, these methods are much more robust to noisethan current work suggests. Specifically, existing consistency resultsrequire that either the noise amplitude or dimensionality must varywith the sample size $n$. We provide new theoretical resultsdemonstrating that low-frequency eigenpairs reliably capture thegeometry of the underlying manifold under a constant noise level, up to a dimension independent threshold $O(r^{-2})$, where $r$ is the noise amplitude. Our results rely on a decomposition of the manifold Laplacian in the Sasakimetric, a technique not used before in this area, to our knowledge. We experimentally validate our theoretical predictions. Additionally, we observesimilar robust behavior for other manifold learning algorithms which are not based on computing the Laplacian, namely LTSA and VAE." Unsupervised Learning for Class Distribution Mismatch,Pan Du Wangbo Zhao Xinai Lu Nian Liu Zhikai Li Chaoyu Gong Suyun Zhao Hong Chen Cuiping Li Kai Wang Yang You,https://icml.cc/virtual/2025/poster/44666,"Class distribution mismatch (CDM) refers to the discrepancy between class distributions in training data and target tasks. Previous methods address this by designing classifiers to categorize classes known during training, while grouping unknown or new classes into an ""other"" category. However, they focus on semi-supervised scenarios and heavily rely on labeled data, limiting their applicability and performance. To address this, we propose Unsupervised Learning for Class Distribution Mismatch (UCDM), which constructs positive-negative pairs from unlabeled data for classifier training. Our approach randomly samples images and uses a diffusion model to add or erase semantic classes, synthesizing diverse training pairs. Additionally, we introduce a confidence-based labeling mechanism that iteratively assigns pseudo-labels to valuable real-world data and incorporates them into the training process. Extensive experiments on three datasets demonstrate UCDM’s superiority over previous semi-supervised methods. Specifically, with a 60\% mismatch proportion on Tiny-ImageNet dataset, our approach, without relying on labeled data, surpasses OpenMatch (with 40 labels per class) by 35.1%, 63.7%, and 72.5% in classifying known, unknown, and new classes." -Using Unsupervised Dynamic Feature Selection to Enhance Latent Representations,Bruno Corcuera Sánchez Carlos Eiras-Franco Brais Cancela,https://openreview.net/forum?id=3g6ktAQn32, -A Proximal Operator for Inducing 2:4-Sparsity,Jonas M. Kübler Yu-Xiang Wang Shoham Sabach Navid Ansari Matthäus Kleindessner Kailash Budhathoki Volkan Cevher George Karypis,https://openreview.net/forum?id=jFC8SS8kWU, Aligned Multi Objective Optimization,Yonathan Efroni Ben Kretzu Daniel Jiang Jalaj Bhandari Zheqing Zhu Karen Ullrich,https://icml.cc/virtual/2025/poster/45445,"To date, the multi-objective optimization literature has mainly focused on conflicting objectives, studying the Pareto front, or requiring users to balance tradeoffs. Yet, in machine learning practice, there are many scenarios where such conflict does not take place. Recent findings from multi-task learning, reinforcement learning, and LLMs training show that diverse related tasks can enhance performance across objectives simultaneously. Despite this evidence, such phenomenon has not been examined from an optimization perspective. This leads to a lack of generic gradient-based methods that can scale to scenarios with a large number of related objectives. To address this gap, we introduce the Aligned Multi-Objective Optimization framework, propose new algorithms for this setting, and provide theoretical guarantees of its superior performance compared to naive approaches." Armijo Line-search Can Make (Stochastic) Gradient Descent Provably Faster,Sharan Vaswani Reza Babanezhad Harikandeh,https://icml.cc/virtual/2025/poster/45597,"Armijo line-search (Armijo-LS) is a standard method to set the step-size for gradient descent (GD). For smooth functions, Armijo-LS alleviates the need to know the global smoothness constant $L$ and adapts to the ``local'' smoothness, enabling GD to converge faster. Existing theoretical analyses show that GD with Armijo-LS ($\texttt{GD-LS}$) can result in constant factor improvements over GD with a $1/L$ step-size (denoted as $\texttt{GD(1/L)}$). We strengthen these results and show that if the objective function satisfies a certain non-uniform smoothness condition, $\texttt{GD-LS}$ can result in a faster convergence rate than $\texttt{GD(1/L)}$. In particular, we prove that for convex objectives corresponding to logistic regression and multi-class classification, $\texttt{GD-LS}$ can converge to the optimum at a linear rate, and hence improves over the sublinear convergence of $\texttt{GD(1/L)}$. Furthermore, for non-convex objectives satisfying gradient domination (e.g., those corresponding to the softmax policy gradient in RL or generalized linear models with a logistic link function), $\texttt{GD-LS}$ can match the fast convergence of algorithms tailored for these specific settings. Finally, we prove that under the interpolation assumption, for convex losses, stochastic GD with a stochastic line-search can match the fast convergence of $\texttt{GD-LS}$." Automatic Differentiation of Optimization Algorithms with Time-Varying Updates,Sheheryar Mehmood Peter Ochs,https://icml.cc/virtual/2025/poster/43618,"Numerous optimization algorithms have a time-varying update rule thanks to, for instance, a changing step size, momentum parameter or, Hessian approximation. Often, such algorithms are used as solvers for the lower-level problem in bilevel optimization, and are unrolled when computing the gradient of the upper-level objective. In this paper, we apply unrolled or automatic differentiation to a time-varying iterative process and provide convergence (rate) guarantees for the resulting derivative iterates. We then adapt these convergence results and apply them to proximal gradient descent with variable step size and FISTA when solving partly-smooth problems. We test the convergence (rates) of these algorithms numerically through several experiments. Our theoretical and numerical results show that the convergence rate of the algorithm is reflected in its derivative iterates." @@ -2558,19 +2512,16 @@ SeedLoRA: A Fusion Approach to Efficient LLM Fine-Tuning,Yong Liu Di Fu Shenggan Triple-Optimistic Learning for Stochastic Contextual Bandits with General Constraints,Hengquan Guo Lingkai Zu Xin Liu,https://icml.cc/virtual/2025/poster/45489,"We study contextual bandits with general constraints, where a learner observes contexts and aims to maximize cumulative rewards while satisfying a wide range of general constraints.We introduce the Optimistic$^3$ framework, a novel learning and decision-making approach that integrates optimistic design into parameter learning, primal decision, and dual violation adaptation (i.e., triple-optimism), combined with an efficient primal-dual architecture. Optimistic$^3$ achieves $\tilde{O}(\sqrt{T})$ regret and constraint violation for contextual bandits with general constraints. This framework not only outperforms the state-of-the-art results that achieve $\tilde{O}(T^{\frac{3}{4}})$ guarantees when Slater's condition does not hold but also improves on previous results that achieve $\tilde{O}(\sqrt{T}/\delta)$ when Slater's condition holds ($\delta$ denotes the Slater's condition parameter), offering a $O(1/\delta)$ improvement. Note this improvement is significant because $\delta$ can be arbitrarily small when constraints are particularly challenging.Moreover, we show that Optimistic$^3$ can be extended to classical multi-armed bandits with both stochastic and adversarial constraints, recovering the best-of-both-worlds guarantee established in the state-of-the-art works, but with significantly less computational overhead." Linear convergence of Sinkhorn's algorithm for generalized static Schrödinger bridge,Rahul Choudhary Hanbaek Lyu,https://icml.cc/virtual/2025/poster/46671,"The classical static Schrödinger Bridge (SSB) problem, which seeks the most likely stochastic evolution between two marginal probability measures, has been studied extensively in the optimal transport and statistical physics communities, and more recently in machine learning communities in the surge of generative models. The standard approach to solve SSB is to first identify its Kantorovich dual and use Sinkhorn's algorithm to find the optimal potential functions. While the original SSB is only a strictly convex minimization problem, this approach is known to warrant linear convergence under mild assumptions. In this work, we consider a generalized SSB allowing any strictly increasing divergence functional, far generalizing the entropy functional $x\log x$ in the standard SSB. This problem naturally arises in a wide range of seemingly unrelated problems in entropic optimal transport, random graphs/matrices, and combinatorics. We establish Kantorovich duality and linear convergence of Sinkhorn's algorithm for the generalized SSB problem under mild conditions. Our results provide a new rigorous foundation for understanding Sinkhorn-type iterative methods in the context of large-scale generalized Schrödinger bridges." Provable and Practical Online Learning Rate Adaptation with Hypergradient Descent,Ya-Chi Chu Wenzhi Gao Yinyu Ye Madeleine Udell,https://icml.cc/virtual/2025/poster/45486,"This paper investigates the convergence properties of the hypergradient descent method ($\texttt{HDM}$), a 25-year-old heuristic originally proposed for adaptive stepsize selection in stochastic first-order methods. We provide the first rigorous convergence analysis of $\texttt{HDM}$ using the online learning framework and apply this analysis to develop a new state-of-the-art adaptive gradient methods with empirical and theoretical support. Notably, $\texttt{HDM}$ automatically identifies the optimal stepsize for the local optimization landscape and achieves local superlinear convergence. Our analysis explains the instability of $\texttt{HDM}$ reported in the literature and proposes efficient strategies to address it. We also develop two $\texttt{HDM}$ variants with heavy-ball and Nesterov momentum. Experiments on deterministic convex problems show $\texttt{HDM}$ with heavy-ball momentum ($\texttt{HDM-HB}$) exhibits robust performance and significantly outperforms other adaptive first-order methods. Moreover, $\texttt{HDM-HB}$ often matches the performance of $\texttt{L-BFGS}$, an efficient and practical quasi-Newton method, using less memory and cheaper iterations." -Adaptive Constrained Optimization for Neural Vehicle Routing,Chengrui Gao Haopu Shang Yuyang Jiang Ke Xue Chao Qian,https://openreview.net/forum?id=cCIQSqhuo2, Hybrid Quantum-Classical Multi-Agent Pathfinding,Thore Gerlach Loong Kuan Lee Frederic BARBARESCO Nico Piatkowski,https://icml.cc/virtual/2025/poster/45635,"Multi-Agent Path Finding (MAPF) focuses on determining conflict-free paths for multiple agents navigating through a shared space to reach specified goal locations. This problem becomes computationally challenging, particularly when handling large numbers of agents, as frequently encountered in practical applications like coordinating autonomous vehicles. Quantum Computing (QC) is a promising candidate in overcoming such limits. However, current quantum hardware is still in its infancy and thus limited in terms of computing power and error robustness. In this work, we present the first optimal hybrid quantum-classical MAPF algorithms which are based on branch-and-cut-and-prize. QC is integrated by iteratively solving QUBO problems, based on conflict graphs. Experiments on actual quantum hardware and results on benchmark data suggest that our approach dominates previous QUBO formulations and state-of-the-art MAPF solvers." Self-Supervised Transformers as Iterative Solution Improvers for Constraint Satisfaction,Yudong Xu Wenhao Li Scott Sanner Elias Boutros Khalil,https://icml.cc/virtual/2025/poster/45737,"We present a Transformer-based framework for Constraint Satisfaction Problems (CSPs). CSPs find use in many applications and thus accelerating their solution with machine learning is of wide interest. Most existing approaches rely on supervised learning from feasible solutions or reinforcement learning, paradigms that require either feasible solutions to these NP-Complete CSPs or large training budgets and a complex expert-designed reward signal. To address these challenges, we propose ConsFormer, a self-supervised framework that leverages a Transformer as a solution refiner. ConsFormer constructs a solution to a CSP iteratively in a process that mimics local search. Instead of using feasible solutions as labeled data, we devise differentiable approximations to the discrete constraints of a CSP to guide model training. Our model is trained to improve random assignments for a single step but is deployed iteratively at test time, circumventing the bottlenecks of supervised and reinforcement learning. Experiments on Sudoku, Graph Coloring, Nurse Rostering, and MAXCUT demonstrate that our method can tackle out-of-distribution CSPs simply through additional iterations." SHIELD: Multi-task Multi-distribution Vehicle Routing Solver with Sparsity and Hierarchy,Yong Liang Goh Zhiguang Cao Yining Ma Jianan Zhou Mohammed Haroon Dupty Wee Sun Lee,https://icml.cc/virtual/2025/poster/46391,"Recent advances toward foundation models for routing problems have shown great potential of a unified deep model for various VRP variants. However, they overlook the complex real-world customer distributions. In this work, we advance the Multi-Task VRP (MTVRP) setting to the more realistic yet challenging Multi-Task Multi-Distribution VRP (MTMDVRP) setting, and introduce SHIELD, a novel model that leverages bothsparsityandhierarchyprinciples. Building on a deeper decoder architecture, we first incorporate the Mixture-of-Depths (MoD) technique to enforce sparsity. This improves both efficiency and generalization by allowing the model to dynamically select nodes to use or skip each decoder layer, providing the needed capacity to adaptively allocate computation for learning the task/distribution specific and shared representations. We also develop a context-based clustering layer that exploits the presence of hierarchical structures in the problems to produce better local representations. These two designs inductively bias the network to identify key features that are common across tasks and distributions, leading to significantly improved generalization on unseen ones. Our empirical results demonstrate the superiority of our approach over existing methods on 9 real-world maps with 16 VRP variants each." Density Ratio Estimation-based Bayesian Optimization with Semi-Supervised Learning,Jungtaek Kim,https://icml.cc/virtual/2025/poster/46666,"Bayesian optimization has attracted huge attention from diverse research areas in science and engineering, since it is capable of efficiently finding a global optimum of an expensive-to-evaluate black-box function. In general, a probabilistic regression model is widely used as a surrogate function to model an explicit distribution over function evaluations given an input to estimate and a training dataset. Beyond the probabilistic regression-based methods, density ratio estimation-based Bayesian optimization has been suggested in order to estimate a density ratio of the groups relatively close and relatively far to a global optimum. Developing this line of research further, supervised classifiers are employed to estimate a class probability for the two groups instead of a density ratio. However, the supervised classifiers used in this strategy are prone to be overconfident for known knowledge on global solution candidates. Supposing that we have access to unlabeled points, e.g., predefined fixed-size pools, we propose density ratio estimation-based Bayesian optimization with semi-supervised learning to solve this challenge. Finally, we show the empirical results of our methods and several baseline methods in two distinct scenarios with unlabeled point sampling and a fixed-size pool, and analyze the validity of our methods in diverse experiments." -Online Uniform Sampling: Randomized Learning-Augmented Approximation Algorithms with Application to Digital Health,Xueqing Liu Kyra Gan Esmaeil Keyvanshokooh Susan Murphy,https://openreview.net/forum?id=T2HxvNtDfJ, Achieving Linear Speedup and Near-Optimal Complexity for Decentralized Optimization over Row-stochastic Networks,Liyuan Liang Xinyi Chen Gan Luo Kun Yuan,https://icml.cc/virtual/2025/poster/45127,"A key challenge in decentralized optimization is determining the optimal convergence rate and designing algorithms to achieve it. While this problem has been extensively addressed for doubly-stochastic and column-stochastic mixing matrices, the row-stochastic scenario remains unexplored. This paper bridges this gap by introducing effective metrics to capture the influence of row-stochastic mixing matrices and establishing the first convergence lower bound for decentralized learning over row-stochastic networks. However, existing algorithms fail to attain this lower bound due to two key issues: deviation in the descent direction caused by the adapted gradient tracking (GT) and instability introduced by the Pull-Diag protocol. To address descent deviation, we propose a novel analysis framework demonstrating that Pull-Diag-GT achieves linear speedup—the first such result for row-stochastic decentralized optimization. Moreover, by incorporating a multi-step gossip (MG) protocol, we resolve the instability issue and attain the lower bound, achieving near-optimal complexity for decentralized optimization over row-stochastic networks." Constant Stepsize Local GD for Logistic Regression: Acceleration by Instability,Michael Crawshaw Blake Woodworth Mingrui Liu,https://icml.cc/virtual/2025/poster/43864,"Existing analysis of Local (Stochastic) Gradient Descent for heterogeneous objectives requires stepsizes $\eta \leq 1/K$ where $K$ is the communication interval, which ensures monotonic decrease of the objective. In contrast, we analyze Local Gradient Descent for logistic regression with separable, heterogeneous data using any stepsize $\eta > 0$. With $R$ communication rounds and $M$ clients, we show convergence at a rate $\mathcal{O}(1/\eta K R)$ after an initial unstable phase lasting for $\widetilde{\mathcal{O}}(\eta K M)$ rounds. This improves upon the existing $\mathcal{O}(1/R)$ rate for general smooth, convex objectives. Our analysis parallels the single machine analysis of Wu et al. (2024) in which instability is caused by extremely large stepsizes, but in our setting another source of instability is large local updates with heterogeneous objectives." Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs,YOUHE JIANG Fangcheng Fu Xiaozhe Yao Guoliang HE Xupeng Miao Ana Klimovic Bin CUI Binhang Yuan Eiko Yoneki,https://icml.cc/virtual/2025/poster/43564,"Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the cost-efficiency of LLM serving as common practices primarily rely on homogeneous GPU resources. In response to this problem, this work conducts a thorough study about serving LLMs over heterogeneous GPU resources on cloud platforms. The rationale is that different GPU types exhibit distinct compute and memory characteristics, aligning well with the divergent resource demands of diverse requests. Particularly, through comprehensive benchmarking, we discover that the cost-efficiency of LLM serving can be substantially optimized by meticulously determining GPU composition, deployment configurations, and workload assignments. Subsequently, we design a scheduling algorithm via mixed-integer linear programming, aiming at deducing the most cost-efficient serving plan under the constraints of price budget and real-time GPU availability. Remarkably, our approach effectively outperforms homogeneous and heterogeneous baselines under a wide array of scenarios, covering diverse workload traces, varying GPU availablilities, and multi-model serving. This casts new light on more accessible and efficient LLM serving over heterogeneous cloud resources." Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight Averaging,Junkang Liu Yuanyuan Liu Fanhua Shang Hongying Liu Jin Liu Wei Feng,https://icml.cc/virtual/2025/poster/45760,"For federated learning (FL) algorithms such as FedSAM, their generalization capability is crucial for real-word applications. In this paper, we revisit the generalization problem in FL and investigate the impact of data heterogeneity on FL generalization. We find that FedSAM usually performs worse than FedAvg in the case of highly heterogeneous data, and thus propose a novel and effective federated learning algorithm with Stochastic Weight Averaging (called \texttt{FedSWA}), which aims to find flatter minima in the setting of highly heterogeneous data. Moreover, we introduce a new momentum-based stochastic controlled weight averaging FL algorithm (\texttt{FedMoSWA}), which is designed to better align local and global models. Theoretically, we provide both convergence analysis and generalization bounds for \texttt{FedSWA} and \texttt{FedMoSWA}. We also prove that the optimization and generalization errors of \texttt{FedMoSWA} are smaller than those of their counterparts, including FedSAM and its variants. Empirically, experimental results on CIFAR10/100 and Tiny ImageNet demonstrate the superiority of the proposed algorithms compared to their counterparts." QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration,HamidReza Imani Jiaxin Peng Peiman Mohseni Abdolah Amirany Tarek El-Ghazawi,https://icml.cc/virtual/2025/poster/44489,"The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single GPU. We propose a serving system that employs \textit{similarity-based expert consolidation} to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce \textit{runtime partial reconfiguration}, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves competitive output quality while maintaining throughput comparable to serving a single model, and incurs only a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85\% average reduction in turnaround time compared to NVIDIA's multi-instance GPU (MIG). Furthermore, experiments on Google's Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness." Scalable First-order Method for Certifying Optimal k-Sparse GLMs,Jiachang Liu Soroosh Shafiee Andrea Lodi,https://icml.cc/virtual/2025/poster/46512,"This paper investigates the problem of certifying optimality for sparse generalized linear models (GLMs), where sparsity is enforced through an $\ell_0$ cardinality constraint. While branch-and-bound (BnB) frameworks can certify optimality by pruning nodes using dual bounds, existing methods for computing these bounds are either computationally intensive or exhibit slow convergence, limiting their scalability to large-scale problems. To address this challenge, we propose a first-order proximal gradient algorithm designed to solve the perspective relaxation of the problem within a BnB framework. Specifically, we formulate the relaxed problem as a composite optimization problem and demonstrate that the proximal operator of the non-smooth component can be computed exactly in log-linear time complexity, eliminating the need to solve a computationally expensive second-order cone program. Furthermore, we introduce a simple restart strategy that enhances convergence speed while maintaining low per-iteration complexity. Extensive experiments on synthetic and real-world datasets show that our approach significantly accelerates dual bound computations and is highly effective in providing optimality certificates for large-scale problems." -Smoothed Normalization for Efficient Distributed Private Optimization,Egor Shulgin Sarit Khirirat Peter Richtárik,https://openreview.net/forum?id=F3hjbhyyRI, The Panaceas for Improving Low-Rank Decomposition in Communication-Efficient Federated Learning,Shiwei Li Xiandi Luo Haozhao Wang Xing Tang Shijie Xu weihongluo Yuhua Li xiuqiang He Ruixuan Li,https://icml.cc/virtual/2025/poster/44777,"To improve the training efficiency of federated learning (FL), previous research has employed low-rank decomposition techniques to reduce communication overhead. In this paper, we seek to enhance the performance of these low-rank decomposition methods. Specifically, we focus on three key issues related to decomposition in FL: what to decompose, how to decompose, and how to aggregate. Subsequently, we introduce three novel techniques: Model Update Decomposition (MUD), Block-wise Kronecker Decomposition (BKD), and Aggregation-Aware Decomposition (AAD), each targeting a specific issue. These techniques are complementary and can be applied simultaneously to achieve optimal performance. Additionally, we provide a rigorous theoretical analysis to ensure the convergence of the proposed MUD. Extensive experimental results show that our approach achieves faster convergence and superior accuracy compared to relevant baseline methods. The code is available at https://github.com/Leopold1423/fedmud-icml25." Contextual Optimization Under Model Misspecification: A Tractable and Generalizable Approach,Omar Bennouna Jiawei Zhang Saurabh Amin Asuman E. Ozdaglar,https://icml.cc/virtual/2025/poster/44602,"Contextual optimization problems are prevalent in decision-making applications where historical data and contextual features are used to learn predictive models that inform optimal actions. However, practical applications often suffer from model misspecification due to incomplete knowledge of the underlying data-generating process, leading to suboptimal decisions. Existing approaches primarily address the well-specified case, leaving a critical gap in handling misspecified models. In this paper, we propose a novel Integrated Learning and Optimization (ILO) framework that explicitly accounts for model misspecification by introducing a tractable surrogate loss function with strong theoretical guarantees on generalizability, tractability, and optimality. Our surrogate loss aligns with the true decision performance objective, ensuring robustness to misspecification without imposing restrictive assumptions. The proposed approach effectively mitigates the challenges of non-convexity and non-smoothness in the target loss function, leading to efficient optimization procedures. We provide rigorous theoretical analysis and experimental validation, demonstrating superior performance compared to state-of-the-art methods. Our work offers a principled solution to the practically relevant challenge of model misspecification in contextual optimization." Guarantees of a Preconditioned Subgradient Algorithm for Overparameterized Asymmetric Low-rank Matrix Recovery,Paris Giampouras HanQin Cai Rene Vidal,https://icml.cc/virtual/2025/poster/45824,"In this paper, we focus on a matrix factorization-based approach for robust recovery of low-rank asymmetric matrices from corrupted measurements. We propose an Overparameterized Preconditioned Subgradient Algorithm (OPSA) and provide, for the first time in the literature, linear convergence rates independent of the rank of the sought asymmetric matrix in the presence of gross corruptions. Our work goes beyond existing results in preconditioned-type approaches addressing their current limitation, i.e., the lack of convergence guarantees in the case of asymmetric matrices of unknown rank. By applying our approach to (robust) matrix sensing, we highlight its merits when the measurement operator satisfies a mixed-norm restricted isometry property. Lastly, we present extensive numerical experiments that validate our theoretical results and demonstrate the effectiveness of our approach for different levels of overparameterization and corruption from outliers." @@ -2620,7 +2571,6 @@ Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Re Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning,Yuhui Wang Qingyuan Wu Dylan R. Ashley Francesco Faccio Weida Li Chao Huang Jürgen Schmidhuber,https://icml.cc/virtual/2025/poster/44887,"The Value Iteration Network (VIN) is an end-to-end differentiable neural network architecture for planning. It exhibits strong generalization to unseen domains by incorporating a differentiable planning module that operates on a latent Markov Decision Process (MDP). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a $100\times 100$ maze---a task that typically requires thousands of planning steps to solve. We observe that this deficiency is due to two issues: the representation capacity of the latent MDP and the planning module's depth. We address these by augmenting the latent MDP with a dynamic transition kernel, dramatically improving its representational capacity, and, to mitigate the vanishing gradient problem, introduce an ""adaptive highway loss"" that constructs skip connections to improve gradient flow. We evaluate our method on 2D/3D maze navigation environments, continuous control, and the real-world Lunar rover navigation task. We find that our new method, named Dynamic Transition VIN (DT-VIN), scales to 5000 layers and solves challenging versions of the above tasks. Altogether, we believe that DT-VIN represents a concrete step forward in performing long-term large-scale planning in complex environments." Strategic Planning: A Top-Down Approach to Option Generation,Max Ruiz Luyten Antonin Berthon Mihaela van der Schaar,https://icml.cc/virtual/2025/poster/43567,"Real-world human decision-making often relies on strategic planning, wherehigh-levelgoals guide the formulation of sub-goals and subsequent actions, as evidenced by domains such as healthcare, business, and urban policy. Despite notable successes in controlled settings, conventional reinforcement learning (RL) follows abottom-upparadigm, which can struggle to adapt to real-world complexities such as sparse rewards and limited exploration budgets. While methods like hierarchical RL and environment shaping provide partial solutions, they frequently rely on either ad-hoc designs (e.g. choose the set of high-level actions) or purely data-driven discovery of high-level actions that still requires significant exploration. In this paper, we introduce atop-downframework for RL that explicitly leverageshuman-like strategyto reduce sample complexity, guide exploration, and enable high-level decision-making. We first formalize theStrategy Problem, which frames policy generation as finding distributions over policies that balancespecificityandvalue. Building on this definition, we propose theStrategistagent—an iterative framework that leverages large language models (LLMs) to synthesize domain knowledge into a structured representation of actionable strategies and sub-goals. We further develop areward shaping methodologythat translates these strategies expressed in natural language into quantitative feedback for RL methods. Empirically, we demonstrate a significantly faster convergence than conventional PPO. Taken together, our findings highlight thattop-down strategic explorationopens new avenues for enhancing RL on real-world decision problems." Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales,Ju-Seung Byun Andrew Perrault,https://icml.cc/virtual/2025/poster/44897,"Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) introduce additional challenges. For instance, diverse preferences complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. These RL challenges create confusion about whether the probability of an action for a given state should be increased or decreased, similar to the noise in labels for classification tasks. In this work, we focus on RL algorithms that share learning difficulties with cross-entropy loss, especially for low-probability predictions. To enhance stability, we adapt reverse cross-entropy (RCE) from supervised learning for noisy data, defining a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO). Notably, SPPO shows strong performance across different hyperparameters. Furthermore, we validate the symmetric RL loss in the RLHF framework using PPO for natural language processing tasks such as IMDB positive sentiment and TL;DR summarization." -Uncertainty-aware Preference Alignment for Diffusion Policies,Runqing Miao Sheng Xu Wai Kin Victor Chan Guiliang Liu,https://openreview.net/forum?id=zKbQUkh6qe, Wasserstein Policy Optimization,David Pfau Ian Davies Diana L Borsa João Guilherme Madeira Araújo Brendan Daniel Tracey Hado van Hasselt,https://icml.cc/virtual/2025/poster/44075,"We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of thegradientof the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions -- without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods." Zero-Shot Offline Imitation Learning via Optimal Transport,Thomas Rupf Marco Bagatella Nico Gürtler Jonas Frey Georg Martius,https://icml.cc/virtual/2025/poster/46209,"Zero-shot imitation learning algorithms hold the promise of reproducing unseen behavior from as little as a single demonstration at test time. Existing practical approaches view the expert demonstration as a sequence of goals, enabling imitation with a high-level goal selector, and a low-level goal-conditioned policy. However, this framework can suffer from myopic behavior: the agent's immediate actions towards achieving individual goals may undermine long-term objectives. We introduce a novel method that mitigates this issue by directly optimizing the occupancy matching objective that is intrinsic to imitation learning. We propose to lift a goal-conditioned value function to a distance between occupancies, which are in turn approximated via a learned world model. The resulting method can learn from offline, suboptimal data, and is capable of non-myopic, zero-shot imitation, as we demonstrate in complex, continuous benchmarks. The code is available at https://github.com/martius-lab/zilot." C2IQL: Constraint-Conditioned Implicit Q-learning for Safe Offline Reinforcement Learning,Zifan LIU Xinran Li Jun Zhang,https://icml.cc/virtual/2025/poster/46250,"Safe offline reinforcement learning aims to develop policies that maximize cumulative rewards while satisfying safety constraints without the need for risky online interaction. However, existing methods often struggle with the out-of-distribution (OOD) problem, leading to potentially unsafe and suboptimal policies. To address this issue, we first propose Constrained Implicit Q-learning (CIQL), a novel algorithm designed to avoid the OOD problem. In particular, CIQL expands the implicit update of reward value functions to constrained settings and then estimates cost value functions under the same implicit policy. Despite its advantages, the further performance improvement of CIQL is still hindered by the inaccurate discounted approximations of constraints. Thus, we further propose Constraint-Conditioned Implicit Q-learning (C2IQL). Building upon CIQL, C2IQL employs a cost reconstruction model to derive non-discounted cumulative costs from discounted values and incorporates a flexible, constraint-conditioned mechanism to accommodate dynamic safety constraints. Experiment results on DSRL benchmarks demonstrate the superiority of C2IQL compared to baseline methods in achieving higher rewards while guaranteeing safety constraints under different threshold conditions." @@ -2629,13 +2579,11 @@ Latent Action Learning Requires Supervision in the Presence of Distractors,Alexa Temporal Distance-aware Transition Augmentation for Offline Model-based Reinforcement Learning,Dongsu Lee Minhae Kwon,https://icml.cc/virtual/2025/poster/44612,"The goal of offline reinforcement learning (RL) is to extract the best possible policy from the previously collected dataset considering theout-of-distribution(OOD) sample issue. Offline model-based RL (MBRL) is a captivating solution capable of alleviating such issues through a \textit{state-action transition augmentation} with a learned dynamic model. Unfortunately, offline MBRL methods have been observed to fail in sparse rewarded and long-horizon environments for a long time. In this work, we propose a novel MBRL method, dubbed Temporal Distance-Aware Transition Augmentation (TempDATA), that generates additional transitions in a geometrically structured representation space, instead of state space. For comprehending long-horizon behaviors efficiently, our main idea is to learn state abstraction, which captures atemporal distancefrom bothtrajectory and transition levelsof state space. Our experiments empirically confirm that TempDATA outperforms previous offline MBRL methods and achieves matching or surpassing the performance of diffusion-based trajectory augmentation and goal-conditioned RL on the D4RL AntMaze, FrankaKitchen, CALVIN, and pixel-based FrankaKitchen." Vintix: Action Model via In-Context Reinforcement Learning,Andrei Polubarov Lyubaykin Nikita Alexander Derevyagin Ilya Zisman Denis Tarasov Alexander Nikulin Vladislav Kurenkov,https://icml.cc/virtual/2025/poster/44459,"In-Context Reinforcement Learning (ICRL) represents a promising paradigm for developing generalist agents that learn at inference time through trial-and-error interactions, analogous to how large language models adapt contextually, but with a focus on reward maximization. However, the scalability of ICRL beyond toy tasks and single-domain settings remains an open challenge. In this work, we present the first steps toward scaling ICRL by introducing a fixed, cross-domain model capable of learning behaviors through in-context reinforcement learning. Our results demonstrate that Algorithm Distillation, a framework designed to facilitate ICRL, offers a compelling and competitive alternative to expert distillation to construct versatile action models. These findings highlight the potential of ICRL as a scalable approach for generalist decision-making systems." Efficient Online Reinforcement Learning for Diffusion Policy,Haitong Ma Tianyi Chen Kai Wang Na Li Bo Dai,https://icml.cc/virtual/2025/poster/46396,"Diffusion policies have achieved superior performance in imitation learning and offline reinforcement learning (RL) due to their rich expressiveness. However, the conventional diffusion training procedure requires samples from target distribution, which is impossible in online RL since we cannot sample from the optimal policy. Backpropagating policy gradient through the diffusion process incurs huge computational costs and instability, thus being expensive and not scalable. To enable efficient training of diffusion policies in online RL, we generalize the conventional denoising score matching by reweighting the loss function. The resulting Reweighted Score Matching (RSM) preserves the optimal solution and low computational cost of denoising score matching, while eliminating the need to sample from the target distribution and allowing learning to optimize value functions. We introduce two tractable reweighted loss functions to solve two commonly used policy optimization problems, policy mirror descent and max-entropy policy, resulting in two practical algorithms named Diffusion Policy Mirror Descent (DPMD) and Soft Diffusion Actor-Critic (SDAC). We conducted comprehensive comparisons on MuJoCo benchmarks. The empirical results show that the proposed algorithms outperform recent diffusion-policy online RLs on most tasks, and the DPMD improves more than 120% over soft actor-critic on Humanoid and Ant." -Hadamard Representations: Augmenting Hyperbolic Tangents in RL,Jacob Eeuwe Kooi Mark Hoogendoorn Vincent Francois-Lavet,https://openreview.net/forum?id=ZCcIah9IZo, Hierarchical Reinforcement Learning with Uncertainty-Guided Diffusional Subgoals,Vivienne Huiling Wang Tinghuai Wang Joni Pajarinen,https://icml.cc/virtual/2025/poster/46632,"Hierarchical reinforcement learning (HRL) learns to make decisions on multiple levels of temporal abstraction. A key challenge in HRL is that the low-level policy changes over time, making it difficult for the high-level policy to generate effective subgoals. To address this issue, the high-level policy must capture a complex subgoal distribution while also accounting for uncertainty in its estimates. We propose an approach that trains a conditional diffusion model regularized by a Gaussian Process (GP) prior to generate a complex variety of subgoals while leveraging principled GP uncertainty quantification. Building on this framework, we develop a strategy that selects subgoals from both the diffusion policy and GP's predictive mean. Our approach outperforms prior HRL methods in both sample efficiency and performance on challenging continuous control benchmarks." Maximum Entropy Reinforcement Learning with Diffusion Policy,Xiaoyi Dong Jian Cheng Xi Sheryl Zhang,https://icml.cc/virtual/2025/poster/46039,"The Soft Actor-Critic (SAC) algorithm with a Gaussian policy has become a mainstream implementation for realizing the Maximum Entropy Reinforcement Learning (MaxEnt RL) objective, which incorporates entropy maximization to encourage exploration and enhance policy robustness. While the Gaussian policy performs well on simpler tasks, its exploration capacity and potential performance in complex multi-goal RL environments are limited by its inherent unimodality. In this paper, we employ the diffusion model, a powerful generative model capable of capturing complex multimodal distributions, as the policy representation to fulfill the MaxEnt RL objective, developing a method named MaxEnt RL with Diffusion Policy (MaxEntDP). Our method enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Experimental results on Mujoco benchmarks show that MaxEntDP outperforms the Gaussian policy and other generative models within the MaxEnt RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available at https://github.com/diffusionyes/MaxEntDP." PIGDreamer: Privileged Information Guided World Models for Safe Partially Observable Reinforcement Learning,Dongchi Huang Jiaqi WANG Yang Li Chunhe Xia Tianle Zhang Kaige Zhang,https://icml.cc/virtual/2025/poster/44134,"Partial observability presents a significant challenge for safe reinforcement learning, as it impedes the identification of potential risks and rewards. Leveraging specific types of privileged information during training to mitigate the effects of partial observability has yielded notable empirical successes. In this paper, we propose Asymmetric Constrained Partially Observable Markov Decision Processes (ACPOMDPs) to theoretically examine the advantages of incorporating privileged information. Building upon ACPOMDPs, we propose the Privileged Information Guided Dreamer, a model-based safe reinforcement learning approach that leverages privileged information to enhance the agent's safety and performance through privileged representation alignment and an asymmetric actor-critic structure. Our empirical results demonstrate that our approach significantly outperforms existing methods in terms of safety and task-centric performance. Meanwhile, compared to alternative privileged model-based reinforcement learning methods, our approach exhibits superior performance and ease of training." R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models,Pengyi Li Jianye HAO Hongyao Tang Yifu Yuan Jinbin Qiao Zibin Dong YAN ZHENG,https://icml.cc/virtual/2025/poster/43934,"Reward functions are crucial for policy learning. Large Language Models (LLMs), with strong coding capabilities and valuable domain knowledge, provide an automated solution for high-quality reward design. However, code-based reward functions require precise guiding logic and parameter configurations within a vast design space, leading to low optimization efficiency.To address the challenges,we propose an efficient automated reward design framework, called R,which decomposes reward design into two parts: reward structure evolution and parameter alignment optimization. To design high-quality reward structures, Rmaintains a reward function population and modularizes the functional components. LLMs are employed as the mutation operator, and module-level crossover is proposed to facilitate efficient exploration and exploitation.To design more efficient reward parameters, R* first leverages LLMs to generate multiple critic functions for trajectory comparison and annotation. Based on these critics, a voting mechanism is employed to collect the trajectory segments with high-confidence labels.These labeled segments are then used to refine the reward function parameters through preference learning.Experiments on diverse robotic control tasks demonstrate that R* outperforms strong baselines in both reward design efficiency and quality, surpassing human-designed reward functions." Return Capping: Sample Efficient CVaR Policy Gradient Optimisation,Harry Mead Clarissa Costen Bruno Lacerda Nick Hawes,https://icml.cc/virtual/2025/poster/44577,"When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines. We have made all our code available here: \url{https://github.com/HarryMJMead/cvar-return-capping}." -Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization,Daniel Palenicek Florian Vogt Jan Peters,https://openreview.net/forum?id=Pr2fNUGU06, SENSEI: Semantic Exploration Guided by Foundation Models to Learn Versatile World Models,Cansu Sancaktar Christian Gumbsch Andrii Zadaianchuk Pavel Kolev Georg Martius,https://icml.cc/virtual/2025/poster/44870,"Exploration is a cornerstone of reinforcement learning (RL). Intrinsic motivation attempts to decouple exploration from external, task-based rewards. However, established approaches to intrinsic motivation that follow general principles such as information gain, often only uncover low-level interactions. In contrast, children’s play suggests that they engage in meaningful high-level behavior by imitating or interacting with their caregivers. Recent work has focused on using foundation models to inject these semantic biases into exploration. However, these methods often rely on unrealistic assumptions, such as language-embedded environments or access to high-level actions. We propose SEmaNtically Sensible ExploratIon (SENSEI), a framework to equip model-based RL agents with an intrinsic motivation for semantically meaningful behavior. SENSEI distills a reward signal of interestingness from Vision Language Model (VLM) annotations, enabling an agent to predict these rewards through a world model. Using model-based RL, SENSEI trains an exploration policy that jointly maximizes semantic rewards and uncertainty. We show that in both robotic and video game-like simulations SENSEI discovers a variety of meaningful behaviors from image observations and low-level actions. SENSEI provides a general tool for learning from foundation model feedback, a crucial research direction, as VLMs become more powerful." Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning,Bryan Lincoln Marques de Oliveira Luana Guedes Barros Martins Bruno Brandão Murilo Lopes da Luz telma woerle de lima soares Luckeciano Carvalho Melo,https://icml.cc/virtual/2025/poster/43653,"Effective visual representation learning is crucial for reinforcement learning (RL) agents to extract task-relevant information from raw sensory inputs and generalize across diverse environments. However, existing RL benchmarks lack the ability to systematically evaluate representation learning capabilities in isolation from other learning challenges. To address this gap, we introduce the Sliding Puzzles Gym (SPGym), a novel benchmark that transforms the classic 8-tile puzzle into a visual RL task with images drawn from arbitrarily large datasets. SPGym's key innovation lies in its ability to precisely control representation learning complexity through adjustable grid sizes and image pools, while maintaining fixed environment dynamics, observation, and action spaces. This design enables researchers to isolate and scale the visual representation challenge independently of other learning components. Through extensive experiments with model-free and model-based RL algorithms, we uncover fundamental limitations in current methods' ability to handle visual diversity. As we increase the pool of possible images, all algorithms exhibit in- and out-of-distribution performance degradation, with sophisticated representation learning techniques often underperforming simpler approaches like data augmentation. These findings highlight critical gaps in visual representation learning for RL and establish SPGym as a valuable tool for driving progress in robust, generalizable decision-making systems." SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels,Malte Mosbach Jan Niklas Ewertz Angel Villar-Corrales Sven Behnke,https://icml.cc/virtual/2025/poster/44962,"Learning a latent dynamics model provides a task-agnostic representation of an agent's understanding of its environment. Leveraging this knowledge for model-based reinforcement learning (RL) holds the potential to improve sample efficiency over model-free methods by learning from imagined rollouts. Furthermore, because the latent space serves as input to behavior models, the informative representations learned by the world model facilitate efficient learning of desired skills. Most existing methods rely on holistic representations of the environment’s state. In contrast, humans reason about objects and their interactions, predicting how actions will affect specific parts of their surroundings. Inspired by this, we proposeSlot-Attention for Object-centric Latent Dynamics (SOLD), a novel model-based RL algorithm that learns object-centric dynamics models in an unsupervised manner from pixel inputs. We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over. Our results show that SOLD outperforms DreamerV3 and TD-MPC2 - state-of-the-art model-based RL algorithms - across a range of multi-object manipulation environments that require both relational reasoning and dexterous control. Videos and code are available at https:// slot-latent-dynamics.github.io." @@ -2680,13 +2628,11 @@ Towards Global-level Mechanistic Interpretability: A Perspective of Modular Circ Understanding Fixed Predictions via Confined Regions,Connor Lawless Tsui-Wei Weng Berk Ustun Madeleine Udell,https://icml.cc/virtual/2025/poster/46631,"Machine learning models can assign fixed predictions that preclude individuals from changing their outcome. Existing approaches to audit fixed predictions do so on a pointwise basis, which requires access to an existing dataset of individuals and may fail to anticipate fixed predictions in out-of-sample data. This work presents a new paradigm to identify fixed predictions by finding confined regions of the feature space in which all individuals receive fixed predictions. This paradigm enables the certification of recourse for out-of-sample data, works in settings without representative datasets, and provides interpretable descriptions of individuals with fixed predictions. We develop a fast method to discover confined regions for linear classifiers using mixed-integer quadratically constrained programming. We conduct a comprehensive empirical study of confined regions across diverse applications. Our results highlight that existing pointwise verification methods fail to anticipate future individuals with fixed predictions, while our method both identifies them and provides an interpretable description." Validating Mechanistic Interpretations: An Axiomatic Approach,Nils Palumbo Ravi Mangal Zifan Wang Saranya Vijayakumar Corina S. Pasareanu Somesh Jha,https://icml.cc/virtual/2025/poster/43956,"Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of amechanistic interpretationitself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem." The Lock-in Hypothesis: Stagnation by Algorithm,Tianyi Qiu Zhonghao He Tejasveer Chugh Max Kleiman-Weiner,https://icml.cc/virtual/2025/poster/44167,"The training and deployment of large language models (LLMs) create a feedback loop with human users: models learn human beliefs from data, reinforce these beliefs with generated content, reabsorb the reinforced beliefs, and feed them back to users again and again. This dynamic resembles an echo chamber.We hypothesize that this feedback loop entrenches the existing values and beliefs of users, leading to a loss of diversity in human ideas and potentially thelock-inof false beliefs.We formalize this hypothesis and test it empirically with agent-based LLM simulations and real-world GPT usage data. Analysis reveals sudden but sustained drops in diversity after the release of new GPT iterations, consistent with the hypothesized human-AI feedback loop.Website: https://thelockinhypothesis.com" -Fair Class-Incremental Learning using Sample Weighting,Jaeyoung Park Minsu Kim Steven Euijong Whang,https://openreview.net/forum?id=0iwwSdla9m, FDGen: A Fairness-Aware Graph Generation Model,Zichong Wang Wenbin Zhang,https://icml.cc/virtual/2025/poster/46368,"Graph generation models have shown significant potential across various domains. However, despite their success, these models often inherit societal biases, limiting their adoption in real-world applications. Existing research on fairness in graph generation primarily addresses structural bias, overlooking the critical issue of feature bias. To address this gap, we propose FDGen, a novel approach that defines and mitigates both feature and structural biases in graph generation models. Furthermore, we provide a theoretical analysis of how bias sources in graph data contribute to disparities in graph generation tasks. Experimental results on four real-world datasets demonstrate that FDGen outperforms state-of-the-art methods, achieving notable improvements in fairness while maintaining competitive generation performance." Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs,Yinong Oliver Wang Nivedha Sivakumar Falaah Arif Khan Katherine Metcalf Adam Golinski Natalie Mackraz Barry-John Theobald Luca Zappella Nicholas Apostoloff,https://icml.cc/virtual/2025/poster/44735,"The recent rapid adoption of large language models (LLMs) highlights the critical need for benchmarking their fairness. Conventional fairness metrics, which focus on discrete accuracy-based evaluations (i.e., prediction correctness), fail to capture the implicit impact of model uncertainty (e.g., higher model confidence about one group over another despite similar accuracy). To address this limitation, we propose an uncertainty-aware fairness metric, UCerf, to enable a fine-grained evaluation of model fairness that is more reflective of the internal bias in model decisions. Furthermore, observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset with 31,756 samples for co-reference resolution, offering a more diverse and suitable benchmark for modern LLMs. Combining our metric and dataset, we provide insightful comparisons of eight open-source LLMs. For example, Mistral-8B exhibits suboptimal fairness due to high confidence in incorrect predictions, a detail overlooked by Equalized Odds but captured by UCerF. Overall, this work provides a holistic framework for LLM evaluation by jointly assessing fairness and uncertainty, enabling the development of more transparent and accountable AI systems." Multiaccuracy and Multicalibration via Proxy Groups,Beepul Bharti Mary Versa Clemens-Sewall Paul Yi Jeremias Sulam,https://icml.cc/virtual/2025/poster/43844,"As the use of predictive machine learning algorithms increases in high-stakes decision-making, it is imperative that these algorithms are fair across sensitive groups. However, measuring and enforcing fairness in real-world applications can be challenging due to missing or incomplete sensitive group information. Proxy-sensitive attributes have been proposed as a practical and effective solution in these settings, but only for parity-based fairness notions. Knowing how to evaluate and control for fairness with missing sensitive group data for newer, different, and more flexible frameworks, such as multiaccuracy and multicalibration, remains unexplored. In this work, we address this gap by demonstrating that in the absence of sensitive group data, proxy-sensitive attributes can provably be used to derive actionable upper bounds on the true multiaccuracy and multicalibration violations, providing insights into a predictive model’s potential worst-case fairness violations. Additionally, we show that adjusting models to satisfy multiaccuracy and multicalibration across proxy-sensitive attributes can significantly mitigate these violations for the true, but unknown, sensitive groups. Through several experiments on real-world datasets, we illustrate that approximate multiaccuracy and multicalibration can be achieved even when sensitive group data is incomplete or unavailable." Relative Error Fair Clustering in the Weak-Strong Oracle Model,Vladimir Braverman Prathamesh Dharangutte Shaofeng H.-C. Jiang Hoai-An Nguyen Chen Wang Yubo Zhang Samson Zhou,https://icml.cc/virtual/2025/poster/46348,"We study fair clustering problems in a setting where distance information is obtained from two sources: a strong oracle providing exact distances, but at a high cost, and a weak oracle providing potentially inaccurate distance estimates at a low cost. The goal is to produce a near-optimal fair clustering on $n$ input points with a minimum number of strong oracle queries. This models the increasingly common trade-off between accurate but expensive similarity measures (e.g., large-scale embeddings) and cheaper but inaccurate alternatives. The study of fair clustering in the model is motivated by the important quest of achieving fairness with the presence of inaccurate information. We achieve the first $(1+\varepsilon)$-coresets for fair $k$-median clustering using $\text{poly}\left(\frac{k}{\varepsilon}\cdot\log n\right)$ queries to the strong oracle. Furthermore, our results imply coresets for the standard setting (without fairness constraints), and we could in fact obtain $(1+\varepsilon)$-coresets for $(k,z)$-clustering for general $z=O(1)$ with a similar number of strong oracle queries. In contrast, previous results achieved a constant-factor $(>10)$ approximation for the standard $k$-clustering problems, and no previous work considered the fair $k$-median clustering problem." An Efficient Private GPT Never Autoregressively Decodes,Zhengyi Li Yue Guan Kang Yang Yu Feng Ning Liu Yu Yu Jingwen Leng Minyi Guo,https://icml.cc/virtual/2025/poster/45418,"The wide deployment of the generative pre-trained transformer (GPT) has raised privacy concerns for both clients and servers. While cryptographic primitives can be employed for secure GPT inference to protect the privacy of both parties, they introduce considerable performance overhead. To accelerate secure inference, this study proposes a public decoding and secure verification approach that utilizes public GPT models, motivated by the observation that securely decoding one and multiple tokens takes a similar latency. The client uses the public model to generate a set of tokens, which are then securely verified by the private model for acceptance. The efficiency of our approach depends on the acceptance ratio of tokens proposed by the public model, which we improve from two aspects: (1) a private sampling protocol optimized for cryptographic primitives and (2) model alignment using knowledge distillation. Our approach improves the efficiency of secure decoding while maintaining the same level of privacy and generation quality as standard secure decoding. Experiments demonstrate a $2.1\times \sim 6.0\times$ speedup compared to standard decoding across three pairs of public-private models and different network conditions." -DictPFL: Efficient and Private Federated Learning on Encrypted Gradients,Jiaqi Xue Mayank Kumar Yuzhang Shang Shangqian Gao Mengxin Zheng Xiaoqian Jiang Qian Lou,https://openreview.net/forum?id=lSTJ628SXl, Differentially Private Boxplots,Kelly Ramsay Jairo Diaz-Rodriguez,https://icml.cc/virtual/2025/poster/44585,"Despite the potential of differentially private data visualization to harmonize data analysis and privacy, research in this area remains underdeveloped. Boxplots are a widely popular visualization used for summarizing a dataset and for comparison of multiple datasets. Consequentially, we introduce a differentially private boxplot. We evaluate its effectiveness for displaying location, scale, skewness and tails of a given empirical distribution. In our theoretical exposition, we show that the location and scale of the boxplot are estimated with optimal sample complexity, and the skewness and tails are estimated consistently, which is not always the case for a boxplot naively constructed from a single existing differentially private quantile algorithm. As a byproduct of this exposition, we introduce several new results concerning private quantile estimation. In simulations, we show that this boxplot performs similarly to a non-private boxplot, and it outperforms the naive boxplot. Additionally, we conduct a real data analysis of Airbnb listings, which shows that comparable analysis can be achieved through differentially private boxplot visualization." EgoPrivacy: What Your First-Person Camera Says About You?,Yijiang Li Genpei Zhang Jiacheng Cheng Yi Li Xiaojun Shan Dashan Gao Jiancheng Lyu Yuan Li Ning Bi Nuno Vasconcelos,https://icml.cc/virtual/2025/poster/44804,"While the rapid proliferation of wearable cameras has raised significant concerns about egocentric video privacy, prior work has largely overlooked the unique privacy threats posed to the camera wearer. This work investigates the core question: How much privacy information about the camera wearer can be inferred from their first-person view videos? We introduce EgoPrivacy, the first large-scale benchmark for the comprehensive evaluation of privacy risks in egocentric vision. EgoPrivacy covers three types of privacy (demographic, individual, and situational), defining seven tasks that aim to recover private information ranging from fine-grained (e.g., wearer's identity) to coarse-grained (e.g., age group). To further emphasize the privacy threats inherent to egocentric vision, we propose Retrieval-Augmented Attack, a novel attack strategy that leverages ego-to-exo retrieval from an external pool of exocentric videos to boost the effectiveness of demographic privacy attacks. An extensive comparison of the different attacks possible under all threat models is presented, showing that private information of the wearer is highly susceptible to leakage. For instance, our findings indicate that foundation models can effectively compromise wearer privacy even in zero-shot settings by recovering attributes such as identity, scene, gender, and race with 70–80% accuracy. Our code and data are available at https://github.com/williamium3000/ego-privacy." Empirical Privacy Variance,Yuzheng Hu Fan Wu Ruicheng Xian Yuhang Liu Lydia Zakynthinou Pritish Kamath Chiyuan Zhang David Forsyth,https://icml.cc/virtual/2025/poster/44071,"We propose the notion of empirical privacy variance and study it in the context of differentially private fine-tuning of language models. Specifically, we show that models calibrated to the same $(\varepsilon, \delta)$-DP guarantee using DP-SGD with different hyperparameter configurations can exhibit significant variations in empirical privacy, which we quantify through the lens of memorization. We investigate the generality of this phenomenon across multiple dimensions and discuss why it is surprising and relevant. Through regression analysis, we examine how individual and composite hyperparameters influence empirical privacy. The results reveal a no-free-lunch trade-off: existing practices of hyperparameter tuning in DP-SGD, which focus on optimizing utility under a fixed privacy budget, often come at the expense of empirical privacy. To address this, we propose refined heuristics for hyperparameter selection that explicitly account for empirical privacy, showing that they are both precise and practically useful. Finally, we take preliminary steps to understand empirical privacy variance. We propose two hypotheses, identify limitations in existing techniques like privacy auditing, and outline open questions for future research." @@ -2719,7 +2665,6 @@ Improving Out-of-Distribution Detection via Dynamic Covariance Calibration,Kaiyu Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning,Mahavir Dabas Si Chen Charles Fleming Ming Jin Ruoxi Jia,https://icml.cc/virtual/2025/poster/45159,"Safety alignment is crucial for Large Language Models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. To this end, we introduceACTOR(Activation-Based Training for Over-Refusal Reduction), a robust and compute- and-data efficient training framework that mini- mizes over-refusals by utilizing internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model’s ability to handle harmful queries and preserving overall utility." Not All Wrong is Bad: Using Adversarial Examples for Unlearning,Ali Ebrahimpour-Boroojeny Hari Sundaram Varun Chandrasekaran,https://icml.cc/virtual/2025/poster/46097,"Machine unlearning, where users can request the deletion of a forget dataset, is becoming increasingly important because of numerous privacy regulations. Initial works on ""exact'' unlearning (e.g., retraining) incur large computational overheads. However, while computationally inexpensive, ""approximate'' methods have fallen short of reaching the effectiveness of exact unlearning: models produced fail to obtain comparable accuracy and prediction confidence on both the forget and test (i.e., unseen) dataset. Exploiting this observation, we propose a new unlearning method, Adversarial Machine UNlearning (AMUN), that outperforms prior state-of-the-art (SOTA) methods for image classification. AMUN lowers the confidence of the model on the forget samples by fine-tuning the model on their corresponding adversarial examples. Adversarial examples naturally belong to the distribution imposed by the model on the input space; fine-tuning the model on the adversarial examples closest to the corresponding forget samples (a) localizes the changes to the decision boundary of the model around each forget sample and (b) avoids drastic changes to the global behavior of the model, thereby preserving the model's accuracy on test samples. Using AMUN for unlearning a random 10% of CIFAR-10 samples, we observe that even SOTA membership inference attacks cannot do better than random guessing." "SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior",Jing-Jing Li Valentina Pyatkin Max Kleiman-Weiner Liwei Jiang Nouha Dziri Anne Collins Jana Schaich Borg Maarten Sap Yejin Choi Sydney Levine,https://icml.cc/virtual/2025/poster/45015,"The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community's values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured ""harm-benefit tree,"" which enumerates harmful and beneficial *actions* and *effects* the AI behavior may lead to, along with *likelihood*, *severity*, and *immediacy* labels that describe potential impacts on *stakeholders*. SafetyAnalyst then aggregates all effects into a harmfulness score using 28 fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this framework to develop an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On comprehensive benchmarks, we show that SafetyAnalyst (average F1=0.81) outperforms existing moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability." -EA-PS: Estimated Attack Effectiveness based Poisoning Defense in Federated Learning under Parameter Constraint Strategy,Yidong Li Zikai Zhang Naiyue Chen ziyi li,https://openreview.net/forum?id=mZOeaIWBpR, GaussMarker: Robust Dual-Domain Watermark for Diffusion Models,Kecen Li Zhicong Huang Xinwen Hou Cheng Hong,https://icml.cc/virtual/2025/poster/44171,"As Diffusion Models (DM) generate increasingly realistic images, related issues such as copyright and misuse have become a growing concern. Watermarking is one of the promising solutions. Existing methods inject the watermark into thesingle-domainof initial Gaussian noise for generation, which suffers from unsatisfactory robustness. This paper presents the firstdual-domainDM watermarking approach using a pipelined injector to consistently embed watermarks in both the spatial and frequency domains. To further boost robustness against certain image manipulations and advanced attacks, we introduce a model-independent learnable Gaussian Noise Restorer (GNR) to refine Gaussian noise extracted from manipulated images and enhance detection robustness by integrating the detection scores of both watermarks.GaussMarker efficiently achieves state-of-the-art performance under eight image distortions and four advanced attacks across three versions of Stable Diffusion with better recall and lower false positive rates, as preferred in real applications." Omni-Angle Assault: An Invisible and Powerful Physical Adversarial Attack on Face Recognition,Shuai Yuan Hongwei Li Rui Zhang Hangcheng Cao Wenbo Jiang Tao Ni Wenshu Fan Qingchuan Zhao Guowen Xu,https://icml.cc/virtual/2025/poster/46320,"Deep learning models employed in face recognition (FR) systems have been shown to be vulnerable to physical adversarial attacks through various modalities, including patches, projections, and infrared radiation. However, existing adversarial examples targeting FR systems often suffer from issues such as conspicuousness, limited effectiveness, and insufficient robustness. To address these challenges, we propose a novel approach for adversarial face generation, UVHat, which utilizes ultraviolet (UV) emitters mounted on a hat to enable invisible and potent attacks in black-box settings. Specifically, UVHat simulates UV light sources via video interpolation and models the positions of these light sources on a curved surface, specifically the human head in our study. To optimize attack performance, UVHat integrates a reinforcement learning-based optimization strategy, which explores a vast parameter search space, encompassing factors such as shooting distance, power, and wavelength. Extensive experimental evaluations validate that UVHat substantially improves the attack success rate in black-box settings, enabling adversarial attacks from multiple angles with enhanced robustness." One Image is Worth a Thousand Words: A Usability Preservable Text-Image Collaborative Erasing Framework,Feiran Li Qianqian Xu Shilong Bao Zhiyong Yang Xiaochun Cao Qingming Huang,https://icml.cc/virtual/2025/poster/45434,"Concept erasing has recently emerged as an effective paradigm to prevent text-to-image diffusion models from generating visually undesirable or even harmful content. However, current removal methods heavily rely on manually crafted text prompts, making it challenging to achieve a high erasure (efficacy) while minimizing the impact on other benign concepts (usability), as illustrated in Fig.1. In this paper, we attribute the limitations to the inherent gap between the text and image modalities, which makes it hard to transfer the intricately entangled concept knowledge from text prompts to the image generation process. To address this, we propose a novel solution by directly integrating visual supervision into the erasure process, introducing the first text-image Collaborative Concept Erasing (Co-Erasing) framework. Specifically, Co-Erasing describes the concept jointly by text prompts and the corresponding undesirable images induced by the prompts, and then reduces the generating probability of the target concept through negative guidance. This approach effectively bypasses the knowledge gap between text and image, significantly enhancing erasure efficacy. Additionally, we design a text-guided image concept refinement strategy that directs the model to focus on visual features most relevant to the specified text concept, minimizing disruption to other benign concepts. Finally, comprehensive experiments suggest that Co-Erasing outperforms state-of-the-art erasure approaches significantly with a better trade-off between efficacy and usability." @@ -2763,8 +2708,6 @@ Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gra Models of Heavy-Tailed Mechanistic Universality,Liam Hodgkinson Zhichao Wang Michael W. Mahoney,https://icml.cc/virtual/2025/poster/43908,"Recent theoretical and empirical successes in deep learning, including the celebrated neural scaling laws, are punctuated by the observation that many objects of interest tend to exhibit some form of heavy-tailed or power law behavior. In particular, the prevalence of heavy-tailed spectral densities in Jacobians, Hessians, and weight matrices has led to the introduction of the concept ofheavy-tailed mechanistic universality(HT-MU). Multiple lines of empirical evidence suggest a robust correlation between heavy-tailed metrics and model performance, indicating that HT-MU may be a fundamental aspect of deep learning efficacy. Here, we propose a general family of random matrix models---thehigh-temperature Marchenko-Pastur (HTMP) ensemble---to explore attributes that give rise to heavy-tailed behavior in trained neural networks. Under this model, spectral densities with power laws on (upper and lower) tails arise through a combination of three independent factors (complex correlation structures in the data; reduced temperatures during training; and reduced eigenvector entropy), appearing as an implicit bias in the model structure, and they can be controlled with an ""eigenvalue repulsion'' parameter. Implications of our model on other appearances of heavy tails, including neural scaling laws, optimizer trajectories, and the five-plus-one phases of neural network training, are discussed." On the Convergence of Continuous Single-timescale Actor-critic,Xuyang Chen Lin Zhao,https://icml.cc/virtual/2025/poster/44001,"Actor-critic algorithms have been instrumental in boosting the performance of numerous challenging applications involving continuous control, such as highly robust and agile robot motion control. However, their theoretical understanding remains largely underdeveloped. Existing analyses mostly focus on finite state-action spaces and on simplified variants of actor-critic, such as double-loop updates with i.i.d. sampling, which are often impractical for real-world applications.We consider the canonical and widely adopted single-timescale updates with Markovian sampling in continuous state-action space. Specifically, we establish finite-time convergence by introducing a novel Lyapunov analysis framework, which provides a unified convergence characterization of both the actor and the critic. Our approach is less conservative than previous methods and offers new insights into the coupled dynamics of actor-critic updates." On the Generalization Ability of Next-Token-Prediction Pretraining,Zhihao Li Xue Jiang Liyuan Liu Xuelin Zhang Hong Chen Feng Zheng,https://icml.cc/virtual/2025/poster/44423,"Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoder-only models (DOMs), utilizing Next-Token-Prediction (NTP) as their pre-training methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model's generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequences $N$, the maximum length of sequence $m$, and the count of parameters in the transformer model $\Theta$. Additionally, experiments on public datasets verify our theoretical findings." -On the Robustness of Transformers against Context Hijacking for Linear Classification,Tianle Li Chenyang Zhang Xingwu Chen Yuan Cao Difan Zou,https://openreview.net/forum?id=C0RNxlDvFE, -PAC-Bayes Bounds for Multivariate Linear Regression and Linear Autoencoders,Ruixin Guo Ruoming Jin Xinyu Li Yang Zhou,https://openreview.net/forum?id=1ueDWPv7j9, Rapid Overfitting of Multi-Pass SGD in Stochastic Convex Optimization,Shira Vansover-Hager Tomer Koren Roi Livni,https://icml.cc/virtual/2025/poster/45321,"We study the out-of-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal $\Theta(1/\sqrt{n})$ excess population loss given a sample of size $n$, much less is understood about the multi-pass version of the algorithm which is widely used in practice. Somewhat surprisingly, we show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample performance significantly and lead to overfitting. In particular, using a step size $\eta = \Theta(1/\sqrt{n})$, which gives the optimal rate after one pass, can lead to population loss as large as $\Omega(1)$ after just one additional pass. More generally, we show that the population loss from the second pass onward is of the order $\Theta(1/(\eta T) + \eta \sqrt{T})$, where $T$ is the total number of steps. These results reveal a certain phase-transition in the out-of-sample behavior of SGD after the first epoch, as well as a sharp separation between the rates of overfitting in the smooth and non-smooth cases of SCO. Additionally, we extend our results to with-replacement SGD, proving that the same asymptotic bounds hold after $O(n \log n)$ steps. Finally, we also prove a lower bound of $\Omega(\eta \sqrt{n})$ on the generalization gap of one-pass SGD in dimension $d = {\widetilde O}(n)$, improving on recent results of Koren et al. (2022) and Schliserman et al. (2024)." Refined generalization analysis of the Deep Ritz Method and Physics-Informed Neural Networks,Xianliang Xu Ye Li Zhongyi Huang,https://icml.cc/virtual/2025/poster/44863,"In this paper, we derive refined generalization bounds for the Deep Ritz Method (DRM) and Physics-Informed Neural Networks (PINNs). For the DRM, we focus on two prototype elliptic partial differential equations (PDEs): Poisson equation and static Schrödinger equation on the $d$-dimensional unit hypercube with the Neumann boundary condition. Furthermore, sharper generalization bounds are derived based on the localization techniques under the assumptions that the exact solutions of the PDEs lie in the Barron spaces or the general Sobolev spaces. For the PINNs, we investigate the general linear second order elliptic PDEs with Dirichlet boundary condition using the local Rademacher complexity in the multi-task learning setting. Finally, we discuss the generalization error in the setting of over-parameterization when solutions of PDEs belong to Barron space." The Power of Random Features and the Limits of Distribution-Free Gradient Descent,Ari Karchmer Eran Malach,https://icml.cc/virtual/2025/poster/43774,"We study the relationship between gradient-based optimization of parametric models (e.g., neural networks) and optimization of linear combinations of random features. Our main result shows that if a parametric model can be learned using mini-batch stochastic gradient descent (bSGD) without making assumptions about the data distribution, then with high probability, the target function can also be approximated using a polynomial-sized combination of random features. The size of this combination depends on the number of gradient steps and numerical precision used in the bSGD process. This finding reveals fundamental limitations of distribution-free learning in neural networks trained by gradient descent, highlighting why making assumptions about data distributions is often crucial in practice.Along the way, we also introduce a new theoretical framework called average probabilistic dimension complexity (adc), which extends the probabilistic dimension complexity developed by Kamath et al. (2020). We prove that adc has a polynomial relationship with statistical query dimension, and use this relationship to demonstrate an infinite separation between adc and standard dimension complexity." @@ -2807,14 +2750,11 @@ HybridGS: High-Efficiency Gaussian Splatting Data Compression using Dual-Channel Tackling View-Dependent Semantics in 3D Language Gaussian Splatting,Jiazhong Cen Xudong Zhou Jiemin Fang Changsong Wen Lingxi Xie XIAOPENG ZHANG Wei Shen Qi Tian,https://icml.cc/virtual/2025/poster/44052,"Recent advancements in 3D Gaussian Splatting (3D-GS) enable high-quality 3D scene reconstruction from RGB images. Many studies extend this paradigm for language-driven open-vocabulary scene understanding. However, most of them simply project 2D semantic features onto 3D Gaussians and overlook a fundamental gap between 2D and 3D understanding: a 3D object may exhibit various semantics from different viewpoints—a phenomenon we termview-dependent semantics. To address this challenge, we proposeLaGa(LanguageGaussians), which establishes cross-view semantic connections by decomposing the 3D scene into objects. Then, it constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics. Extensive experiments demonstrate that LaGa effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, LaGa achieves a significant improvement of+18.7\% mIoUover the previous SOTA on the LERF-OVS dataset. Our code is available at: https://github.com/https://github.com/SJTU-DeepVisionLab/LaGa." BAnG: Bidirectional Anchored Generation for Conditional RNA Design,Roman Klypa Alberto Bietti Sergei Grudinin,https://icml.cc/virtual/2025/poster/44645,"Designing RNA molecules that interact with specific proteins is a critical challenge in experimental and computational biology. Existing computational approaches require a substantial amount of experimentally determined RNA sequences for each specific protein or a detailed knowledge of RNA structure, restricting their utility in practice. To address this limitation, we develop RNA-BAnG, a deep learning-based model designed to generate RNA sequences for protein interactions without these requirements. Central to our approach is a novel generative method, Bidirectional Anchored Generation (BAnG), which leverages the observation that protein-binding RNA sequences often contain functional binding motifs embedded within broader sequence contexts. We first validate our method on generic synthetic tasks involving similar localized motifs to those appearing in RNAs, demonstrating its benefits over existing generative approaches. We then evaluate our model on biological sequences, showing its effectiveness for conditional RNA sequence design given a binding protein." Leveraging Per-Instance Privacy for Machine Unlearning,Nazanin Mohammadi Sepahvand Anvith Thudi Berivan Isik Ashmita Bhattacharyya Nicolas Papernot Eleni Triantafillou Daniel M. Roy Gintare Karolina Dziugaite,https://icml.cc/virtual/2025/poster/46697,"We present a principled, per-instance approach to quantifying the difficulty of unlearning via fine-tuning. We begin by sharpening an analysis of noisy gradient descent for unlearning (Chien et al., 2024), obtaining a better utility–unlearning trade-off by replacing worst-case privacy loss bounds with per-instance privacy losses (Thudi et al., 2024), each of which bounds the (R ´enyi) divergence to retraining without an individual datapoint. To demonstrate the practical applicability of our theory, we present empirical results showing that our theoretical predictions are born out both for Stochastic Gradient Langevin Dynamics (SGLD) as well as for standard fine-tuning without explicit noise. We further demonstrate that per-instance privacy losses correlate well with several existing data difficulty metrics, while alsoidentifying harder groups of data points, and introduce novel evaluation methods based on loss barriers. All together, our findings provide a foundation for more efficient and adaptive unlearning strategies tailored to the unique properties of individual data points." -PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery,Bowei He Lihao Yin Huiling Zhen Xiaokun Zhang Mingxuan Yuan Chen Ma,https://openreview.net/forum?id=ub44gZJNhk, "Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification",Eric Zhao Pranjal Awasthi Sreenivas Gollapudi,https://icml.cc/virtual/2025/poster/43608,"Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one---typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts---chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies." TabFSBench: Tabular Benchmark for Feature Shifts in Open Environments,Zi-Jian Cheng Ziyi Jia Zhi Zhou Yu-Feng Li Lan-Zhe Guo,https://icml.cc/virtual/2025/poster/44787,"Tabular data is widely utilized in various machine learning tasks. Current tabular learning research predominantly focuses on closed environments, while in real-world applications, open environments are often encountered, where distribution and feature shifts occur, leading to significant degradation in model performance. Previous research has primarily concentrated on mitigating distribution shifts, whereas feature shifts, a distinctive and unexplored challenge of tabular data, have garnered limited attention. To this end, this paper conducts the first comprehensive study on feature shifts in tabular data and introduces the firsttabularfeature-shiftbenchmark (TabFSBench). TabFSBench evaluates impacts of four distinct feature-shift scenarios on four tabular model categories across various datasets and assesses the performance of large language models (LLMs) and tabular LLMs in the tabular benchmark for the first time. Our study demonstrates three main observations: (1) most tabular models have the limited applicability in feature-shift scenarios; (2) the shifted feature set importance has a linear relationship with model performance degradation; (3) model performance in closed environments correlates with feature-shift performance. Future research direction is also explored for each observation.Benchmark:LAMDASZ-ML/TabFSBench." WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting,Jiecheng Lu Xu Han Yan Sun Shihao Yang,https://icml.cc/virtual/2025/poster/45318,"We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results." One-Pass Feature Evolvable Learning with Theoretical Guarantees,Cun-Yuan Xing Meng-Zhang Qian Wu-Yang Chen Wei Gao Zhi-Hua Zhou,https://icml.cc/virtual/2025/poster/44326,"Feature evolvable learning studies the scenario where old features will vanish and new features will emerge when learning with data streams, and various methods have been developed by utilizing some useful relationships from old features to new features, rather than re-training from scratch. In this work, we focus on two fundamental problems: How to characterize the relationships between two different feature spaces, and how to exploit those relationships for feature evolvable learning. We introduce the Kernel Ortho-Mapping (KOM) discrepancy to characterize relationships between two different feature spaces via kernel functions, and correlate with the optimal classifiers learned from different feature spaces. Based on this discrepancy, we develop the one-pass algorithm for feature evolvable learning, which requires going through all instances only once without storing the entire or partial training data. Our basic idea is to take online kernel learning with the random Fourier features and incorporate some feature and label relationships via the KOM discrepancy for feature evolvable learning. We finally validate the effectiveness of our proposed method both theoretically and empirically." TeDS: Joint Learning of Diachronic and Synchronic Perspectives in Quaternion Space for Temporal Knowledge Graph Completion,Jiujiang Guo Mankun Zhao Wenbin Zhang Tianyi Xu Linying Xu Yu Jian Yu Mei Yu Ruiguo,https://icml.cc/virtual/2025/poster/43888,"Existing research on temporal knowledge graph completion treats temporal information as supplementary, without simulating various features of facts from a temporal perspective. This work summarizes features of temporalized facts from both diachronic and synchronic perspectives: (1) Diachronicity. Facts often exhibit varying characteristics and trends across different temporal domains; (2) Synchronicity. In specific temporal contexts, various relations between entities influence each other, generating latent semantics. To track above issues, we design a quaternion-based model, TeDS, which divides timestamps into diachronic and synchronic timestamps to support dual temporal perception: (a) Two composite quaternions fusing time and relation information are generated by reorganizing synchronic timestamp and relation quaternions, and Hamilton operator achieves their interaction. (b) Each time point is sequentially mapped to an angle and converted to scalar component of a quaternion using trigonometric functions to build diachronic timestamps. We then rotate relation by using Hamilton operator between it and diachronic timestamp. In this way, TeDS achieves deep integration of relations and time while accommodating different perspectives. Empirically, TeDS significantly outperforms SOTA models on six benchmarks." -Covariances for Free: Exploiting Mean Distributions for Federated Learning with Pre-trained Models,Dipam Goswami Simone Magistri Kai Wang Bartłomiej Twardowski Andrew D. Bagdanov Joost van de Weijer,https://openreview.net/forum?id=723VXasfGq, -Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation,Robert M. Gower Guillaume Garrigos Nicolas Loizou Konstantin Mishchenko Dimitris Oikonomou Fabian Schaipp,https://openreview.net/forum?id=Wgx6w9BALm, Discrete and Continuous Difference of Submodular Minimization,George Orfanides Tim Hoheisel Marwa El Halabi,https://icml.cc/virtual/2025/poster/45938,"Submodular functions, defined on continuous or discrete domains, arise in numerous applications. We study the minimization of the difference of submodular (DS) functions, over both domains, extending prior work restricted to set functions.We show that all functions on discrete domains and all smooth functions on continuous domains are DS. For discrete domains, we observe that DS minimization is equivalent to minimizing the difference of two convex (DC) functions, as in the set function case. We propose a novel variant of the DC Algorithm (DCA) and apply it to the resulting DC Program, obtaining comparable theoretical guarantees as in the set function case. The algorithm can be applied to continuous domains via discretization. Experiments demonstrate that our method outperforms baselines in integer compressive sensing and integer least squares." A Unified Comparative Study with Generalized Conformity Scores for Multi-Output Conformal Regression,Victor Dheur Matteo Fontana Yorick Estievenart Naomi Desobry Souhaib Ben Taieb,https://icml.cc/virtual/2025/poster/45852,"Conformal prediction provides a powerful framework for constructing distribution-free prediction regions with finite-sample coverage guarantees. While extensively studied in univariate settings, its extension to multi-output problems presents additional challenges, including complex output dependencies and high computational costs, and remains relatively underexplored. In this work, we present a unified comparative study of nine conformal methods with different multivariate base models for constructing multivariate prediction regions within the same framework. This study highlights their key properties while also exploring the connections between them. Additionally, we introduce two novel classes of conformity scores for multi-output regression that generalize their univariate counterparts. These scores ensure asymptotic conditional coverage while maintaining exact finite-sample marginal coverage. One class is compatible with any generative model, offering broad applicability, while the other is computationally efficient, leveraging the properties of invertible generative models. Finally, we conduct a comprehensive empirical evaluation across 13 tabular datasets, comparing all the multi-output conformal methods explored in this work. To ensure a fair and consistent comparison, all methods are implemented within a unified code base." ExpProof : Operationalizing Explanations for Confidential Models with ZKPs,Chhavi Yadav Evan Laufer Dan Boneh Kamalika Chaudhuri,https://icml.cc/virtual/2025/poster/44593,"In principle, explanations are intended as a way to increase trust in machine learning models and are often obligated by regulations. However, many circumstances where these are demanded are adversarial in nature, meaning the involved parties have misaligned interests and are incentivized to manipulate explanations for their purpose. As a result, explainability methods fail to be operational in such settings despite the demand. In this paper, we take a step towards operationalizing explanations in adversarial scenarios with Zero-Knowledge Proofs (ZKPs), a cryptographic primitive. Specifically we explore ZKP-amenable versions of the popular explainability algorithm LIME and evaluate their performance on Neural Networks and Random Forests. Our code is publicly available at : \url{https://github.com/emlaufer/ExpProof}." @@ -2826,9 +2766,7 @@ RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing,Jinyao Guo A Physics-Augmented Deep Learning Framework for Classifying Single Molecule Force Spectroscopy Data,Cailong Hua Sivaraman Rajaganapathy Rebecca A Slick Joseph Vavra Joseph M. Muretta James M. Ervasti Murti Salapaka,https://icml.cc/virtual/2025/poster/45169,"Deciphering protein folding and unfolding pathways under tension is essential for deepening our understanding of fundamental biological mechanisms. Such insights hold the promise of developing treatments for a range of debilitating and fatal conditions, including muscular disorders like Duchenne Muscular Dystrophy and neurodegenerative diseases such as Parkinson's disease. Single molecule force spectroscopy (SMFS) is a powerful technique for investigating forces involved in protein domains folding and unfolding. However, SMFS trials often involve multiple protein molecules, necessitating filtering to isolate measurements from single-molecule trials. Currently, manual visual inspection is the primary method for classifying single-molecule data; a process that is both time-consuming and requires significant expertise. Here, we both apply state-of-the-art machine learning models and present a novel deep learning model tailored to SMFS data. The proposed model employs a dual-branch fusion strategy; one branch integrates the physics of protein molecules, and the other operates independently of physical constraints. This model automates the isolation of single-molecule measurements, significantly enhancing data processing efficiency. To train and validate our approach, we developed a physics-based Monte Carlo engine to simulate force spectroscopy datasets, including trials involving single molecules, multiple molecules, and no molecules. Our model achieves state-of-the-art performance, outperforming five baseline methods on both simulated and experimental datasets. It attains nearly 100\% accuracy across all simulated datasets and an average accuracy of $79.6 \pm 5.2$\% on experimental datasets, using only $\sim$30 training samples, surpassing baseline methods by 11.4\%. Notably, even without expert annotations on experimental data, the model achieves an average accuracy of $72.0 \pm 5.9$\% when pre-trained on corresponding simulated datasets. With our deep learning approach, the time required to extract meaningful statistics from single-molecule SMFS trials is reduced from a day to under an hour. This work results in SMFS experimental datasets from four important protein molecules crucial to many biological pathways. To support further research, we have made our datasets publicly available and provided a Python-based toolbox (https://github.com/SalapakaLab-SIMBioSys/SMFS-Identification)." Retrieval Augmented Zero-Shot Enzyme Generation for Specified Substrate,Jiahe Du Kaixiong Zhou Xinyu Hong Zhaozhuo Xu Jinbo Xu Xiao Huang,https://icml.cc/virtual/2025/poster/45546,"Generating novel enzymes for target molecules in zero-shot scenarios is a fundamental challenge in biomaterial synthesis and chemical production. Without known enzymes for a target molecule, training generative models becomes difficult due to the lack of direct supervision. To address this, we propose a retrieval-augmented generation method that uses existing enzyme-substrate data to guide enzyme design. Our method retrieves enzymes with substrates that share structural similarities with the target molecule, leveraging functional similarities in catalytic activity. Since none of the retrieved enzymes directly catalyze the target molecule, we use a conditioned discrete diffusion model to generate new enzymes based on the retrieved examples. An enzyme-substrate relationship classifier guides the generation process to ensure optimal protein sequence distributions. We evaluate our model on enzyme design tasks with diverse real-world substrates and show that it outperforms existing protein generation methods in catalytic capability, foldability, and docking accuracy. Additionally, we define the zero-shot substrate-specified enzyme generation task and introduce a dataset with evaluation benchmarks." Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding,Jiasheng Zhang Delvin Ce Zhang Shuang Liang Zhengpin Li Rex Ying Jie Shao,https://icml.cc/virtual/2025/poster/45183,"Protein language models often struggle to capture biological functions due to their lack of factual knowledge (e.g., gene descriptions). Existing solutions leverage protein knowledge graphs (PKGs) as auxiliary pre-training objectives, but lack explicit integration of task-oriented knowledge, making them suffer from limited knowledge exploitation and catastrophic forgetting. The root cause is that they fail to align PKGs with task-specific data, forcing their knowledge modeling to adapt to the knowledge-isolated nature of downstream tasks. In this paper, we propose Knowledge-aware retrieval augmented protein language model (Kara), achieving the first task-oriented and explicit integration of PKGs and protein language models. With a knowledge retriever learning to predict linkages between PKG and task proteins, Kara unifies the knowledge integration of the pre-training and fine-tuning stages with a structure-based regularization, mitigating catastrophic forgetting. To ensure task-oriented integration, Kara uses contextualized virtual tokens to extract graph context as task-specific knowledge for new proteins. Experiments show that Kara outperforms existing knowledge-enhanced models in 6 representative tasks, achieving on average 5.1% improvements." -STNet: Spectral Transformation Network for Solving Operator Eigenvalue Problem,Hong Wang Jiang Yixuan Jie Wang Xinyi Li Jian Luo huanshuo dong,https://openreview.net/forum?id=2UpXNbZDyt, Sub-Sequential Physics-Informed Learning with State Space Model,Chenhui Xu Dancheng Liu Yuting Hu Jiajie Li Ruiyang Qin Qingxiao Zheng Jinjun Xiong,https://icml.cc/virtual/2025/poster/45079,"Physics-Informed Neural Networks (PINNs) are a kind of deep-learning-based numerical solvers for partial differential equations (PDEs). Existing PINNs often suffer from failure modes of being unable to propagate patterns of initial conditions. We discover that these failure modes are caused by the simplicity bias of neural networks and the mismatch between PDE's continuity and PINN's discrete sampling. We reveal that the State Space Model (SSM) can be a continuous-discrete articulation allowing initial condition propagation, and that simplicity bias can be eliminated by aligning a sequence of moderate granularity. Accordingly, we propose PINNMamba, a novel framework that introduces sub-sequence modeling with SSM. Experimental results show that PINNMamba can reduce errors by up to 86.3\% compared with state-of-the-art architecture. Our code is available at Supplementary Material." -SymMaP: Improving Computational Efficiency in Linear Solvers through Symbolic Preconditioning,Hong Wang Jie Wang Minghao Ma Haoran Shao Haoyang Liu,https://openreview.net/forum?id=5sb0tv9MVn, Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing,Kaifeng Gao Jiaxin Shi Hanwang Zhang Chunping Wang Jun Xiao Long Chen,https://icml.cc/virtual/2025/poster/44902,"With the advance of diffusion models, today's video generation has achieved impressive quality. To extend the generation length and facilitate real-world applications, a majority of video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent clips conditioned on the last frame(s) of the previous clip. However, existing autoregressive VDMs are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the computational demands increase significantly (i.e., with a quadratic complexity w.r.t. the autoregression step). In this paper, we proposeCa2-VDM, an efficient autoregressive VDM withCausal generation andCache sharing. Forcausal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations. Forcache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost. Extensive experiments demonstrated that our Ca2-VDM achieves state-of-the-art quantitative and qualitative video generation results and significantly improves the generation speed. Code is available: https://github.com/Dawn-LX/CausalCache-VDM" CAD-Editor: A Locate-then-Infill Framework with Automated Training Data Synthesis for Text-Based CAD Editing,Yu Yuan Shizhao Sun Qi Liu Jiang Bian,https://icml.cc/virtual/2025/poster/44580,"Computer Aided Design (CAD) is indispensable across various industries. \emph{Text-based CAD editing}, which automates the modification of CAD models based on textual instructions, holds great potential but remains underexplored.Existing methods primarily focus on design variation generation or text-based CAD generation, either lacking support for text-based control or neglecting existing CAD models as constraints.We introduce \emph{CAD-Editor}, the first framework for text-based CAD editing. To address the challenge of demanding triplet data with accurate correspondence for training, we propose an automated data synthesis pipeline. This pipeline utilizes design variation models to generate pairs of original and edited CAD models and employs Large Vision-Language Models (LVLMs) to summarize their differences into editing instructions.To tackle the composite nature of text-based CAD editing, we propose a locate-then-infill framework that decomposes the task into two focused sub-tasks: locating regions requiring modification and infilling these regions with appropriate edits. Large Language Models (LLMs) serve as the backbone for both sub-tasks, leveraging their capabilities in natural language understanding and CAD knowledge.Experiments show that CAD-Editor achieves superior performance both quantitatively and qualitatively." CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering,Zhe Zhang Mingxiu Cai Hanxiao Wang Gaochang Wu Tianyou Chai Xiatian Zhu,https://icml.cc/virtual/2025/poster/46359,"Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples.Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is inaccurate yet overlooked, leading to sub-optimal detection. To address this issue, we introduce the concept of cost filtering, borrowed from classical matching tasks, such as depth and flow estimation, into the UAD problem. We call this approach CostFilter-AD. Specifically, we first construct a matching cost volume between the input and normal samples, comprising two spatial dimensions and one matching dimension that encodes potential matches. To refine this, we propose a cost volume filtering network, guided by the input observation as an attention query across multiple feature layers, which effectively suppresses matching noise while preserving edge structures and capturing subtle anomalies. Designed as a generic post-processing plug-in, CostFilter-AD can be integrated with either reconstruction-based or embedding-based methods. Extensive experiments on MVTec-AD and VisA benchmarks validate the generic benefits of CostFilter-AD for both single- and multi-class UAD tasks. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD." @@ -2837,7 +2775,6 @@ Efficient Multi-modal Long Context Learning for Training-free Adaptation,Zehong Elucidating the design space of language models for image generation,Xuantong LIU Shaozhe Hao Xianbiao Qi Tianyang Hu Jun Wang Rong Xiao Yuan Yao,https://icml.cc/virtual/2025/poster/45954,"The success of large language models (LLMs) in text generation has inspired their application to image generation. However, existing methods either rely on specialized designs with inductive biases or adopt LLMs without fully exploring their potential in vision tasks. In this work, we systematically investigate the design space of LLMs for image generation and demonstrate that LLMs can achieve near state-of-the-art performance without domain-specific designs, simply by making proper choices in tokenization methods, modeling approaches, scan patterns, vocabulary design, and sampling strategies. We further analyze autoregressive models' learning and scaling behavior, revealing how larger models effectively capture more useful information than the smaller ones. Additionally, we explore the inherent differences between text and image modalities, highlighting the potential of LLMs across domains. The exploration provides valuable insights to inspire more effective designs when applying LLMs to other domains. With extensive experiments, our proposed model, **ELM** achieves an FID of 1.54 on 256$\times$256 ImageNet and an FID of 3.29 on 512$\times$512 ImageNet, demonstrating the powerful generative potential of LLMs in vision tasks." FreeMesh: Boosting Mesh Generation with Coordinates Merging,Jian Liu Haohan Weng Biwen Lei Xianghui Yang Zibo Zhao Zhuo Chen Song Guo Tao Han Chunchao Guo,https://icml.cc/virtual/2025/poster/45605,"The next-coordinate prediction paradigm has emerged as the de facto standard in current auto-regressive mesh generation methods.Despite their effectiveness, there is no efficient measurement for the various tokenizers that serialize meshes into sequences. In this paper, we introduce a new metric Per-Token-Mesh-Entropy (PTME) to evaluate the existing mesh tokenizers theoretically without any training.Building upon PTME, we propose a plug-and-play tokenization technique called coordinate merging.It further improves the compression ratios of existing tokenizers by rearranging and merging the most frequent patterns of coordinates.Through experiments on various tokenization methods like MeshXL, MeshAnything V2, and Edgerunner, we further validate the performance of our method.We hope that the proposed PTME and coordinate merging can enhance the existing mesh tokenizers and guide the further development of native mesh generation." From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs,Ang Cao Sergio Arnaud Oleksandr Maksymets Jianing Yang Ayush Jain Ada Martin Vincent-Pierre Berges Paul McVay Ruslan Partsey Aravind Rajeswaran Franziska Meier Justin Johnson Jeong Joon Park Alexander Sax,https://icml.cc/virtual/2025/poster/43636,"3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes--a six-order-of-magnitude gap that severely limits performance. We introduce \textbf{\emph{LIFT-GS}}, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with 25.7\% mAP on open-vocabulary instance segmentation (vs. 20.2\% prior SOTA) and consistent 10-30\% improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2×, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: \url{https://liftgs.github.io}." -Improved Convex Decomposition with Ensembling and Boolean Primitives,Vaibhav Vavilala Florian Kluger Seemandhar Jain Bodo Rosenhahn David Forsyth,https://openreview.net/forum?id=nHaaNf5cOM, IRBridge: Solving Image Restoration Bridge with Pre-trained Generative Diffusion Models,Hanting Wang Tao Jin Wang Lin Shulei Wang Hai Huang Shengpeng Ji Zhou Zhao,https://icml.cc/virtual/2025/poster/44764,"Bridge models in image restoration construct a diffusion process from degraded to clear images. However, existing methods typically require training a bridge model from scratch for each specific type of degradation, resulting in high computational costs and limited performance. This work aims to efficiently leverage pretrained generative priors within existing image restoration bridges to eliminate this requirement. The main challenge is that standard generative models are typically designed for a diffusion process that starts from pure noise, while restoration tasks begin with a low-quality image, resulting in a mismatch in the state distributions between the two processes. To address this challenge, we propose a transition equation that bridges two diffusion processes with the same endpoint distribution. Based on this, we introduce theIRBridgeframework, which enables the direct utilization of generative models within image restoration bridges, offering a more flexible and adaptable approach to image restoration. Extensive experiments on six image restoration tasks demonstrate that IRBridge efficiently integrates generative priors, resulting in improved robustness and generalization performance. Code will be available at GitHub." Large Displacement Motion Transfer with Unsupervised Anytime Interpolation,Guixiang Wang Jianjun Li,https://icml.cc/virtual/2025/poster/43896,"Motion transfer is to transfer pose in driving video to object of source image, so that object of source image moves. Although great progress has been made recently in unsupervised motion transfer, many unsupervised methods still struggle to accurately model large displacement motions when large motion differences occur between source and driving images. To solve the problem, we propose an unsupervised anytime interpolation based large displacement motion transfer method, which can generate a series of anytime interpolated images between source and driving images. By decomposing large displacement motion into many small displacement motions, difficulty of large displacement motion estimation is reduced. In the process, we design a selector that can select optimal interpolated image from generated interpolated images for downstream tasks. Since there are no real images as labels in the interpolation process, we propose a bidirectional training strategy. Some constraints are added to optimal interpolated image to generate a reasonable interpolated image. To encourage network to generate high-quality images, a pre-trained Vision Transformer model is used to design constraint losses. Finally, experiments show that compared with the large displacement motion between source and driving images, small displacement motion between interpolated and driving images is easier to realize motion transfer. Compared with existing state-of-art methods, our method has significant improvements in motion-related metrics." LRA-QViT: Integrating Low-Rank Approximation and Quantization for Robust and Efficient Vision Transformers,Beom Jin Kang NamJoon Kim Hyun Kim,https://icml.cc/virtual/2025/poster/45855,"Recently, transformer-based models have demonstrated state-of-the-art performance across various computer vision tasks, including image classification, detection, and segmentation. However, their substantial parameter count poses significant challenges for deployment in resource-constrained environments such as edge or mobile devices. Low-rank approximation (LRA) has emerged as a promising model compression technique, effectively reducing the number of parameters in transformer models by decomposing high-dimensional weight matrices into low-rank representations. Nevertheless, matrix decomposition inherently introduces information loss, often leading to a decline in model accuracy. Furthermore, existing studies on LRA largely overlook the quantization process, which is a critical step in deploying practical vision transformer (ViT) models. To address these challenges, we propose a robust LRA framework that preserves weight information after matrix decomposition and incorporates quantization tailored to LRA characteristics. First, we introduce a reparameterizable branch-based low-rank approximation (RB-LRA) method coupled with weight reconstruction to minimize information loss during matrix decomposition. Subsequently, we enhance model accuracy by integrating RB-LRA with knowledge distillation techniques. Lastly, we present an LRA-aware quantization method designed to mitigate the large outliers generated by LRA, thereby improving the robustness of the quantized model. To validate the effectiveness of our approach, we conducted extensive experiments on the ImageNet dataset using various ViT-based models. Notably, the Swin-B model with RB-LRA achieved a 31.8\% reduction in parameters and a 30.4\% reduction in GFLOPs, with only a 0.03\% drop in accuracy. Furthermore, incorporating the proposed LRA-aware quantization method reduced accuracy loss by an additional 0.83\% compared to naive quantization." @@ -2866,9 +2803,7 @@ LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs – No S Overcoming Non-monotonicity in Transducer-based Streaming Generation,Zhengrui Ma Yang Feng Min zhang,https://icml.cc/virtual/2025/poster/45770,"Streaming generation models are utilized across fields, with the Transducer architecture being popular in industrial applications. However, its input-synchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation. In this research, we address this issue by integrating Transducer's decoding with the history of input stream via a learnable monotonic attention. Our approach leverages the forward-backward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the monotonic context representations, thereby avoiding the need to enumerate the exponentially large alignment space during training. Extensive experiments show that our MonoAttn-Transducer effectively handles non-monotonic alignments in streaming scenarios, offering a robust solution for complex generation tasks. Code is available at https://github.com/ictnlp/MonoAttn-Transducer." Reducing Tool Hallucination via Reliability Alignment,Hongshen Xu Zichen Zhu Lei Pan Zihan Wang Su Zhu Da Ma Ruisheng Cao Lu Chen Kai Yu,https://icml.cc/virtual/2025/poster/45001,"Large Language Models (LLMs) have expanded their capabilities beyond language generation to interact with external tools, enabling automation and real-world applications. However, tool hallucinations—where models either select inappropriate tools or misuse them—pose significant challenges, leading to erroneous task execution, increased computational costs, and reduced system reliability. To systematically address this issue, we define and categorize tool hallucinations into two main types: tool selection hallucination and tool usage hallucination. To evaluate and mitigate these issues, we introduce RelyToolBench, which integrates specialized test cases and novel metrics to assess hallucination-aware task success and efficiency. Finally, we propose Relign, a reliability alignment framework that expands the tool-use action space to include indecisive actions, allowing LLMs to defer tool use, seek clarification, or adjust tool selection dynamically. Through extensive experiments, we demonstrate that Relign significantly reduces tool hallucinations, improves task reliability, and enhances the efficiency of LLM tool interactions. The code and data will be publicly available." SING: Spatial Context in Large Language Model for Next-Gen Wearables,Ayushi Mishra Yang Bai Priyadarshan Narayanasamy Nakul Garg Nirupam Roy,https://icml.cc/virtual/2025/poster/44194,"Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI’s Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of 25.72°—a substantial improvement compared to the 88.52° median error in existing work—with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16°. Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences." -TableMaster: A Recipe to Advance Table Understanding with Language Models,Lang Cao,https://openreview.net/forum?id=6RNBm37sVe, Identifying Neural Dynamics Using Interventional State Space Models,Amin Nejatbakhsh Yixin Wang,https://icml.cc/virtual/2025/poster/44119,"Neural circuits produce signals that are complex and nonlinear. To facilitate the understanding of neural dynamics, a popular approach is to fit state space models (SSM) to the data and analyze the dynamics of the low-dimensional latent variables. Despite the power of SSM to explain the dynamics of neural circuits, these models have been shown to merely capture statistical associations in the data and cannot be causally interpreted. Therefore, an important research problem is to build models that can predict neural dynamics under causal manipulations. Here, we propose interventional state-space models (iSSM), a class of causal models that can predict neural responses to novel perturbations. We draw on recent advances in causal dynamical systems and present theoretical results for the identifiability of iSSM. In simulations of the motor cortex, we show that iSSM can recover the true latents and the underlying dynamics. In addition, we illustrate two applications of iSSM in biological datasets. First, we applied iSSM to a dataset of calcium recordings from ALM neurons in mice during photostimulation. Second, we applied iSSM to a dataset of electrophysiological recordings from macaque dlPFC during micro-stimulation. In both cases, we show that iSSM outperforms SSM and results in identifiable parameters. The code is available at https://github.com/amin-nejat/issm." -LEAD: Large Foundation Model for EEG-Based Alzheimer’s Disease Detection,Yihe Wang Nan Huang Nadia Mammone Marco Cecchi Xiang Zhang,https://openreview.net/forum?id=cz4EevJGHf, MindAligner: Explicit Brain Functional Alignment for Cross-Subject Visual Decoding from Limited fMRI Data,Yuqin Dai Zhouheng Yao Chunfeng Song Qihao Zheng Weijian Mai Kunyu Peng Shuai Lu Wanli Ouyang Jian Yang Jiamin Wu,https://icml.cc/virtual/2025/poster/46635,"Brain decoding aims to reconstruct visual perception of human subject from fMRI signals, which is crucial for understanding brain's perception mechanisms. Existing methods are confined to the single-subject paradigm due to substantial brain variability, which leads to weak generalization across individuals and incurs high training costs, exacerbated by limited availability of fMRI data. To address these challenges, we propose MindAligner, an explicit functional alignment framework for cross-subject brain decoding from limited fMRI data. The proposed MindAligner enjoys several merits. First, we learn a Brain Transfer Matrix (BTM) that projects the brain signals of an arbitrary new subject to one of the known subjects, enabling seamless use of pre-trained decoding models. Second, to facilitate reliable BTM learning, a Brain Functional Alignment module is proposed to perform soft cross-subject brain alignment under different visual stimuli with a multi-level brain alignment loss, uncovering fine-grained functional correspondences with high interpretability. Experiments indicate that MindAligner not only outperforms existing methods in visual decoding under data-limited conditions, but also provides valuable neuroscience insights in cross-subject functional analysis. The code will be made publicly available." MindCustomer: Multi-Context Image Generation Blended with Brain Signal,Muzhou Yu Shuyun Lin Lei Ma Bo Lei Kaisheng Ma,https://icml.cc/virtual/2025/poster/46700,"Advancements in generative models have promoted text- and image-based multi-context image generation. Brain signals, offering a direct representation of user intent, present new opportunities for image customization. However, it faces challenges in brain interpretation, cross-modal context fusion and retention. In this paper, we present MindCustomer to explore the blending of visual brain signals in multi-context image generation. We first design shared neural data augmentation for stable cross-subject brain embedding by introducing the Image-Brain Translator (IBT) to generate brain responses from visual images. Then, we propose an effective cross-modal information fusion pipeline that mask-freely adapts distinct semantics from image and brain contexts within a diffusion model. It resolves semantic conflicts for context preservation and enables harmonious context integration. During the fusion pipeline, we further utilize the IBT to transfer image context to the brain representation to mitigate the cross-modal disparity. MindCustomer enables cross-subject generation, delivering unified, high-quality, and natural image outputs. Moreover, it exhibits strong generalization for new subjects via few-shot learning, indicating the potential for practical application. As the first work for multi-context blending with brain signal, MindCustomer lays a foundational exploration and inspiration for future brain-controlled generative technologies." Learning Safe Control via On-the-Fly Bandit Exploration,Alexandre Capone Ryan Kazuo Cosner Aaron Ames Sandra Hirche,https://icml.cc/virtual/2025/poster/44982,"Control tasks with safety requirements under high levels of model uncertainty are increasingly common. Machine learning techniques are frequently used to address such tasks, typically by leveraging model error bounds to specify robust constraint-based safety filters. However, if the learned model uncertainty is very high, the corresponding filters are potentially invalid, meaning no control input satisfies the constraints imposed by the safety filter. While most works address this issue by assuming some form of safe backup controller, ours tackles it by collecting additional data on the fly using a Gaussian process bandit-type algorithm. We combine a control barrier function with a learned model to specify a robust certificate that ensures safety if feasible. Whenever infeasibility occurs, we leverage the control barrier function to guide exploration, ensuring the collected data contributes toward the closed-loop system safety. By combining a safety filter with exploration in this manner, our method provably achieves safety in a general setting that does not require any prior model or backup controller, provided that the true system lies in a reproducing kernel Hilbert space. To the best of our knowledge, it is the first safe learning-based control method that achieves this." @@ -2887,21 +2822,15 @@ Unisolver: PDE-Conditional Transformers Towards Universal Neural PDE Solvers,Han LASER: Attention with Exponential Transformation,Sai Surya Duvvuri Inderjit S Dhillon,https://icml.cc/virtual/2025/poster/44345,"Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer's performance. We analyze the gradients backpropagated through the softmax operation in the attention mechanism and observe that these gradients can often be small. This poor gradient signal backpropagation can lead to inefficient learning of parameters preceeding the attention operations. To this end, we introduce a new attention mechanism called LASER, which we analytically show to admit a larger gradient signal. We show that LASER attention can be implemented by making small modifications to existing attention implementations. We conduct experiments on autoregressive large language models (LLMs) with upto 7.7 billion parameters with an average improvement of upto 1.44% over standard attention on downstream evaluations and 1.65% finetuning improvements. Additionally, LASER demonstrates generalization performance improvement across a variety of tasks (vision, text and speech):Vision Transformer (ViT) on Imagenet, Conformer on the Librispeech speech-to-text and BERT with 2.2 billion parameters." LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models,Tzu-Tao Chang Shivaram Venkataraman,https://icml.cc/virtual/2025/poster/44230,"Cross-attention is commonly adopted in multimodal large language models (MLLMs) for integrating visual information into the language backbone. However, in applications with large visual inputs, such as video understanding, processing a large number of visual tokens in cross-attention layers leads to high memory demands and often necessitates distributed computation across multiple GPUs. Existing distributed attention mechanisms face significant communication overheads, making cross-attention layers a critical bottleneck for efficient training and inference of MLLMs. To address this, we propose LV-XAttn, a distributed, exact cross-attention mechanism with minimal communication overhead. We observe that in applications involving large visual inputs, the size of the query block is typically much smaller than that of the key-value blocks. Thus, in LV-XAttn we keep the large key-value blocks locally on each GPU and exchange smaller query blocks across GPUs. We also introduce an efficient activation recomputation technique to support longer visual context. We theoretically analyze the communication benefits of LV-XAttn and show that it can achieve speedups for a wide range of models. Our evaluations with Llama 3-V, mPLUG-Owl3 and OpenFlamingo models find that LV-XAttn achieves up to 10.62$\times$ end-to-end speedup compared to existing approaches." On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding,Kevin Xu Issei Sato,https://icml.cc/virtual/2025/poster/45800,"Looped Transformers provide advantages in parameter efficiency, computational capabilities, and generalization for reasoning tasks. However, their expressive power regarding function approximation remains underexplored. In this paper, we establish the approximation rate of Looped Transformers by defining the modulus of continuity for sequence-to-sequence functions. This reveals a limitation specific to the looped architecture. That is, the analysis prompts the incorporation of scaling parameters for each loop, conditioned on timestep encoding. Experiments validate the theoretical results, showing that increasing the number of loops enhances performance, with further gains achieved through the timestep encoding." -Communication Efficient Federated Learning via Model-Agnostic Projection Adaptation,Mohammad Mahdi Rahimi Younghyun Park Humaira Kousar Hasnain Irshad Bhatti Dong-Jun Han Jaekyun Moon,https://openreview.net/forum?id=ZUHo1WHVl1, Improving Soft Unification with Knowledge Graph Embedding Methods,Xuanming Cui Chionh Wei Peng Adriel Kuek Ser-Nam Lim,https://icml.cc/virtual/2025/poster/44679,"Neural Theorem Provers (NTPs) present a promising framework for neuro-symbolic reasoning, combining end-to-end differentiability with the interpretability of symbolic logic programming. However, optimizing NTPs remains a significant challenge due to their complex objective landscape and gradient sparcity. On the other hand, Knowledge Graph Embedding (KGE) methods offer smooth optimization with well-defined learning objectives but often lack interpretability. In this work, we propose several strategies to integrate the strengths of NTPs and KGEs, and demonstrate substantial improvements in both accuracy and computational efficiency. Specifically, we show that by leveraging the strength of structural learning in KGEs, we can greatly improve NTPs' poorly structured embedding space, while by substituting NTPs with efficient KGE operations, we can significantly reduce evaluation time by over 1000$\times$ on large-scale dataset such as WN18RR with a mild accuracy trade-off." -Oscillations Make Neural Networks Robust to Quantization,Jonathan Wenshøj Bob Pepin Raghavendra Selvan,https://openreview.net/forum?id=gqqKARGb8s, Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?,Antonia Wüst Tim Tobiasch Lukas Helff Inga Ibs Wolfgang Stammer Devendra Singh Dhami Constantin A. Rothkopf Kristian Kersting,https://icml.cc/virtual/2025/poster/45299,"Recently, newly developed Vision-Language Models (VLMs), such as OpenAI's o1, have emerged, seemingly demonstrating advanced reasoning capabilities across text and image modalities. However, the depth of these advances in language-guided perception and abstract reasoning remains underexplored, and it is unclear whether these models can truly live up to their ambitious promises. To assess the progress and identify shortcomings, we enter the wonderland of Bongard problems, a set of classic visual reasoning puzzles that require human-like abilities of pattern recognition and abstract reasoning. With our extensive evaluation setup, we show that while VLMs occasionally succeed in identifying discriminative concepts and solving some of the problems, they frequently falter. Surprisingly, even elementary concepts that may seem trivial to humans, such as simple spirals, pose significant challenges. Moreover, when explicitly asked to recognize ground truth concepts, they continue to falter, suggesting not only a lack of understanding of these elementary visual concepts but also an inability to generalize to unseen concepts. We compare the results of VLMs to human performance and observe that a significant gap remains between human visual reasoning capabilities and machine cognition." GAPrompt: Geometry-Aware Point Cloud Prompt for 3D Vision Model,Zixiang Ai Zichen Liu Yuanhang Lei Zhenyu Cui Xu Zou Jiahuan Zhou,https://icml.cc/virtual/2025/poster/46482,"Pre-trained 3D vision models have gained significant attention for their promising performance on point cloud data. However, fully fine-tuning these models for downstream tasks is computationally expensive and storage-intensive. Existing parameter-efficient fine-tuning (PEFT) approaches, which focus primarily on input token prompting, struggle to achieve competitive performance due to their limited ability to capture the geometric information inherent in point clouds. To address this challenge, we propose a novel Geometry-Aware Point Cloud Prompt (GAPrompt) that leverages geometric cues to enhance the adaptability of 3D vision models. First, we introduce a Point Prompt that serves as an auxiliary input alongside the original point cloud, explicitly guiding the model to capture fine-grained geometric details. Additionally, we present a Point Shift Prompter designed to extract global shape information from the point cloud, enabling instance-specific geometric adjustments at the input level. Moreover, our proposed Prompt Propagation mechanism incorporates the shape information into the model's feature extraction process, further strengthening its ability to capture essential geometric characteristics. Extensive experiments demonstrate that GAPrompt significantly outperforms state-of-the-art PEFT methods and achieves competitive results compared to full fine-tuning on various benchmarks, while utilizing only 2.19\% of trainable parameters." Peri-LN: Revisiting Normalization Layer in the Transformer Architecture,Jeonghoon Kim Byeongchan Lee Cheonbok Park Yeontaek Oh Beomjun Kim Taehwan Yoo Seongjin Shin Dongyoon Han Jinwoo Shin Kang Min Yoo,https://icml.cc/virtual/2025/poster/44675,"Selecting a layer normalization (LN) strategy that stabilizes training and speeds convergence in Transformers remains difficult, even for today’s large language models (LLM). We present a comprehensive analytical foundation for understanding how different LN strategies influence training dynamics in large-scale Transformers. Until recently, Pre-LN and Post-LN have long dominated practices despite their limitations in large-scale training. However, several open-source models have recently begun silently adopting a third strategy without much explanation. This strategy places normalization layer **peripherally** around sublayers, a design we term **Peri-LN**. While Peri-LN has demonstrated promising performance, its precise mechanisms and benefits remain almost unexplored. Our in-depth analysis delineates the distinct behaviors of LN strategies, showing how each placement shapes activation variance and gradient propagation. To validate our theoretical insight, we conduct extensive experiments on Transformers up to $3.2$B parameters, showing that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability. Our results suggest that Peri-LN warrants broader consideration for large-scale Transformer architectures, providing renewed insights into the optimal placement of LN." -RSMerge: Bridging Head and Tail Classes via Subsampled Model Merging,Masih Aminbeidokhti Subhankar Roy Eric Granger Elisa Ricci Marco Pedersoli,https://openreview.net/forum?id=UcRPCN2Tvf, "SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training",Tianzhe Chu Yuexiang Zhai Jihan Yang Shengbang Tong Saining Xie Dale Schuurmans Quoc V Le Sergey Levine Yi Ma,https://icml.cc/virtual/2025/poster/44633,"Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks." Beyond Log-Concavity and Score Regularity: Improved Convergence Bounds for Score-Based Generative Models in W2-distance,Marta Gentiloni Silveri Antonio Ocello,https://icml.cc/virtual/2025/poster/44471,"Score-based Generative Models (SGMs) aim to sample from a target distribution by learning score functions using samples perturbed by Gaussian noise. Existing convergence bounds for SGMs in the $\mathcal{W}_2$-distance rely on stringent assumptions about the data distribution. In this work, we present a novel framework for analyzing $\mathcal{W}_2$-convergence in SGMs, significantly relaxing traditional assumptions such as log-concavity and score regularity. Leveraging the regularization properties of the Ornstein-Uhlenbeck (OU) process, we show that weak log-concavity of the data distribution evolves into log-concavity over time. This transition is rigorously quantified through a PDE-based analysis of the Hamilton-Jacobi-Bellman equation governing the log-density of the forward process. Moreover, we establish that the drift of the time-reversed OU process alternates between contractive and non-contractive regimes, reflecting the dynamics of concavity.Our approach circumvents the need for stringent regularity conditions on the score function and its estimators, relying instead on milder, more practical assumptions. We demonstrate the wide applicability of this framework through explicit computations on Gaussian mixture models, illustrating its versatility and potential for broader classes of data distributions." Compositional Condition Question Answering in Tabular Understanding,Jun-Peng Jiang Tao Zhou De-Chuan Zhan Han-Jia Ye,https://icml.cc/virtual/2025/poster/44789,"Multimodal Large Language Models (MLLMs) for tabular understanding have made significant progress in tasks such as financial report analysis and public data tests. However, our comprehensive analysis shows that these models are still limited in certain simple scenarios, particularly when handling compositional conditions in QA. Further investigation reveals that the poor performance can be attributed to two main challenges: the visual encoder's inability to accurately recognize the content of a row, and the model's tendency to overlook conditions in the question.To address these, we introduce a new Compositional Condition Tabular Understanding method, called {\sc CoCoTab}. Specifically, to capture the structural relationships within tables, we enhance the visual encoder with additional row and column patches. Moreover, we introduce the conditional tokens between the visual patches and query embeddings, ensuring the model focuses on relevant parts of the table according to the conditions specified in the query.Additionally, we also introduce the Massive Multimodal Tabular Understanding (MMTU) benchmark, which comprehensively assesses the full capabilities of MLLMs in tabular understanding. Our proposed method achieves state-of-the-art performance on both existing tabular understanding benchmarks and MMTU.Our code can be available at \url{https://github.com/LAMDA-Tabular/MMTU}." -Discrete Spatial Diffusion: Intensity-Preserving Diffusion Modeling,Javier E. Santos Roman Colman Agnese Marcato Nicholas Lubbers Yen Ting Lin,https://openreview.net/forum?id=VkqKOc0j9w, -"Disturbance-based Discretization, Differentiable IDS Channel, and an IDS-Correcting Code for DNA Storage",Alan J.X. Guo Mengyi Wei Yufan Dai Yali Wei Pengchen Zhang,https://openreview.net/forum?id=3EOll1fd1z, Field Matching: an Electrostatic Paradigm to Generate and Transfer Data,Alexander Kolesov S. I. Manukhov Vladimir Vladimirovich Palyulin Alexander Korotin,https://icml.cc/virtual/2025/poster/46213,"We propose Electrostatic Field Matching (EFM), a novel method that is suitable for both generative modelling and distribution transfer tasks. Our approach is inspired by the physics of an electrical capacitor. We place source and target distributions on the capacitor plates and assign them positive and negative charges, respectively. We then learn the capacitor's electrostatic field using a neural network approximator. To map the distributions to each other, we start at one plate of the capacitor and move the samples along the learned electrostatic field lines until they reach the other plate. We theoretically justify that this approach provably yields the distribution transfer. In practice, we demonstrate the performance of our EFM in toy and image data experiments." Improved Discretization Complexity Analysis of Consistency Models: Variance Exploding Forward Process and Decay Discretization Scheme,Ruofeng Yang Bo Jiang Cheng Chen Shuai Li,https://icml.cc/virtual/2025/poster/46054,"Consistency models, a new class of one-step generative models, have shown competitive performance with multi-step diffusion models. The most challenging part of consistency models is the training process, which discretizes the continuous diffusion process into $K$ steps and trains a one-step mapping function on these discretized timepoints. Despite the empirical success, only a few works focus on the discretization complexity $K$, and their setting is far from that of empirical works. More specifically, the current theoretical works analyze the variance preserving (VP) diffusion process with a uniform stepsize, while empirical works adopt a variance exploding (VE) process with a decay discretization stepsize. As a result, these works suffer from large discretization complexity and fail to explain the empirical success of consistency models. To close the gap between theory and application, we analyze consistency models with (1) VE process and (2) decay stepsize and prove the state-of-the-art discretization complexity for consistency models. This result is competitive with the results of diffusion models and shows the potential of consistency models. To balance the computation and performance, previous empirical work further proposes a $2$-step consistency algorithm. In this work, we also analyze the role of $2$-step sampling and show that it improves the discretization complexity compared with one-step generation." -IO-LVM: Inverse Optimization Latent Variable Models with Graph-based Planning Applications,Alan Lahoud Erik Schaffernicht Johannes A. Stork,https://openreview.net/forum?id=k8wsUSPGgZ, Privacy Attacks on Image AutoRegressive Models,Antoni Kowalczuk Jan Dubiński Franziska Boenisch Adam Dziedzic,https://icml.cc/virtual/2025/poster/46325,"Image AutoRegressive generation has emerged as a new powerful paradigm with image autoregressive models (IARs) matching state-of-the-art diffusion models (DMs) in image quality (FID: 1.48 vs. 1.58) while allowing for a higher generation speed.However, the privacy risks associated with IARs remain unexplored, raising concerns regarding their responsible deployment. To address this gap, we conduct a comprehensive privacy analysis of IARs, comparing their privacy risks to the ones of DMs as reference points. Concretely, we develop a novel membership inference attack (MIA) that achieves a remarkably high success rate in detecting training images (with a True Positive Rate at False Positive Rate = 1% of 86.38% vs. 6.38% for DMs with comparable attacks). We leverage our novel MIA to provide dataset inference (DI) for IARs, and show that it requires as few as 6 samples to detect dataset membership (compared to 200 for DI in DMs), confirming a higher information leakage in IARs. Finally, we are able to extract hundreds of training data points from an IAR (e.g., 698 from VAR-d30). Our results suggest a fundamental privacy-utility trade-off: while IARs excel in image generation quality and speed, they areempiricallysignificantly more vulnerable to privacy attacks compared to DMs that achieve similar performance. We release the code at https://github.com/sprintml/privacyattacksagainst_iars for reproducibility." Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers,Weilun Feng Chuanguang Yang Haotong Qin Xiangqi Li Yu Wang Zhulin An Libo Huang Boyu Diao Zixiang Zhao Yongjun Xu Michele Magno,https://icml.cc/virtual/2025/poster/45429,"Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters.Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present **Q-VDiT**, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the *Token aware Quantization Estimator* (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce *Temporal Maintenance Distillation* (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency score of 23.40, setting a new benchmark and outperforming the current state-of-the-art quantization methods by **1.9$\times$**." RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression,Payman Behnam Yaosheng Fu Ritchie Zhao Po-An Tsai Zhiding Yu Alexey Tumanov,https://icml.cc/virtual/2025/poster/45253,"Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy containing two consecutive stages. In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens. In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensionality reductions. We show that RocketKV provides a compression ratio of up to 400×, end-to-end speedup of up to 3.7× as well as peak memory reduction of up to 32.6% in the decode phase on an NVIDIA A100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks. We also propose a variant of RocketKV for multi-turn scenarios, which consistently outperforms other existing methods and achieves accuracy nearly on par with an oracle top-k attention scheme." @@ -2920,10 +2849,8 @@ Automated Benchmark Generation for Repository-Level Coding Tasks,Konstantinos Ve BackSlash: Rate Constrained Optimized Training of Large Language Models,Jun Wu Jiangtao Wen Yuxing Han,https://icml.cc/virtual/2025/poster/45543,"The rapid advancement of large-language models (LLMs) has driven extensive research into parameter compression after training has been completed, yet compression during the training phase remains largely unexplored. In this work, we introduce Rate-Constrained Training (BackSlash), a novel training-time compression approach based on rate-distortion optimization (RDO). BackSlash enables a flexible trade-off between model accuracy and complexity, significantly reducing parameter redundancy while preserving performance. Experiments in various architectures and tasks demonstrate that BackSlash can reduce memory usage by 60\% - 90\% without accuracy loss and provides significant compression gain compared to compression after training. Moreover, BackSlash proves to be highly versatile: it enhances generalization with small Lagrange multipliers, improves model robustness to pruning (maintaining accuracy even at 80\% pruning rates), and enables network simplification for accelerated inference on edge devices." Do NOT Think That Much for 2+3=? On the Overthinking of Long Reasoning Models,Xingyu Chen Jiahao Xu Tian Liang Zhiwei He Jianhui Pang Dian Yu Linfeng Song Qiuzhi Liu Mengfei Zhou Zhuosheng Zhang Rui Wang Zhaopeng Tu Haitao Mi Dong Yu,https://icml.cc/virtual/2025/poster/45540,"The remarkable performance of long reasoning models can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where long reasoning models generate redundant solutions that contribute minimally to accuracy and diversity, thereby wasting computational resources on simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by long reasoning models. Using a self-training paradigm, we propose strategies to mitigate overthinking, simplifying reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME. Our code is open-source and available at https://github.com/galaxyChen/overthinking." Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?,Simon Park Abhishek Panigrahi Yun Cheng Dingli Yu Anirudh Goyal Sanjeev Arora,https://icml.cc/virtual/2025/poster/43878,"Vision Language Models (VLMs) are impressive at visual question answering and image captioning. But they underperform on multi-step visual reasoning---even compared to LLMs on the same tasks presented in text form---giving rise to perceptions ofmodality imbalanceorbrittleness. Towards a systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning, comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We propose strategies for training on the SIMPLE version of tasks that improve performance on the corresponding HARD task, i.e., simple-to-hard (S2H) generalization. This controlled setup, where each task also has an equivalent text-only version, allows a quantification of the modality imbalance and how it is impacted by training strategy. We show that 1) explicit image-to-text conversion is important in promoting S2H generalization on images, by transferring reasoning from text; 2) conversion can be internalized at test time. We also report results of mechanistic study of this phenomenon. We identify measures of gradient alignment that can identify training strategies that promote better S2H generalization. Ablations highlight the importance of chain-of-thought." -House of Cards: Massive Weights in LLMs,Jaehoon Oh Seungjun Shin Dokwan Oh,https://openreview.net/forum?id=uVuyDdHuZ0, Hypo3D: Exploring Hypothetical Reasoning in 3D,Ye Mao Weixun Luo Junpeng Jing Anlan Qiu Krystian Mikolajczyk,https://icml.cc/virtual/2025/poster/46012,"The rise of vision-language foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduceHypothetical 3D Reasoning, namely Hypo3D, a benchmark designed to evaluate models' ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering (VQA) benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive experiments show that state-of-the-art foundation models struggle to reason effectively in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the change is irrelevant to the question, models often incorrectly adjust their answers. The code and dataset are publicly available at: https://matchlab-imperial.github.io/Hypo3D." IBCircuit: Towards Holistic Circuit Discovery with Information Bottleneck,Tian Bian Yifan Niu Chaohao Yuan Chengzhi Piao Bingzhe Wu Long-Kai Huang Yu Rong Tingyang Xu Hong Cheng Jia Li,https://icml.cc/virtual/2025/poster/46175,"Circuit discovery has recently attracted attention as a potential research direction to explain the non-trivial behaviors of language models. It aims to find the computational subgraphs, also known as circuits, within the model that are responsible for solving specific tasks. However, most existing studies overlook the holistic nature of these circuits and require designing specific corrupted activations for different tasks, which is inaccurate and inefficient. In this work, we propose an end-to-end approach based on the principle of Information Bottleneck, called IBCircuit, to holistically identify informative circuits. In contrast to traditional causal interventions, IBCircuit is an optimization framework for holistic circuit discovery and can be applied to any given task without tediously corrupted activation design. In both the Indirect Object Identification (IOI) and Greater-Than tasks, IBCircuit identifies more faithful and minimal circuits in terms of critical node components and edge components compared to recent related work." -Large Language Diffusion Models,Shen Nie Fengqi Zhu Zebin You Xiaolu Zhang Jingyang Ou Jun Hu JUN ZHOU Yankai Lin Ji-Rong Wen Chongxuan Li,https://openreview.net/forum?id=W2tWu0aikL, Large Language Models are Demonstration Pre-Selectors for Themselves,Jiarui Jin Yuwei Wu Haoxuan Li Xiaoting He Weinan Zhang Yiming Yang Yong Yu Jun Wang Mengyue Yang,https://icml.cc/virtual/2025/poster/44899,"In-context learning with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training dataset. However, previous few-shot in-context learning methods, which calculate similarity scores for choosing demonstrations, incur high computational costs by repeatedly retrieving large-scale datasets for each query. This is due to their failure to recognize that not all demonstrations are equally informative, and many less informative demonstrations can be inferred from a core set of highly informative ones. To this end, we propose FEEDER (FEw yet Essential Demonstration prE-selectoR), a novel \emph{pre-selection} framework that identifies a core subset of demonstrations containing the most informative examples. This subset, referred to as the FEEDER set, consists of demonstrations that capture both the ''sufficiency'' and ''necessity'' information to infer the entire dataset. Notice that FEEDER is selected before the few-shot in-context learning, enabling more efficient few-shot demonstrations choosing in a smaller set. To identify FEEDER, we propose a novel effective tree based algorithm. Once selected, it can replace the original dataset, leading to improved efficiency and prediction accuracy in few-shot in-context learning. Additionally, FEEDER also benefit fine-tuning LLMs, we propose a bi-level optimization method enabling more efficient training without sacrificing performance when datasets become smaller. Our experiments are on 6 text classification datasets, 1 reasoning dataset, and 1 semantic-parsing dataset, across 6 LLMs (ranging from 335M to 7B parameters), demonstrate that: (i) In few-shot inference, FEEDER achieves superior (or comparable) performance while utilizing only half the input training data. (ii) In fine-tuning, FEEDER significantly boosts the performance of LLMs." Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge,Swarnadeep Saha Xian Li Marjan Ghazvininejad Jason E Weston Tianlu Wang,https://icml.cc/virtual/2025/poster/45391,"LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-by-step reasoning process that underlies the final evaluation of a response. However, due to the lack of human-annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench and PPE, despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models." Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation,Sadegh Mahdavi Muchen Li Kaiwen Liu Christos Thrampoulidis Leonid Sigal Renjie Liao,https://icml.cc/virtual/2025/poster/44681,"Advances in Large Language Models (LLMs) have sparked interest in their ability to solve Olympiad-level math problems. However, the training and evaluation of these models are constrained by the limited size and quality of available datasets, as creating large-scale data for such advanced problems requires extensive effort from human experts.In addition, current benchmarks are prone to contamination, leading to unreliable evaluations.In this paper, we present an automated pipeline that leverages the rich resources of the Art of Problem Solving (AoPS) forum, which predominantly features Olympiad-level problems and community-driven solutions.Using open-source LLMs, we develop a method to extract question-answer pairs from the forum, resulting inAoPS-Instruct, a dataset of more than 600,000 high-quality QA pairs.Our experiments demonstrate that fine-tuning LLMs on AoPS-Instruct improves their reasoning abilities across various benchmarks. Moreover, we build an automatic pipeline that introducesLiveAoPSBench, an evolving evaluation set with timestamps, derived from the latest forum data, providing a contamination-resistant benchmark for assessing LLM performance.Notably, we observe a significant decline in LLM performance over time, suggesting their success on older examples may stem from pre-training exposure rather than true reasoning ability. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning, offering valuable insights into the capabilities and limitations of LLMs in this domain." @@ -2940,13 +2867,11 @@ R.I.P.: Better Models by Survival of the Fittest Prompts,Ping Yu Weizhe Yuan Olg RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals,David Reber Sean M Richardson Todd Nief Cristina Garbacea Victor Veitch,https://icml.cc/virtual/2025/poster/43898,"Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs.However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures thecausaleffect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewritingtwice. We establish the validity of the RATE procedure and show empirically that it is an effective estimator." Reflection-Window Decoding: Text Generation with Selective Refinement,Zeyu Tang Zhenhao Chen Xiangchen Song Loka Li Yunlong Deng Yifan Shen Guangyi Chen Peter Spirtes Kun Zhang,https://icml.cc/virtual/2025/poster/44024,"The autoregressive decoding for text generation in large language models (LLMs), while widely used, is inherently suboptimal due to the lack of a built-in mechanism to perform refinement and/or correction of the generated content. In this paper, we consider optimality in terms of the joint probability over the generated response, when jointly considering all tokens at the same time. We theoretically characterize the potential deviation of the autoregressively generated response from its globally optimal counterpart that is of the same length. Our analysis suggests that we need to be cautious when noticeable uncertainty arises during text generation, which may signal the sub-optimality of the generation history. To address the pitfall of autoregressive decoding for text generation, we propose an approach that incorporates a sliding reflection window and a pausing criterion, such that refinement and generation can be carried out interchangeably as the decoding proceeds. Our selective refinement framework strikes a balance between efficiency and optimality, and our extensive experimental results demonstrate the effectiveness of our approach." Retraining-free Merging of Sparse MoE via Hierarchical Clustering,I-Chun Chen Hsu-Shen Liu Wei-Fang Sun Chen-Hao Chao Yen-Chang Hsu Chun-Yi Lee,https://icml.cc/virtual/2025/poster/44392,"Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reducedinference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for large-scale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE’s effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE’s superior performance and practical applicability for real-world deployments. Our implementation is available at https://github.com/wazenmai/HC-SMoE." -Risk-aware Direct Preference Optimization under Nested Risk Measure,Lijun Zhang Lin Li Yajie Qi Huizhong Song Yaodong Yang Jun Wang Wei Wei,https://openreview.net/forum?id=jssraGlHtK, T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling,Zhenyu Hou Xin Lv Rui Lu Jiajie Zhang Yujiang Li Zijun Yao Juanzi Li Jie Tang Yuxiao Dong,https://icml.cc/virtual/2025/poster/43762,"Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. While reinforcement learning (RL) holds promise for enabling self-exploration, recent attempts yield modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through over-sampling. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1’s better performance without any additional verification. The model weights and training data are publicly available at https://github.com/THUDM/T1." Test-Time Learning for Large Language Models,Jinwu Hu Zitian Zhang Guohao Chen Xutao Wen Chao Shuai Wei Luo Bin Xiao Yuanqing Li Mingkui Tan,https://icml.cc/virtual/2025/poster/44367,"While Large Language Models (LLMs) have exhibited remarkable emergent capabilities through extensive pre-training, they still face critical limitations in generalizing to specialized domains and handling diverse linguistic variations, known as distribution shifts. In this paper, we propose a Test-Time Learning (TTL) paradigm for LLMs, namely TLM, which dynamically adapts LLMs to target domains using only unlabeled test data during testing. Specifically, we first provide empirical evidence and theoretical insights to reveal that more accurate predictions from LLMs can be achieved by minimizing the input perplexity of the unlabeled test data. Based on this insight, we formulate the Test-Time Learning process of LLMs as input perplexity minimization, enabling self-supervised enhancement of LLM performance. Furthermore, we observe that high-perplexity samples tend to be more informative for model optimization. Accordingly, we introduce a Sample Efficient Learning Strategy that actively selects and emphasizes these high-perplexity samples for test-time updates. Lastly, to mitigate catastrophic forgetting and ensure adaptation stability, we adopt Low-Rank Adaptation (LoRA) instead of full-parameter optimization, which allows lightweight model updates while preserving more original knowledge from the model. We introduce the AdaptEval benchmark for TTL and demonstrate through experiments that TLM improves performance by at least 20% compared to original LLMs on domain knowledge adaptation." Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback,Yafu Li Xuyang Hu Xiaoye Qu Linjie Li Yu Cheng,https://icml.cc/virtual/2025/poster/46149,"Large language models (LLMs) have presented impressive performance but often lack the flexibility to adapt to human preferences quickly without retraining. Inspired by the recent efforts on test-time scaling, we make the first attempt to propose Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, eliminating the need to update model parameters. Instead of relying on purely numerical rewards, TPO translates reward signals into \emph{textual} critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth of the inference process. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly." TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization,Mingkang Zhu Xi Chen Zhongdao Wang Bei Yu Hengshuang Zhao Jiaya Jia,https://icml.cc/virtual/2025/poster/45181,"Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language models. However, it is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO), since DPO is formulated as a sequence-level bandit problem. To address this challenge, this work decomposes the sequence-level PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance, from which closed-form optimal token-level policy and the corresponding token-level reward can be derived. Using the obtained reward and Bradley-Terry model, this work establishes a framework of computable loss functions with token-level reward guidance for DPO, and proposes a practical reward guidance based on the induced DPO reward. This formulation enables different tokens to exhibit varying degrees of deviation from reference policy based on their respective rewards. Experiment results demonstrate that our method achieves substantial performance improvements over DPO, with win rate gains of up to 7.5 points on MT-Bench, 6.2 points on AlpacaEval 2, and 4.3 points on Arena-Hard. Code is available at https://github.com/dvlab-research/TGDPO." Token Signature: Predicting Chain-of-Thought Gains with Token Decoding Feature in Large Language Models,Peijie Liu Fengli Xu Yong Li,https://icml.cc/virtual/2025/poster/45104,"Chain-of-Thought (CoT) technique has proven effective in improving the performance of large language models (LLMs) on complex reasoning tasks. However, the performance gains are inconsistent across different tasks, and the underlying mechanism remains a long-standing research question. In this work, we make a preliminary observation that the monotonicity of token probability distributions may be correlated with the gains achieved through CoT reasoning. Leveraging this insight, we propose two indicators based on the token probability distribution to assess CoT effectiveness across different tasks. By combining instance-level indicators with logistic regression model, we introduce Dynamic CoT, a method that dynamically select between CoT and direct answer. Furthermore, we extend Dynamic CoT to closed-source models by transferring decision strategies learned from open-source models. Our indicators for assessing CoT effectiveness achieve an accuracy of 89.2\%, and Dynamic CoT reduces token consumption by more than 35\% while maintaining high accuracy. Overall, our work offers a novel perspective on the underlying mechanisms of CoT reasoning and provides a framework for its more efficient deployment." -Training Large Language Models to Reason Efficiently,Daman Arora Andrea Zanette,https://openreview.net/forum?id=hSAlC1SgcY, UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning,Jiawei Zhang Shuang Yang Bo Li,https://icml.cc/virtual/2025/poster/44008,"Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for complex tasks such as web shopping, automated email replies, and financial trading. However, these advancements amplify the risks of adversarial attacks, especially when agents can access sensitive external functionalities. Nevertheless, manipulating LLM agents into performing targeted malicious actions or invoking specific tools remains challenging, as these agents extensively reason or plan before executing final actions. In this work, we present UDora, a unified red teaming framework designed for LLM agents that dynamically hijacks the agent's reasoning processes to compel malicious behavior. Specifically, UDora first generates the model’s reasoning trace for the given task, then automatically identifies optimal points within this trace to insert targeted perturbations. The resulting perturbed reasoning is then used as a surrogate response for optimization. By iteratively applying this process, the LLM agent will then be induced to undertake designated malicious actions or to invoke specific malicious tools. Our approach demonstrates superior effectiveness compared to existing methods across three LLM agent datasets. The code is available at https://github.com/AI-secure/UDora." Unlocking Post-hoc Dataset Inference with Synthetic Data,Bihe Zhao Pratyush Maini Franziska Boenisch Adam Dziedzic,https://icml.cc/virtual/2025/poster/44819,"The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners’ intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set—known to be absent from training—that closely matches the compromised dataset’s distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method’s reliability for real-world litigations. Our code is available at https://github.com/sprintml/PostHocDatasetInference." Unnatural Languages Are Not Bugs but Features for LLMs,Keyu Duan Yiran Zhao Zhili Feng Jinjie Ni Tianyu Pang Qian Liu Tianle Cai Longxu Dou Kenji Kawaguchi Anirudh Goyal J Zico Kolter Michael Qizhe Shieh,https://icml.cc/virtual/2025/poster/44282,"Large Language Models (LLMs) have been observed to process non-human-readable text sequences, such as jailbreak prompts, often viewed as a bug for aligned LLMs. In this work, we present a systematic investigation challenging this perception, demonstrating that unnatural languages - strings that appear incomprehensible to humans but maintain semantic meanings for LLMs - contain latent features usable by models. Notably, unnatural languages possess latent features that can be generalized across different models and tasks during inference. Furthermore, models fine-tuned on unnatural versions of instruction datasets perform on-par with those trained on natural language, achieving (49.71) win rates in Length-controlled AlpacaEval 2.0 in average across various base models. In addition, through comprehensive analysis, we demonstrate that LLMs process unnatural languages by filtering noise and inferring contextual meaning from filtered words. Our code is publicly available at https://github.com/John-AI-Lab/Unnatural_Language." @@ -2973,12 +2898,10 @@ Federated Learning for Feature Generalization with Convex Constraints,Dongwon Ki NTK-DFL: Enhancing Decentralized Federated Learning in Heterogeneous Settings via Neural Tangent Kernel,Gabriel Thompson Kai Yue Chau-Wai Wong Huaiyu Dai,https://icml.cc/virtual/2025/poster/44433,"Decentralized federated learning (DFL) is a collaborative machine learning framework for training a model across participants without a central server or raw data exchange. DFL faces challenges due to statistical heterogeneity, as participants often possess data of different distributions reflecting local environments and user behaviors. Recent work has shown that the neural tangent kernel (NTK) approach, when applied to federated learning in a centralized framework, can lead to improved performance. We propose an approach leveraging the NTK to train client models in the decentralized setting, while introducing a synergy between NTK-based evolution and model averaging. This synergy exploits inter-client model deviation and improves both accuracy and convergence in heterogeneous settings. Empirical results demonstrate that our approach consistently achieves higher accuracy than baselines in highly heterogeneous settings, where other approaches often underperform. Additionally, it reaches target performance in 4.6 times fewer communication rounds. We validate our approach across multiple datasets, network topologies, and heterogeneity settings to ensure robustness and generalization. Source code for NTK-DFL is available at https://github.com/Gabe-Thomp/ntk-dfl}{https://github.com/Gabe-Thomp/ntk-dfl" Enhancing Statistical Validity and Power in Hybrid Controlled Trials: A Randomization Inference Approach with Conformal Selective Borrowing,Ke Zhu Shu Yang Xiaofei Wang,https://icml.cc/virtual/2025/poster/44990,"External controls from historical trials or observational data can augment randomized controlled trials when large-scale randomization is impractical or unethical, such as in drug evaluation for rare diseases. However, non-randomized external controls can introduce biases, and existing Bayesian and frequentist methods may inflate the type I error rate, particularly in small-sample trials where external data borrowing is most critical. To address these challenges, we propose a randomization inference framework that ensures finite-sample exact and model-free type I error rate control, adhering to the “analyze as you randomize” principle to safeguard against hidden biases. Recognizing that biased external controls reduce the power of randomization tests, we leverage conformal inference to develop an individualized test-then-pool procedure that selectively borrows comparable external controls to improve power. Our approach incorporates selection uncertainty into randomization tests, providing valid post-selection inference. Additionally, we propose an adaptive procedure to optimize the selection threshold by minimizing the mean squared error across a class of estimators encompassing both no-borrowing and full-borrowing approaches. The proposed methods are supported by non-asymptotic theoretical analysis, validated through simulations, and applied to a randomized lung cancer trial that integrates external controls from the National Cancer Database." Multi-Objective Causal Bayesian Optimization,Shriya Bhatija Paul-David Zuercher Jakob Thumm Thomas Bohné,https://icml.cc/virtual/2025/poster/43849,"In decision-making problems, the outcome of an intervention often depends on the causal relationships between system components and is highly costly to evaluate. In such settings, causal Bayesian optimization (CBO) exploits the causal relationships between the system variables and sequentially performs interventions to approach the optimum with minimal data. Extending CBO to the multi-outcome setting, we proposemulti-objective Causal Bayesian optimization(MO-CBO), a paradigm for identifying Pareto-optimal interventions within a known multi-target causal graph. Our methodology first reduces the search space by discarding sub-optimal interventions based on the structure of the given causal graph. We further show that any MO-CBO problem can be decomposed into several traditional multi-objective optimization tasks. Our proposed MO-CBO algorithm is designed to identify Pareto-optimal interventions by iteratively exploring these underlying tasks, guided by relative hypervolume improvement. Experiments on synthetic and real-world causal graphs demonstrate the superiority of our approach over non-causal multi-objective Bayesian optimization in settings where causal information is available." -A New Rejection Sampling Approach to $k\text{-}\mathtt{means}$++ with Improved Tradeoffs,Poojan Chetan Shah Shashwat Agrawal Ragesh Jaiswal,https://openreview.net/forum?id=FswxmvMOSG, Modified K-means Algorithm with Local Optimality Guarantees,Mingyi Li Michael R. Metel Akiko Takeda,https://icml.cc/virtual/2025/poster/43694,"The K-means algorithm is one of the most widely studied clustering algorithms in machine learning. While extensive research has focused on its ability to achieve a globally optimal solution, there still lacks a rigorous analysis of its local optimality guarantees. In this paper, we first present conditions under which the K-means algorithm converges to a locally optimal solution. Based on this, we propose simple modifications to the K-means algorithm which ensure local optimality in both the continuous and discrete sense, with the same computational complexity as the original K-means algorithm. As the dissimilarity measure, we consider a general Bregman divergence, which is an extension of the squared Euclidean distance often used in the K-means algorithm. Numerical experiments confirm that the K-means algorithm does not always find a locally optimal solution in practice, while our proposed methods provide improved locally optimal solutions with reduced clustering loss. Our code is available at https://github.com/lmingyi/LO-K-means." PROTOCOL: Partial Optimal Transport-enhanced Contrastive Learning for Imbalanced Multi-view Clustering,Xuqian Xue Yiming Lei Qi Cai Hongming Shan Junping Zhang,https://icml.cc/virtual/2025/poster/45370,"While contrastive multi-view clustering has achieved remarkable success, it implicitly assumes balanced class distribution. However, real-world multi-view data primarily exhibits class imbalance distribution. Consequently, existing methods suffer performance degradation due to their inability to perceive and model such imbalance. To address this challenge, we present the first systematic study of imbalanced multi-view clustering, focusing on two fundamental problems:i. perceiving class imbalance distribution, andii. mitigating representation degradation of minority samples. We propose PROTOCOL, a novel PaRtial Optimal TranspOrt-enhanced COntrastive Learning framework for imbalanced multi-view clustering. First, for class imbalance perception, we map multi-view features into a consensus space and reformulate the imbalanced clustering as a partial optimal transport (POT) problem, augmented withprogressive mass constraintsandweighted KL divergencefor class distributions. Second, we develop a POT-enhanced class-rebalanced contrastive learning at both feature and class levels, incorporatinglogit adjustmentandclass-sensitive learningto enhance minority sample representations. Extensive experiments demonstrate that PROTOCOL significantly improves clustering performance on imbalanced multi-view data, filling a critical research gap in this field." AutoEval Done Right: Using Synthetic Data for Model Evaluation,Pierre Boyeau Anastasios Nikolas Angelopoulos Tianle Li Nir Yosef Jitendra Malik Michael I. Jordan,https://icml.cc/virtual/2025/poster/45243,The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set,Mara Finkelstein Daniel Deutsch Parker Riley Juraj Juraska Geza Kovacs Markus Freitag,https://icml.cc/virtual/2025/poster/44938,"As LLMs continue to become more powerful and versatile, human evaluation has become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves state-of-the-art evaluators for many tasks. TheseAutoratersare typically designed so that they generalize to new systemsandtest sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure the capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate ourSpecialistmethod on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the state-of-the-art XCOMET metric by 54% and 119% on the WMT'23 and WMT'24 test sets, respectively. We perform extensive analyses to understand the representations learned by our Specialist metrics, and how variability in rater behavior affects their performance. We also verify the generalizability and robustness of our Specialist method across different numbers of ICL examples, LLM backbones, systems to evaluate, and evaluation tasks." -MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache,Leyang Xue Yao Fu Zhan Lu Chuanhao Sun Luo Mai Mahesh K. Marina,https://openreview.net/forum?id=BL7WMLJKZM, Active Learning with Selective Time-Step Acquisition for PDEs,Yegon Kim Hyunsu Kim Gyeonghoon Ko Juho Lee,https://icml.cc/virtual/2025/poster/44573,"Accurately solving partial differential equations (PDEs) is critical to understanding complex scientific and engineering phenomena, yet traditional numerical solvers are computationally expensive. Surrogate models offer a more efficient alternative, but their development is hindered by the cost of generating sufficient training data from numerical solvers. In this paper, we present a novel framework for active learning (AL) in PDE surrogate modeling that reduces this cost. Unlike the existing AL methods for PDEs that always acquire entire PDE trajectories, our approach strategically generates only the most important time steps with the numerical solver, while employing the surrogate model to approximate the remaining steps. This dramatically reduces the cost incurred by each trajectory and thus allows the active learning algorithm to try out a more diverse set of trajectories given the same budget. To accommodate this novel framework, we develop an acquisition function that estimates the utility of a set of time steps by approximating its resulting variance reduction. We demonstrate the effectiveness of our method on several benchmark PDEs, including the Burgers' equation, Korteweg–De Vries equation, Kuramoto–Sivashinsky equation, the incompressible Navier-Stokes equation, and the compressible Navier-Stokes equation.Experiments show that our approach improves performance by large margins over the best existing method. Our method not only reduces average error but also the 99\%, 95\%, and 50\% quantiles of error, which is rare for an AL algorithm. All in all, our approach offers a data-efficient solution to surrogate modeling for PDEs." The Relationship Between No-Regret Learning and Online Conformal Prediction,Ramya Ramalingam Shayan Kiyani Aaron Roth,https://icml.cc/virtual/2025/poster/45708,"Existing algorithms for online conformal prediction---guaranteeing marginal coverage in adversarial settings---are variants of online gradient descent (OGD), but their analyses of worst-case coverage do not follow from the regret guarantee of OGD. What is the relationship between no-regret learning and online conformal prediction? We observe that although standard regret guarantees imply marginal coverage in i.i.d. settings, this connection fails as soon as we either move to adversarial environments or ask for group conditional coverage. On the other hand, we show a tight connection betweenthreshold calibratedcoverage and swap-regret in adversarial settings, which extends to group-conditional (multi-valid) coverage. We also show that algorithms in thefollow the regularized leaderfamily of no regret learning algorithms (which includes online gradient descent) can be used to give group-conditional coverage guarantees in adversarial settings for arbitrary grouping functions. Via this connection we analyze and conduct experiments using a multi-group generalization of the ACI algorithm of Gibbs & Candes (2021)." A Geometric Approach to Personalized Recommendation with Set-Theoretic Constraints Using Box Embeddings,Shib Sankar Dasgupta Michael Boratko Andrew McCallum,https://icml.cc/virtual/2025/poster/46603,"Personalized item recommendation typically suffers from data sparsity, which is most often addressed by learning vector representations of users and items via low-rank matrix factorization. While this effectively densifies the matrix by assuming users and movies can be represented by linearly dependent latent features, it does not capture more complicated interactions. For example, vector representations struggle with set-theoretic relationships, such as negation and intersection, e.g. recommending a movie that is “comedy and action, but not romance”. In this work, we formulate the problem of personalized item recommendation as matrix completion where rows are set-theoretically dependent. To capture this set-theoretic dependence we represent each user and attribute by a hyperrectangle or box (i.e. a Cartesian product of intervals). Box embeddings can intuitively be understood as trainable Venn diagrams, and thus not only inherently represent similarity (via the Jaccard index), but also naturally and faithfully support arbitrary set-theoretic relationships. Queries involving set-theoretic constraints can be efficiently computed directly on the embedding space by performing geometric operations on the representations. We empirically demonstrate the superiority of box embeddings over vector-based neural methods on both simple and complex item recommendation queries by up to 30% overall." @@ -2986,7 +2909,6 @@ Learning Invariant Causal Mechanism from Vision-Language Models,Zeen Song Siyu Z Learning Single Index Models with Diffusion Priors,Anqi Tang Youming Chen Shuchen Xue Zhaoqiang Liu,https://icml.cc/virtual/2025/poster/44269,"Diffusion models (DMs) have demonstrated remarkable ability to generate diverse and high-quality images by efficiently modeling complex data distributions. They have also been explored as powerful generative priors for signal recovery, resulting in a substantial improvement in the quality of reconstructed signals. However, existing research on signal recovery with diffusion models either focuses on specific reconstruction problems or is unable to handle nonlinear measurement models with discontinuous or unknown link functions. In this work, we focus on using DMs to achieve accurate recovery from semi-parametric single index models, which encompass a variety of popular nonlinear models that may have {\em discontinuous} and {\em unknown} link functions. We propose an efficient reconstruction method that only requires one round of unconditional sampling and (partial) inversion of DMs. Theoretical analysis on the effectiveness of the proposed methods has been established under appropriate conditions. We perform numerical experiments on image datasets for different nonlinear measurement models. We observe that compared to competing methods, our approach can yield more accurate reconstructions while utilizing significantly fewer neural function evaluations." Learning Vision and Language Concepts for Controllable Image Generation,Shaoan Xie Lingjing Kong Yujia Zheng Zeyu Tang Eric P. Xing Guangyi Chen Kun Zhang,https://icml.cc/virtual/2025/poster/44412,"Concept learning seeks to extract semantic and interpretable representations of atomic concepts from high-dimensional data such as images and text, which can be instrumental to a variety of downstream tasks (e.g., image generation/editing). Despite its importance, the theoretical foundations for learning atomic concepts and their interactions, especially from multimodal distributions, remain underexplored.In this work, we establish fundamental conditions for learning atomic multimodal concepts and their underlying interactions With identfiability guarantees. We formulate concept learning as a latent variable identification problem, representing atomic concepts in each modality as latent variables, with a graphical model to specify their interactions across modalities. Our theoretical contribution is to provide component-wise identifiability of atomic concepts under flexible, nonparametric conditions that accommodate both continuous and discrete modalities. Building on these theoretical insights, we demonstrate the practical utility of our theory in a downstream task text-to-image (T2I) generation. We develop a principled T2I model that explicitly learns atomic textual and visual concepts with sparse connections between them, allowing us to achieve image generation and editing at the atomic concept level. Empirical evaluations show that our model outperforms existing methods in T2I generation tasks, offering superior controllability and interpretability." Mixed-curvature decision trees and random forests,Philippe Chlenski Quentin Chu Raiyan R. Khan Kaizhu Du Antonio Khalil Moretti Itsik Pe'er,https://icml.cc/virtual/2025/poster/43603,"Decision trees (DTs) and their random forest (RF) extensions are workhorses of classification and regression in Euclidean spaces. However, algorithms for learning in non-Euclidean spaces are still limited. We extend DT and RF algorithms to product manifolds: Cartesian products of several hyperbolic, hyperspherical, or Euclidean components. Such manifolds handle heterogeneous curvature while still factorizing neatly into simpler components, making them compelling embedding spaces for complex datasets. Our novel angular reformulation respects manifold geometry while preserving the algorithmic properties that make decision trees effective. In the special cases of single-component manifolds, our method simplifies to its Euclidean or hyperbolic counterparts, or introduces hyperspherical DT algorithms, depending on the curvature. In benchmarks on a diverse suite of 57 classification, regression, and link prediction tasks, our product RFs ranked first on 29 tasks and came in the top 2 for 41. This highlights the value of product RFs as straightforward yet powerful new tools for data analysis in product manifolds. Code for our method is available at https://github.com/pchlenski/manify." -ShapeEmbed: a self-supervised learning framework for shape quantification,Anna Foix Romero Craig Russell Alexander Krull Virginie Uhlmann,https://openreview.net/forum?id=P0bQIvwnEZ, Near-optimal Sketchy Natural Gradients for Physics-Informed Neural Networks,Maricela Best Mckay Avleen Kaur Chen Greif Brian Wetton,https://icml.cc/virtual/2025/poster/44747,"Natural gradient methods for PINNs have achieved state-of-the-art performance with errors several orders of magnitude smaller than those achieved by standard optimizers such as ADAM or L-BFGS. However, computing natural gradients for PINNs is prohibitively computationally costly and memory-intensive for all but small neural network architectures. We develop a randomized algorithm for natural gradient descent for PINNs that uses sketching to approximate the natural gradient descent direction. We prove that the change of coordinate Gram matrix used in a natural gradient descent update has rapidly-decaying eigenvalues for a one-layer, one-dimensional neural network and empirically demonstrate that this structure holds for four different example problems. Under this structure, our sketching algorithm is guaranteed to provide a near-optimal low-rank approximation of the Gramian. Our algorithm dramatically speeds up computation time and reduces memory overhead. Additionally, in our experiments, the sketched natural gradient outperforms the original natural gradient in terms of accuracy, often achieving an error that is an order of magnitude smaller. Training time for a network with around 5,000 parameters is reduced from several hours to under two minutes. Training can be practically scaled to large network sizes; we optimize a PINN for a network with over a million parameters within a few minutes, a task for which the full Gram matrix does not fit in memory." Shifting Time: Time-series Forecasting with Khatri-Rao Neural Operators,Srinath Dama Kevin Course Prasanth B. Nair,https://icml.cc/virtual/2025/poster/44565,"We present an operator-theoretic framework for temporal and spatio-temporal forecasting based on learning acontinuous time-shift operator. Our operator learning paradigm offers a continuous relaxation of the discrete lag factor used in traditional autoregressive models, enabling the history of a system up to a given time to be mapped to its future values. We parametrize the time-shift operator using Khatri-Rao neural operators (KRNOs), a novel architecture based on non-stationary integral transforms with nearly linear computational scaling. Our framework naturally handles irregularly sampled observations and enables forecasting at super-resolution in both space and time. Extensive numerical studies across diverse temporal and spatio-temporal benchmarks demonstrate that our approach achieves state-of-the-art or competitive performance with leading methods." TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree,Yu-Yang Qian Yuan-Ze Xu Zhen-Yu Zhang Peng Zhao Zhi-Hua Zhou,https://icml.cc/virtual/2025/poster/44546,"Many real-world applications collect data in a streaming environment, where learning tasks are encountered sequentially. This necessitatescontinual learning(CL) to update models online, enabling adaptation to new tasks while preserving past knowledge to prevent catastrophic forgetting. Nowadays, with the flourish oflarge pre-trained models(LPMs),efficiencyhas become increasingly critical for CL, due to their substantial computational demands and growing parameter sizes. In this paper, we introduce TreeLoRA (K-D Tree of Low-Rank Adapters), a novel approach that constructslayer-wiseadapters by leveraging hierarchical gradient similarity to enable efficient CL, particularly for LPMs. To reduce the computational burden of task similarity estimation, we employbandittechniques to develop an algorithm based on lower confidence bounds to efficiently explore the task structure. Furthermore, we use sparse gradient updates to facilitate parameter optimization, making the approach better suited for LPMs. Theoretical analysis is provided to justify the rationale behind our approach, and experiments on bothvision transformers(ViTs) andlarge language models(LLMs) demonstrate the effectiveness and efficiency of our approach across various domains, including vision and natural language processing tasks." @@ -3016,13 +2938,11 @@ MetaOptimize: A Framework for Optimizing Step Sizes and Other Meta-parameters,Ar Quantum Optimization via Gradient-Based Hamiltonian Descent,Jiaqi Leng Bin Shi,https://icml.cc/virtual/2025/poster/43703,"With rapid advancements in machine learning, first-order algorithms have emerged as the backbone of modern optimization techniques, owing to their computational efficiency and low memory requirements. Recently, the connection between accelerated gradient methods and damped heavy-ball motion, particularly within the framework of Hamiltonian dynamics, has inspired the development of innovative quantum algorithms for continuous optimization. One such algorithm, Quantum Hamiltonian Descent (QHD), leverages quantum tunneling to escape saddle points and local minima, facilitating the discovery of global solutions in complex optimization landscapes. However, QHD faces several challenges, including slower convergence rates compared to classical gradient methods and limited robustness in highly non-convex problems due to the non-local nature of quantum states. Furthermore, the original QHD formulation primarily relies on function value information, which limits its effectiveness. Inspired by insights from high-resolution differential equations that have elucidated the acceleration mechanisms in classical methods, we propose an enhancement to QHD by incorporating gradient information, leading to what we call gradient-based QHD. This gradient-based QHD achieves faster convergence and significantly increases the likelihood of identifying global solutions. Numerical simulations on challenging problem instances demonstrate that this gradient-based QHD outperforms existing quantum and classical methods by at least an order of magnitude." Stochastic Smoothed Primal-Dual Algorithms for Nonconvex Optimization with Linear Inequality Constraints,Ruichuan Huang Jiawei Zhang Ahmet Alacaoglu,https://icml.cc/virtual/2025/poster/44818,"We propose smoothed primal-dual algorithms for solving stochastic nonconvex optimization problems with linear \emph{inequality} constraints. Our algorithms are single-loop and only require a single (or two) samples of stochastic gradients at each iteration. A defining feature of our algorithm is that it is based on an inexact gradient descent framework for the Moreau envelope, where the gradient of the Moreau envelope is estimated using one step of a stochastic primal-dual (linearized) augmented Lagrangian algorithm. To handle inequality constraints and stochasticity, we combine the recently established global error bounds in constrained optimization with a Moreau envelope-based analysis of stochastic proximal algorithms. We establish the optimal (in their respective cases) $O(\varepsilon^{-4})$ and $O(\varepsilon^{-3})$ sample complexity guarantees for our algorithms and provide extensions to stochastic linear constraints. Unlike existing methods, iterations of our algorithms are free of subproblems, large batch sizes or increasing penalty parameters in their iterations and they use dual variable updates to ensure feasibility." Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed,Savelii Chezhegov Klyukin Yaroslav Andrei Semenov Aleksandr Beznosikov Alexander Gasnikov Samuel Horváth Martin Takáč Eduard Gorbunov,https://icml.cc/virtual/2025/poster/45593,"Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the current understanding of the high-probability convergence of AdaGrad/Adam-type methods is limited in this case. In this work, we prove that AdaGrad/Adam (and their delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. We also show that gradient clipping fixes this issue, i.e., we derive new high-probability convergence bounds with polylogarithmic dependence on the confidence level for AdaGrad-Norm and Adam-Norm with clipping and with/without delay for smooth convex/non-convex stochastic optimization with heavy-tailed noise. We extend our results to the case of AdaGrad/Adam with delayed stepsizes. Our empirical evaluations highlight the superiority of clipped versions of AdaGrad/Adam in handling the heavy-tailed noise." -Sign Operator for Coping with Heavy-Tailed Noise: High Probability Convergence Bounds with Extensions to Distributed Optimization and Comparison Oracle,Nikita Maksimovich Kornilov Philip Zmushko Andrei Semenov Alexander Gasnikov Aleksandr Beznosikov,https://openreview.net/forum?id=chmWoDwwqk, Global Optimization with a Power-Transformed Objective and Gaussian Smoothing,Chen Xu,https://icml.cc/virtual/2025/poster/46360,"We propose a novel method, namely Gaussian Smoothing with a Power-Transformed Objective (GS-PowerOpt), that solves global optimization problems in two steps: (1) perform a (exponential) power-$N$ transformation to the not necessarily differentiable objective $f:\mathbb{R}^d\rightarrow \mathbb{R}$ and get $f_N$, and (2) optimize the Gaussian-smoothed $f_N$ with stochastic approximations. Under mild conditions on $f$, for any $\delta>0$, we prove that with a sufficiently large power $N_\delta$, this method converges to a solution in the $\delta$-neighborhood of $f$'s global optimum point, at the iteration complexity of $O(d^4\varepsilon^{-2})$. If we require that $f$ is differentiable and further assume the Lipschitz condition on $f$ and its gradient, the iteration complexity reduces to $O(d^2\varepsilon^{-2})$, which is significantly faster than the standard homotopy method. In most of the experiments performed, our method produces better solutions than other algorithms that also apply the smoothing technique." Can Transformers Learn Full Bayesian Inference in Context?,Arik Reuter Tim G. J. Rudner Vincent Fortuin David Rügamer,https://icml.cc/virtual/2025/poster/46240,"Transformers have emerged as the dominant architecture in the field of deep learning, with a broad range of applications and remarkable in-context learning (ICL) capabilities. While not yet fully understood, ICL has already proved to be an intriguing phenomenon, allowing transformers to learn in context—without requiring further training. In this paper, we further advance the understanding of ICL by demonstrating that transformers can perform full Bayesian inference for commonly used statistical models in context. More specifically, we introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows and enables us to infer complex posterior distributions for models such as generalized linear models and latent factor models. Extensive experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods that do not operate in context. The source code for this paper is available at https://github.com/ArikReuter/ICLforFullBayesianInference" Random Policy Evaluation Uncovers Policies of Generative Flow Networks,Haoran He Emmanuel Bengio Qingpeng Cai Ling Pan,https://icml.cc/virtual/2025/poster/43990,"The Generative Flow Network (GFlowNet) is a probabilistic framework in which an agent learns a stochastic policy and flow functions to sample objects with probability proportional to an unnormalized reward function. GFlowNets share a strong connection with reinforcement learning (RL) that typically aims to maximize reward. A number of recent works explored connections between GFlowNets and maximum entropy (MaxEnt) RL, which incorporates entropy regularization into the standard RL objective. However, the relationship between GFlowNets and standard RL remains largely unexplored, despite the inherent similarities in their sequential decision-making nature.While GFlowNets can discover diverse solutions through specialized flow-matching objectives, connecting them to standard RL can simplify their implementation through well-established RL principles and also improve RL's capabilities in diverse solution discovery (a critical requirement in many real-world applications), and bridging this gap can further unlock the potential of both fields. In this paper, we bridge this gap by revealing a fundamental connection between GFlowNets and one of the most basic components of RL -- policy evaluation. Surprisingly, we find that the value function obtained from evaluating a uniform policy is closely associated with the flow functions in GFlowNets. Building upon these insights, we introduce a rectified random policy evaluation (RPE) algorithm, which achieves the same reward-matching effect as GFlowNets based on simply evaluating a fixed random policy, offering a new perspective. Empirical results across extensive benchmarks demonstrate that RPE achieves competitive results compared to previous approaches, shedding light on the previously overlooked connection between (non-MaxEnt) RL and GFlowNets." Rethinking Aleatoric and Epistemic Uncertainty,Freddie Bickford Smith Jannik Kossen Eleanor Trollope Mark van der Wilk Adam Foster Tom Rainforth,https://icml.cc/virtual/2025/poster/46057,"The ideas of aleatoric and epistemic uncertainty are widely used to reason about the probabilistic predictions of machine-learning models. We identify incoherence in existing discussions of these ideas and suggest this stems from the aleatoric-epistemic view being insufficiently expressive to capture all the distinct quantities that researchers are interested in. To address this we present a decision-theoretic perspective that relates rigorous notions of uncertainty, predictive performance and statistical dispersion in data. This serves to support clearer thinking as the field moves forward. Additionally we provide insights into popular information-theoretic quantities, showing they can be poor estimators of what they are often purported to measure, while also explaining how they can still be useful in guiding data acquisition." Bayesian Inference for Correlated Human Experts and Classifiers,Markelle Kelly Alex James Boyd Sam Showalter Mark Steyvers Padhraic Smyth,https://icml.cc/virtual/2025/poster/43804,"Applications of machine learning often involve making predictions based on both model outputs and the opinions of human experts. In this context, we investigate the problem of querying experts for class label predictions, using as few human queries as possible, and leveraging the class probability estimates of pre-trained classifiers. We develop a general Bayesian framework for this problem, modeling expert correlation via a joint latent representation, enabling simulation-based inference about the utility of additional expert queries, as well as inference of posterior distributions over unobserved expert labels. We apply our approach to two real-world medical classification problems, as well as to CIFAR-10H and ImageNet-16H, demonstrating substantial reductions relative to baselines in the cost of querying human experts while maintaining high prediction accuracy." -Bayesian Parameter Shift Rules in Variational Quantum Eigensolvers,Samuele Pedrielli Christopher J. Anders Lena Funcke Karl Jansen Kim Andrea Nicoli Shinichi Nakajima,https://openreview.net/forum?id=NjOAQ2nJuL, Conditioning Diffusions Using Malliavin Calculus,Jakiw Pidstrigach Elizabeth Louise Baker Carles Domingo-Enrich George Deligiannidis Nikolas Nüsken,https://icml.cc/virtual/2025/poster/46698,"In generative modelling and stochastic optimal control, a central computational task is to modify a reference diffusion process to maximise a given terminal-time reward. Most existing methods require this reward to be differentiable, using gradients to steer the diffusion towards favourable outcomes. However, in many practical settings, like diffusion bridges, the reward is singular, taking an infinite value if the target is hit and zero otherwise.We introduce a novel framework, based on Malliavin calculus and centred around a generalisation of the Tweedie score formula to nonlinear stochastic differential equations, that enables the development of methods robust to such singularities.This allows our approach to handle a broad range of applications, like diffusion bridges, or adding conditional controls to an already trained diffusion model.We demonstrate that our approach offers stable and reliable training, outperforming existing techniques. As a byproduct, we also introduce a novel score matching objective. Our loss functions are formulated such that they could readily be extended to manifold-valued and infinite dimensional diffusions." Action-Constrained Imitation Learning,Chia-Han Yeh Tse-Sheng Nan Risto Vuorio Wei Hung Hung Yen Wu Shao-Hua Sun Ping-Chun Hsieh,https://icml.cc/virtual/2025/poster/45492,"Policy learning under action constraints plays a central role in ensuring safe behaviors in various robot control and resource allocation applications.In this paper, we study a new problem setting termed Action-Constrained Imitation Learning (ACIL), where an action-constrained imitator aims to learn from a demonstrative expert with larger action space.The fundamental challenge of ACIL lies in the unavoidable mismatch of occupancy measure between the expert and the imitator caused by the action constraints. We tackle this mismatch through trajectory alignment and propose DTWIL, which replaces the original expert demonstrations with a surrogate dataset that follows similar state trajectories while adhering to the action constraints. Specifically, we recast trajectory alignment as a planning problem and solve it via Model Predictive Control, which aligns the surrogate trajectories with the expert trajectories based on the Dynamic Time Warping (DTW) distance. Through extensive experiments, we demonstrate that learning from the dataset generated by DTWIL significantly enhances performance across multiple robot control tasks and outperforms various benchmark imitation learning algorithms in terms of sample efficiency." ADDQ: Adaptive distributional double Q-learning,Leif Döring Benedikt Wille Maximilian Birr Mihail Bîrsan Martin Slowik,https://icml.cc/virtual/2025/poster/46093,"Bias problems in the estimation of Q-values are a well-known obstacle that slows down convergence of Q-learning and actor-critic methods. One of the reasons of the success of modern RL algorithms is partially a direct or indirect overestimation reduction mechanism. We introduce an easy to implement method built on top of distributional reinforcement learning (DRL) algorithms to deal with the overestimation in a locally adaptive way. Our framework ADDQ is simple to implement, existing DRL implementations can be improved with a few lines of code. We provide theoretical backup and experimental results in tabular, Atari, and MuJoCo environments, comparisons with state-of-the-art methods, and a proof of convergence in the tabular case." @@ -3038,15 +2958,12 @@ Ad Hoc Teamwork via Offline Goal-Based Decision Transformers,Xinzhi Zhang Hohei Finite-Time Global Optimality Convergence in Deep Neural Actor-Critic Methods for Decentralized Multi-Agent Reinforcement Learning,Zhiyao Zhang Myeung Suk Oh FNU Hairi Ziyue Luo Alvaro Velasquez Jia Liu,https://icml.cc/virtual/2025/poster/44842,"Actor-critic methods for decentralized multi-agent reinforcement learning (MARL) facilitate collaborative optimal decision making without centralized coordination, thus enabling a wide range of applications in practice. To date, however, most theoretical convergence studies for existing actor-critic decentralized MARL methods are limited to the guarantee of a stationary solution under the linear function approximation. This leaves a significant gap between the highly successful use of deep neural actor-critic for decentralized MARL in practice and the current theoretical understanding. To bridge this gap, in this paper, we make the first attempt to develop a deep neural actor-critic method for decentralized MARL, where both the actor and critic components are inherently non-linear. We show that our proposed method enjoys a global optimality guarantee with a finite-time convergence rate of $\mathcal{O}(1/T)$, where $T$ is the total iteration times. This marks the first global convergence result for deep neural actor-critic methods in the MARL literature. We also conduct extensive numerical experiments, which verify our theoretical results." GradPS: Resolving Futile Neurons in Parameter Sharing Network for Multi-Agent Reinforcement Learning,Haoyuan Qin Zhengzhu Liu Chenxing Lin Chennan Ma Songzhu Mei Siqi Shen Cheng Wang,https://icml.cc/virtual/2025/poster/45651,"Parameter-sharing (PS) techniques have been widely adopted in cooperative Multi-Agent Reinforcement Learning (MARL). In PS, all the agents share a policy network with identical parameters, which enjoys good sample efficiency. However, PS could lead to homogeneous policies that limit MARL performance. We tackle this problem from the angle of gradient conflict among agents. We find that the existence of futile neurons whose update is canceled out by gradient conflicts among agents leads to poor learning efficiency and diversity. To address this deficiency, we propose GradPS, a gradient-based PS method. It dynamically creates multiple clones for each futile neuron. For each clone, a group of agents with low gradient-conflict shares the neuron's parameters.Our method can enjoy good sample efficiency by sharing the gradients among agents of the same clone neuron. Moreover, it can encourage diverse behaviors through independently updating an exclusive clone neuron. Through extensive experiments, we show that GradPS can learn diverse policies with promising performance. The source code for GradPS is available in \url{https://github.com/xmu-rl-3dv/GradPS}." Reidentify: Context-Aware Identity Generation for Contextual Multi-Agent Reinforcement Learning,Zhiwei Xu Kun Hu Xin Xin Weiliang Meng Yiwei Shi Hangyu Mao Bin Zhang Dapeng Li Jiangjin Yin,https://icml.cc/virtual/2025/poster/43673,"Generalizing multi-agent reinforcement learning (MARL) to accommodate variations in problem configurations remains a critical challenge in real-world applications, where even subtle differences in task setups can cause pre-trained policies to fail. To address this, we propose Context-Aware Identity Generation (CAID), a novel framework to enhance MARL performance under the Contextual MARL (CMARL) setting. CAID dynamically generates unique agent identities through the agent identity decoder built on a causal Transformer architecture. These identities provide contextualized representations that align corresponding agents across similar problem variants, facilitating policy reuse and improving sample efficiency. Furthermore, the action regulator in CAID incorporates these agent identities into the action-value space, enabling seamless adaptation to varying contexts. Extensive experiments on CMARL benchmarks demonstrate that CAID significantly outperforms existing approaches by enhancing both sample efficiency and generalization across diverse context variants." -The Meta-Representation Hypothesis,Zhengpeng Xie Jiahang Cao Qiang Zhang Jianxiong Zhang Changwei Wang Renjing Xu,https://openreview.net/forum?id=P1krvpwfW6, DiLQR: Differentiable Iterative Linear Quadratic Regulator via Implicit Differentiation,Shuyuan Wang Philip D Loewen Michael Forbes Bhushan Gopaluni Wei Pan,https://icml.cc/virtual/2025/poster/44176,"While differentiable control has emerged as a powerful paradigm combining model-free flexibility with model-based efficiency, the iterative Linear Quadratic Regulator (iLQR) remains underexplored as a differentiable component. The scalability of differentiating through extended iterations and horizons poses significant challenges, hindering iLQR from being an effective differentiable controller. This paper introduces DiLQR, a framework that facilitates differentiation through iLQR, allowing it to serve as a trainable and differentiable module, either as or within a neural network. A novel aspect of this framework is the analytical solution that it provides for the gradient of an iLQR controller through implicit differentiation, which ensures a constant backward cost regardless of iteration, while producing an accurate gradient. We evaluate our framework on imitation tasks on famous control benchmarks. Our analytical method demonstrates superior computational performance, achieving up to $\textbf{128x}$ speedup and a minimum of $\textbf{21x}$ speedup compared to automatic differentiation. Our method also demonstrates superior learning performance ($\mathbf{10^6x}$) compared to traditional neural network policies and better model loss with differentiable controllers that lack exact analytical gradients. Furthermore, we integrate our module into a larger network with visual inputs to demonstrate the capacity of our method for high-dimensional, fully end-to-end tasks. Codes can be found on the project homepage~\url{https://sites.google.com/view/dilqr/}." Flow-based Domain Randomization for Learning and Sequencing Robotic Skills,Aidan Curtis Eric Li Michael Noseworthy Nishad Gothoskar Sachin Chitta Hui Li Leslie Pack Kaelbling Nicole E Carey,https://icml.cc/virtual/2025/poster/46239,"Domain randomization in reinforcement learning is an established technique for increasing the robustness of control policies learned in simulation. By randomizing properties of the environment during training, the learned policy can be robust to uncertainty along the randomized dimensions. While the environment distribution is typically specified by hand, in this paper we investigate the problem of automatically discovering this sampling distribution via entropy-regularized reward maximization of a neural sampling distribution in the form of a normalizing flow. We show that this architecture is more flexible and results in better robustness than existing approaches to learning simple parameterized sampling distributions. We demonstrate that these policies can be used to learn robust policies for contact-rich assembly tasks. Additionally, we explore how these sampling distributions, in combination with a privileged value function, can be used for out-of-distribution detection in the context of an uncertainty-aware multi-step manipulation planner." Algorithmic Recourse for Long-Term Improvement,Kentaro Kanamori Ken Kobayashi Satoshi Hara Takuya Takagi,https://icml.cc/virtual/2025/poster/44455,"Algorithmic recourse aims to provide a recourse action for altering an unfavorable prediction given by a model into a favorable one (e.g., loan approval). In practice, it is also desirable to ensure that an action makes the real-world outcome better (e.g., loan repayment). We call this requirementimprovement. Unfortunately, existing methods cannot ensure improvement unless we know the true oracle. To address this issue, we propose a framework for suggesting improvement-oriented actions from a long-term perspective. Specifically, we introduce a new online learning task of assigning actions to a given sequence of instances. We assume that we can observe delayed feedback on whether the past suggested action achieved improvement. Using the feedback, we estimate an action that can achieve improvement for each instance. To solve this task, we propose two approaches based on contextual linear bandit and contextual Bayesian optimization. Experimental results demonstrated that our approaches could assign improvement-oriented actions to more instances than the existing methods." -Fisher Divergence for Attribution through Stochastic Differential Equations,XIONGREN CHEN Jiuyong Li Jixue Liu Lin Liu Stefan Peters Thuc Duy Le Wentao Gao Xiaojing Du,https://openreview.net/forum?id=eHIc7SL0sr, Graph Inverse Style Transfer for Counterfactual Explainability,Bardh Prenkaj Efstratios Zaradoukas Gjergji Kasneci,https://icml.cc/virtual/2025/poster/46421,"Counterfactual explainability seeks to uncover model decisions by identifying minimal changes to the input that alter the predicted outcome. This task becomes particularly challenging for graph data due to preserving structural integrity and semantic meaning. Unlike prior approaches that rely on forward perturbation mechanisms, we introduce Graph Inverse Style Transfer (GIST), the first framework to re-imagine graph counterfactual generation as a backtracking process, leveraging spectral style transfer. By aligning the global structure with the original input spectrum and preserving local content faithfulness, GIST produces valid counterfactuals as interpolations between the input style and counterfactual content. Tested on 8 binary and multi-class graph classification benchmarks, GIST achieves a remarkable +7.6% improvement in the validity of produced counterfactuals and significant gains (+45.5%) in faithfully explaining the true class distribution. Additionally, GIST's backtracking mechanism effectively mitigates overshooting the underlying predictor's decision boundary, minimizing the spectral differences between the input and the counterfactuals. These results challenge traditional forward perturbation methods, offering a novel perspective that advances graph explainability." Selective Preference Aggregation,Shreyas Kadekodi Hayden McTavish Berk Ustun,https://icml.cc/virtual/2025/poster/45248,"Many applications in machine learning and decision-making rely on procedures to aggregate human preferences.In such tasks, individual express ordinal preferences over a set of items through votes, ratings, or pairwise comparisons. We then summarize their collective preferences as a ranking. Standard methods for preference aggregation are designed to return rankings that arbitrate individual disagreements in ways that are faithful and fair. In this work, we introduce a paradigm for *selective aggregation*, where we can avoid the need to arbitrate dissent by abstaining from comparison. We summarize collective preferences as a *selective ranking* -- i.e., a partial order where we can only compare items where at least $100\cdot(1 - \tau)\%$ of individuals agree. We develop algorithms to build selective rankings that achieve all possible trade-offs between comparability and disagreement, and derive formal guarantees on their safety and stability. We conduct an extensive set of experiments on real-world datasets to benchmark our approach and demonstrate its functionality. Our results show selective aggregation can promote transparency and robustness by revealing disagreement and abstaining from arbitration." TOPLOC: A Locality Sensitive Hashing Scheme for Trustless Verifiable Inference,Jack Min Ong Matthew Di Ferrante Aaron Pazdera Ryan Garner Sami Jaghouar Manveer Basra Max Ryabinin Johannes Hagemann,https://icml.cc/virtual/2025/poster/46281,"Large language models (LLMs) have proven to be very capable, but access to frontier models currently relies on inference providers.This introduces trust challenges: how can we be sure that the provider is using the model configuration they claim?We propose TOPLOC, a novel method for verifiable inference that addresses this problem.TOPLOC leverages a compact locality-sensitive hashing mechanism for intermediate activations, which can detect unauthorized modifications to models, prompts, or precision with 100\% accuracy, achieving no false positives or negatives in our empirical evaluations.Our approach is robust across diverse hardware configurations, GPU types, and algebraic reorderings, which allows for validation speeds significantly faster than the original inference.By introducing a polynomial encoding scheme, TOPLOC minimizes the memory overhead of the generated proofs by $1000\times$, requiring only 258 bytes of storage per 32 new tokens, compared to the 262 KB requirement of storing the token embeddings directly for Llama 3.1-8B-Instruct.Our method empowers users to verify LLM inference computations efficiently, fostering greater trust and transparency in open ecosystems and laying a foundation for decentralized, verifiable and trustless AI services." -Understanding the learned look-ahead behavior of chess neural networks,Diogo Cruz,https://openreview.net/forum?id=OcBAd0JPxv, Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models,Rei Higuchi Taiji Suzuki,https://icml.cc/virtual/2025/poster/44313,"Aligning large language models (LLMs) with human preferences is crucial for safe deployment, yet existing methods assume specific preference models like Bradley-Terry model.This assumption leads to statistical inconsistency, where more data doesn't guarantee convergence to true human preferences.To address this critical gap, we introduce a novel alignment method Direct Density Ratio Optimization (DDRO).DDRO directly estimates the density ratio between preferred and unpreferred output distributions, circumventing the need for explicit human preference modeling.We theoretically prove that DDRO is statistically consistent, ensuring convergence to the true preferred distribution as the data size grows, regardless of the underlying preference structure.Experiments demonstrate that DDRO achieves superior performance compared to existing methods, showcasing its effectiveness and potential for significant improvement.DDRO unlocks the potential for truly data-driven alignment, paving the way for more reliable and human-aligned LLMs." Preference learning made easy: Everything should be understood through win rate,Lily H Zhang Rajesh Ranganath,https://icml.cc/virtual/2025/poster/45135,"Preference learning, or the task of aligning generative models to preference comparison data, has yet to reach the conceptual maturity of classification, density estimation, etc. To close this gap, this work presents a framework to understand preference learning starting from the sampling distribution of pairwise preference data. First, we prove that the only evaluation of a generative model that respects both preferences and prevalences in the data distribution is a form of win rate, justifying win rate as the focal point to understand preference learning. We then analyze preference learning methods as win rate optimization (WRO) or non-WRO. We present novel instances of WRO beyond existing examples (RLHF, NLHF) and identify two key theoretical benefits of all such methods. We prove that common non-WRO methods like DPO and SFT on preferred samples lack these properties and suggest ways to mitigate such theoretical limitations. We also show that WRO underperforms in practice due optimization difficulties and that optimization success predicts performance better than choices which affect the objective's solution. Our analysis highlights best practices for existing methods and provides recommendations for future research, guided by the principle that one should either align non-WRO methods more closely with WRO or improve the optimization of WRO objectives." FACTER: Fairness-Aware Conformal Thresholding and Prompt Engineering for Enabling Fair LLM-Based Recommender Systems,Arya Fayyazi Mehdi Kamal Massoud Pedram,https://icml.cc/virtual/2025/poster/44576,"We propose FACTER, a fairness-aware framework for LLM-based recommendation systems that integrates conformal prediction with dynamic prompt engineering. By introducing an adaptive semantic variance threshold and a violation-triggered mechanism, FACTER automatically tightens fairness constraints whenever biased patterns emerge. We further develop an adversarial prompt generator that leverages historical violations to reduce repeated demographic biases without retraining the LLM. Empirical results on MovieLens and Amazon show that FACTER substantially reduces fairness violations (up to 95.5%) while maintaining strong recommendation accuracy, revealing semantic variance as a potent proxy of bias." @@ -3054,10 +2971,8 @@ Fair Clustering via Alignment,Kunwoong Kim Jihu Lee Sangchul Park Yongdai Kim,ht FairICP: Encouraging Equalized Odds via Inverse Conditional Permutation,Yuheng Lai Leying Guan,https://icml.cc/virtual/2025/poster/45147,"Equalized odds, an important notion of algorithmic fairness, aims to ensure that sensitive variables, such as race and gender, do not unfairly influence the algorithm's prediction when conditioning on the true outcome. Despite rapid advancements, current research primarily focuses on equalized odds violations caused by a single sensitive attribute, leaving the challenge of simultaneously accounting for multiple attributes under-addressed. We bridge this gap by introducing an in-processing fairness-aware learning approach, FairICP, which integrates adversarial learning with a novel inverse conditional permutation scheme. FairICP offers a flexible and efficient scheme to promote equalized odds under fairness conditions described by complex and multi-dimensional sensitive attributes. The efficacy and adaptability of our method are demonstrated through both simulation studies and empirical analyses of real-world datasets." Optimal Fair Learning Robust to Adversarial Distribution Shift,Sushant Agarwal Amit Deshpande Rajmohan Rajaraman Ravi Sundaram,https://icml.cc/virtual/2025/poster/45184,"Previous work in fair machine learning has characterised the Fair Bayes Optimal Classifier (BOC) on a given distribution for both deterministic and randomized classifiers. We study the robustness of the Fair BOC to adversarial noise in the data distribution. Kearns & Li (1988) implies that the accuracy of the deterministic BOC without any fairness constraints is robust (Lipschitz) to malicious noise in the data distribution. We demonstrate that their robustness guarantee breaks down when we add fairness constraints. Hence, we consider the randomized Fair BOC, and our central result is that its accuracy is robust to malicious noise in the data distribution. Our robustness result applies to various fairness constraints---Demographic Parity, Equal Opportunity, Predictive Equality. Beyond robustness, we demonstrate that randomization leads to better accuracy and efficiency. We show that the randomized Fair BOC is nearly-deterministic, and gives randomized predictions on at most one data point, hence availing numerous benefits of randomness, while using very little of it." Policy Design for Two-sided Platforms with Participation Dynamics,Haruka Kiyohara Fan Yao Sarah Dean,https://icml.cc/virtual/2025/poster/43919,"In two-sided platforms (e.g., video streaming or e-commerce), viewers and providers engage in interactive dynamics: viewers benefit from increases in provider populations, while providers benefit from increases in viewer population. Despite the importance of such “population effects” on long-term platform health, recommendation policies do not generally take the participation dynamics into account. This paper thus studies the dynamics and recommender policy design on two-sided platforms under the population effects for the first time. Our control- and game-theoretic findings warn against the use of the standard “myopic-greedy” policy and shed light on the importance of provider-side considerations (i.e., effectively distributing exposure among provider groups) to improve social welfare via population growth. We also present a simple algorithm to optimize long-term social welfare by taking the population effects into account, and demonstrate its effectiveness in synthetic and real-data experiments. Our experiment code is available at https://github.com/sdean-group/dynamics-two-sided-market." -Reliable Image Quality Evaluation and Mitigation of Quality Bias in Generative Models,Hoin Jung Shenyu Lu De Wang Xiaoqian Wang,https://openreview.net/forum?id=zC0ZxjXixH, Janus: Dual-Server Multi-Round Secure Aggregation with Verifiability for Federated Learning,Lang Pu Jingjing Gu Chao Lin Xinyi Huang,https://icml.cc/virtual/2025/poster/45774,"Secure Aggregation (SA) is a cornerstone of Federated Learning (FL), ensuring that user updates remain hidden from servers. The advanced Flamingo (S\&P'23) has realized multi-round aggregation and improved efficiency. However, it still faces several key challenges: scalability issues with dynamic user participation, a lack of verifiability for server-side aggregation results, and vulnerability to Model Inconsistency Attacks (MIA) caused by a malicious server distributing inconsistent models. To address these issues, we propose $\textit{Janus}$, a generic SA scheme based on dual-server architecture. Janus ensures security against up to $n-2$ colluding clients (where $n$ is the total client count), which prevents privacy breaches for non-colluders. Additionally, Janus is model-independent, ensuring applicability across any FL model without specific adaptations. Furthermore, Janus introduces a new cryptographic primitive, Separable Homomorphic Commitment, which enables clients to efficiently verify the correctness of aggregation. Finally, extensive experiments show that Janus not only significantly enhances security but also reduces per-client communication and computation overhead from logarithmic to constant scale, with a tolerable impact on model performance." """Who experiences large model decay and why?"" A Hierarchical Framework for Diagnosing Heterogeneous Performance Drift",Harvineet Singh Fan Xia Alexej Gossmann Andrew Chuang Julian C. Hong Jean Feng,https://icml.cc/virtual/2025/poster/45315,"Machine learning (ML) models frequently experience performance degradation when deployed in new contexts. Such degradation is rarely uniform: some subgroups may suffer large performance decay while others may not. Understanding where and how large differences in performance arise is critical for designingtargetedcorrective actions that mitigate decay for the most affected subgroups while minimizing any unintended effects. Current approaches do not provide such detailed insight, as they either (i) explain howaverageperformance shifts arise or (ii) identify adversely affected subgroups without insight into how this occurred. To this end, we introduce aSubgroup-scanningHierarchicalInferenceFramework for performance drifT(SHIFT). SHIFT first asks ""Is there any subgroup with unacceptably large performance decay due to covariate/outcome shifts?"" (Where?) and, if so, dives deeper to ask ""Can we explain this using more detailed variable(subset)-specific shifts?"" (How?). In real-world experiments, we find that SHIFT identifies interpretable subgroups affected by performance decay, and suggests targeted actions that effectively mitigate the decay." -Text-Image Dual Consistency-Guided OOD Detection with Pretrained Vision-Language Models,Fayi Le Wenwu He Chentao Cao Dong Liang Zhuo-Xu Cui,https://openreview.net/forum?id=jiJAGTW98t, Understanding Model Ensemble in Transferable Adversarial Attack,Wei Yao Zeliang Zhang Huayi Tang Yong Liu,https://icml.cc/virtual/2025/poster/43506,"Model ensemble adversarial attack has become a powerful method for generating transferable adversarial examples that can target even unknown models, but its theoretical foundation remains underexplored. To address this gap, we provide early theoretical insights that serve as a roadmap for advancing model ensemble adversarial attack. We first define transferability error to measure the error in adversarial transferability, alongside concepts of diversity and empirical model ensemble Rademacher complexity. We then decompose the transferability error into vulnerability, diversity, and a constant, which rigidly explains the origin of transferability error in model ensemble attack: the vulnerability of an adversarial example to ensemble components, and the diversity of ensemble components. Furthermore, we apply the latest mathematical tools in information theory to bound the transferability error using complexity and generalization terms, validating three practical guidelines for reducing transferability error: (1) incorporating more surrogate models, (2) increasing their diversity, and (3) reducing their complexity in cases of overfitting. Finally, extensive experiments with 54 models validate our theoretical framework, representing a significant step forward in understanding transferable model ensemble adversarial attacks." Adversaries Can Misuse Combinations of Safe Models,Erik Jones Anca Dragan Jacob Steinhardt,https://icml.cc/virtual/2025/poster/45845,"Developers try to evaluate whether an AI system can accomplish malicious tasks before releasing it; for example, they might test whether a model enables cyberoffense, user manipulation, or bioterrorism. In this work, we show that individually testing models for such misuse is inadequate; adversaries can misuse combinations of models even when each individual model is safe. The adversary accomplishes this by first decomposing tasks into subtasks, then solving each subtask with the best-suited model. For example, an adversary might solve challenging-but-benign subtasks with an aligned frontier model, and easy-but-malicious subtasks with a weaker misaligned model. We study two decomposition methods: manual decomposition where a human identifies a natural decomposition of a task, and automated decomposition where a weak model generates benign tasks for a frontier model to solve, then uses the solutions in-context to solve the original task. Using these decompositions, we empirically show that adversaries can create vulnerable code, explicit images, python scripts for hacking, and manipulative tweets at much higher rates with combinations of models than either individual model. Our work suggests that even perfectly-aligned frontier systems enable misuse without ever producing malicious outputs, and that red-teaming efforts should extend beyond single models in isolation." Automated Red Teaming with GOAT: the Generative Offensive Agent Tester,Maya Pavlova Erik Brinkman Krithika Iyer Vítor Albiero Joanna Bitton Hailey Nguyen Cristian Canton Ferrer Ivan Evtimov Aaron Grattafiori,https://icml.cc/virtual/2025/poster/44754,"Red teaming aims to assess how large language models (LLMs) can produce content that violates norms, policies, and rules set forth during their safety training. However, most existing automated methods in literature are not representative of the way common users exploit the multi-turn conversational nature of AI models. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vuLnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general purpose model in a way that encourages reasoning through the choices of methods available, the current target model’s response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 96% against smaller models such as Llama 3.1 8B, and 91% against Llama 3.1 70B and 94% for GPT-4o when evaluated against larger models on the JailbreakBench dataset." @@ -3071,7 +2986,6 @@ BaxBench: Can LLMs Generate Correct and Secure Backends?,Mark Vero Niels Mündle Cowpox: Towards the Immunity of VLM-based Multi-Agent Systems,YUTONG WU Jie Zhang Yiming Li Chao Zhang Qing Guo Han Qiu Nils Lukas Tianwei Zhang,https://icml.cc/virtual/2025/poster/46436,"Vision Language Model (VLM) Agents are stateful, autonomous entities capable of perceiving and interacting with their environments through vision and language.Multi-agent systems comprise specialized agents who collaborate to solve a (complex) task. A core security property isrobustness, stating that the system maintains its integrity during adversarial attacks. Multi-agent systems lack robustness, as a successful exploit against one agent can spread andinfectother agents to undermine the entire system's integrity. We propose a defense Cowpox to provably enhance the robustness of a multi-agent system by a distributed mechanism that improves therecovery rateof agents by limiting the expected number of infections to other agents.The core idea is to generate and distribute a specialcure samplethat immunizes an agent against the attack before exposure. We demonstrate the effectiveness of Cowpox empirically and provide theoretical robustness guarantees." Hardware and Software Platform Inference,Cheng Zhang Hanna Foerster Robert D. Mullins Yiren Zhao Ilia Shumailov,https://icml.cc/virtual/2025/poster/44242,"It is now a common business practice to buy access to large language model (LLM) inference rather than self-host, because of significant upfront hardware infrastructure and energy costs. However, as a buyer, there is no mechanism to verify the authenticity of the advertised service including the serving hardware platform, e.g. that it is actually being served using an NVIDIA H100. Furthermore, there are reports suggesting that model providers may deliver models that differ slightly from the advertised ones, often to make them run on less expensive hardware. That way, a client pays premium for a capable model access on more expensive hardware, yet ends up being served by a (potentially less capable) cheaper model on cheaper hardware. In this paper we introducehardware and software platform inference (HSPI)-- a method for identifying the underlying GPU architecture and software stack of a (black-box) machine learning model solely based on its input-output behavior. Our method leverages the inherent differences of various GPU architectures and compilers to distinguish between different GPU types and software stacks. By analyzing the numerical patterns in the model's outputs, we propose a classification framework capable of accurately identifying the GPU used for model inference as well as the underlying software configuration. Our findings demonstrate the feasibility of inferring GPU type from black-box models. We evaluate HSPI against models served on different real hardware and find that in a white-box setting we can distinguish between different GPUs with between 83.9% and 100% accuracy. Even in a black-box setting we are able to achieve results that are up to three times higher than random guess accuracy." Stay-Positive: A Case for Ignoring Real Image Features in Fake Image Detection,Anirudh Sundara Rajan Yong Jae Lee,https://icml.cc/virtual/2025/poster/45069,"Detecting AI-generated images is a challenging yet essential task. A primary difficulty arises from the detector’s tendency to rely on spurious patterns, such as compression artifacts, which can influence its decisions. These issues often stem from specific patterns that the detector associates with the real data distribution, making it difficult to isolate the actual generative traces. We argue that an image should be classified as fake if and only if it contains artifacts introduced by the generative model. Based on this premise, we propose Stay-Positive, an algorithm designed to constrain the detector’s focus to generative artifacts while disregarding those associated with real data. Experimental results demonstrate that detectors trained with Stay-Positive exhibit reduced susceptibility to spurious correlations, leading to improved generalization and robustness to post-processing. Additionally, unlike detectors that associate artifacts with real images, those that focus purely on fake artifacts are better at detecting inpainted real images." -Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models,Yongcan Yu Yanbo Wang Ran He Jian Liang,https://openreview.net/forum?id=YaUyLKQGk0, Conformal Tail Risk Control for Large Language Model Alignment,Catherine Chen Jingyan Shen Zhun Deng Lihua Lei,https://icml.cc/virtual/2025/poster/45795,"Recent developments in large language models (LLMs) have led to their widespread usage for various tasks. The prevalence of LLMs in society implores the assurance on the reliability of their performance. In particular, risk-sensitive applications demand meticulous attention to unexpectedly poor outcomes, i.e., tail events, for instance, toxic answers, humiliating language, and offensive outputs. Due to the costly nature of acquiring human annotations, general-purpose scoring models have been created to automate the process of quantifying these tail events. This phenomenon introduces potential human-machine misalignment between the respective scoring mechanisms. In this work, we present a lightweight calibration framework for blackbox models that ensures the alignment of humans and machines with provable guarantees. Our framework provides a rigorous approach to controlling any distortion risk measure that is characterized by a weighted average of quantiles of the loss incurred by the LLM with high confidence. The theoretical foundation of our method relies on the connection between conformal risk control and a traditional family of statistics, i.e., L-statistics. To demonstrate the utility of our framework, we conduct comprehensive experiments that address the issue of human-machine misalignment." Improved Approximations for Hard Graph Problems using Predictions,Anders Aamand Justin Y. Chen Siddharth Gollapudi Sandeep Silwal Hao WU,https://icml.cc/virtual/2025/poster/46430,"We design improved approximation algorithms for NP-hard graph problems by incorporating predictions (e.g., learned from past data). Our prediction model builds upon and extends the $\varepsilon$-prediction framework by Cohen-Addad, d'Orsi, Gupta, Lee, and Panigrahi (NeurIPS 2024). We consider an edge-based version of this model, where each edge provides two bits of information, corresponding to predictions about whether each of its endpoints belong to an optimal solution. Even with weak predictions where each bit is only $\varepsilon$-correlated with the true solution, this information allows us to break approximation barriers in the standard setting. We develop algorithms with improved approximation ratios for MaxCut, Vertex Cover, Set Cover, and Maximum Independent Set problems (among others). Across these problems, our algorithms share a unifying theme, where we separately satisfy constraints related to high degree vertices (using predictions) and low-degree vertices (without using predictions) and carefully combine the answers." Can Transformers Reason Logically? A Study in SAT Solving,Leyan Pan Vijay Ganesh Jacob Abernethy Chris Esposo Wenke Lee,https://icml.cc/virtual/2025/poster/46444,"We formally study the logical reasoning capabilities of decoder-only Transformers in the context of the boolean satisfiability (SAT) problem. First, we prove by construction that decoder-only Transformers can decide 3-SAT, in a non-uniform model of computation, using backtracking and deduction via Chain-of-Thought (CoT).Second, we implement our construction as a PyTorch model with a tool (PARAT) that we designed to empirically demonstrate its correctness and investigate its properties.Third, rather than \textit{programming} a transformer to reason, we evaluate empirically whether it can be \textit{trained} to do so by learning directly from algorithmic traces (``reasoning paths'') from our theoretical construction. The trained models demonstrate strong out-of-distribution generalization on problem sizes seen during training but has limited length generalization, which is consistent with the implications of our theoretical result." @@ -3091,7 +3005,6 @@ Towards Theoretical Understanding of Sequential Decision Making with Preference Universal Approximation Theorem of Deep Q-Networks,Qian Qi,https://icml.cc/virtual/2025/poster/45935,"We establish a continuous-time framework for analyzing Deep Q-Networks (DQNs) via stochastic control and Forward-Backward Stochastic Differential Equations (FBSDEs). Considering a continuous-time Markov Decision Process (MDP) driven by a square-integrable martingale, we analyze DQN approximation properties. We show that DQNs can approximate the optimal Q-function on compact sets with arbitrary accuracy and high probability, leveraging residual network approximation theorems and large deviation bounds for the state-action process. We then analyze the convergence of a general Q-learning algorithm for training DQNs in this setting, adapting stochastic approximation theorems. Our analysis emphasizes the interplay between DQN layer count, time discretization, and the role of viscosity solutions (primarily for the value function $V^*$) in addressing potential non-smoothness of the optimal Q-function. This work bridges deep reinforcement learning and stochastic control, offering insights into DQNs in continuous-time settings, relevant for applications with physical systems or high-frequency data." Cradle: Empowering Foundation Agents towards General Computer Control,Weihao Tan Wentao Zhang Xinrun Xu Haochong Xia Ziluo Ding Boyu Li Bohan Zhou Junpeng Yue Jiechuan Jiang Yewen Li Ruyi An Molei Qin Chuqiao Zong Longtao Zheng YuJie Wu Xiaoqiang Chai Yifei Bi Tianbao Xie Pengjie Gu Xiyun Li Ceyao Zhang Long Tian Chaojie Wang Xinrun Wang Börje F. Karlsson Bo An Shuicheng YAN Zongqing Lu,https://icml.cc/virtual/2025/poster/46393,"Despite their success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory, Cradle is able to understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning and information retrieval, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities:Skylines, Stardew Valley and Dealer's Life 2), five software applications (Chrome, Outlook, Feishu, Meitu and CapCut), and a comprehensive benchmark, OSWorld. With a unified interface to interact with any software, Cradle greatly extends the reach of foundation agents thus paving the way for generalist agents." A Machine Learning Approach to Duality in Statistical Physics,Prateek Gupta Andrea E. V. Ferrari Nabil Iqbal,https://icml.cc/virtual/2025/poster/44740,"The notion of duality -- that a given physical system can have two different mathematical descriptions -- is a key idea in modern theoretical physics. Establishing a duality in lattice statistical mechanics models requires the construction of a dual Hamiltonian and a map from the original to the dual observables. By using neural networks to parameterize these maps and introducing a loss function that penalises the difference between correlation functions in original and dual models, we formulate the process of duality discovery as an optimization problem. We numerically solve this problem and show that our framework can rediscover the celebrated Kramers-Wannier duality for the 2d Ising model, numerically reconstructing the known mapping of temperatures. We further investigate the 2d Ising model deformed by a plaquette coupling and find families of ``approximate duals''. We discuss future directions and prospects for discovering new dualities within this framework." -Continuous machine learning on Euclidean graphs with unordered vertices,Yury Elkin Vitaliy Kurlin,https://openreview.net/forum?id=DhgofViQyk, DragSolver: A Multi-Scale Transformer for Real-World Automotive Drag Coefficient Estimation,Ye Liu Yuntian Chen,https://icml.cc/virtual/2025/poster/44542,"Automotive drag coefficient ($C_d$) is pivotal to energy efficiency, fuel consumption, and aerodynamic performance. However, costly computational fluid dynamics (CFD) simulations and wind tunnel tests struggle to meet the rapid-iteration demands of automotive design. We present DragSolver, a Transformer-based framework for rapid and accurate $C_d$ estimation from large-scale, diverse 3D vehicle models.DragSolver tackles four key real-world challenges: (1) multi-scale feature extraction to capture both global shape and fine local geometry; (2) heterogeneous scale normalization to handle meshes with varying sizes and densities;(3) surface-guided gating to suppress internal structures irrelevant to external aerodynamics;and (4) epistemic uncertainty estimation via Monte Carlo dropout for risk-aware design. Extensive evaluations on three industrial-scale datasets (DrivaerNet, DrivaerNet++, and DrivaerML) show that DragSolver outperforms existing approaches in accuracy and generalization, achieving an average reduction of relative $L_2$ error by 58.7% across real-world datasets. Crucially, DragSolver is the first to achieve reliable, real-time $C_d$ inference on production-level automotive geometries." Foundation Molecular Grammar: Multi-Modal Foundation Models Induce Interpretable Molecular Graph Languages,Michael Sun Weize Yuan Gang Liu Wojciech Matusik Jie Chen,https://icml.cc/virtual/2025/poster/45739,"Recent data-efficient molecular generation approaches exploit graph grammars to introduce interpretability into the generative models. However, grammar learning therein relies on expert annotation or unreliable heuristics for algorithmic inference. We propose Foundation Molecular Grammar (FMG), which leverages multi-modal foundation models (MMFMs) to induce an interpretable molecular language. By exploiting the chemical knowledge of an MMFM, FMG renders molecules as images, describes them as text, and aligns information across modalities using prompt learning. FMG can be used as a drop-in replacement for the prior grammar learning approaches in molecular generation and property prediction. We show that FMG not only excels in synthesizability, diversity, and data efficiency but also offers built-in chemical interpretability for automated molecular discovery workflows. Code is available at https://github.com/shiningsunnyday/induction." Quadruple Attention in Many-body Systems for Accurate Molecular Property Predictions,Jiahua Rao Dahao Xu Wentao Wei Yicong Chen Mingjun Yang Yuedong Yang,https://icml.cc/virtual/2025/poster/44950,"While Graph Neural Networks and Transformers have shown promise in predicting molecular properties, they struggle with directly modeling complex many-body interactions. Current methods often approximate interactions like three- and four-body terms in message passing, while attention-based models, despite enabling direct atom communication, are typically limited to triplets, making higher-order interactions computationally demanding. To address the limitations, we introduce MABNet, a geometric attention framework designed to model four-body interactions by facilitating direct communication among atomic quartets. This approach bypasses the computational bottlenecks associated with traditional triplet-based attention mechanisms, allowing for the efficient handling of higher-order interactions. MABNet achieves state-of-the-art performance on benchmarks like MD22 and SPICE. These improvements underscore its capability to accurately capture intricate many-body interactions in large molecules. By unifying rigorous many-body physics with computational efficiency, MABNet advances molecular simulations for applications in drug design and materials discovery, while its extensible framework paves the way for modeling higher-order quantum effects." @@ -3105,7 +3018,6 @@ NeuroTree: Hierarchical Functional Brain Pathway Decoding for Mental Health Diso DexScale: Automating Data Scaling for Sim2Real Generalizable Robot Control,Guiliang Liu Yueci Deng Runyi Zhao Huayi Zhou Jian Chen Jietao Chen Ruiyan Xu Yunxin Tai Kui Jia,https://icml.cc/virtual/2025/poster/46168,"A critical prerequisite for achieving generalizable robot control is the availability of a large-scale robot training dataset. Due to the expense of collecting realistic robotic data, recent studies explored simulating and recording robot skills in virtual environments. While simulated data can be generated at higher speeds, lower costs, and larger scales, the applicability of such simulated data remains questionable due to the gap between simulated and realistic environments. To advance the Sim2Real generalization, in this study, we present DexScale, a data engine designed to perform automatic skills simulation and scaling for learning deployable robot manipulation policies. Specifically, DexScale ensures the usability of simulated skills by integrating diverse forms of realistic data into the simulated environment, preserving semantic alignment with the target tasks. For each simulated skill in the environment, DexScale facilitates effective Sim2Real data scaling by automating the process of domain randomization and adaptation. Tuned by the scaled dataset, the control policy achieves zero-shot Sim2Real generalization across diverse tasks, multiple robot embodiments, and widely studied policy model architectures, highlighting its importance in advancing Sim2Real embodied intelligence." One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation,Zhendong Wang Max Li Ajay Mandlekar Zhenjia Xu Jiaojiao Fan Yashraj Narang Linxi Fan Yuke Zhu Yogesh Balaji Mingyuan Zhou Ming-Yu Liu Yu Zeng,https://icml.cc/virtual/2025/poster/45971,"Diffusion models, praised for their success in generative tasks, are increasingly being applied to robotics, demonstrating exceptional performance in behavior cloning. However, their slow generation process stemming from iterative denoising steps poses a challenge for real-time applications in resource-constrained robotics setups and dynamically changing environments.In this paper, we introduce the One-Step Diffusion Policy (OneDP), a novel approach that distills knowledge from pre-trained diffusion policies into a single-step action generator, significantly accelerating response times for robotic control tasks. We ensure the distilled generator closely aligns with the original policy distribution by minimizing the Kullback-Leibler (KL) divergence along the diffusion chain, requiring only $2\%$-$10\%$ additional pre-training cost for convergence. We evaluated OneDP on 6 challenging simulation tasks as well as 4 self-designed real-world tasks using the Franka robot. The results demonstrate that OneDP not only achieves state-of-the-art success rates but also delivers an order-of-magnitude improvement in inference speed, boosting action prediction frequency from 1.5 Hz to 62 Hz, establishing its potential for dynamic and computationally constrained robotic applications. A video demo is provided at our project page, and the code will be publicly available." Exploring Representations and Interventions in Time Series Foundation Models,Michał Wiliński Mononito Goswami Willa Potosnak Nina Żukowska Artur Dubrawski,https://icml.cc/virtual/2025/poster/44453,"Time series foundation models (TSFMs) promise to be powerful tools for a wide range of applications. However, their internal representations and learned concepts are still not well understood. In this study, we investigate the structure and redundancy of representations across various TSFMs, examining the self-similarity of model layers within and across different model sizes. This analysis reveals block-like redundancy in the representations, which can be utilized for informed pruning to improve inference speed and efficiency. We also explore the concepts learned by these models, such as periodicity and trends. We demonstrate how conceptual priors can be derived from TSFM representations and leveraged to steer its outputs toward concept-informed predictions. Our work bridges representational analysis from language and vision models to TSFMs, offering new methods for building more computationally efficient and transparent TSFMs." -Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks,Alexandros Doumanoglou Kurt Driessens Dimitrios Zarpalas,https://openreview.net/forum?id=g7wLilhGhh, Semantic Shift Estimation via Dual-Projection and Classifier Reconstruction for Exemplar-Free Class-Incremental Learning,Run He Di Fang Yicheng Xu Yawen Cui Ming Li Cen Chen Ziqian Zeng Huiping Zhuang,https://icml.cc/virtual/2025/poster/43787,"Exemplar-Free Class-Incremental Learning (EFCIL) aims to sequentially learn from distinct categories without retaining exemplars but easily suffers from catastrophic forgetting of learned knowledge. While existing EFCIL methods leverage knowledge distillation to alleviate forgetting, they still face two critical challenges: semantic shift and decision bias. Specifically, the embeddings of old tasks shift in the embedding space after learning new tasks, and the classifier becomes biased towards new tasks due to training solely with new data, hindering the balance between old and new knowledge. To address these issues, we propose the Dual-Projection Shift Estimation and Classifier Reconstruction (DPCR) approach for EFCIL. DPCR effectively estimates semantic shift through a dual-projection, which combines a learnable transformation with a row-space projection to capture both task-wise and category-wise shifts. Furthermore, to mitigate decision bias, DPCR employs ridge regression to reformulate a classifier reconstruction process. This reconstruction exploits previous in covariance and prototype of each class after calibration with estimated shift, thereby reducing decision bias. Extensive experiments demonstrate that, on various datasets, DPCR effectively balances old and new tasks, outperforming state-of-the-art EFCIL methods. Our codes are available at https://github.com/RHe502/ICML25-DPCR." SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression,Mohammad Mozaffari Amir Yazdanbakhsh Maryam Mehri Dehnavi,https://icml.cc/virtual/2025/poster/46479,"Conventional model compression techniques for LLMs address high memory consumption and slow inference challenges but typically require computationally expensive retraining to preserve accuracy. In contrast, one-shot compression methods eliminate retraining cost, but struggle to achieve accuracy comparable to dense models. This paper presents SLIM, a new one-shot compression framework that holistically integrates hardware-friendly quantization, sparsity, and low-rank approximation into a unified process. First, we formulate the quantization process using a probabilistic approach (SLIM-Quant) that enables us to apply uniform quantization. Then, we use an existing one-shot pruning method to apply semi-structured sparsity on top of the quantized weights. Finally, to compensate for the introduced aggregated quantization and sparsity error, we use a novel saliency function with unique invertible and additive features that enables us tomathematically compute the value of low-rank adapters. SLIM improves model accuracy by up to 5.66% (LLaMA-2-7B) for 2:4 sparsity with 4-bit weight quantization, outperforming prior methods. Models compressed with SLIM achieve up to 4.3× and 3.8× on Nvidia RTX3060 and A100 GPUs, respectively. Additionally, they achieve up to 0.23× end-to-end memory reduction in comparison to their dense counterparts. We also propose an optional PEFT recipe that further improves accuracyby up to 1.66% (LLaMA-2-13B) compared to SLIM without fine-tuning." Fundamental Limits of Visual Autoregressive Transformers: Universal Approximation Abilities,Yifang Chen Xiaoyu Li Yingyu Liang Zhenmei Shi Zhao Song,https://icml.cc/virtual/2025/poster/44146,"We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine ``next-scale prediction'' framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any word-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas." @@ -3125,14 +3037,12 @@ Learning Robust Neural Processes with Risk-Averse Stochastic Optimization,Huafen Multidimensional Adaptive Coefficient for Inference Trajectory Optimization in Flow and Diffusion,Dohoon Lee Jaehyun Park Hyunwoo J. Kim Kyogu Lee,https://icml.cc/virtual/2025/poster/45075,"Flow and diffusion models have demonstrated strong performance and training stability across various tasks but lack two critical properties of simulation-based methods: freedom of dimensionality and adaptability to different inference trajectories. To address this limitation, we propose the Multidimensional Adaptive Coefficient (MAC), a plug-in module for flow and diffusion models that extends conventional unidimensional coefficients to multidimensional ones and enables inference trajectory-wise adaptation. MAC is trained via simulation-based feedback through adversarial refinement. Empirical results across diverse frameworks and datasets demonstrate that MAC enhances generative quality with high training efficiency. Consequently, our work offers a new perspective on inference trajectory optimality, encouraging future research to move beyond vector field design and to leverage training-efficient, simulation-based optimization." SketchDNN: Joint Continuous-Discrete Diffusion for CAD Sketch Generation,Sathvik Reddy Chereddy John Femiani,https://icml.cc/virtual/2025/poster/46031,"We present SketchDNN, a generative model for synthesizing CAD sketches that jointly models both continuous parameters and discrete class labels through a unified continuous-discrete diffusion process. Our core innovation is Gaussian-Softmax diffusion, where logits perturbed with Gaussian noise are projected onto the probability simplex via a softmax transformation, facilitating blended class labels for discrete variables. This formulation addresses 2 key challenges, namely, the heterogeneity of primitive parameterizations and the permutation invariance of primitives in CAD sketches. Our approach significantly improves generation quality, reducing Fréchet Inception Distance (FID) from 16.04 to 7.80 and negative log-likelihood (NLL) from 84.8 to 81.33, establishing a new state-of-the-art in CAD sketch generation on the SketchGraphs dataset." VCT: Training Consistency Models with Variational Noise Coupling,Gianluigi Silvestri Luca Ambrogioni Chieh-Hsin Lai Yuhta Takida Yuki Mitsufuji,https://icml.cc/virtual/2025/poster/46068,"Consistency Training (CT) has recently emerged as a strong alternative to diffusion models for image generation. However, non-distillation CT often suffers from high variance and instability, motivating ongoing research into its training dynamics. We propose Variational Consistency Training (VCT), a flexible and effective framework compatible with various forward kernels, including those in flow matching. Its key innovation is a learned noise-data coupling scheme inspired by Variational Autoencoders, where a data-dependent encoder models noise emission. This enables VCT to adaptively learn noise-to-data pairings, reducing training variance relative to the fixed, unsorted pairings in classical CT. Experiments on multiple image datasets demonstrate significant improvements: our method surpasses baselines, achieves state-of-the-art FID among non-distillation CT approaches on CIFAR-10, and matches SoTA performance on ImageNet 64x64 with only two sampling steps. Code is available at https://github.com/sony/vct." -GraphFLEx: Structure Learning $\underline{\text{F}}$ramework for $\underline{\text{L}}$arge $\underline{\text{Ex}}$panding $\underline{\text{Graph}}$s,Mohit Kataria Nikita Malik Sandeep Kumar Jayadeva Jayadeva,https://openreview.net/forum?id=dwBBhbueYk, Pruning for GNNs: Lower Complexity with Comparable Expressiveness,Dun Ma Jianguo Chen Wenguo Yang Suixiang Gao Shengminjie Chen,https://icml.cc/virtual/2025/poster/46397,"In recent years, the pursuit of higher expressive power in graph neural networks (GNNs) has often led to more complex aggregation mechanisms and deeper architectures. To address these issues, we have identified redundant structures in GNNs, and by pruning them, we propose Pruned MP-GNNs, K-Path GNNs, and K-Hop GNNs based on their original architectures. We show that 1) Although some structures are pruned in Pruned MP-GNNs and Pruned K-Path GNNs, their expressive power has not been compromised. 2) K-Hop MP-GNNs and their pruned architecture exhibit equivalent expressiveness on regular and strongly regular graphs. 3) The complexity of pruned K-Path GNNs and pruned K-Hop GNNs is lower than that of MP-GNNs, yet their expressive power is higher. Experimental results validate our refinements, demonstrating competitive performance across benchmark datasets with improved efficiency." BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms,Yunlong Hou Fengzhuo Zhang Cunxiao Du Xuan Zhang Jiachun Pan Tianyu Pang Chao Du Vincent Y. F. Tan Zhuoran Yang,https://icml.cc/virtual/2025/poster/44460,"Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts." De-mark: Watermark Removal in Large Language Models,Ruibo Chen Yihan Wu Junfeng Guo Heng Huang,https://icml.cc/virtual/2025/poster/46417,"Watermarking techniques offer a promising way to identify machine-generated content via embedding covert information into the contents generated from language models (LMs). However, the robustness of the watermarking schemes has not been well explored. In this paper, we present De-mark, an advanced framework designed to remove n-gram-based watermarks effectively. Our method utilizes a novel querying strategy, termed random selection probing, which aids in assessing the strength of the watermark and identifying the red-green list within the n-gram watermark. Experiments on popular LMs, such as Llama3 and ChatGPT, demonstrate the efficiency and effectiveness of De-mark in watermark removal and exploitation tasks." Enhancing Decision-Making of Large Language Models via Actor-Critic,Heng Dong Kefei Duan Chongjie Zhang,https://icml.cc/virtual/2025/poster/45712,"Large Language Models (LLMs) have achieved remarkable advancements in natural language processing tasks, yet they encounter challenges in complex decision-making scenarios that require long-term reasoning and alignment with high-level objectives. Existing methods either rely on short-term auto-regressive action generation or face limitations in accurately simulating rollouts and assessing outcomes, leading to sub-optimal decisions. This paper introduces a novel LLM-based Actor-Critic framework, termed LAC, that effectively improves LLM policies with long-term action evaluations in a principled and scalable way. Our approach addresses two key challenges: (1) extracting robust action evaluations by computing Q-values via token logits associated with positive/negative outcomes, enhanced by future trajectory rollouts and reasoning; and (2) enabling efficient policy improvement through a gradient-free mechanism. Experiments across diverse environments -- including high-level decision-making (ALFWorld), low-level action spaces (BabyAI-Text), and large action spaces (WebShop) -- demonstrate the framework’s generality and superiority over state-of-the-art methods. Notably, our approach achieves competitive performance using 7B/8B parameter LLMs, even outperforming baseline methods employing GPT-4 in complex tasks. These results underscore the potential of integrating structured policy optimization with LLMs’ intrinsic knowledge to advance decision-making capabilities in multi-step environments." Exploiting Presentative Feature Distributions for Parameter-Efficient Continual Learning of Large Language Models,Xin Cheng Jiabo Ye Haiyang Xu Ming Yan Ji Zhang Feng Liu Fei Huang Lei Feng,https://icml.cc/virtual/2025/poster/46354,"Endowing large language models (LLMs) with continual learning (CL) capacities is practically important, which enables them to dynamically acquire new knowledge over time. Although many effective methods have been proposed for CL of LLMs, they did not consider online scenarios, thereby sharing a common problem: information leakage (IL), where the task-related information of learned tasks is accessed or reused again. IL not only imposes potential risks on data privacy protection but also significantly hinders the deployment of LLMs in real-world scenarios. To avoid IL while maintaining outstanding CL performance, we propose a novel CL method for LLMs, which first characterizes a parameter-efficient fine-tuning (PEFT) block by a presentative feature distribution, and then dynamically selects the appropriate PEFT blocks for each instance based on its similarity with the presentative feature distributions. Extensive experiments validate the effectiveness of our method on the CL of LLM, showcasing its potential to enhance both privacy and adaptability in practical applications." From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?,Zhanke Zhou Xiao Feng Zhaocheng Zhu Jiangchao Yao Sanmi Koyejo Bo Han,https://icml.cc/virtual/2025/poster/45603,"While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast, active reasoning—where an LLM must interact with external systems to acquire missing evidence or data—has received little systematic attention. To address this shortfall, we present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM’s active reasoning skills. AR-Bench comprises three task families—detective cases, situation puzzles, and guessing numbers—that together simulate real-world, agentic scenarios and measure performance across commonsense, logical, and symbolic reasoning challenges.Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning: they frequently fail to acquire or leverage the information needed to solve tasks. This gap highlights a stark divergence between their passive and active reasoning abilities. Moreover, ablation studies indicate that even advanced strategies, such as tree-based searching or post-training approaches, yield only modest gains and fall short of the levels required for real-world deployment. Collectively, these findings highlight the critical need to advance methodology for active reasoning, e.g., incorporating interactive learning, real-time feedback loops, and environment-aware objectives for training.The benchmark is publicly available at: https://github.com/tmlr-group/AR-Bench." -Holes in Latent Space: Topological Signatures Under Adversarial Influence,Aideen Fay Inés García-Redondo Qiquan Wang Haim Dubossarsky Anthea Monod,https://openreview.net/forum?id=Q3yOTl0Ajo, KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference,Xing Li Zeyu XING Yiming Li Linping Qu Hui-Ling Zhen Yiwu Yao Wulong Liu Sinno Jialin Pan Mingxuan Yuan,https://icml.cc/virtual/2025/poster/43487,"KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness.However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference.To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 21.25\% compared with KIVI-KV8 quantization over various context lengths. Our code and searched configurations are available at https://github.com/cmd2001/KVTuner." MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving,Ruida WANG Rui Pan Yuxin Li Jipeng Zhang Yizhen Jia Shizhe Diao Renjie Pi Junjie Hu Tong Zhang,https://icml.cc/virtual/2025/poster/46138,"Solving mathematical problems using computer-verifiable languages like Lean has significantly impacted the mathematical and computer science communities. State-of-the-art methods utilize a single Large Language Model (LLM) to generate complete proof or perform tree search, but they fail to balance these tasks. We proposeMA-LoT:Model-CollAboration Lean-based Long Chain-of-Thought, a comprehensive framework for Lean4 theorem proving to solve this issue. It separates the cognition tasks of general NL for whole-proof generation and error analysis for proof correction using the model-collaboration method. We achieve this by structured interaction of the LLM and Lean4 verifier in Long CoT. To implement the framework, we propose the novelLoT-Transfer Learningtraining-inference pipeline, which enables the Long CoT thinking capability to LLMs without special data annotation. Extensive experiment shows that our framework achieves a61.07%accuracy rate on the Lean4 version of the MiniF2F-Test dataset, largely outperforming DeepSeek-V3 (33.61%), single-model tree search (InternLM-Step-Prover, 50.70%), and whole-proof generation (Godel-Prover, 55.33%) baselines. Furthermore, our findings highlight the potential of combining Long CoT with formal verification for a more insightful generation in a broader perspective." Nemotron-CORTEXA: Enhancing LLM Agents for Software Engineering Tasks via Improved Localization and Solution Diversity,Atefeh Sohrabizadeh Jialin Song Mingjie Liu Rajarshi Roy Chankyu Lee Jonathan Raiman Bryan Catanzaro,https://icml.cc/virtual/2025/poster/44274,"Large Language Models (LLMs) have demonstrated significant potential in code generation by following natural language instructions. Unfortunately, crucial real-world software engineering tasks, such as debugging or repository-level feature implementation, involve processing extensive contexts beyond current LLM context sizes and performing complex reasoning that is brittle using standard autoregressive decoding. Enhancing LLMs' performance in these scenarios requires careful consideration of the contextual information provided to the model, optimizing how the model leverages that, and identifying tools that enable more effective navigation of the development environment.To address these challenges, we introduce Nemotron-CORTEXA, an agentic system built on a predefined scaffold that enhances LLMs' ability to navigate and reason efficiently in complex software engineering contexts. Specifically, we develop a novel code embedding model that retrieves the most relevant files with greater precision, along with a localization agent that refines the granularity of the retrieval process. Additionally, we demonstrate that providing diverse contextual information and utilizing different prompt formats enable the model to identify and resolve issues more efficiently. We evaluate Nemotron-CORTEXA using SWE-bench, a benchmark derived from real-world GitHub issues. Compared to the widely used Agentless framework, Nemotron-CORTEXA achieves a higher issue resolution rate at a lower cost, highlighting its practical impact in addressing real-world software engineering challenges." @@ -3146,16 +3056,13 @@ CTBench: A Library and Benchmark for Certified Training,Yuhao Mao Stefan Balauca Pixel2Feature Attack (P2FA): Rethinking the Perturbed Space to Enhance Adversarial Transferability,Renpu Liu Hao Wu Jiawei Zhang Xin Cheng Xiangyang Luo Bin Ma Jinwei Wang,https://icml.cc/virtual/2025/poster/44743,"Adversarial examples have been shown to deceive Deep Neural Networks (DNNs), raising widespread concerns about this security threat. More seriously, as different DNN models share critical features, feature-level attacks can generate transferable adversarial examples, thereby deceiving black-box models in real-world scenarios. Nevertheless, we have theoretically discovered the principle behind the limited transferability of existing feature-level attacks: Their attack effectiveness is essentially equivalent to perturbing features in one step along the direction of feature importance in the feature space, despite performing multiple perturbations in the pixel space. This finding indicates that existing feature-level attacks are inefficient in disrupting features through multiple pixel-space perturbations. To address this problem, we propose a P2FA that efficiently perturbs features multiple times. Specifically, we directly shift the perturbed space from pixel to feature space. Then, we perturb the features multiple times rather than just once in the feature space with the guidance of feature importance to enhance the efficiency of disrupting critical shared features. Finally, we invert the perturbed features to the pixels to generate more transferable adversarial examples. Numerous experimental results strongly demonstrate the superior transferability of P2FA over State-Of-The-Art (SOTA) attacks." You Always Recognize Me (YARM): Robust Texture Synthesis Against Multi-View Corruption,Weihang Ran Wei Yuan Yinqiang Zheng,https://icml.cc/virtual/2025/poster/44107,"Damage to imaging systems and complex external environments often introduce corruption, which can impair the performance of deep learning models pretrained on high-quality image data. Previous methods have focused on restoring degraded images or fine-tuning models to adapt to out-of-distribution data. However, these approaches struggle with complex, unknown corruptions and often reduce model accuracy on high-quality data. Inspired by the use of warning colors and camouflage in the real world, we propose designing a robust appearance that can enhance model recognition of low-quality image data. Furthermore, we demonstrate that certain universal features in radiance fields can be applied across objects of the same class with different geometries. We also examine the impact of different proxy models on the transferability of robust appearances. Extensive experiments demonstrate the effectiveness of our proposed method, which outperforms existing image restoration and model fine-tuning approaches across different experimental settings, and retains effectiveness when transferred to models with different architectures. Code will be available at https://github.com/SilverRAN/YARM." Autoencoder-Based Hybrid Replay for Class-Incremental Learning,Milad Khademi Nori IL MIN KIM Guanghui Wang,https://icml.cc/virtual/2025/poster/44510,"In class-incremental learning (CIL), effective incremental learning strategies are essential to mitigate task confusion and catastrophic forgetting, especially as the number of tasks $t$ increases. Current exemplar replay strategies impose $\mathcal{O}(t)$ memory/compute complexities. We propose an autoencoder-based hybrid replay (AHR) strategy that leverages our new hybrid autoencoder (HAE) to function as a compressor to alleviate the requirement for large memory, achieving $\mathcal{O}(0.1 t)$ at the worst case with the computing complexity of $\mathcal{O}(t)$ while accomplishing state-of-the-art performance. The decoder later recovers the exemplar data stored in the latent space, rather than in raw format. Additionally, HAE is designed for both discriminative and generative modeling, enabling classification and replay capabilities, respectively. HAE adopts the charged particle system energy minimization equations and repulsive force algorithm for the incremental embedding and distribution of new class centroids in its latent space. Our results demonstrate that AHR consistently outperforms recent baselines across multiple benchmarks while operating with the same memory/compute budgets. The source code is included in the supplementary material and will be open-sourced upon publication." -Big Cooperative Learning to Conquer Local Optima,Yulai Cong,https://openreview.net/forum?id=mg4ANJcd0d, A Recipe for Causal Graph Regression: Confounding Effects Revisited,Yujia Yin Tianyi Qu Zihao Wang Yifan Chen,https://icml.cc/virtual/2025/poster/45617,"Through recognizing causal subgraphs, causal graph learning (CGL) has risen to be a promising approach for improving the generalizability of graph neural networks under out-of-distribution (OOD) scenarios. However, the empirical successes of CGL techniques are mostly exemplified in classification settings, while regression tasks, a more challenging setting in graph learning, are overlooked. We thus devote this work to tackling causal graph regression (CGR); to this end we reshape the processing of confounding effects in existing CGL studies, which mainly deal with classification. Specifically, we reflect on the predictive power of confounders in graph-level regression, and generalize classification-specific causal intervention techniques to regression through a lens of contrastive learning. Extensive experiments on graph OOD benchmarks validate the efficacy of our proposals for CGR. The model implementation and the code are provided on https://github.com/causal-graph/CGR." Strategic A/B testing via Maximum Probability-driven Two-armed Bandit,Yu Zhang Shanshan Zhao Bokui Wan Jinjuan Wang Xiaodong Yan,https://icml.cc/virtual/2025/poster/46086,"Detecting a minor average treatment effect is a major challenge in large-scale applications, where even minimal improvements can have a significant economic impact. Traditional methods, reliant on normal distribution-based or expanded statistics, often fail to identify such minor effects because of their inability to handle small discrepancies with sufficient sensitivity. This work leverages a counterfactual outcome framework and proposes a maximum probability-driven two-armed bandit (TAB) process by weighting the mean volatility statistic, which controls Type I error. The implementation of permutation methods further enhances the robustness and efficacy. The established strategic central limit theorem (SCLT) demonstrates that our approach yields a more concentrated distribution under the null hypothesis and a less concentrated one under the alternative hypothesis, greatly improving statistical power. The experimental results indicate a significant improvement in the A/B testing, highlighting the potential to reduce experimental costs while maintaining high statistical power." Global-Local Dirichlet Processes for Clustering Grouped Data in the Presence of Group-Specific Idiosyncratic Variables,Arhit Chakrabarti Yang Ni Debdeep Pati Bani Mallick,https://icml.cc/virtual/2025/poster/43711,"We consider the problem of clustering grouped data for which the observations may include group-specific variables in addition to the variables that are shared across groups. This type of data is quite common; for example, in cancer genomic studies, molecular information is available for all cancers whereas cancer-specific clinical information may only be available for certain cancers. Existing grouped clustering methods only consider the shared variables but ignore valuable information from the group-specific variables. To allow for these group-specific variables to aid in the clustering, we propose a novel Bayesian nonparametric approach, termed global-local (GLocal) Dirichlet process, that models the ""global-local"" structure of the observations across groups. We characterize the GLocal Dirichlet process using the stick-breaking representation and the representation as a limit of a finite mixture model. We theoretically quantify the approximation errors of the truncated prior, the corresponding finite mixture model, and the associated posterior distribution. We develop a fast variational Bayes algorithm for scalable posterior inference, which we illustrate with extensive simulations and a TCGA pan-gastrointestinal cancer dataset." G-Sim: Generative Simulations with Large Language Models and Gradient-Free Calibration,Samuel Holt Max Ruiz Luyten Antonin Berthon Mihaela van der Schaar,https://icml.cc/virtual/2025/poster/45361,"Constructing robust simulators is essential for asking ""what if?"" questions and guiding policy in critical domains like healthcare and logistics. However, existing methods often struggle, either failing to generalize beyond historical data or, when using Large Language Models (LLMs), suffering from inaccuracies and poor empirical alignment. We introduceG-Sim, a hybrid framework that automates simulator construction by synergizing LLM-driven structural design with rigorous empirical calibration. G-Sim employs an LLM in an iterative loop to propose and refine a simulator's core components and causal relationships, guided by domain knowledge. This structure is then grounded in reality by estimating its parameters using flexible calibration techniques. Specifically, G-Sim can leverage methods that are bothlikelihood-freeandgradient-freewith respect to the simulator, such asgradient-free optimizationfor direct parameter estimation orsimulation-based inferencefor obtaining a posterior distribution over parameters. This allows it to handle non-differentiable and stochastic simulators. By integrating domain priors with empirical evidence, G-Sim produces reliable, causally-informed simulators, mitigating data-inefficiency and enabling robust system-level interventions for complex decision-making." Online Detection of LLM-Generated Texts via Sequential Hypothesis Testing by Betting,Can Chen Jun-Kun Wang,https://icml.cc/virtual/2025/poster/44240,"Developing algorithms to differentiate between machine-generated texts and human-written texts has garnered substantial attention in recent years. Existing methods in this direction typically concern an offline setting where a dataset containing a mix of real and machine-generated texts is given upfront, and the task is to determine whether each sample in the dataset is from a large language model (LLM) or a human. However, in many practical scenarios, sources such as news websites, social media accounts, and online forums publish content in a streaming fashion. Therefore, in this online scenario, how to quickly and accurately determine whether the source is an LLM with strong statistical guarantees is crucial for these media or platforms to function effectively and prevent the spread of misinformation and other potential misuse of LLMs. To tackle the problem of online detection, we develop an algorithm based on the techniques of sequential hypothesis testing by betting that not only builds upon and complements existing offline detection techniques but also enjoys statistical guarantees, which include a controlled false positive rate and the expected time to correctly identify a source as an LLM. Experiments were conducted to demonstrate the effectiveness of our method." -Consistent Multigroup Low-rank Approximation,Antonis Matakos Martino Ciaperoni Heikki Mannila,https://openreview.net/forum?id=zLMOug6hHj, Efficient LiDAR Reflectance Compression via Scanning Serialization,Jiahao Zhu Kang You Dandan Ding Zhan Ma,https://icml.cc/virtual/2025/poster/46532,"Reflectance attributes in LiDAR point clouds provide essential information for downstream tasks but remain underexplored in neural compression methods. To address this, we introduce SerLiC, a serialization-based neural compression framework to fully exploit the intrinsic characteristics of LiDAR reflectance. SerLiC first transforms 3D LiDAR point clouds into 1D sequences via scan-order serialization, offering a device-centric perspective for reflectance analysis. Each point is then tokenized into a contextual representation comprising its sensor scanning index, radial distance, and prior reflectance, for effective dependencies exploration. For efficient sequential modeling, Mamba is incorporated with a dual parallelization scheme, enabling simultaneous autoregressive dependency capture and fast processing. Extensive experiments demonstrate that SerLiC attains over 2$\times$ volume reduction against the original reflectance data, outperforming the state-of-the-art method by up to 22\% reduction of compressed bits while using only 2\% of its parameters. Moreover, a lightweight version of SerLiC achieves $\geq 10$ fps (frames per second) with just 111K parameters, which is attractive for real applications." Right Time to Learn: Promoting Generalization via Bio-inspired Spacing Effect in Knowledge Distillation,Guanglong Sun Hongwei Yan Liyuan Wang Qian Li Bo Lei Yi Zhong,https://icml.cc/virtual/2025/poster/44991,"Knowledge distillation (KD) is a powerful strategy for training deep neural networks (DNNs). While it was originally proposed to train a more compact “student” model from a large “teacher” model, many recent efforts have focused on adapting it as an effective way to promote generalization of the model itself, such as online KD and self KD. Here, we propose an easy-to-use and compatible strategy named Spaced KD to improve the effectiveness of both online KD and self KD, in which the student model distills knowledge from a teacher model trained with a space interval ahead. This strategy is inspired by a prominent theory named spacing effect in the field of biological learning and memory, positing that appropriate intervals between learning trials can significantly enhance learning performance. We provide an in-depth theoretical and empirical analysis showing that the benefits of the proposed spacing effect in KD stem from seeking a flat minima during stochastic gradient descent (SGD). We perform extensive experiments to demonstrate the effectiveness of our Spaced KD in improving the learning performance of DNNs (e.g., the additional performance gain is up to 2.31% and 3.34% on Tiny-ImageNet over online KD and self KD, respectively). Our codes have been released on github~\url{https://github.com/SunGL001/Spaced-KD}." -The Choice of Normalization Influences Shrinkage in Regularized Regression,Johan Larsson Jonas Wallin,https://openreview.net/forum?id=XyudeZXHn3, Predicting the Susceptibility of Examples to Catastrophic Forgetting,Guy Hacohen Tinne Tuytelaars,https://icml.cc/virtual/2025/poster/43833,"Catastrophic forgetting -- the tendency of neural networks to forget previously learned data when learning new information -- remains a central challenge in continual learning. In this work, we adopt a behavioral approach, observing a connection between learning speed and forgetting: examples learned more quickly are less prone to forgetting. Focusing on replay-based continual learning, we show that the composition of the replay buffer -- specifically, whether it contains quickly or slowly learned examples -- has a significant effect on forgetting. Motivated by this insight, we introduce Speed-Based Sampling (SBS), a simple yet general strategy that selects replay examples based on their learning speed. SBS integrates easily into existing buffer-based methods and improves performance across a wide range of competitive continual learning benchmarks, advancing state-of-the-art results. Our findings underscore the value of accounting for the forgetting dynamics when designing continual learning algorithms." Structured Preconditioners in Adaptive Optimization: A Unified Analysis,Shuo Xie Tianhao Wang Sashank J. Reddi Sanjiv Kumar Zhiyuan Li,https://icml.cc/virtual/2025/poster/45803,"We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex optimization. Our analysis not only provides matching rate to several important structured preconditioned algorithms including diagonal AdaGrad, full-matrix AdaGrad, and AdaGrad-Norm, but also gives an improved convergence rate for a one-sided variant of Shampoo over that of original Shampoo. Interestingly, more structured preconditioners (e.g., diagonal Adagrad, AdaGrad-Norm which use less space and compute) are often presented as computationally efficient approximations to full-matrix Adagrad, aiming for improved optimization performance through better approximations. Our unified analysis challenges this prevailing view and reveals, perhaps surprisingly, that more structured preconditioners, despite using less space and computation per step, can outperform their less structured counterparts. To demonstrate this, we show that one-sided Shampoo, which is relatively much cheaper than full-matrix AdaGrad could outperform it both theoretically and experimentally." Active Learning of Deep Neural Networks via Gradient-Free Cutting Planes,Erica Zhang Fangzhao Zhang Mert Pilanci,https://icml.cc/virtual/2025/poster/43666,"Active learning methods aim to improve sample complexity in machine learning. In this work, we investigate an active learning scheme via a novel gradient-free cutting-plane training method for ReLU networks of arbitrary depth and develop a convergence theory. We demonstrate, for the first time, that cutting-plane algorithms, traditionally used in linear models, can be extended to deep neural networks despite their nonconvexity and nonlinear decision boundaries. Moreover, this training method induces the first deep active learning scheme known to achieve convergence guarantees, revealing a geometric contraction rate of the feasible set. We exemplify the effectiveness of our proposed active learning method against popular deep active learning baselines via both synthetic data experiments and sentimental classification task on real datasets." @@ -3170,13 +3077,10 @@ Towards Attributions of Input Variables in a Coalition,Xinhao Zheng Huiqi Deng Q SPRI: Aligning Large Language Models with Context-Situated Principles,Hongli Zhan Muneeza Azmat Raya Horesh Junyi Jessy Li Mikhail Yurochkin,https://icml.cc/virtual/2025/poster/44235,"Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness." Approximate Differential Privacy of the $\ell_2$ Mechanism,Matthew Joseph Alex Kulesza Alexander Yu,https://icml.cc/virtual/2025/poster/43510,"We study the $\ell_2$ mechanism for computing a $d$-dimensional statistic with bounded $\ell_2$ sensitivity under approximate differential privacy. Across a range of privacy parameters, we find that the $\ell_2$ mechanism obtains error approaching that of the Laplace mechanism as $d \to 1$ and approaching that of the Gaussian mechanism as $d \to \infty$; however, it dominates both in between." "Differentially Private Analysis for Binary Response Models: Optimality, Estimation, and Inference",Ce Zhang Yixin Han Yafei Wang Xiaodong Yan Linglong Kong Ting Li Bei Jiang,https://icml.cc/virtual/2025/poster/46409,"Randomized response (RR) mechanisms constitute a fundamental and effective technique for ensuring label differential privacy (LabelDP). However, existing RR methods primarily focus on the response labels while overlooking the influence of covariates and often do not fully address optimality. To address these challenges, this paper explores optimal LabelDP procedures using RR mechanisms, focusing on achieving optimal estimation and inference in binary response models. We first analyze the asymptotic behaviors of RR binary response models and then optimize the procedure by maximizing the trace of the Fisher Information Matrix within the $\varepsilon$- and $(\varepsilon,\delta)$-LabelDP constraints. Our theoretical results indicate that the proposed methods achieve optimal LabelDP guarantees while maintaining statistical accuracy in binary response models under mild conditions. Furthermore, we develop private confidence intervals with nominal coverage for statistical inference. Extensive simulation studies and real-world applications confirm that our methods outperform existing approaches in terms of precise estimation, privacy protection, and reliable inference." -Hot PATE: Private Aggregation of Distributions for Diverse Tasks,Edith Cohen Benjamin Cohen-Wang Xin Lyu Jelani Nelson Tamas Sarlos Uri Stemmer,https://openreview.net/forum?id=T2ujAPFA4C, -Towards Efficient and Scalable Implementation of Differentially Private Deep Learning,Sebastian Rodriguez Beltran Marlon Tobaben Joonas Jälkö Niki Andreas Loppi Antti Honkela,https://openreview.net/forum?id=uiuwlZ3cPe, Improving Out-of-Distribution Detection with Markov Logic Networks,Konstantin Kirchheim Frank Ortmeier,https://icml.cc/virtual/2025/poster/46267,"Out-of-distribution (OOD) detection is essential for ensuring the reliability of deep learning models operating in open-world scenarios. Current OOD detectors mainly rely on statistical models to identify unusual patterns in the latent representations of a deep neural network. This work proposes to augment existing OOD detectors with probabilistic reasoning, utilizing Markov logic networks (MLNs). MLNs connect first-order logic with probabilistic reasoning to assign probabilities to inputs based on weighted logical constraints defined over human-understandable concepts, which offers improved explainability. Through extensive experiments on multiple datasets, we demonstrate that MLNs can significantly enhance the performance of a wide range of existing OOD detectors while maintaining computational efficiency. Furthermore, we introduce a simple algorithm for learning logical constraints for OOD detection from a dataset and showcase its effectiveness." Adversarial Inception Backdoor Attacks against Reinforcement Learning,Ethan Rathbun Alina Oprea Christopher Amato,https://icml.cc/virtual/2025/poster/43529,"Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. The objectives of these attacks are twofold: induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks assume arbitrary control over the agent's rewards, inducing values far outside the environment's natural constraints. This results in brittle attacks that fail once the proper reward constraints are enforced. Thus, in this work we propose a new class of backdoor attacks against DRL which are the first to achieve state of the art performance under strict reward constraints. These ``inception'' attacks manipulate the agent's training data -- inserting the trigger into prior observations and replacing high return actions with those of the targeted adversarial behavior. We formally define these attacks and prove they achieve both adversarial objectives against arbitrary Markov Decision Processes (MDP). Using this framework we devise an online inception attack which achieves an 100% attack success rate on multiple environments under constrained rewards while minimally impacting the agent's task performance." Convergence of Consistency Model with Multistep Sampling under General Data Assumptions,Yiding Chen Yiyi Zhang Owen Oertell Wen Sun,https://icml.cc/virtual/2025/poster/43651,"Diffusion models accomplish remarkable success in data generation tasks across various domains. However, the iterative sampling process is computationally expensive. Consistency models are proposed to learn consistency functions to map from noise to data directly, which allows one-step fast data generation and multistep sampling to improve sample quality. In this paper, we study the convergence of consistency models when the self-consistency property holds approximately under the training distribution. Our analysis requires only mild data assumption and applies to a family of forward processes. When the target data distribution has bounded support or has tails that decay sufficiently fast, we show that the samples generated by the consistency model are close to the target distribution in Wasserstein distance; when the target distribution satisfies some smoothness assumption, we show that with an additional perturbation step for smoothing, the generated samples are close to the target distribution in total variation distance. We provide two case studies with commonly chosen forward processes to demonstrate the benefit of multistep sampling." A Rescaling-Invariant Lipschitz Bound Based on Path-Metrics for Modern ReLU Network Parameterizations,Antoine Gonon Nicolas Brisebarre Elisa Riccietti Rémi Gribonval,https://icml.cc/virtual/2025/poster/45188,"Robustness with respect to weight perturbations underpins guarantees for generalization, pruning and quantization. Existingguarantees rely on *Lipschitz bounds in parameter space*, cover only plain feed-forward MLPs, and break under the ubiquitous neuron-wise rescaling symmetry of ReLU networks. We prove a new Lipschitz inequality expressed through the $\ell^{1}$-*path-metric* of the weights. The bound is (i) *rescaling-invariant* by construction and (ii) applies to any ReLU-DAG architecture with any combination of convolutions, skip connections, pooling, and frozen (inference-time) batch-normalization —thus encompassing ResNets, U-Nets, VGG-style CNNs, and more. By respecting the network’s natural symmetries, the new bound strictly sharpens prior parameter-space bounds and can be computed in two forward passes. To illustrate its utility, we derive from it a symmetry-aware pruning criterion andshow—through a proof-of-concept experiment on a ResNet-18 trained on ImageNet—that its pruning performance matches that of classical magnitude pruning, while becoming totally immune to arbitrary neuron-wise rescalings." -"A Scalable Solver for 2p0s Differential Games with One-Sided Payoff Information and Continuous Actions, States, and Time",Mukesh Ghimire Lei Zhang Zhe Xu Yi Ren,https://openreview.net/forum?id=iDnwpbn20h, Safely Learning Optimal Auctions: A Testable Learning Framework for Mechanism Design,Vikram Kher Manolis Zampetakis,https://icml.cc/virtual/2025/poster/44342,"When can the distributional assumptions of theorems and learning algorithms be trusted? Inspired by this question, Rubinfeld and Vasilyan (2023) initiated the study of testable learning. In this schema, we always learn one of the following two things: either we have achieved the desired accuracy regardless of whether the distributional assumptions are satisfied, or the input distribution does not satisfy the original distributional assumptions. Motivated by the challenge of relying on strong distributional assumptions in many theorems in mechanism design, we develop a testable learning framework for mechanism design. Traditional models in mechanism design assume that value distributions satisfy some notion of regularity. Unfortunately, testing regularity is not possible in the original testable learning framework as we show. To bypass this impossibility, we propose a regularized version of the testable learning framework. Under this framework, we always learn one of the following two things: either we achieve high revenue compared to the best possible revenue of any regular distribution close to the input distribution, or the input distribution does not satisfy regularity. We then use this framework to provide: 1) a tester-learner pair for revenue optimal mechanisms, 2) a tester for whether the fundamental Bulow-Klemperer Theorem (Bulow and Klemperer 1996) is applicable to a given dataset, and 3) a tester to confirm the existence of an anonymous reserve price that results in the anonymous price auction securing a constant fraction of the optimal revenue." Graph Attention is Not Always Beneficial: A Theoretical Analysis of Graph Attention Mechanisms via Contextual Stochastic Block Models,Zhongtian Ma Qiaosheng Zhang Bocheng Zhou Yexin Zhang Shuyue Hu Zhen Wang,https://icml.cc/virtual/2025/poster/46423,"Despite the growing popularity of graph attention mechanisms, their theoretical understanding remains limited. This paper aims to explore the conditions under which these mechanisms are effective in node classification tasks through the lens of Contextual Stochastic Block Models (CSBMs). Our theoretical analysis reveals that incorporating graph attention mechanisms is *not universally beneficial*. Specifically, by appropriately defining *structure noise* and *feature noise* in graphs, we show that graph attention mechanisms can enhance classification performance when structure noise exceeds feature noise. Conversely, when feature noise predominates, simpler graph convolution operations are more effective. Furthermore, we examine the over-smoothing phenomenon and show that, in the high signal-to-noise ratio (SNR) regime, graph convolutional networks suffer from over-smoothing, whereas graph attention mechanisms can effectively resolve this issue. Building on these insights, we propose a novel multi-layer Graph Attention Network (GAT) architecture that significantly outperforms single-layer GATs in achieving *perfect node classification* in CSBMs, relaxing the SNR requirement from $\omega(\sqrt{\log n})$ to $\omega(\sqrt{\log n} / \sqrt[3]{n})$. To our knowledge, this is the first study to delineate the conditions for perfect node classification using multi-layer GATs. Our theoretical contributions are corroborated by extensive experiments on both synthetic and real-world datasets, highlighting the practical implications of our findings." Provable Efficiency of Guidance in Diffusion Models for General Data Distribution,Gen Li Yuchen Jiao,https://icml.cc/virtual/2025/poster/45240,"Diffusion models have emerged as a powerful framework for generative modeling, with guidance techniques playing a crucial role in enhancing sample quality. Despite their empirical success, a comprehensive theoretical understanding of the guidance effect remains limited. Existing studies only focus on case studies, where the distribution conditioned on each class is either isotropic Gaussian or supported on a one-dimensional interval with some extra conditions. How to analyze the guidance effect beyond these case studies remains an open question. Towards closing this gap, we make an attempt to analyze diffusion guidance under general data distributions. Rather than demonstrating uniform sample quality improvement, which does not hold in some distributions, we prove that guidance can improve the whole sample quality, in the sense that the ratio of bad samples (measured by the classifier probability) decreases in the presence of guidance. This aligns with the motivation of introducing guidance." @@ -3198,7 +3102,6 @@ Faster Stochastic Optimization with Arbitrary Delays via Adaptive Asynchronous M Flexibility-conditioned protein structure design with flow matching,Vsevolod Viliuga Leif Seute Nicolas Wolf Simon Wagner Arne Elofsson Jan Stühmer Frauke Gräter,https://icml.cc/virtual/2025/poster/46289,"Recent advances in geometric deep learning and generative modeling have enabled the design of novel proteins with a wide range of desired properties. However, current state-of-the-art approaches are typically restricted to generating proteins with only static target properties, such as motifs and symmetries. In this work, we take a step towards overcoming this limitation by proposing a framework to condition structure generation on flexibility, which is crucial for key functionalities such as catalysis or molecular recognition. We first introduce BackFlip, an equivariant neural network for predicting per-residue flexibility from an input backbone structure. Relying on BackFlip, we propose FliPS, an SE(3)-equivariant conditional flow matching model that solves the inverse problem, that is, generating backbones that display a target flexibility profile. In our experiments, we show that FliPS is able to generate novel and diverse protein backbones with the desired flexibility, verified by Molecular Dynamics (MD) simulations." Learn Singularly Perturbed Solutions via Homotopy Dynamics,Chuqi CHEN Yahong Yang Yang Xiang Wenrui Hao,https://icml.cc/virtual/2025/poster/45414,"Solving partial differential equations (PDEs) using neural networks has become a central focus in scientific machine learning. Training neural networks for singularly perturbed problems is particularly challenging due to certain parameters in the PDEs that introduce near-singularities in the loss function. In this study, we overcome this challenge by introducing a novel method based on homotopy dynamics to effectively manipulate these parameters. From a theoretical perspective, we analyze the effects of these parameters on training difficulty in these singularly perturbed problems and establish the convergence of the proposed homotopy dynamics method. Experimentally, we demonstrate that our approach significantly accelerates convergence and improves the accuracy of these singularly perturbed problems. These findings present an efficient optimization strategy leveraging homotopy dynamics, offering a robust framework to extend the applicability of neural networks for solving singularly perturbed differential equations." PDE-Transformer: Efficient and Versatile Transformers for Physics Simulations,Benjamin Holzschuh Qiang Liu Georg Kohl Nils Thuerey,https://icml.cc/virtual/2025/poster/46546,"We introduce PDE-Transformer, an improved transformer-based architecture for surrogate modeling of physics simulations on regular grids. We combine recent architectural improvements of diffusion transformers with adjustments specific for large-scale simulations to yield a more scalable and versatile general-purpose transformer architecture, which can be used as the backbone for building large-scale foundation models in physical sciences. We demonstrate that our proposed architecture outperforms state-of-the-art transformer architectures for computer vision on a large dataset of 16 different types of PDEs. We propose to embed different physical channels individually as spatio-temporal tokens, which interact via channel-wise self-attention. This helps to maintain a consistent information density of tokens when learning multiple types of PDEs simultaneously. We demonstrate that our pre-trained models achieve improved performance on several challenging downstream tasks compared to training from scratch and also beat other foundation model architectures for physics simulations.Our source code is available at https://github.com/tum-pbs/pde-transformer." -SatFlow: Generative model based framework for producing High Resolution Gap Free Remote Sensing Imagery.,Bharath Chandra Reddy Irigireddy varaprasad bandaru,https://openreview.net/forum?id=0ePi13LA5p, TCP-Diffusion: A Multi-modal Diffusion Model for Global Tropical Cyclone Precipitation Forecasting with Change Awareness,Cheng Huang Pan Mu Cong Bai Peter AG Watson,https://icml.cc/virtual/2025/poster/46214,"Deep learning methods have made significant progress in regular rainfall forecasting, yet the more hazardous tropical cyclone (TC) rainfall has not received the same attention. While regular rainfall models can offer valuable insights for designing TC rainfall forecasting models, most existing methods suffer from cumulative errors and lack physical consistency. Additionally, these methods overlook the importance of meteorological factors in TC rainfall and their integration with the numerical weather prediction (NWP) model. To address these issues, we propose Tropical Cyclone Precipitation Diffusion (TCP-Diffusion), a multi-modal model for forecasting of TC precipitation given an existing TC in any location globally. It forecasts rainfall around the TC center for the next 12 hours at 3 hourly resolution based on past rainfall observations and multi-modal environmental variables. Adjacent residual prediction (ARP) changes the training target from the absolute rainfall value to the rainfall trend and gives our model the capability of rainfall change awareness, reducing cumulative errors and ensuring physical consistency. Considering the influence of TC-related meteorological factors and the useful information from NWP model forecasts, we propose a multi-model framework with specialized encoders to extract richer information from environmental variables and results provided by NWP models. The results of extensive experiments show that our method outperforms other DL methods and the NWP method from the European Centre for Medium-Range Weather Forecasts (ECMWF)." WyckoffDiff -- A Generative Diffusion Model for Crystal Symmetry,Filip Ekström Kelvinius Oskar B. Andersson Abhijith S Parackal Dong Qian Rickard Armiento Fredrik Lindsten,https://icml.cc/virtual/2025/poster/45457,"Crystalline materials often exhibit a high level of symmetry. However, most generative models do not account for symmetry, but rather model each atom without any constraints on its position or element. We propose a generative model, Wyckoff Diffusion (WyckoffDiff), which generates symmetry-based descriptions of crystals. This is enabled by considering a crystal structure representation that encodes all symmetry, and we design a novel neural network architecture which enables using this representation inside a discrete generative model framework. In addition to respecting symmetry by construction, the discrete nature of our model enables fast generation. We additionally present a new metric, Fréchet Wrenformer Distance, which captures the symmetry aspects of the materials generated, and we benchmark WyckoffDiff against recently proposed generative models for crystal generation. As a proof-of-concept study, we use WyckoffDiff to find new materials below the convex hull of thermodynamical stability." HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration,Yushi Huang Zining Wang Ruihao Gong Jing Liu Xinjie Zhang Jinyang Guo Xianglong Liu Jun Zhang,https://icml.cc/virtual/2025/poster/45499,"Diffusion Transformers (DiTs) excel in generative tasks but face practical deployment challenges due to high inference costs. Feature caching, which stores and retrieves redundant computations, offers the potential for acceleration. Existing learning-based caching, though adaptive, overlooks the impact of the prior timestep. It also suffers from misaligned objectives-*aligned predicted noise vs. high-quality images*-between training and inference. These two discrepancies compromise both performance and efficiency.To this end, we *harmonize* training and inference with a novel learning-based *caching* framework dubbed **HarmoniCa**. It first incorporates *Step-Wise Denoising Training* (SDT) to ensure the continuity of the denoising process, where prior steps can be leveraged. In addition, an *Image Error Proxy-Guided Objective* (IEPO) is applied to balance image quality against cache utilization through an efficient proxy to approximate the image error. Extensive experiments across $8$ models, $4$ samplers, and resolutions from $256\times256$ to $2K$ demonstrate superior performance and speedup of our framework. For instance, it achieves over $40\\%$ latency reduction (*i.e.*, $2.07\times$ theoretical speedup) and improved performance on PixArt-$\alpha$. Remarkably, our *image-free* approach reduces training time by $25\\%$ compared with the previous method. Our code is available at https://github.com/ModelTC/HarmoniCa." @@ -3208,20 +3111,15 @@ Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search,Boyan Li Jiayi Zh DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts,Tobias Braun Mark Rothermel Marcus Rohrbach Anna Rohrbach,https://icml.cc/virtual/2025/poster/43719,"The proliferation of disinformation demands reliable and scalable fact-checking solutions. We presentDynamicEvidence-basedFAct-checking withMultimodalExperts (DEFAME), a modular, zero-shot MLLM pipeline for open-domain, text-image claim verification. DEFAME operates in a six-stage process, dynamically selecting the tools and search depth to extract and evaluate textual and visual evidence. Unlike prior approaches that are text-only, lack explainability, or rely solely on parametric knowledge, DEFAME performs end-to-end verification, accounting for images in claimsandevidence while generating structured, multimodal reports. Evaluation on the popular benchmarks VERITE, AVeriTeC, and MOCHEG shows that DEFAME surpasses all previous methods, establishing itself as the new general state-of-the-art fact-checking system for uni- and multimodal fact-checking. Moreover, we introduce a new multimodal benchmark, ClaimReview2024+, featuring claims after the knowledge cutoff of GPT-4o, avoiding data leakage. Here, DEFAME drastically outperforms the GPT-4o baselines, showing temporal generalizability and the potential for real-time fact-checking." Cross-Modal Alignment via Variational Copula Modelling,Feng Wu Tsai Hor Chan Fuying Wang Guosheng Yin Lequan Yu,https://icml.cc/virtual/2025/poster/46358,"Various data modalities are common in real-world applications. (e.g., EHR, medical images and clinical notes in healthcare). Thus, it is essential to develop multimodal learning methods to aggregate information from multiple modalities. The main challenge is appropriately aligning and fusing the representations of different modalities into a joint distribution. Existing methods mainly rely on concatenation or the Kronecker product, oversimplifying interactions structure between modalities and indicating a need to model more complex interactions. Additionally, the joint distribution of latent representations with higher-order interactions is underexplored. Copula is a powerful statistical structure in modelling the interactions between variables, as it bridges the joint distribution and marginal distributions of multiple variables. In this paper, we propose a novel copula modelling-driven multimodal learning framework, which focuses on learning the joint distribution of various modalities to capture the complex interaction among them. The key idea is interpreting the copula model as a tool to align the marginal distributions of the modalities efficiently. By assuming a Gaussian mixture distribution for each modality and a copula model on the joint distribution, our model can also generate accurate representations for missing modalities. Extensive experiments on public MIMIC datasets demonstrate the superior performance of our model over other competitors. The code is anonymously available at https://github.com/HKU-MedAI/CMCM." Staged and Physics-Grounded Learning Framework with Hyperintensity Prior for Pre-Contrast MRI Synthesis,Dayang Wang Srivathsa Pasumarthi Venkata Ajit Shankaranarayanan Greg Zaharchuk,https://icml.cc/virtual/2025/poster/45409,"Contrast-enhanced MRI enhances pathological visualization but often necessitates Pre-Contrast images for accurate quantitative analysis and comparative assessment. However, Pre-Contrast images are frequently unavailable due to time, cost, or safety constraints, or they may suffer from degradation, making alignment challenging. This limitation hinders clinical diagnostics and the performance of tools requiring combined image types. To address this challenge, we propose a novel staged, physics-grounded learning framework with a hyperintensity prior to synthesize Pre-Contrast images directly from Post-Contrast MRIs. The proposed method can generate high-quality Pre-Contrast images, thus, enabling comprehensive diagnostics while reducing the need for additional imaging sessions, costs, and patient risks. To the best of our knowledge, this is the first Pre-Contrast synthesis model capable of generating images that may be interchangeably used with standard-of-care Pre-Contrast images. Extensive evaluations across multiple datasets, sites, anatomies, and downstream tasks demonstrate the model’s robustness and clinical applicability, positioning it as a valuable tool for contrast-enhanced MRI workflows." -Canonic Signed Spike Coding for Efficient Spiking Neural Networks,Yiwen Gu Junchuan Gu Haibin Shen Kejie Huang,https://openreview.net/forum?id=8Y50Exn2YL, DynaMind: Reasoning over Abstract Video Dynamics for Embodied Decision-Making,Ziru Wang Mengmeng Wang Jade Dai Teli Ma Guo-Jun Qi Yong Liu Guang Dai Jingdong Wang,https://icml.cc/virtual/2025/poster/43462,"Integrating natural language instructions and visual perception with decision-making is a critical challenge for embodied agents. Existing methods often struggle to balance the conciseness of language commands with the richness of video content. To bridge the gap between modalities, we propose extracting key spatiotemporal patterns from video that capture visual saliency and temporal evolution, referred to as dynamic representation. Building on this, we introduce DynaMind, a framework that enhances decision-making through dynamic reasoning. Specifically, we design an adaptive FrameScorer to evaluate video frames based on semantic consistency and visual saliency, assigning each frame an importance score. These scores are used to filter redundant video content and synthesize compact dynamic representations. Leveraging these representations, we predict critical future dynamics and apply a dynamic-guided policy to generate coherent and context-aware actions. Extensive results demonstrate that DynaMind significantly outperforms the baselines across several simulation benchmarks and real-world scenarios." Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification,Flavio Petruzzellis Cristina Cornelio Pietro Lio,https://icml.cc/virtual/2025/poster/43660,"Large Language Models (LLMs) have shown promise as robotic planners but often struggle with long-horizon and complex tasks, especially in specialized environments requiring external knowledge. While hierarchical planning and Retrieval-Augmented Generation (RAG) address some of these challenges, they remain insufficient on their own and a deeper integration is required for achieving more reliable systems. To this end, we propose a neuro-symbolic approach that enhances LLMs-based planners with Knowledge Graph-based RAG for hierarchical plan generation. This method decomposes complex tasks into manageable subtasks, further expanded into executable atomic action sequences. To ensure formal correctness and proper decomposition, we integrate a Symbolic Validator, which also functions as a failure detector by aligning expected and observed world states. Our evaluation against baseline methods demonstrates the consistent significant advantages of integrating hierarchical planning, symbolic verification, and RAG across tasks of varying complexity and different LLMs. Additionally, our experimental setup and novel metrics not only validate our approach for complex planning but also serve as a tool for assessing LLMs' reasoning and compositional capabilities. Code available at https://github.com/corneliocristina/HVR." Harnessing Heterogeneous Statistical Strength for Personalized Federated Learning via Hierarchical Bayesian Inference,Mahendra Singh Thapa Rui Li,https://icml.cc/virtual/2025/poster/44831,"Personalized federated learning (PFL) based on Bayesian approach tackle the challenges from statistical heterogeneity of client data by computing a personalized posterior distribution over the parameters of each client's local model and constructing a global distribution by aggregating the parameters of these personalized posteriors. However, the heuristic aggregation methods introduce strong biases and result in global models with poor generalization. We thus propose a novel hierarchical Bayesian inference framework for PFL by specifying a conjugate hyper-prior over the parameters of the personalized posteriors. This allows us to jointly compute a global posterior distribution for aggregation and the personalized ones at local level. This hierarchical Bayesian inference framework achieves elegant balance between local personalization and global model robustness. Extensive empirical study shows that by effectively sharing the heterogeneous statistical strength across the local models while retaining their distinctive characteristics, our framework yields state-of-the-art performance. We also show that existing Bayesian PFLs are special cases of our framework." How Effective Can Dropout Be in Multiple Instance Learning ?,Wenhui Zhu Peijie Qiu Xiwen Chen Zhangsihao Yang Aristeidis Sotiras Abolfazl Razi Yalin Wang,https://icml.cc/virtual/2025/poster/43917,"Multiple Instance Learning (MIL) is a popular weakly-supervised method for various applications, with a particular interest in histological whole slide image (WSI) classification. Due to the gigapixel resolution of WSI, applications of MIL in WSI typically necessitate a two-stage training scheme: first, extract features from the pre-trained backbone and then perform MIL aggregation. However, it is well-known that this suboptimal training scheme suffers from ""noisy"" feature embeddings from the backbone and inherent weak supervision, hindering MIL from learning rich and generalizable features. However, the most commonly used technique (i.e., dropout) for mitigating this issue has yet to be explored in MIL. In this paper, we empirically explore how effective the dropout can be in MIL. Interestingly, we observe that dropping the top-k most important instances within a bag leads to better performance and generalization even under noise attack. Based on this key observation, we propose a novel MIL-specific dropout method, termed MIL-Dropout, which systematically determines which instances to drop. Experiments on five MIL benchmark datasets and two WSI datasets demonstrate that MIL-Dropout boosts the performance of current MIL methods with a negligible computational cost. The code is available at \url{https://github.com/ChongQingNoSubway/MILDropout}." Multiobjective distribution matching,Xiaoyuan Zhang Peijie Li Yingying Yu Yichi Zhang Han Zhao Qingfu Zhang,https://icml.cc/virtual/2025/poster/45789,"Distribution matching is a key technique in machine learning, with applications in generative models, domain adaptation, and algorithmic fairness. A related but less explored challenge is generating a distribution that aligns with multiple underlying distributions, often with conflicting objectives, known as a Pareto optimal distribution.In this paper, we develop a general theory based on information geometry to construct the Pareto set and front for the entire exponential family under KL and inverse KL divergences. This formulation allows explicit derivation of the Pareto set and front for multivariate normal distributions, enabling applications like multiobjective variational autoencoders (MOVAEs) to generate interpolated image distributions.Experimental results on real-world images demonstrate that both algorithms can generate high-quality interpolated images across multiple distributions." Fast Inference with Kronecker-Sparse Matrices,Antoine Gonon Léon Zheng Pascal Carrivain TUNG QUOC LE,https://icml.cc/virtual/2025/poster/45258,"Kronecker-sparse (KS) matrices—whose supports are Kronecker products of identity and all-ones blocks—underpin the structure of Butterfly and Monarch matrices and offer the promise of more efficient models. However, existing GPU kernels for KS matrix multiplication suffer from high data movement costs, with up to 50% of time spent on memory-bound tensor permutations. We propose a fused, output-stationary GPU kernel that eliminates these overheads, reducing global memory traffic threefold. Across 600 KS patterns, our kernel achieves in FP32 a median speedup of x1.4 and lowers energy consumption by 15%. A simple heuristic based on KS pattern parameters predicts when our method outperforms existing ones. We release all code atgithub.com/PascalCarrivain/ksmm, including a PyTorch-compatibleKSLinearlayer, and demonstrate in FP32 end-to-end latency reductions of up to 22% in ViT-S/16 and 16% in GPT-2 medium." -Hyperflows: Pruning Reveals the Importance of Weights,Barbulescu Eugen Antonio Alexoaie,https://openreview.net/forum?id=vkltBcQgrL, -Graph Transformers Get the GIST: Graph Invariant Structural Trait for Refined Graph Encoding,Hoang Anh Duy Le Shaochen Zhong Jerry Xiao Jiamu Zhang Yu-Neng Chuang Li Li Rui Chen Shuai Xu Zirui Liu Kaixiong Zhou Vipin Chaudhary Zhaozhuo Xu Xia Hu,https://openreview.net/forum?id=Ck6WljG6ZM, -Toward Foundation Model for Multivariate Wearable Sensing of Physiological Signals,Yunfei Luo Yuliang Chen Asif Salekin Tauhidur Rahman,https://openreview.net/forum?id=4x83oH6Oy6, Continuous Semi-Implicit Models,Longlin Yu Jiajun Zha Tong Yang Tianyu Xie Xiangyu Zhang S.-H. Chan Cheng Zhang,https://icml.cc/virtual/2025/poster/43572,"Semi-implicit distributions have shown great promise in variational inference and generative modeling. Hierarchical semi-implicit models, which stack multiple semi-implicit layers, enhance the expressiveness of semi-implicit distributions and can be used to accelerate diffusion models given pretrained score networks. However, their sequential training often suffers from slow convergence. In this paper, we introduce CoSIM, a continuous semi-implicit model that extends hierarchical semi-implicit models into a continuous framework. By incorporating a continuous transition kernel, CoSIM enables efficient, simulation-free training. Furthermore, we show that CoSIM achieves consistency with a carefully designed transition kernel, offering a novel approach for multistep distillation of generative models at the distributional level. Extensive experiments on image generation demonstrate that CoSIM performs on par or better than existing diffusion model acceleration methods, achieving superior performance on FD-DINOv2." Generative Data Mining with Longtail-Guided Diffusion,David S Hayden Mao Ye Timur Garipov Gregory P. Meyer Carl Vondrick Zhao Chen Yuning Chai Eric M Wolff Siddhartha Srinivasa,https://icml.cc/virtual/2025/poster/46120,"It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on numerous image classification benchmarks, and can be analyzed by a VLM to proactively discover, textually explain, and address conceptual gaps in a deployed predictive model." Hessian Geometry of Latent Space in Generative Models,Alexander Lobashev Dmitry Guskov Maria Larchenko Mikhail Tamm,https://icml.cc/virtual/2025/poster/45794,"This paper presents a novel method for analyzing the latent space geometry of generative models, including statistical physics models and diffusion models, by reconstructing the Fisher information metric. The method approximates the posterior distribution of latent variables given generated samples and uses this to learn the log-partition function, which defines the Fisher metric for exponential families. Theoretical convergence guarantees are provided, and the method is validated on the Ising and TASEP models, outperforming existing baselines in reconstructing thermodynamic quantities. Applied to diffusion models, the method reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher metric. We demonstrate that while geodesic interpolations are approximately linear within individual phases, this linearity breaks down at phase boundaries, where the diffusion model exhibits a divergent Lipschitz constant with respect to the latent space. These findings provide new insights into the complex structure of diffusion model latent spaces and their connection to phenomena like phase transitions.Our source code is available at \url{https://github.com/alobashev/hessian-geometry-of-diffusion-models}." -Mean-Shift Distillation for Diffusion Mode Seeking,Vikas Thamizharasan Nikitas Chatzis Iliyan Georgiev Matthew Fisher Difan Liu Nanxuan Zhao Evangelos Kalogerakis Michal Lukáč,https://openreview.net/forum?id=USAKSVAwIc, Vector Grimoire: Codebook-based Shape Generation under Raster Image Supervision,Marco Cipriano Moritz Feuerpfeil Gerard de Melo,https://icml.cc/virtual/2025/poster/43574,"Scalable Vector Graphics (SVG) is a popular format on the web and in the design industry. However, despite the great strides made in generative modeling, SVG has remained underexplored due to the discrete and complex nature of such data. We introduce GRIMOIRE, a text-guided SVG generative model that is comprised of two modules: A Visual Shape Quantizer (VSQ) learns to map raster images onto a discrete codebook by reconstructing them as vector shapes, and an Auto-Regressive Transformer (ART) models the joint probability distribution over shape tokens, positions and textual descriptions, allowing us to generate vector graphics from natural language. Unlike existing models that require direct supervision from SVG data, GRIMOIRE learns shape image patches using only raster image supervision which opens up vector generative modeling to significantly more data. We demonstrate the effectiveness of our method by fitting GRIMOIRE for closed filled shapes on the MNIST and Emoji, and for outline strokes on icon and font data, surpassing previous image-supervised methods in generative quality and vector-supervised approach in flexibility." A General Graph Spectral Wavelet Convolution via Chebyshev Order Decomposition,Nian Liu Xiaoxin He Thomas Laurent Francesco Di Giovanni Michael M. Bronstein Xavier Bresson,https://icml.cc/virtual/2025/poster/45116,"Spectral graph convolution, an important tool of data filtering on graphs, relies on two essential decisions: selecting spectral bases for signal transformation and parameterizing the kernel for frequency analysis. While recent techniques mainly focus on standard Fourier transform and vector-valued spectral functions, they fall short in flexibility to model signal distributions over large spatial ranges, and capacity of spectral function. In this paper, we present a novel wavelet-based graph convolution network, namely WaveGC, which integrates multi-resolution spectral bases and a matrix-valued filter kernel. Theoretically, we establish that WaveGC can effectively capture and decouple short-range and long-range information, providing superior filtering flexibility, surpassing existing graph wavelet neural networks. To instantiate WaveGC, we introduce a novel technique for learning general graph wavelets by separately combining odd and even terms of Chebyshev polynomials. This approach strictly satisfies wavelet admissibility criteria. Our numerical experiments showcase the consistent improvements in both short-range and long-range tasks. This underscores the effectiveness of the proposed model in handling different scenarios." From Theory to Practice: Rethinking Green and Martin Kernels for Unleashing Graph Transformers,Yoon Hyeok Lee Jaemin Park Taejin Paik Doyun Kim Bosun Hwang,https://icml.cc/virtual/2025/poster/44592,"Graph Transformers (GTs) have emerged as a powerful alternative to message-passing neural networks, yet their performance heavily depends on effectively embedding structural inductive biases. In this work, we introduce novel structural encodings (SEs) grounded in a rigorous analysis of random walks (RWs), leveraging Green and Martin kernels that we have carefully redefined for AI applications while preserving their mathematical essence.These kernels capture the long-term behavior of RWs on graphs and allow for enhanced representation of complex topologies, including non-aperiodic and directed acyclic substructures.Empirical evaluations across eight benchmark datasets demonstrate strong performance across diverse tasks, notably in molecular and circuit domains.We attribute this performance boost to the improved ability of our kernel-based SEs to encode intricate structural information, thereby strengthening the global attention and inductive bias within GTs.This work highlights the effectiveness of theoretically grounded kernel methods in advancing Transformer-based models for graph learning." @@ -3238,21 +3136,17 @@ ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Prefe Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training,Mozhi Zhang Howe Tissue Lu Wang Xipeng Qiu,https://icml.cc/virtual/2025/poster/44261,"We introduce *Domain2Vec*, a novel approach that decomposes any dataset into a linear combination of several *meta-domains*, a new concept designed to capture the key underlying features of datasets.*Domain2Vec* maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary.These domain vectors enable the identification of optimal data mixture for language model (LM) pretraining in a training-free manner under the ***D**istribution **A**lignment **A**ssumption* (DA$^{2}$), which suggests that when the data distribution of the training set and the validation set is more aligned, a lower validation loss is achieved.Moreover, *Domain2Vec* can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods.Extensive experiments demonstrate that *Domain2Vec* helps find the data mixture that enhances downstream task performance with minimal computational overhead.Specifically, *Domain2Vec* achieves the same validation loss on Pile-CC using only $51.5$\% of the compute required when training on the original mixture of The Pile Dataset.Under equivalent compute budget, *Domain2Vec* improves downstream performance by an average of $2.83$\%." EffiCoder: Enhancing Code Generation in Large Language Models through Efficiency-Aware Fine-tuning,Dong HUANG Guangtao Zeng Jianbo Dai Meng Luo Han Weng Yuhao QING Heming Cui Zhijiang Guo Jie Zhang,https://icml.cc/virtual/2025/poster/46272,"As large language models (LLMs) play an increasingly important role in code generation, enhancing both correctness and efficiency has become crucial. Current methods primarily focus on correctness, often overlooking efficiency. To address this gap, we introduce SWIFTCODE to improve both aspects by fine-tuning LLMs on a high-quality dataset comprising correct and efficient code samples. Our methodology involves leveraging multiple LLMs to generate diverse candidate code solutions for various tasks across different programming languages. We then evaluate these solutions by directly measuring their execution time and memory usage through local execution. The code solution with the lowest execution time and memory consumption is selected as the final output for each task. Experimental results demonstrate significant improvements when fine-tuning with SWIFTCODE. For instance, Qwen2.5-Coder-7B-Instruct's pass@1 score increases from 44.8\% to 57.7\%, while the average execution time for correct tasks decreases by 48.4\%. SWIFTCODE offers a scalable and effective solution for advancing AI-driven code generation, benefiting both software development and computational problem-solving." Improving Rationality in the Reasoning Process of Language Models through Self-playing Game,Pinzheng Wang Juntao Li Zecheng Tang Haijia Gui Min zhang,https://icml.cc/virtual/2025/poster/45387,"Large language models (LLMs) have demonstrated considerable reasoning abilities in various tasks such as mathematics and coding.However, recent studies indicate that even the best models lack true comprehension of their reasoning processes.In this paper, we explore how self-play can enhance the rationality of models in the reasoning process without supervision from humans or superior models.We design a $\textit{\textbf{C}ritic-\textbf{D}iscernment \textbf{G}ame}~(\textbf{CDG})$ in which a prover first provides a solution to a given problem and is subsequently challenged by critiques of its solution. These critiques either aim to assist or mislead the prover. The objective of the prover is to maintain the correct answer when faced with misleading comments, while correcting errors in response to constructive feedback.Our experiments on tasks involving mathematical reasoning, stepwise error detection, self-correction, and long-chain reasoning demonstrate that CDG training can significantly improve the ability of well-aligned LLMs to comprehend their reasoning process." -Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE,Haiduo Huang Fuwei Yang Zhenhua Liu Yixing Xu Jinze Li Yang Liu Xuanwu Yin Dong Li Pengju Ren Emad Barsoum,https://openreview.net/forum?id=fNIsoDWzGk, LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws,Prasanna Mayilvahanan Thaddäus Wiedemer Sayak Mallick Matthias Bethge Wieland Brendel,https://icml.cc/virtual/2025/poster/45734,"Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute.More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization.In this work, we investigate which factors most strongly influence loss-to-loss scaling.Our experiments reveal that the pretraining data determines the scaling trend.In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact.Consequently, practitioners should carefully curate pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency." Neutral residues: revisiting adapters for model extension,Franck SIGNE TALLA Edouard Grave Herve Jegou,https://icml.cc/virtual/2025/poster/44454,"We address the problem of extending a pre-trained large language model to a new domain that was not seen during training. Standard techniques, such as fine-tuning or low-rank adaptation (LoRA) are successful at domain adaptation, but do not formally add capacity to the model. This often leads to a trade-off, between performing well on the new domain vs. degrading performance on the original domain.Here, we propose to revisit and improve adapters to extend LLMs. Our paper analyzes this extension problem from three angles: data, architecture and training procedure, which are advantageously considered jointly. The resulting method, called neutral residues, modifies adapters in a way that leads to each new residual block to output near-zeros on the original domain. This solution leads to strong results when adapting a state-of-the-art model originally trained on English to a new language. Neutral residues significantly outperforms competing approaches such as fine-tuning, LoRA or vanilla adapters in terms of the trade-off between learning the new language and not forgetting English." PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data,Tingchen Fu Mrinank Sharma Philip Torr Shay B Cohen David Krueger Fazl Barez,https://icml.cc/virtual/2025/poster/46610,"Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 22 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not always enhance resilience against poisoning attacks and the influence on model resilience varies among different model suites. (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation." Structure-Guided Large Language Models for Text-to-SQL Generation,Qinggang Zhang Hao Chen Junnan Dong Shengyuan Chen Feiran Huang Xiao Huang,https://icml.cc/virtual/2025/poster/44477,"Recent advancements in large language models (LLMs) have shown promise in bridging the gap between natural language queries and database management systems, enabling users to interact with databases without the background of SQL. However, LLMs often struggle to fully exploit and comprehend the user intention and complex structures of databases. Decomposition-based methods have been proposed to enhance the performance of LLMs on complex tasks, but decomposing SQL generation into subtasks is non-trivial due to the declarative structure of SQL syntax and the intricate connections between query concepts and database elements. In this paper, we propose a novel Structure GUided text-to-SQL framework ( SGU-SQL) that incorporates syntax-based prompting to enhance the SQL generation capabilities of LLMs. Specifically, SGU-SQL establishes structure-aware links between user queries and database schema and recursively decomposes the complex generation task using syntax-based prompting to guide LLMs in incrementally constructing target SQLs. Extensive experiments on two benchmark datasets demonstrate that SGU-SQL consistently outperforms state-of-the-art text-to-SQL baselines." Synthetic Text Generation for Training Large Language Models via Gradient Matching,Dang Nguyen Zeman Li Mohammadhossein Bateni Vahab Mirrokni Meisam Razaviyayn Baharan Mirzasoleiman,https://icml.cc/virtual/2025/poster/44161,"Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available athttps://github.com/BigML-CS-UCLA/GRADMM." -Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs,Yue Wang Qiuzhi Liu Jiahao Xu Tian Liang Xingyu Chen Zhiwei He Linfeng Song Dian Yu Juntao Li Zhuosheng Zhang Rui Wang Zhaopeng Tu Haitao Mi Dong Yu,https://openreview.net/forum?id=bNJyu7JHUO, Approximating Latent Manifolds in Neural Networks via Vanishing Ideals,Nico Pelleriti Max Zimmer Elias Samuel Wirth Sebastian Pokutta,https://icml.cc/virtual/2025/poster/45012,"Deep neural networks have reshaped modern machine learning by learning powerful latent representations that often align with the manifold hypothesis: high-dimensional data lie on lower-dimensional manifolds. In this paper, we establish a connection between manifold learning and computational algebra by demonstrating how vanishing ideals can characterize the latent manifolds of deep networks. To that end, we propose a new neural architecture that (i) truncates a pretrained network at an intermediate layer, (ii) approximates each class manifold via polynomial generators of the vanishing ideal, and (iii) transforms the resulting latent space into linearly separable features through a single polynomial layer. The resulting models have significantly fewer layers than their pretrained baselines, while maintaining comparable accuracy, achieving higher throughput, and utilizing fewer parameters. Furthermore, drawing on spectral complexity analysis, we derive sharper theoretical guarantees for generalization, showing that our approach can in principle offer tighter bounds than standard deep networks. Numerical experiments confirm the effectiveness and efficiency of the proposed approach." Self-supervised Adversarial Purification for Graph Neural Networks,Woohyun Lee Hogun Park,https://icml.cc/virtual/2025/poster/43540,"Defending Graph Neural Networks (GNNs) against adversarial attacks requires balancing accuracy and robustness, a trade-off often mishandled by traditional methods like adversarial training that intertwine these conflicting objectives within a single classifier. To overcome this limitation, we propose a self-supervised adversarial purification framework. We separate robustness from the classifier by introducing a dedicated purifier, which cleanses the input data before classification. In contrast to prior adversarial purification methods, we propose GPR-GAE, a novel graph auto-encoder (GAE), as a specialized purifier trained with a self-supervised strategy, adapting to diverse graph structures in a data-driven manner. Utilizing multiple Generalized PageRank (GPR) filters, GPR-GAE captures diverse structural representations for robust and effective purification. Our multi-step purification process further facilitates GPR-GAE to achieve precise graph recovery and robust defense against structural perturbations. Experiments across diverse datasets and attack scenarios demonstrate the state-of-the-art robustness of GPR-GAE, showcasing it as an independent plug-and-play purifier for GNN classifiers. Our code can be found in https://github.com/woodavid31/GPR-GAE." Structure-informed Risk Minimization for Robust Ensemble Learning,Fengchun Qiao Yanlin Chen Xi Peng,https://icml.cc/virtual/2025/poster/45805,"Ensemble learning is a powerful approach for improving generalization under distribution shifts, but its effectiveness heavily depends on how individual models are combined. Existing methods often optimize ensemble weights based on validation data, which may not represent unseen test distributions, leading to suboptimal performance in out-of-distribution (OoD) settings. Inspired by Distributionally Robust Optimization (DRO), we propose Structure-informed Risk Minimization (SRM), a principled framework that learns robust ensemble weights without access to test data. Unlike standard DRO, which defines uncertainty sets based on divergence metrics alone, SRM incorporates structural information of training distributions, ensuring that the uncertainty set aligns with plausible real-world shifts. This approach mitigates the over-pessimism of traditional worst-case optimization while maintaining robustness. We introduce a computationally efficient optimization algorithm with theoretical guarantees and demonstrate that SRM achieves superior OoD generalization compared to existing ensemble combination strategies across diverse benchmarks. Code is available at: https://github.com/deep-real/SRM." Self-Organizing Visual Prototypes for Non-Parametric Representation Learning,Thalles Silva Helio Pedrini Adín Ramírez Rivera,https://icml.cc/virtual/2025/poster/45509,"We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together better characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations of two loss functions that implement the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection. Our pre-trained encoders achieve state-of-the-art performance on many retrieval benchmarks and demonstrate increasing performance gains with more complex encoders." SKOLR: Structured Koopman Operator Linear RNN for Time-Series Forecasting,Yitian Zhang Liheng Ma Antonios Valkanas Boris N. Oreshkin Mark Coates,https://icml.cc/virtual/2025/poster/44949,"Koopman operator theory provides a framework for nonlinear dynamical system analysis and time-series forecasting by mapping dynamics to a space of real-valued measurement functions, enabling a linear operator representation. Despite the advantage of linearity, the operator is generally infinite-dimensional. Therefore, the objective is to learn measurement functions that yield a tractable finite-dimensional Koopman operator approximation. In this work, we establish a connection between Koopman operator approximation and linear Recurrent Neural Networks (RNNs), which have recently demonstrated remarkable success in sequence modeling. We show that by considering an extended state consisting of lagged observations, we can establish an equivalence between a structured Koopman operator and linear RNN updates. Building on this connection, we present SKOLR, which integrates a learnable spectral decomposition of the input signal with a multilayer perceptron (MLP) as the measurement functions and implements a structured Koopman operator via a highly parallel linear RNN stack. Numerical experiments on various forecasting benchmarks and dynamical systems show that this streamlined, Koopman-theory-based design delivers exceptional performance. Our code is available at: https://github.com/networkslab/SKOLR." Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training,Max Milkert David Hyde Forrest John Laine,https://icml.cc/virtual/2025/poster/46437,"In a neural network with ReLU activations, the number of piecewise linear regions in the output can grow exponentially with depth.However, this is highly unlikely to happen when the initial parameters are sampled randomly, which therefore often leads to the use of networks that are unnecessarily large.To address this problem, we introduce a novel parameterization of the network that restricts its weights so that a depth $d$ network produces exactly $2^d$ linear regions at initialization and maintains those regions throughout training under the parameterization.This approach allows us to learn approximations of convex, one-dimensional functions that are several orders of magnitude more accurate than their randomly initialized counterparts.We further demonstrate a preliminary extension of our construction to multidimensional and non-convex functions, allowing the technique to replace traditional dense layers in various architectures." -Theoretical Analysis of Contrastive Learning in Vision-Language Model Pretraining: The Role of Synthetic Text Captions for Feature Alignment,Jiawei Sun Shuai Zhang Hongkang Li Meng Wang,https://openreview.net/forum?id=hgAAXdv8q8, -EBMaC: Empirical Bayes and Matrix Constraints for Label Shift,Mushan Li Kihyun Han YANYUAN MA,https://openreview.net/forum?id=GL6AJXT4R4, Reducing Confounding Bias without Data Splitting for Causal Inference via Optimal Transport,Yuguang Yan Zongyu Li Haolin Yang Zeqin Yang Hao Zhou Ruichu Cai Zhifeng Hao,https://icml.cc/virtual/2025/poster/44515,"Causal inference seeks to estimate the effect given a treatment such as a medicine or the dosage of a medication. To reduce the confounding bias caused by the non-randomized treatment assignment, most existing methods reduce the shift between subpopulations receiving different treatments. However, these methods split limited training samples into smaller groups, which cuts down the number of samples in each group, while precise distribution estimation and alignment highly rely on a sufficient number of training samples. In this paper, we propose a distribution alignment paradigm without data splitting, which can be naturally applied in the settings of binary and continuous treatments. To this end, we characterize the confounding bias by considering different probability measures of the same set including all the training samples, and exploit the optimal transport theory to analyze the confounding bias and outcome estimation error. Based on this, we propose to learn balanced representations by reducing the bias between the marginal distribution and the conditional distribution of a treatment. As a result, data reduction caused by splitting is avoided, and the outcome prediction model trained on one treatment group can be generalized to the entire population. The experiments on both binary and continuous treatment settings demonstrate the effectiveness of our method." Strong and Weak Identifiability of Optimization-based Causal Discovery in Non-linear Additive Noise Models,Mingjia Li Hong Qian Tian-Zuo Wang ShujunLi Min Zhang Aimin Zhou,https://icml.cc/virtual/2025/poster/43823,"Causal discovery aims to identify causal relationships from observational data. Recently, optimization-based causal discovery methods have attracted extensive attention in the literature due to their efficiency in handling high-dimensional problems. However, we observe that optimization-based methods often perform well on certain problems but struggle with others. This paper identifies a specific characteristic of causal structural equations that determines the difficulty of identification in causal discovery and, in turn, the performance of optimization-based methods. We conduct an in-depth study of the additive noise model (ANM) and propose to further divide identifiable problems into strongly and weakly identifiable types based on the difficulty of identification. We also provide a sufficient condition to distinguish the two categories. Inspired by these findings, this paper further proposes GENE, a generic method for addressing strongly and weakly identifiable problems in a unified way under the ANM assumption. GENE adopts an order-based search framework that incorporates conditional independence tests into order fitness evaluation, ensuring effectiveness on weakly identifiable problems. In addition, GENE restricts the dimensionality of the effect variables to ensure \emph{scale invariance}, a property crucial for practical applications. Experiments demonstrate that GENE is uniquely effective in addressing weakly identifiable problems while also remaining competitive with state-of-the-art causal discovery algorithms for strongly identifiable problems." RAGGED: Towards Informed Design of Scalable and Stable RAG Systems,Jennifer Hsia Afreen Shaikh Zora Zhiruo Wang Graham Neubig,https://icml.cc/virtual/2025/poster/46460,"Retrieval-augmented generation (RAG) enhances language models by integrating external knowledge, but its effectiveness is highly dependent on system configuration. Improper retrieval settings can degrade performance, making RAG less reliable than closed-book generation. In this work, we introduce RAGGED, a framework for systematically evaluating RAG systems across diverse retriever-reader configurations, retrieval depths, and datasets. Our analysis reveals that reader robustness to noise is the key determinant of RAG stability and scalability. Some readers benefit from increased retrieval depth, while others degrade due to their sensitivity to distracting content. Through large-scale experiments on open-domain, multi-hop, and specialized-domain datasets, we show that retrievers, rerankers, and prompts influence performance but do not fundamentally alter these reader-driven trends. By providing a principled framework and new metrics to assess RAG stability and scalability, RAGGED enables systematic evaluation of retrieval-augmented generation systems, guiding future research on optimizing retrieval depth and model robustness." @@ -3267,37 +3161,27 @@ Test-Time Selective Adaptation for Uni-Modal Distribution Shift in Multi-Modal D Zero-shot Meta-learning for Tabular Prediction Tasks with Adversarially Pre-trained Transformer,Yulun Wu Doron L Bergman,https://icml.cc/virtual/2025/poster/43960,"We present an Adversarially Pre-trained Transformer (APT) that is able to perform zero-shot meta-learning on tabular prediction tasks without using any real-world dataset to pre-train the model, extending on the recent development of Prior-Data Fitted Networks (PFNs) and TabPFN. Specifically, APT is pre-trained with adversarial synthetic data agents, who continue to shift their underlying data generating distribution and deliberately challenge the model with different synthetic datasets. In addition, we propose a mixture block model architecture that is able to handle classification tasks with arbitrary number of classes, addressing the class size limitation -- a crucial weakness of prior tabular zero-shot learning algorithms. In experiments, we show that our framework matches state-of-the-art performance on small tabular classification tasks without filtering on dataset characteristics such as number of classes and number of missing values, while maintaining an average runtime under one second. On common benchmark dataset suites in both classification and regression, we show that adversarial pre-training was able to enhance TabPFN's performance. In our analysis, we demonstrate that the adversarial synthetic data agents were able to generate a more diverse collection of data compared to the ordinary random generator in TabPFN. In addition, we demonstrate that our mixture block neural design has improved generalizability and greatly accelerated pre-training." Verification Learning: Make Unsupervised Neuro-Symbolic System Feasible,Lin-Han Jia Wen-Chao Hu Jie-Jing Shao Lan-Zhe Guo Yu-Feng Li,https://icml.cc/virtual/2025/poster/44815,"The current Neuro-Symbolic (NeSy) Learning paradigm suffers from an over-reliance on labeled data, so if we completely disregard labels, it leads to less symbol information, a larger solution space, and more shortcuts—issues that current Nesy systems cannot resolve. This paper introduces a novel learning paradigm, Verification Learning (VL), which addresses this challenge by transforming the label-based reasoning process in Nesy into a label-free verification process. VL achieves excellent learning results solely by relying on unlabeled data and a function that verifies whether the current predictions conform to the rules. We formalize this problem as a Constraint Optimization Problem (COP) and propose a Dynamic Combinatorial Sorting (DCS) algorithm that accelerates the solution by reducing verification attempts, effectively lowering computational costs and introduce a prior alignment method to address potential shortcuts. Our theoretical analysis points out which tasks in Nesy systems can be completed without labels and explains why rules can replace infinite labels for some tasks, while for others the rules have no effect. We validate the proposed framework through several fully unsupervised tasks including addition, sort, match, and chess, each showing significant performance and efficiency improvements." An in depth look at the Procrustes-Wasserstein distance: properties and barycenters,Davide Adamo Marco Corneli Manon Vuillien Emmanuelle Vila,https://icml.cc/virtual/2025/poster/44717,"Due to its invariance to rigid transformations such as rotations and reflections, Procrustes-Wasserstein (PW) was introduced in the literature as an optimal transport (OT) distance, alternative to Wasserstein and more suited to tasks such as the alignment and comparison of point clouds. Having that application in mind, we carefully build a space of discrete probability measures and show that over that space PW actuallyisa distance. Algorithms to solve the PW problems already exist, however we extend the PW framework by discussing and testing several initialization strategies. We then introduce the notion of PW barycenter and detail an algorithm to estimate it from the data. The result is a new method to compute representative shapes from a collection of point clouds. We benchmark our method against existing OT approaches, demonstrating superior performance in scenarios requiring precise alignment and shape preservation. We finally show the usefulness of the PW barycenters in an archaeological context. Our results highlight the potential of PW in advancing 2D and 3D point cloud analysis for machine learning and computational geometry applications." -Constrained Optimization From a Control Perspective via Feedback Linearization,Runyu Zhang Arvind Raghunathan Jeff S Shamma Na Li,https://openreview.net/forum?id=bV8Nd99mek, Graph-Supported Dynamic Algorithm Configuration for Multi-Objective Combinatorial Optimization,Robbert Reijnen Yaoxin Wu Zaharah Bukhsh Yingqian Zhang,https://icml.cc/virtual/2025/poster/45652,"Deep reinforcement learning (DRL) has been widely used for dynamic algorithm configuration, particularly in evolutionary computation, which benefits from the adaptive update of parameters during the algorithmic execution. However, applying DRL to algorithm configuration for multi-objective combinatorial optimization (MOCO) problems remains relatively unexplored. This paper presents a novel graph neural network (GNN) based DRL to configure multi-objective evolutionary algorithms. We model the dynamic algorithm configuration as a Markov decision process, representing the convergence of solutions in the objective space by a graph, with their embeddings learned by a GNN to enhance the state representation. Experiments on diverse MOCO challenges indicate that our method outperforms traditional and DRL-based algorithm configuration methods in terms of efficacy and adaptability. It also exhibits advantageous generalizability across objective types and problem sizes, and applicability to different evolutionary computation methods." Optimization over Sparse Support-Preserving Sets: Two-Step Projection with Global Optimality Guarantees,William de Vazelhes Xiaotong Yuan Bin Gu,https://icml.cc/virtual/2025/poster/46136,"In sparse optimization, enforcing hard constraints using the $\ell_0$ pseudo-norm offers advantages like controlled sparsity compared to convex relaxations. However, many real-world applications demand not only sparsity constraints but also some extra constraints. While prior algorithms have been developed to address this complex scenario with mixed combinatorial and convex constraints, they typically require the closed form projection onto the mixed constraints which might not exist, and/or only provide local guarantees of convergence which is different from the global guarantees commonly sought in sparse optimization. To fill this gap, in this paper, we study the problem of sparse optimization with extra *support-preserving* constraints commonly encountered in the literature. We present a new variant of iterative hard-thresholding algorithm equipped with a two-step consecutive projection operator customized for these mixed constraints, serving as a simple alternative to the Euclidean projection onto the mixed constraint. By introducing a novel trade-off between sparsity relaxation and sub-optimality, we provide global guarantees in objective value for the output of our algorithm, in the deterministic, stochastic, and zeroth-order settings, under the conventional restricted strong-convexity/smoothness assumptions. As a fundamental contribution in proof techniques, we develop a novel extension of the classic three-point lemma to the considered two-step non-convex projection operator, which allows us to analyze the convergence in objective value in an elegant way that has not been possible with existing techniques. In the zeroth-order case, such technique also improves upon the state-of-the-art result from de Vazelhes et. al. (2022), even in the case without additional constraints, by allowing us to remove a non-vanishing system error present in their work." -Dynamic Range Reduction via Branch-and-Bound,Thore Gerlach Nico Piatkowski,https://openreview.net/forum?id=yaqjXxbtSB, Inverse Optimization via Learning Feasible Regions,Ke Ren Peyman Mohajerin Esfahani Angelos Georghiou,https://icml.cc/virtual/2025/poster/44209,"We study inverse optimization (IO), where the goal is to use a parametric optimization program as the hypothesis class to infer relationships between input-decision pairs. Most of the literature focuses on learning only the objective function, as learning the constraint function (i.e., feasible regions) leads to nonconvex training programs. Motivated by this, we focus on learning feasible regions for known linear objectives, and introduce two training losses along with a hypothesis class to parameterize the constraint function. Our hypothesis class surpasses the previous objective-only method by naturally capturing discontinuous behaviors in input-decision pairs. We introduce a customized block coordinate descent algorithm with a smoothing technique to solve the training problems, while for further restricted hypothesis classes, we reformulate the training optimization as a tractable convex program or mixed integer linear program. Synthetic experiments and two power system applications including comparisons with state-of-the-art approaches showcase and validate the proposed approach." -Memory Efficient Block Coordinate Descent Method for Forward-Only Second-Order Finetuning of LLM Models,Zhiyuan Yu Yifei Cheng Liang Ding Xinmei Tian Li Shen Dacheng Tao,https://openreview.net/forum?id=rqT21nfAJ0, -Conformal Prediction for Hierarchical Data,Guillaume Principato Gilles Stoltz Yvenn Amara-Ouali Yannig Goude Bachir Hamrouche Jean-Michel Poggi,https://openreview.net/forum?id=JtsxqKYOIC, Solving Linear-Gaussian Bayesian Inverse Problems with Decoupled Diffusion Sequential Monte Carlo,Filip Ekström Kelvinius Zheng Zhao Fredrik Lindsten,https://icml.cc/virtual/2025/poster/45342,"A recent line of research has exploited pre-trained generative diffusion models as priors for solving Bayesian inverse problems. We contribute to this research direction by designing a sequential Monte Carlo method for linear-Gaussian inverse problems which builds on ``decoupled diffusion"", where the generative process is designed such that larger updates to the sample are possible. The method is asymptotically exact and we demonstrate the effectiveness of our Decoupled Diffusion Sequential Monte Carlo (DDSMC) algorithm on both synthetic as well as protein and image data. Further, we demonstrate how the approach can be extended to discrete data." Learn to Vaccinate: Combining Structure Learning and Effective Vaccination for Epidemic and Outbreak Control,Sepehr Elahi Paula Mürmann Patrick Thiran,https://icml.cc/virtual/2025/poster/46404,"The Susceptible-Infected-Susceptible (SIS) model is a widely used model for the spread of information and infectious diseases, particularly non-immunizing ones, on a graph. Given a highly contagious disease, a natural question is how to best vaccinate individuals to minimize the disease's extinction time. While previous works showed that the problem of optimal vaccination is closely linked to the NP-hardSpectral Radius Minimization(SRM) problem, they assumed that the graph is known, which is often not the case in practice. In this work, we consider the problem of minimizing the extinction time of an outbreak modeled by an SIS model where the graph on which the disease spreads is unknown and only the infection states of the vertices are observed. To this end, we split the problem into two: learning the graph and determining effective vaccination strategies. We propose a novel inclusion-exclusion-based learning algorithm and, unlike previous approaches, establish its sample complexity for graph recovery. We then detail an optimal algorithm for the SRM problem and prove that its running time is polynomial in the number of vertices for graphs with bounded treewidth. This is complemented by an efficient and effective polynomial-time greedy heuristic for any graph. Finally, we present experiments on synthetic and real-world data that numerically validate our learning and vaccination algorithms." Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics,Tyler Kastner Mark Rowland Yunhao Tang Murat A Erdogdu Amir-massoud Farahmand,https://icml.cc/virtual/2025/poster/44550,"We study the problem of distributional reinforcement learning using categorical parametrisations and a KL divergence loss. Previous work analyzing categorical distributional RL has done so using a Cramér distance-based loss, simplifying the analysis but creating a theory-practice gap. We introduce a preconditioned version of the algorithm, and prove that it is guaranteed to converge. We further derive the asymptotic variance of the categorical estimates under different learning rate regimes, and compare to that of classical reinforcement learning. We finally empirically validate our theoretical results and perform an empirical investigation into the relative strengths of using KL losses, and derive a number of actionable insights for practitioners." Learning Policy Committees for Effective Personalization in MDPs with Diverse Tasks,Luise Ge Michael Lanier Anindya Sarkar Bengisu Guresti Chongjie Zhang Yevgeniy Vorobeychik,https://icml.cc/virtual/2025/poster/44698,"Many dynamic decision problems, such as robotic control, involve a series of tasks, many of which are unknown at training time.Typical approaches for these problems, such as multi-task and meta reinforcement learning, do not generalize well when the tasks are diverse. On the other hand, approaches that aim to tackle task diversity, such as using task embedding as policy context and task clustering, typically lack performance guarantees and require a large number of training tasks. To address these challenges, we propose a novel approach for learning a policy committee that includes at least one near-optimal policy with high probability for tasks encountered during execution. While we show that this problem is in general inapproximable, we present two practical algorithmic solutions.The first yields provable approximation and task sample complexity guarantees when tasks are low-dimensional (the best we can do due to inapproximability), whereas the second is a general and practical gradient-based approach. In addition, we provide a provable sample complexity bound for few-shot learning. Our experiments on MuJoCo and Meta-World show that the proposed approach outperforms state-of-the-art multi-task, meta-, and task clustering baselines in training, generalization, and few-shot learning, often by a large margin. Our code is available at https://github.com/CERL-WUSTL/PACMAN." Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning,Seungho Baek taegeon park Jongchan Park Seungjun Oh Yusung Kim,https://icml.cc/virtual/2025/poster/46345,"Existing offline hierarchical reinforcement learning methods rely on high-level policy learning to generate subgoal sequences. However, their efficiency degrades as task horizons increase, and they lack effective strategies for stitching useful state transitions across different trajectories. We propose Graph-Assisted Stitching (GAS), a novel framework that formulates subgoal selection as a graph search problem rather than learning an explicit high-level policy. By embedding states into a Temporal Distance Representation (TDR) space, GAS clusters semantically similar states from different trajectories into unified graph nodes, enabling efficient transition stitching. A shortest-path algorithm is then applied to select subgoal sequences within the graph, while a low-level policy learns to reach the subgoals. To improve graph quality, we introduce the Temporal Efficiency (TE) metric, which filters out noisy or inefficient transition states, significantly enhancing task performance. GAS outperforms prior offline HRL methods across locomotion, navigation, and manipulation tasks. Notably, in the most stitching-critical task, it achieves a score of 88.3, dramatically surpassing the previous state-of-the-art score of 1.0. Our source code is available at: https://github.com/qortmdgh4141/GAS." -Fisher-Guided Selective Forgetting: Mitigating Primacy Bias in Deep Reinforcement Learning,Massimiliano Falzari Matthia Sabatelli,https://openreview.net/forum?id=uOGK8zwgvc, Simple Policy Optimization,Zhengpeng Xie Qiang Zhang Fan Yang Marco Hutter Renjing Xu,https://icml.cc/virtual/2025/poster/45232,"Model-free reinforcement learning algorithms have seen remarkable progress, but key challenges remain. Trust Region Policy Optimization (TRPO) is known for ensuring monotonic policy improvement through conservative updates within a trust region, backed by strong theoretical guarantees. However, its reliance on complex second-order optimization limits its practical efficiency. Proximal Policy Optimization (PPO) addresses this by simplifying TRPO's approach using ratio clipping, improving efficiency but sacrificing some theoretical robustness. This raises a natural question: Can we combine the strengths of both methods? In this paper, we introduce Simple Policy Optimization (SPO), a novel unconstrained first-order algorithm. By slightly modifying the policy loss used in PPO, SPO can achieve the best of both worlds. Our new objective improves upon ratio clipping, offering stronger theoretical properties and better constraining the probability ratio within the trust region. Empirical results demonstrate that SPO outperforms PPO with a simple implementation, particularly for training large, complex network architectures end-to-end." LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence,Zhuoling Li Xiaogang Xu Zhenhua Xu Ser-Nam Lim Hengshuang Zhao,https://icml.cc/virtual/2025/poster/43466,"Recent embodied agents are primarily built based on reinforcement learning (RL) or large language models (LLMs). Among them, RL agents are efficient for deployment but only perform very few tasks. By contrast, giant LLM agents (often more than 1000B parameters) present strong generalization while demanding enormous computing resources. In this work, we combine their advantages while avoiding the drawbacks by conducting the proposed referee RL on our developed large auto-regressive model (LARM). Specifically, LARM is built upon a lightweight LLM (fewer than 5B parameters) and directly outputs the next action to execute rather than text. We mathematically reveal that classic RL feedbacks vanish in long-horizon embodied exploration and introduce a giant LLM based referee to handle this reward vanishment during training LARM. In this way, LARM learns to complete diverse open-world tasks without human intervention. Especially, LARM successfully harvests enchanted diamond equipment in Minecraft, which demands significantly longer decision-making chains than the highest achievements of prior best methods." A New Approach to Backtracking Counterfactual Explanations: A Unified Causal Framework for Efficient Model Interpretability,Pouria Fatemi Ehsan Sharifian Mohammad Hossein Yassaee,https://icml.cc/virtual/2025/poster/46336,"Counterfactual explanations enhance interpretability by identifying alternative inputs that produce different outputs, offering localized insights into model decisions. However, traditional methods often neglect causal relationships, leading to unrealistic examples. While newer approaches integrate causality, they are computationally expensive. To address these challenges, we propose an efficient method called BRACE based on backtracking counterfactuals that incorporates causal reasoning to generate actionable explanations. We first examine the limitations of existing methods and then introduce our novel approach and its features. We also explore the relationship between our method and previous techniques, demonstrating that it generalizes them in specific scenarios. Finally, experiments show that our method provides deeper insights into model outputs." Prediction via Shapley Value Regression,Amr Alkhatib Roman Bresson Henrik Boström Michalis Vazirgiannis,https://icml.cc/virtual/2025/poster/44871,"Shapley values have several desirable, theoretically well-supported, properties for explaining black-box model predictions. Traditionally, Shapley values are computed post-hoc, leading to additional computational cost at inference time. To overcome this, a novel method, called ViaSHAP, is proposed, that learns a function to compute Shapley values, from which the predictions can be derived directly by summation. Two approaches to implement the proposed method are explored; one based on the universal approximation theorem and the other on the Kolmogorov-Arnold representation theorem. Results from a large-scale empirical investigation are presented, showing that ViaSHAP using Kolmogorov-Arnold Networks performs on par with state-of-the-art algorithms for tabular data. It is also shown that the explanations of ViaSHAP are significantly more accurate than the popular approximator FastSHAP on both tabular data and images." Textural or Textual: How Vision-Language Models Read Text in Images,Hanzhang Wang Qingyuan Ma,https://icml.cc/virtual/2025/poster/44522,"Typographic attacks are often attributed to the ability of multimodal pre-trained models to fuse textual semantics into visual representations, yet the mechanisms and locus of such interference remain unclear. We examine whether such models genuinely encode textual semantics or primarily rely on texture-based visual features. To disentangle orthographic form from meaning, we introduce the ToT dataset, which includes controlled word pairs that either share semantics with distinct appearances (synonyms) or share appearance with differing semantics (paronyms). A layer-wise analysis of Intrinsic Dimension (ID) reveals that early layers exhibit competing dynamics between orthographic and semantic representations. In later layers, semantic accuracy increases as ID decreases, but this improvement largely stems from orthographic disambiguation. Notably, clear semantic differentiation emerges only in the final block, challenging the common assumption that semantic understanding is progressively constructed across depth. These findings reveal how current vision-language models construct text representations through texture-dependent processes, prompting a reconsideration of the gap between visual perception and semantic understanding. The code is available at: https://github.com/Ovsia/Textural-or-Textual" -Transforming Visual Classifiers for Zero-Shot Text-Based Interpretability,Fawaz Sammani Jonas Fischer Nikos Deligiannis,https://openreview.net/forum?id=VOTeRb9AU1, -Modification-Considering Value Learning for Reward Hacking Mitigation in RL,Evgenii Opryshko Umangi Jain Igor Gilitschenski,https://openreview.net/forum?id=OmYqzp8NO7, Accelerating Spectral Clustering under Fairness Constraints,Francesco Tonin Alex Lambert Johan Suykens Volkan Cevher,https://icml.cc/virtual/2025/poster/45681,"Fairness of decision-making algorithms is an increasingly important issue. In this paper, we focus on spectral clustering with group fairness constraints, where every demographic group is represented in each cluster proportionally as in the general population. We present a new efficient method for fair spectral clustering (Fair SC) by casting the Fair SC problem within the difference of convex functions (DC) framework. To this end, we introduce a novel variable augmentation strategy and employ an alternating direction method of multipliers type of algorithm adapted to DC problems. We show that each associated subproblem can be solved efficiently, resulting in higher computational efficiency compared to prior work, which required a computationally expensive eigendecomposition. Numerical experimentsdemonstrate the effectiveness of our approach on both synthetic and real-world benchmarks, showing significant speedups in computation time over prior art, especially as the problem size grows. This work thus represents a considerable step forward towards the adoption of fair clustering in real-world applications." On the Alignment between Fairness and Accuracy: from the Perspective of Adversarial Robustness,Junyi Chai Taeuk Jang Jing Gao Xiaoqian Wang,https://icml.cc/virtual/2025/poster/44670,"While numerous work has been proposed to address fairness in machine learning, existing methods do not guarantee fair predictions under imperceptible feature perturbation, and a seemingly fair model can suffer from large group-wise disparities under such perturbation. Moreover, while adversarial training has been shown to be reliable in improving a model's robustness to defend against adversarial feature perturbation that deteriorates accuracy, it has not been properly studied in the context of adversarial perturbation against fairness. To tackle these challenges, in this paper, we study the problem of adversarial attack and adversarial robustness w.r.t. two terms: fairness and accuracy. From the adversarial attack perspective, we propose a unified structure for adversarial attacks against fairness which brings together common notions in group fairness, and we theoretically prove the equivalence of adversarial attacks against different fairness notions. Further, we derive the connections between adversarial attacks against fairness and those against accuracy. From the adversarial robustness perspective, we theoretically align robustness to adversarial attacks against fairness and accuracy, where robustness w.r.t. one term enhances robustness w.r.t. the other term. Our study suggests a novel way to unify adversarial training w.r.t. fairness and accuracy, and experiments show our proposed method achieves better robustness w.r.t. both terms." The Disparate Benefits of Deep Ensembles,Kajetan Schweighofer Adrian Arnaiz-Rodriguez Sepp Hochreiter Nuria M Oliver,https://icml.cc/virtual/2025/poster/43767,"Ensembles of Deep Neural Networks, Deep Ensembles, are widely used as a simple way to boost predictive performance. However, their impact on algorithmic fairness is not well understood yet. Algorithmic fairness examines how a model's performance varies across socially relevant groups defined by protected attributes such as age, gender, or race. In this work, we explore the interplay between the performance gains from Deep Ensembles and fairness. Our analysis reveals that they unevenly favor different groups, a phenomenon that we term the disparate benefits effect. We empirically investigate this effect using popular facial analysis and medical imaging datasets with protected group attributes and find that it affects multiple established group fairness metrics, including statistical parity and equal opportunity. Furthermore, we identify that the per-group differences in predictive diversity of ensemble members can explain this effect. Finally, we demonstrate that the classical Hardt post-processing method is particularly effective at mitigating the disparate benefits effect of Deep Ensembles by leveraging their better-calibrated predictive distributions." Fast Exact Unlearning for In-Context Learning Data for LLMs,Andrei Ioan Muresanu Anvith Thudi Michael R. Zhang Nicolas Papernot,https://icml.cc/virtual/2025/poster/45148,"Modern machine learning models are expensive to train, and there is a growing concern about the challenge of retroactively removing specific training data. Achieving exact unlearning in deep learning pipelines—producing models as if certain data had never been included in training—remains an open problem. In this paper, we revisit exact unlearning in deep learning and show that for large language models (LLMs) we can efficiently exactly unlearn ``fine-tuning data"" (the data used to adapt a pre-trained model). This follows from two observations. First, we can use in-context learning to adapt the LLM to the fine-tuning dataset instead of SGD based algorithms. Second, we show that accurate in-context learning can be done with quantized k-means, which allows for effectively constant time unlearning operations. Our evaluation shows that this unlearning recipe has similar performance to fine-tuning alternatives, but vastly reduces the unlearning costs. Our study also highlights the need for new measures of unlearning cost when adapting the learning algorithm to have faster unlearn operations." -Locally Differentially Private Graph Clustering via the Power Iteration Method,Sayan Mukherjee Vorapong Suppakitpaisarn,https://openreview.net/forum?id=RKlnPO5bii, Ranked from Within: Ranking Large Multimodal Models Without Labels,Weijie Tu Weijian Deng Dylan Campbell Yu Yao Jiyang Zheng Tom Gedeon Tongliang Liu,https://icml.cc/virtual/2025/poster/45911,"Can the relative performance of a pre-trained large multimodal model (LMM) be predicted without access to labels? As LMMs proliferate, it becomes increasingly important to develop efficient ways to choose between them when faced with new data or tasks. The usual approach does the equivalent of giving the models an exam and marking them. We opt to avoid marking and the associated labor of determining the ground-truth answers. Instead, we explore other signals elicited and ascertain how well the models know their own limits, evaluating the effectiveness of these signals at unsupervised model ranking. We evaluate 47 state-of-the-art LMMs (e.g., LLaVA) across 9 visual question answering benchmarks, analyzing how well uncertainty-based metrics can predict relative model performance. Our findings show that uncertainty scores derived from softmax distributions provide a robust and consistent basis for ranking models across various tasks. This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation." SCISSOR: Mitigating Semantic Bias through Cluster-Aware Siamese Networks for Robust Classification,Shuo Yang Bardh Prenkaj Gjergji Kasneci,https://icml.cc/virtual/2025/poster/45266,"Shortcut learning undermines model generalization to out-of-distribution data. While the literature attributes shortcuts to biases in superficial features, we show that imbalances in the semantic distribution of sample embeddings induce spurious semantic correlations, compromising model robustness. To address this issue, we propose SCISSOR (Semantic Cluster Intervention for Suppressing ShORtcut), a Siamese network-based debiasing approach that remaps the semantic space by discouraging latent clusters exploited as shortcuts. Unlike prior data-debiasing approaches, SCISSOR eliminates the need for data augmentation and rewriting. We evaluate SCISSOR on 6 models across 4 benchmarks: Chest-XRay and Not-MNIST in computer vision, and GYAFC and Yelp in NLP tasks. Compared to several baselines, SCISSOR reports +5.3 absolute points in F1 score on GYAFC, +7.3 on Yelp, +7.7 on Chest-XRay, and +1 on Not-MNIST. SCISSOR is also highly advantageous for lightweight models with ∼9.5% improvement on F1 for ViT on computer vision datasets and ∼11.9% for BERT on NLP. Our study redefines the landscape of model generalization by addressing overlooked semantic biases, establishing SCISSOR as a foundational framework for mitigating shortcut learning and fostering more robust, bias-resistant AI systems." MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents,Kaijie Zhu Xianjun Yang Jindong Wang Wenbo Guo William Yang Wang,https://icml.cc/virtual/2025/poster/44447,"Recent research has explored that LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions. Existing defenses against IPI have significant limitations: either require essential model training resources, lack effectiveness against sophisticated attacks, or harm the normal utilities. We present MELON (Masked re-Execution and TooL comparisON), a novel IPI defense. Our approach builds on the observation that under a successful attack, the agent’s next action becomes less dependent on user tasks and more on malicious tasks. Following this, we design MELON to detect attacks by re-executing the agent’s trajectory with a masked user prompt modified through a masking function. We identify an attack if the actions generated in the original and masked executions are similar. We also include three key designs to reduce the potential false positives and false negatives. Extensive evaluation on the IPI benchmark AgentDojo demonstrates that MELON outperforms SOTA defenses in both attack prevention and utility preservation. Moreover, we show that combining MELON with a SOTA prompt augmentation defense (denoted as MELON-Aug) further improves its performance. We also conduct a detailed ablation study to validate our key designs. Code is available at https://github.com/kaijiezhu11/MELON." Grokking Beyond the Euclidean Norm of Model Parameters,Tikeng Notsawo Pascal Junior Guillaume Dumas Guillaume Rabusseau,https://icml.cc/virtual/2025/poster/45887,"Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. In this work, we demonstrate that grokking can be induced by regularization, either explicit or implicit. More precisely, we show that when there exists a model with a property $P$ (e.g., sparse or low-rank weights) that generalizes on the problem of interest, gradient descent with a small but non-zero regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) results in grokking. This extends previous work showing that small non-zero weight decay induces grokking. Moreover, our analysis shows that over-parameterization by adding depth makes it possible to grok or ungrok without explicitly using regularization, which is impossible in shallow cases. We further show that the $\ell_2$ norm is not a reliable proxy for generalization when the model is regularized toward a different property $P$, as the $\ell_2$ norm grows in many cases where no weight decay is used, but the model generalizes anyway. We also show that grokking can be amplified solely through data selection, with any other hyperparameter fixed." -On Bitrates of Very Sparse Superposition Codes,Christopher Neil Gadzinski Decebal Constantin Mocanu,https://openreview.net/forum?id=peR9HAdJnk, -Approximating Nash Equilibria in General-Sum Games via Meta-Learning,David Sychrovský Christopher Solinas Revan MacQueen Kevin A. Wang James R. Wright Nathan R. Sturtevant Michael Bowling,https://openreview.net/forum?id=OEKs42CZBJ, Adversarial Robust Generalization of Graph Neural Networks,Chang Cao Han Li Yulong Wang Rui Wu Hong Chen,https://icml.cc/virtual/2025/poster/46457,"While Graph Neural Networks (GNNs) have shown outstanding performance in node classification tasks, they are vulnerable to adversarial attacks, which are imperceptible changes to input samples. Adversarial training, as a widely used tool to enhance the adversarial robustness of GNNs, has presented remarkable effectiveness in node classification tasks. However, the generalization properties for explaining their behaviors remain not well understood from the theoretical viewpoint. To fill this gap, we develop a high probability generalization bound of general GNNs in adversarial learning through covering number analysis. We estimate the covering number of the GNN model class based on the entire perturbed feature matrix by constructing a cover for the perturbation set. Our results are generally applicable to a series of GNNs. We demonstrate their applicability by investigating the generalization performance of several popular GNN models under adversarial attacks, which reveal the architecture-related factors influencing the generalization gap. Our experimental results on benchmark datasets provide evidence that supports the established theoretical findings." Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension,Yijun Dong Yicheng Li Yunai Li Jason D. Lee Qi Lei,https://icml.cc/virtual/2025/poster/43961,"Weak-to-strong (W2S) generalization is a type of finetuning (FT) where a strong (large) student model is trained on pseudo-labels generated by a weak teacher. Surprisingly, W2S FT often outperforms the weak teacher. We seek to understand this phenomenon through the observation that FT often occurs in intrinsically low-dimensional spaces. Leveraging the low intrinsic dimensionality of FT, we analyze W2S in the ridgeless regression setting from a variance reduction perspective. For a strong student-weak teacher pair with sufficiently expressive low-dimensional feature subspaces $\mathcal{V}_s, \mathcal{V}_w$, we provide an exact characterization of the variance that dominates the generalization error of W2S. This unveils a virtue of discrepancy between the strong and weak models in W2S: the variance of the weak teacher is inherited by the strong student in $\mathcal{V}_s \cap \mathcal{V}_w$, while reduced by a factor of $\mathrm{dim}(\mathcal{V}_s)/N$ in the subspace of discrepancy $\mathcal{V}_w \setminus \mathcal{V}_s$ with $N$ pseudo-labels for W2S. Our analysis further casts light on the sample complexities and the scaling of performance gap recovery in W2S. The analysis is supported by experiments on synthetic regression problems, as well as real vision and NLP tasks." Limitations of measure-first protocols in quantum machine learning,Casper Gyurik Riccardo Molteni Vedran Dunjko,https://icml.cc/virtual/2025/poster/44218,"In recent times, there have been major developments in two distinct yet connected domains of quantum information. On the one hand, substantial progress has been made in so-called randomized measurement protocols. Here, a number of properties of unknown quantum states can be deduced from surprisingly few measurement outcomes, using schemes such as classical shadows. On the other hand, significant progress has been made in quantum machine learning. For example, exponential advantages have been proven when the data consists of quantum states and quantum algorithms can coherently measure multiple copies of input states. In this work, we aim to understand the implications and limitations of combining randomized measurement protocols with quantum machine learning, although the implications are broader. Specifically, we investigate quantum machine learning algorithms that, when dealing with quantum data, can either process it entirely using quantum methods or measure the input data through a fixed measurement scheme and utilize the resulting classical information. We prove limitations for the general class of quantum machine learning algorithms that use fixed measurement schemes on the input quantum states.Our results have several implications. From the perspective of randomized measurement procedures, we show limitations of measure-first protocols in the average case, improving on the state-of-the-art which only focuses on worst-case scenarios. Additionally, previous lower bounds were only known for physically unrealizable states. We improve upon this by employing quantum pseudorandom functions to prove that a learning separation also exists when dealing with physically realizable states, which may be encountered in experiments. From a machine learning perspective, our results are crucial for defining a physically meaningful task that shows fully quantum machine learning processing is not only more efficient but also necessary for solving certain problems. The tasks at hand are also realistic, as the algorithms and proven separations hold when working with efficiently preparable states and remain robust in the presence of measurement and preparation errors." @@ -3307,13 +3191,11 @@ Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantizatio AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization,Junkang Wu Xue Wang Zhengyi Yang Jiancan Wu Jinyang Gao Bolin Ding Xiang Wang Xiangnan He,https://icml.cc/virtual/2025/poster/45946,"Aligning large language models (LLMs) with human preferences requires balancing policy optimization with computational stability. While recent offline methods like DPO and SimPO bypass reinforcement learning’s complexity, they face critical limitations: DPO relies on static reference models that degrade with policy updates, and SimPO assumes a uniform target reward margin that ignores instance-wise preference strength. We propose AlphaDPO, an adaptive preference optimization framework that dynamically reparameterizes the reference distribution to address these issues. Our key innovation lies in an implicit reference model (\hat{\pi}{\text{ref}} \propto U(y|x)(\pi\theta/\pi_{\text{ref}})^\alpha), which interpolates between policy-driven specialization and uniform exploration while enabling instance-adaptive reward margins. Theoretically, we prove AlphaDPO implicitly controls sequential KL divergence between iterative policy updates, ensuring stability even with poorly calibrated reference models. Empirically, AlphaDPO achieves state-of-the-art performance on AlpacaEval 2 (58.7\% LC win rate) and Arena-Hard (35.7\% win rate) across Mistral2-7B, Llama3-8B, and Gemma2-9B, demonstrating robust alignment without multi-stage training. Our work establishes adaptive reference reparameterization as a principled mechanism for preference optimization." "One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs",Yinghui Li Jiayi Kuang Haojing Huang Zhikun Xu Xinnian Liang Yi Yu Wenlian Lu Yangning Li Xiaoyu Tan Chao Qu Ying Shen Hai-Tao Zheng Philip S. Yu,https://icml.cc/virtual/2025/poster/46191,"Leveraging mathematical Large Language Models (LLMs) for proof generation is a fundamental topic in LLMs research. We argue that the ability of current LLMs to prove statements largely depends on whether they have encountered the relevant proof process during training. This reliance limits their deeper understanding of mathematical theorems and related concepts. Inspired by the pedagogical method of ""proof by counterexamples"" commonly used in human mathematics education, our work aims to enhance LLMs’ ability to conduct mathematical reasoning and proof through counterexamples. Specifically, we manually create a high-quality, university-level mathematical benchmark, COUNTERMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. Additionally, we develop a data engineering framework to automatically obtain training data for further model improvement. Extensive experiments and detailed analyses demonstrate that COUNTERMATH is challenging, indicating that LLMs, such as OpenAI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs' counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities. We believe that our work offers new perspectives on the community of mathematical LLMs." "System-Aware Unlearning Algorithms: Use Lesser, Forget Faster",Linda Lu Ayush Sekhari Karthik Sridharan,https://icml.cc/virtual/2025/poster/45985,"Machine unlearning addresses the problem of updating a machine learning model/system trained on a dataset $S$ so that the influence of a set of deletion requests $U \subseteq S$ on the unlearned model is minimized. The gold standard definition of unlearning demands that the updated model, after deletion, be nearly identical to the model obtained by retraining. This definition is designed for a worst-case attacker (one who can recover not only the unlearned model but also the remaining data samples, i.e., $S \setminus U$). Such a stringent definition has made developing efficient unlearning algorithms challenging. However, such strong attackers are also unrealistic. In this work, we propose a new definition, *system-aware unlearning*, which aims to provide unlearning guarantees against an attacker that can at best only gain access to the data stored in the system for learning/unlearning requests and not all of $S\setminus U$. With this new definition, we use the simple intuition that if a system can store less to make its learning/unlearning updates, it can be more secure and update more efficiently against a system-aware attacker. Towards that end, we present an exact system-aware unlearning algorithm for linear classification using a selective sampling-based approach, and we generalize the method for classification with general function classes. We theoretically analyze the tradeoffs between deletion capacity, accuracy, memory, and computation time." -Accelerating Eigenvalue Dataset Generation via Chebyshev Subspace Filter,Hong Wang Jie Wang Jian Luo huanshuo dong Yeqiu Chen Runmin Jiang Zhen huang,https://openreview.net/forum?id=NslNp4sGmE, Closed-form Solutions: A New Perspective on Solving Differential Equations,Shu Wei Yanjie Li Lina Yu Weijun Li Min Wu Linjun Sun Jingyi Liu Hong Qin Yusong Deng Jufeng Han Yan Pang,https://icml.cc/virtual/2025/poster/46180,"The quest for analytical solutions to differential equations has traditionally been constrained by the need for extensive mathematical expertise.Machine learning methods like genetic algorithms have shown promise in this domain, but are hindered by significant computational time and the complexity of their derived solutions. This paper introducesSSDE(Symbolic Solver for Differential Equations), a novel reinforcement learning-based approach that derives symbolic closed-form solutions for various differential equations. Evaluations across a diverse set of ordinary and partial differential equations demonstrate that SSDE outperforms existing machine learning methods, delivering superior accuracy and efficiency in obtaining analytical solutions." CoCoA-Mix: Confusion-and-Confidence-Aware Mixture Model for Context Optimization,Dasol Hong Wooju Lee Hyun Myung,https://icml.cc/virtual/2025/poster/44709,"Prompt tuning, which adapts vision-language models by freezing model parameters and opti- mizing only the prompt, has proven effective for task-specific adaptations. The core challenge in prompt tuning is improving specialization for a specific task and generalization for unseen domains. However, frozen encoders often produce misaligned features, leading to confusion between classes and limiting specialization. To overcome this issue, we propose a confusion-aware loss (CoA-loss) that improves specialization by refining the decision boundaries between confusing classes. Additionally, we mathematically demonstrate that a mixture model can enhance generalization without compromising specialization. This is achieved using confidence-aware weights (CoA- weights), which adjust the weights of each prediction in the mixture model based on its confidence within the class domains. Extensive experiments show that CoCoA-Mix, a mixture model with CoA-loss and CoA-weights, outperforms state-of-the-art methods by enhancing specialization and generalization. Our code is publicly available at https://github.com/url-kaist/CoCoA-Mix" Complex Wavelet Mutual Information Loss: A Multi-Scale Loss Function for Semantic Segmentation,Renhao Lu,https://icml.cc/virtual/2025/poster/44329,"Recent advancements in deep neural networks have significantly enhanced the performance of semantic segmentation. However, class imbalance and instance imbalance remain persistent challenges, where smaller instances and thin boundaries are often overshadowed by larger structures. To address the multiscale nature of segmented objects, various models have incorporated mechanisms such as spatial attention and feature pyramid networks. Despite these advancements, most loss functions are still primarily pixel-wise, while regional and boundary-focused loss functions often incur high computational costs or are restricted to small-scale regions. To address this limitation, we propose the complex wavelet mutual information (CWMI) loss, a novel loss function that leverages mutual information from subband images decomposed by a complex steerable pyramid. The complex steerable pyramid captures features across multiple orientations and preserves structural similarity across scales. Meanwhile, mutual information is well-suited to capturing high-dimensional directional features and offers greater noise robustness. Extensive experiments on diverse segmentation datasets demonstrate that CWMI loss achieves significant improvements in both pixel-wise accuracy and topological metrics compared to state-of-the-art methods, while introducing minimal computational overhead. Our code is available at https://github.com/lurenhaothu/CWMI" Few-Shot Learner Generalizes Across AI-Generated Image Detection,Shiyu Wu Jing Liu Jing Li Yequan Wang,https://icml.cc/virtual/2025/poster/43709,"Current fake image detectors trained on large synthetic image datasets perform satisfactorily on limited studied generative models. However, these detectors suffer a notable performance decline over unseen models. Besides, collecting adequate training data from online generative models is often expensive or infeasible. To overcome these issues, we propose Few-Shot Detector (FSD), a novel AI-generated image detector which learns a specialized metric space for effectively distinguishing unseen fake images using very few samples. Experiments show that FSD achieves state-of-the-art performance by $+11.6\%$ average accuracy on the GenImage dataset with only $10$ additional samples. More importantly, our method is better capable of capturing the intra-category commonality in unseen images without further training. Our code is available at https://github.com/teheperinko541/Few-Shot-AIGI-Detector." Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection,Peipeng Yu Jianwei Fei Hui Gao Xuan Feng Zhihua Xia Chip Hong Chang,https://icml.cc/virtual/2025/poster/43687,"Current Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection due to the misalignment of their knowledge and forensics patterns. To this end, we present a novel framework that unlocks LVLMs' potential capabilities for deepfake detection. Our framework includes a Knowledge-guided Forgery Detector (KFD), a Forgery Prompt Learner (FPL), and a Large Language Model (LLM). The KFD is used to calculate correlations between image features and pristine/deepfake image description embeddings, enabling forgery classification and localization. The outputs of the KFD are subsequently processed by the Forgery Prompt Learner to construct fine-grained forgery prompt embeddings. These embeddings, along with visual and question prompt embeddings, are fed into the LLM to generate textual detection responses. Extensive experiments on multiple benchmarks, including FF++, CDF2, DFD, DFDCP, DFDC, and DF40, demonstrate that our scheme surpasses state-of-the-art methods in generalization performance, while also supporting multi-turn dialogue capabilities." -Non-invasive electromyographic speech neuroprosthesis: a geometric perspective,Harshavardhana T Gowda Ferdous Rahimi Lee M. Miller,https://openreview.net/forum?id=fk5a8tUCSJ, Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM,Xiong Wang Yangze Li Chaoyou Fu Yike Zhang Yunhang Shen Lei Xie Ke Li Xing Sun Long MA,https://icml.cc/virtual/2025/poster/43854,"The GPT-4o's excellent duplex speech interaction ability has given users an impressive experience. Researchers have recently proposed several multimodal LLMs to achieve user-agent speech-to-speech conversations. In this paper, we propose a novel speech-text multimodal LLM architecture called Freeze-Omni, and our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM's parameters frozen throughout the training process. We effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level as that in the text modality of its backbone LLM while achieving low latency in the end-to-end spoken response. In addition, we also designed a method to achieve duplex dialogue ability through multitask training, giving Freeze-Omni a more natural style of dialogue ability between users and agents. In summary, Freeze-Omni holds great potential to conduct speech-to-speech dialogue based on a multimodal LLM under the condition of a frozen LLM, avoiding the catastrophic forgetting problem caused by limited data and training resources." AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion Models,Yaopei Zeng Yuanpu Cao Bochuan Cao Yurui Chang Jinghui Chen Lu Lin,https://icml.cc/virtual/2025/poster/45719,"Recent advances in diffusion models have significantly enhanced the quality of image synthesis, yet they have also introduced serious safety concerns, particularly the generation of Not Safe for Work (NSFW) content. Previous research has demonstrated that adversarial prompts can be used to generate NSFW content. However, such adversarial text prompts are often easily detectable by text-based filters, limiting their efficacy. In this paper, we expose a previously overlooked vulnerability: adversarial image attacks targeting Image-to-Image (I2I) diffusion models. We propose AdvI2I, a novel framework that manipulates input images to induce diffusion models to generate NSFW content. By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms, such as Safe Latent Diffusion (SLD), without altering the text prompts. Furthermore, we introduce AdvI2I-Adaptive, an enhanced version that adapts to potential countermeasures and minimizes the resemblance between adversarial images and NSFW concept embeddings, making the attack more resilient against defenses. Through extensive experiments, we demonstrate that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards, highlighting the urgent need for stronger security measures to address the misuse of I2I diffusion models." Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning,Fan Shi Bin Li Xiangyang Xue,https://icml.cc/virtual/2025/poster/43966,"Abstract visual reasoning (AVR) enables humans to quickly discover and generalize abstract rules to new scenarios. Designing intelligent systems with human-like AVR abilities has been a long-standing topic in the artificial intelligence community. Deep AVR solvers have recently achieved remarkable success in various AVR tasks. However, they usually use task-specific designs or parameters in different tasks. In such a paradigm, solving new tasks often means retraining the model, and sometimes retuning the model architectures, which increases the cost of solving AVR problems. In contrast to task-specific approaches, this paper proposes a novel Unified Conditional Generative Solver (UCGS), aiming to address multiple AVR tasks in a unified framework. First, we prove that some well-known AVR tasks can be reformulated as the problem of estimating the predictability of target images in problem panels. Then, we illustrate that, under the proposed framework, training one conditional generative model can solve various AVR tasks. The experiments show that with a single round of multi-task training, UCGS demonstrates abstract reasoning ability across various AVR tasks. Especially, UCGS exhibits the ability of zero-shot reasoning, enabling it to perform abstract reasoning on problems from unseen AVR tasks in the testing phase." @@ -3326,8 +3208,6 @@ Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM,Penghao Training Software Engineering Agents and Verifiers with SWE-Gym,Jiayi Pan Xingyao Wang Graham Neubig Navdeep Jaitly Heng Ji Alane Suhr Yizhe Zhang,https://icml.cc/virtual/2025/poster/46038,"We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents, achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories." Deep Principal Support Vector Machines for Nonlinear Sufficient Dimension Reduction,Yinfeng Chen Jin Liu Rui Qiu,https://icml.cc/virtual/2025/poster/43795,"The normal vectors obtained from the support vector machine (SVM) method offer the potential to achieve sufficient dimension reduction in both classification and regression scenarios. Motivated by it, we in this paper introduce a unified framework for nonlinear sufficient dimension reduction based on classification ensemble. Kernel principal SVM, which leverages the reproducing kernel Hilbert space, can almost be regarded as a special case of this framework, and we generalize it by using a neural network function class for more flexible deep nonlinear reduction. We theoretically prove its unbiasedness with respect to the central $\sigma$-field and provide a nonasymptotic upper bound for the estimation error. Simulations and real data analysis demonstrate the considerable competitiveness of the proposed method, especially under heavy data contamination, large sample sizes, and complex inputs." Optimizing Robustness and Accuracy in Mixture of Experts: A Dual-Model Approach,Xu Zhang Kaidi Xu Ziqing Hu Ren Wang,https://icml.cc/virtual/2025/poster/44667,"Mixture of Experts (MoE) have shown remarkable success in leveraging specialized expert networks for complex machine learning tasks. However, their susceptibility to adversarial attacks presents a critical challenge for deployment in robust applications. This paper addresses the critical question of how to incorporate robustness into MoEs while maintaining high natural accuracy. We begin by analyzing the vulnerability of MoE components, finding that expert networks are notably more susceptible to adversarial attacks than the router. Based on this insight, we propose a targeted robust training technique that integrates a novel loss function to enhance the adversarial robustness of MoE, requiring only the robustification of one additional expert without compromising training or inference efficiency. Building on this, we introduce a dual-model strategy that linearly combines a standard MoE model with our robustified MoE model using a smoothing parameter. This approach allows for flexible control over the robustness-accuracy trade-off. We further provide theoretical foundations by deriving certified robustness bounds for both the single MoE and the dual-model. To push the boundaries of robustness and accuracy, we propose a novel joint training strategy JTDMoE for the dual-model. This joint training enhances both robustness and accuracy beyond what is achievable with separate models. Experimental results on CIFAR-10 and TinyImageNet datasets using ResNet18 and Vision Transformer (ViT) architectures demonstrate the effectiveness of our proposed methods." -How Classifiers Extract General Features for Downstream Tasks: An Asymptotic Analysis in Two-Layer Models,HEE BIN YOO Sungyoon Lee Cheongjae Jang Dong-Sig Han Jaein Kim Seunghyeon Lim Byoung-Tak Zhang,https://openreview.net/forum?id=vlRd3GeJEq, -Universal Approximation Theorem of Networks Activated by Normalization,Yunhao Ni Yuhe Liu WenXin Sun Yitong Tang Peilin Feng Yuxin Guo wenjun wu Lei Huang,https://openreview.net/forum?id=fGflKXfEP1, Distributionally Robust Policy Learning under Concept Drifts,Jingyuan Wang Zhimei Ren Ruohan Zhan Zhengyuan Zhou,https://icml.cc/virtual/2025/poster/45171,"Distributionally robust policy learning aims to find a policy that performs well under the worst-case distributional shift, and yet most existing methods for robust policy learning consider the worst-case *joint* distribution of the covariate and the outcome. The joint-modeling strategy can be unnecessarily conservative when we have more information on the source of distributional shifts. This paper studies a more nuanced problem --- robust policy learning under the *concept drift*, when only the conditional relationship between the outcome and the covariate changes. To this end, we first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy under a set of perturbed conditional distributions. We show that the policy value estimator enjoys asymptotic normality even if the nuisance parameters are estimated with a slower-than-root-$n$ rate. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class $\Pi$, and show that the sub-optimality gap of the proposed algorithm is of the order $\kappa(\Pi)n^{-1/2}$, where $\kappa(\Pi)$ is the entropy integral of $\Pi$ under the Hamming distance and $n$ is the sample size. A matching lower bound is provided to show the optimality of the rate. The proposed methods are implemented and evaluated in numerical studies, demonstrating substantial improvement compared with existing benchmarks." Contrastive Learning with Simplicial Convolutional Networks for Short-Text Classification,Huang Liang Benedict Lee Daniel Hui Loong Ng Kelin Xia,https://icml.cc/virtual/2025/poster/45750,"Text classification is a fundamental task in Natural Language Processing (NLP). Short text classification has recently captured much attention due to its increased amount from various sources with limited labels and its inherent challenges for its sparsity in words and semantics. Recent studies have adopted self-supervised contrastive learning across different representations to improve performance. However, most of the current models face several challenges. Firstly, the augmentation step might not be able to generate positive and negative samples that are semantically similar and dissimilar to the anchor respectively. Secondly, the text data could be enhanced with external auxiliary information that might introduce noise to the sparse text data. In addition, they are limited in capturing higher-order information such as group-wise interactions. In this work, we propose a novel document simplicial complex construction based on text data for a higher-order message-passing mechanism. We enhance the short text classification performance by contrasting the structural representation with the sequential representation generated by the transformer mechanism for improved outcomes and mitigated issues. The proposed framework, Contrastive Learning with Simplicial Convolutional Networks (C-SCN), leverages the expressive power of graph neural networks, models higher-order information beyond pair-wise relations and enriches features through contrastive learning. Experimental results on four benchmark datasets demonstrate the capability of C-SCN to outperform existing models in analysing sequential and complex short-text data." Slimming the Fat-Tail: Morphing-Flow for Adaptive Time Series Modeling,Tianyu Liu Kai Sun Fuchun Sun Yu Luo Yuanlong Zhang,https://icml.cc/virtual/2025/poster/44444,"Temporal sequences, even after stationarization, often exhibit leptokurtic distributions with fat tails and persistent distribution shifts. These properties destabilize feature dynamics, amplify model variance, and hinder model convergence in time series forecasting. To address this, we propose Morphing-Flow (MoF), a framework that combines a spline-based transform layer (Flow) and a test-time-trained method (Morph), which adaptively normalizes non-stationary, fat-tailed distributions while preserving critical extreme features. MoF ensures that inputs remain within a network’s effective activation space—a structured, normal-like distribution—even under distributional drift. Experiments across eight datasets show that MoF achieves state-of-the-art performance: With a simple linear backbone architecture, it matches the performance of state-of-the-art models on datasets such as Electricity and ETTh2. When paired with a patch-based Mamba architecture, MoF outperforms its closest competitor by 6.3% on average and reduces forecasting errors in fat-tailed datasets such as Exchange by 21.7%. Moreover, MoF acts as a plug-and-play module, boosting performance in existing models without architectural changes." @@ -3336,9 +3216,7 @@ When Data-Free Knowledge Distillation Meets Non-Transferable Teacher: Escaping O Hybrid Batch Normalisation: Resolving the Dilemma of Batch Normalisation in Federated Learning,Hongyao Chen Tianyang Xu Xiaojun Wu Josef Kittler,https://icml.cc/virtual/2025/poster/43475,"Batch Normalisation (BN) is widely used in conventional deep neural network training to harmonise the input-output distributions for each batch of data.However, federated learning, a distributed learning paradigm, faces the challenge of dealing with non-independent and identically distributed data among the client nodes. Due to the lack of a coherent methodology for updating BN statistical parameters, standard BN degrades the federated learning performance.To this end, it is urgent to explore an alternative normalisation solution for federated learning. In this work, we resolve the dilemma of the BN layer in federated learning by developing a customised normalisation approach, Hybrid Batch Normalisation (HBN). HBN separates the update of statistical parameters (i.e., means and variances used for evaluation) from that of learnable parameters (i.e., parameters that require gradient updates), obtaining unbiased estimates of global statistical parameters in distributed scenarios. In contrast with the existing solutions, we emphasise the supportive power of global statistics for federated learning. The HBN layer introduces a learnable hybrid distribution factor, allowing each computing node to adaptively mix the statistical parameters of the current batch with the global statistics. Our HBN can serve as a powerful plugin to advance federated learning performance.It reflects promising merits across a wide range of federated learning settings, especially for small batch sizes and heterogeneous data. Code is available at https://github.com/Hongyao-Chen/HybridBN." Adaptive Partitioning Schemes for Optimistic Optimization,Raja Sunkara Ardhendu Tripathy,https://icml.cc/virtual/2025/poster/43904,"Applications such as engineering design often require us to optimize a black-box function, i.e., a system whose inner processing is not analytically known and whose gradients are not available. Practitioners often have a fixed budget for the number of function evaluations and the performance of an optimization algorithm is measured by its simple regret. In this paper, we study the class of ``Optimistic Optimization'' algorithms for black-box optimization that use a partitioning scheme for the domain. We develop algorithms that learn a good partitioning scheme and use flexible surrogate models such as neural networks in the optimization procedure. For multi-index functions on an $m$-dimensional subspace within $d$ dimensions, our algorithm attains $\tilde{O}(n^{-\beta / d})$ regret, where $\beta = 1 + \frac{d-m}{2m-1}$, as opposed to $\tilde{O}(n^{-1/d})$ for SequOOL, a state-of-the-art optimistic optimization algorithm. Our approach is competitive across a wide range of numerical benchmarks. Additionally, we introduce weight quantization in a large language model as a novel task for black-box optimization. Our approach improves the quality of Activation-aware Weight Quantization (AWQ) of the OPT-1.3B model, achieving an approximate 10\% improvement in performance relative to the best possible unquantized model." Optimal Sensor Scheduling and Selection for Continuous-Discrete Kalman Filtering with Auxiliary Dynamics,Mohamad Al Ahdab john leth Zheng-Hua Tan,https://icml.cc/virtual/2025/poster/46077,"We study the Continuous-Discrete Kalman Filter (CD-KF) for State-Space Models (SSMs) where continuous-time dynamics are observed via multiple sensors with discrete, irregularly timed measurements. Our focus extends to scenarios in which the measurement process is coupled with the states of an auxiliary SSM. For instance, higher measurement rates may increase energy consumption or heat generation, while a sensor’s accuracy can depend on its own spatial trajectory or that of the measured target. Each sensor thus carries distinct costs and constraints associated with its measurement rate and additional constraints and costs on the auxiliary state. We model measurement occurrences as independent Poisson processes with sensor-specific rates and derive an upper bound on the mean posterior covariance matrix of the CD-KF along the mean auxiliary state. The bound is continuously differentiable with respect to the measurement rates, which enables efficient gradient-based optimization. Exploiting this bound, we propose a finite-horizon optimal control framework to optimize measurement rates and auxiliary-state dynamics jointly. We further introduce a deterministic method for scheduling measurement times from the optimized rates. Empirical results in state-space filtering and dynamic temporal Gaussian process regression demonstrate that our approach achieves improved trade-offs between resource usage and estimation accuracy." -Achieve Performatively Optimal Policy for Performative Reinforcement Learning,Ziyi Chen Heng Huang,https://openreview.net/forum?id=iHu3oAA5lJ, Optimizing Language Models for Inference Time Objectives using Reinforcement Learning,Yunhao Tang Kunhao Zheng Gabriel Synnaeve Remi Munos,https://icml.cc/virtual/2025/poster/44851,"In this work, we investigate the merits of explicitly optimizing for inference time algorithmic performance during model training. We show how optimizing for inference time performance can improve overall model efficacy. We consider generic inference time objectives with $k$ samples, with focus on pass@$k$ and majority voting as two main applications. With language model training on reasoning datasets, we showcase the performance trade-off enabled by training with such objectives. When training on code generation tasks, we show that the approach significantly improves pass@$k$ objectives compared to the baseline method." -Fixing Value Function Decomposition for Multi-Agent Reinforcement Learning,Andrea Baisero Rupali Bhati Shuo Liu Aathira Sunil Pillai Christopher Amato,https://openreview.net/forum?id=qUtxbtsfwp, Learning Progress Driven Multi-Agent Curriculum,Wenshuai Zhao Zhiyuan Li Joni Pajarinen,https://icml.cc/virtual/2025/poster/46153,"The number of agents can be an effective curriculum variable for controlling the difficulty of multi-agent reinforcement learning (MARL) tasks. Existing work typically uses manually defined curricula such as linear schemes. We identify two potential flaws while applying existing reward-based automatic curriculum learning methods in MARL: (1) The expected episode return used to measure task difficulty has high variance; (2) Credit assignment difficulty can be exacerbated in tasks where increasing the number of agents yields higher returns which is common in many MARL tasks. To address these issues, we propose to control the curriculum by using a TD-error basedlearning progressmeasure and by letting the curriculum proceed from an initial context distribution to the final task specific one. Since our approach maintains a distribution over the number of agents and measures learning progress rather than absolute performance, which often increases with the number of agents, we alleviate problem (2). Moreover, the learning progress measure naturally alleviates problem (1) by aggregating returns. In three challenging sparse-reward MARL benchmarks, our approach outperforms state-of-the-art baselines." B-score: Detecting biases in large language models using response history,An Vo Mohammad Reza Taesiri Daeyoung Kim Anh Totti Nguyen,https://icml.cc/virtual/2025/poster/44236,"Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question in a multi-turn conversation. To understand which types of questions invite more biased answers, we test LLMs on our proposed set of questions that span 9 topics and belong to three types: (1) Subjective; (2) Random; and (3) Objective. Interestingly, LLMs are able to ""de-bias"" themselves in a multi-turn conversation in response to questions that seek a Random, unbiased answer. Furthermore, we propose B-score, a novel metric that is effective in detecting biases in Subjective, Random, Easy, and Hard questions. On MMLU, HLE, and CSQA, leveraging B-score substantially improves the verification accuracy of LLM answers (i.e, accepting LLM correct answers and rejecting incorrect ones) compared to using verbalized confidence scores or the frequency of single-turn answers alone. Code and data are available at: b-score.github.io." What makes an Ensemble (Un) Interpretable?,Shahaf Bassan Guy Amir Meirav Zehavi Guy Katz,https://icml.cc/virtual/2025/poster/44253,"Ensemble models are widely recognized in the ML community for their limited interpretability. For instance, while a single decision tree is considered interpretable, ensembles of trees (e.g., boosted trees) are often treated as black-boxes. Despite this folklore recognition, there remains a lack of rigorous mathematical understanding of what particularly makes an ensemble (un)-interpretable, including how fundamental factors like the (1) *number*, (2) *size*, and (3) *type* of base models influence its interpretability. In this work, we seek to bridge this gap by applying concepts from computational complexity theory to study the challenges of generating explanations for various ensemble configurations. Our analysis uncovers nuanced complexity patterns influenced by various factors. For example, we demonstrate that under standard complexity assumptions like P$\neq$NP, interpreting ensembles remains intractable even when base models are of constant size. Surprisingly, the complexity changes drastically with the number of base models: small ensembles of decision trees are efficiently interpretable, whereas ensembles of linear models remain intractable, even with a constant number of models. We believe that our findings provide a more robust foundation for understanding the interpretability of ensembles, emphasizing the benefits of examining it through a computational complexity lens." @@ -3349,7 +3227,6 @@ Offline Learning for Combinatorial Multi-armed Bandits,Xutong Liu Xiangxiang Dai Tracking Most Significant Shifts in Infinite-Armed Bandits,Joe Suk Jung-hun Kim,https://icml.cc/virtual/2025/poster/46066,"We study an infinite-armed bandit problem where actions' mean rewards are initially sampled from areservoir distribution. Most prior works in this setting focused on stationary rewards (Berry et al., 1997; Wang et al., 2008; Bonald and Proutiere, 2013; Carpentier and Valko, 2015) with the more challenging adversarial/non-stationary variant only recently studied in the context of rotting/decreasing rewards (Kim et al., 2022; 2024). Furthermore, optimal regret upper bounds were only achieved using parameter knowledge of non-stationarity and only known for certain regimes of regularity of the reservoir. This work shows the first parameter-free optimal regret bounds while also relaxing these distributional assumptions. We also study a natural notion ofsignificant shiftfor this problem inspired by recent developments in finite-armed MAB (Suk & Kpotufe, 2022). We show that tighter regret bounds in terms of significant shifts can be adaptively attained. Our enhanced rates only depend on the rotting non-stationarity and thus exhibit an interesting phenomenon for this problem where rising non-stationarity does not factor into the difficulty of non-stationarity." Agent Workflow Memory,Zora Zhiruo Wang Jiayuan Mao Daniel Fried Graham Neubig,https://icml.cc/virtual/2025/poster/45496,"Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks — Mind2Web and WebArena — that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen." Calibrated Physics-Informed Uncertainty Quantification,Vignesh Gopakumar Ander Gray Lorenzo Zanisi Timothy Nunn Daniel Giles Matt Kusner Stanislas Pamela Marc Peter Deisenroth,https://icml.cc/virtual/2025/poster/44881,"Simulating complex physical systems is crucial for understanding and predicting phenomena across diverse fields, such as fluid dynamics and heat transfer, as well as plasma physics and structural mechanics. Traditional approaches rely on solving partial differential equations (PDEs) using numerical methods, which are computationally expensive and often prohibitively slow for real-time applications or large-scale simulations. Neural PDEs have emerged as efficient alternatives to these costly numerical solvers, offering significant computational speed-ups. However, their lack of robust uncertainty quantification (UQ) limits deployment in critical applications. We introduce a model-agnostic, physics-informed conformal prediction (CP) framework that provides guaranteed uncertainty estimates without requiring labelled data. By utilising a physics-based approach, we can quantify and calibrate the model's inconsistencies with the physics rather than the uncertainty arising from the data. Our approach utilises convolutional layers as finite-difference stencils and leverages physics residual errors as nonconformity scores, enabling data-free UQ with marginal and joint coverage guarantees across prediction domains for a range of complex PDEs. We further validate the efficacy of our method on neural PDE models for plasma modelling and shot design in fusion reactors." -It Takes Two to Tango: Directly Optimizing for Constrained Synthesizability in Generative Molecular Design,Jeff Guo Philippe Schwaller,https://openreview.net/forum?id=Im90Ziq4M1, PINNsAgent: Automated PDE Surrogation with Large Language Models,Qingpo Wuwu Chonghan Gao Tianyu Chen Yihang Huang Yuekai Zhang Jianing Wang Jianxin Li Haoyi Zhou Shanghang Zhang,https://icml.cc/virtual/2025/poster/45282,"Solving partial differential equations (PDEs) using neural methods has been a long-standing scientific and engineering research pursuit. Physics-Informed Neural Networks (PINNs) have emerged as a promising alternative to traditional numerical methods for solving PDEs. However, the gap between domain-specific knowledge and deep learning expertise often limits the practical application of PINNs. Previous works typically involve manually conducting extensive PINNs experiments and summarizing heuristic rules for hyperparameter tuning. In this work, we introduce PINNsAgent, a novel surrogation framework that leverages large language models (LLMs) to bridge the gap between domain-specific knowledge and deep learning. PINNsAgent integrates Physics-Guided Knowledge Replay (PGKR) for efficient knowledge transfer from solved PDEs to similar problems, and Memory Tree Reasoning for exploring the search space of optimal PINNs architectures. We evaluate PINNsAgent on 14 benchmark PDEs, demonstrating its effectiveness in automating the surrogation process and significantly improving the accuracy of PINNs-based solutions." Symmetry-Driven Discovery of Dynamical Variables in Molecular Simulations,Jeet Mohapatra Nima Dehmamy Csaba Both Subhro Das Tommi Jaakkola,https://icml.cc/virtual/2025/poster/45068,"We introduce a novel approach for discovering effective degrees of freedom (DOF) in molecular dynamics simulations by mapping the DOF to approximate symmetries of the energy landscape. Unlike most existing methods, we do not require trajectory data but instead rely on knowledge of the forcefield (energy function) around the initial state. We present a scalable symmetry loss function compatible with existing force-field frameworks and a Hessian-based method efficient for smaller systems. Our approach enables systematic exploration of conformational space by connecting structural dynamics to energy landscape symmetries. We apply our method to two systems, Alanine dipeptide and Chignolin, recovering their known important conformations. Our approach can prove useful for efficient exploration in molecular simulations with potential applications in protein folding and drug discovery." Fine-Grained Captioning of Long Videos through Scene Graph Consolidation,Sanghyeok Chu Seonguk Seo Bohyung Han,https://icml.cc/virtual/2025/poster/44795,"Recent advances in vision-language models have led to impressive progress in caption generation for images and short video clips. However, these models remain constrained by their limited temporal receptive fields, making it difficult to producecoherent and comprehensive captions for long videos. While several methods have been proposed to aggregate information across video segments, they often rely on supervised fine-tuning or incur significant computational overhead. To address these challenges, we introduce a novel framework for long video captioning based on graph consolidation. Our approach first generates segment-level captions, corresponding to individual frames or short video intervals, using off-the-shelf visual captioning models. These captions are then parsed into individual scene graphs, which are subsequently consolidated into a unified graph representation that preserves both holistic context and fine-grained details throughout the video. A lightweight graph-to-text decoder then produces the final video-level caption. This framework effectively extends the temporal understanding capabilities of existing models without requiring any additional fine-tuning on long video datasets. Experimental results show that our method significantly outperforms existing LLM-based consolidation approaches, achieving strong zero-shot performance while substantially reducing computational costs." @@ -3357,21 +3234,14 @@ LineFlow: A Framework to Learn Active Control of Production Lines,Kai Müller Ma Boosting Masked ECG-Text Auto-Encoders as Discriminative Learners,Manh Pham Hung Aaqib Saeed Dong Ma,https://icml.cc/virtual/2025/poster/44157,"The accurate interpretation of Electrocardiogram (ECG) signals is pivotal for diagnosing cardiovascular diseases. Integrating ECG signals with accompanying textual reports further holds immense potential to enhance clinical diagnostics by combining physiological data and qualitative insights. However, this integration faces significant challenges due to inherent modality disparities and the scarcity of labeled data for robust cross-modal learning. To address these obstacles, we propose D-BETA, a novel framework that pre-trains ECG and text data using a contrastive masked auto-encoder architecture, uniquely combining generative and boosted discriminative capabilities for robust cross-modal representations. This is accomplished through masked modality modeling, specialized loss functions, and an improved negative sampling strategy tailored for cross-modal alignment. Extensive experiments on five public datasets across diverse downstream tasks demonstrate that D-BETA significantly outperforms existing methods, achieving an average AUC improvement of 15% in linear probing with only one percent of training data and 2% in zero-shot performance without requiring training data over state-of-the-art models. These results highlight the effectiveness of D-BETA, underscoring its potential to advance automated clinical diagnostics through multi-modal representations." DiffusionVLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression,Junjie Wen Yichen Zhu Minjie Zhu Zhibin Tang Jinming Li Zhongyi Zhou Xiaoyu Liu Chaomin Shen Yaxin Peng Feifei Feng,https://icml.cc/virtual/2025/poster/45061,"In this paper, we present DiffusionVLA, a novel framework that integrates autoregressive reasoning with diffusion policies to address the limitations of existing methods: while autoregressive Vision-Language-Action (VLA) models lack precise and robust action generation, diffusion-based policies inherently lack reasoning capabilities. Central to our approach is autoregressive reasoning — a task decomposition and explanation process enabled by a pre-trained VLM — to guide diffusion-based action policies. To tightly couple reasoning with action generation, we introduce a reasoning injection module that directly embeds self-generated reasoning phrases into the policy learning process. The framework is simple, flexible, and efficient, enabling seamless deployment across diverse robotic platforms.We conduct extensive experiments using multiple real robots to validate the effectiveness of DiVLA. Our tests include a challenging factory sorting task, where DiVLA successfully categorizes objects, including those not seen during training. The reasoning injection module enhances interpretability, enabling explicit failure diagnosis by visualizing the model’s decision process. Additionally, we test DiVLA on a zero-shot bin-picking task, achieving \textbf{63.7\% accuracy on 102 previously unseen objects}. Our method demonstrates robustness to visual changes, such as distractors and new backgrounds, and easily adapts to new embodiments. Furthermore, DiVLA can follow novel instructions and retain conversational ability. Notably, DiVLA is data-efficient and fast at inference; our smallest DiVLA-2B runs 82Hz on a single A6000 GPU. Finally, we scale the model from 2B to 72B parameters, showcasing improved generalization capabilities with increased model size." Star Attention: Efficient LLM Inference over Long Sequences,Shantanu Acharya Fei Jia Boris Ginsburg,https://icml.cc/virtual/2025/poster/45335,"Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 97-100% of accuracy." -Beyond KL-Regularization: Achieving Unbiased Direct Alignment through Diffusion $f_{\chi^n}$-Preference Optimization,Xinjian Zhang Wei Xiang,https://openreview.net/forum?id=dZehp9p5Jt, -Conditional Lagrangian Wasserstein Flow for Time Series Imputation,Weizhu Qian Dalin Zhang Yan Zhao Yunyao Cheng,https://openreview.net/forum?id=yK6yb16vRe, -Wide & Deep Learning for Node Classification,Yancheng Chen Wenguo Yang Zhipeng Jiang,https://openreview.net/forum?id=MaP5KdgbID, A Tale of Two Structures: Do LLMs Capture the Fractal Complexity of Language?,Ibrahim Alabdulmohsin Andreas Peter Steiner,https://icml.cc/virtual/2025/poster/44028,"Language exhibits a fractal structure in its information-theoretic complexity (i.e. bits per token), with self-similarity across scales and long-range dependence (LRD). In this work, we investigate whether large language models (LLMs) can replicate such fractal characteristics and identify conditions-such as temperature setting and prompting method-under which they may fail. Moreover, we find that the fractal parameters observed in natural language are contained within a narrow range, whereas those of LLMs' output vary widely, suggesting that fractal parameters might prove helpful in detecting a non-trivial portion of LLM-generated texts. Notably, these findings, and many others reported in this work, are robust to the choice of the architecture; e.g. Gemini 1.0 Pro, Mistral-7B and Gemma-2B. We also release a dataset comprising over 240,000 articles generated by various LLMs (both pretrained and instruction-tuned) with different decoding temperatures and prompting methods, along with their corresponding human-generated texts. We hope that this work highlights the complex interplay between fractal properties, prompting, and statistical mimicry in LLMs, offering insights for generating, evaluating and detecting synthetic texts." Collapse or Thrive: Perils and Promises of Synthetic Data in a Self-Generating World,Joshua Kazdan Rylan Schaeffer Apratim Dey Matthias Gerstgrasser Rafael Rafailov David L. Donoho Sanmi Koyejo,https://icml.cc/virtual/2025/poster/44713,"What happens when generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models? Some prior work warns of “model collapse” as the web is overwhelmed by synthetic data; other work suggests the problem can be contained (i.e. collapse can be avoided) by managing how available data are used in pretraining. In this paper, we report experiments on three ways of using data (training-workflows), across three generative model task-settings (multivariate Gaussian estimation, kernel density estimation, and language-model fine-tuning) to further confirm the possibility of containment: (a) we confirm that the training-workflow of {\it replacing} all real data by successive generations of purely synthetic data indeed suffers model collapse in all task-settings studied; (b) we consider the training-workflow of {\it accumulating} synthetic data alongside real data and training on all data combined and confirming that, although the proportion of real data eventually becomes zero, models remain stable and their test losses do not diverge under this training-workflow; (c) we consider a training-workflow where real and synthetic data accumulate together but successive generations of pretraining are constrained to use fixed-size data subsets each generation. In this workflow, we observe slow and gradual rather than explosive degradation of test loss performance across generations. Our insights are particularly important when forecasting whether future frontier generative models will collapse or thrive, and our results open avenues for empirically and mathematically studying the context-dependent value of synthetic data." DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination,Simin Chen Pranav Pusarla Baishakhi Ray,https://icml.cc/virtual/2025/poster/46547,"The rapid advancement of code large language models (Code LLMs) underscores the critical need for effective and transparent benchmarking methods. However, current benchmarking predominantly relies on publicly available, human-created datasets. The widespread use of these static benchmark datasets makes the evaluation process particularly susceptible to data contamination—an unavoidable consequence of the extensive data collection processes employed during LLM training. Existing methods for addressing data contamination typically face significant limitations, including reliance on substantial human effort and difficulty in managing class imbalances. To overcome these challenges, we propose DyCodeEval, a novel benchmarking suite specifically designed to evaluate Code LLMs under realistic contamination scenarios. Given an initial seed programming problem, DyCodeEval utilizes multiple agents to systematically extract and modify contextual information without changing the core logic, generating semantically equivalent variations. We introduce a dynamic data generation method and conduct extensive empirical studies on two seed datasets involving 18 Code LLMs. The results demonstrate that DyCodeEval effectively assesses the reasoning capabilities of Code LLMs under contamination conditions while producing diverse problem variants, thereby ensuring robust and consistent benchmarking outcomes." Hidden No More: Attacking and Defending Private Third-Party LLM Inference,Rahul Krishna Thomas Louai Zahran Erica Choi Akilesh Potti Micah Goldblum Arka Pal,https://icml.cc/virtual/2025/poster/45330,"Recent advances in Large Language Models (LLMs) have led to widespread adoption of third-party inference services, raising critical privacy concerns. In this work, we introduce a novel reconstruction technique that can recover original prompts from hidden states with nearly perfect accuracy across multiple state-of-the-art LLMs in the increasingly important open-weights setting. Although the attack is conceptually simple, it has not -- to the best of our knowledge -- previously been described nor shown to work practically. Furthermore, our attack remains effective against various permutation and noise-based defenses, challenging assumptions about the security of previously proposed schemes. To address these vulnerabilities, we propose Cascade, a multi-party inference scheme that leverages sharding in the sequence dimension to retain privacy of the user input. Through theoretical analysis and empirical evaluation, we demonstrate that Cascade is secure against both our attack as well as previous methods, while maintaining computational and communication efficiency. Our findings highlight the importance of rigorous security analysis in privacy-preserving LLM inference and offer practical solutions for secure deployment." On the Robustness of Reward Models for Language Model Alignment,Jiwoo Hong Noah Lee Eunki Kim Guijin Son Woojin Chung Aman Gupta Shao Tang James Thorne,https://icml.cc/virtual/2025/poster/45164,"The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss as one-way classifiers are prone to over-optimization, losing generalizability to unseen inputs. In this paper, we study the cause of over-optimization and its downstream effects on the RLHF procedure, highlighting the importance of robustness in RMs. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Correspondingly, we propose batch-wise sum-to-zero regularization (BSR) that enforces reward sum for each batch to be zero-centered, constraining the rewards with abnormally large magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness on unseen inputs. Then, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5\% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0, with reducing generation length by 40\% while adding a 7\% increase in win rate, further highlights that robustness in RMs induces robustness in RLHF training." -"Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment",Ziyi Chen Junyi Li Peiran Yu Heng Huang,https://openreview.net/forum?id=MJ2iChsvhs, RULEBREAKERS: Challenging LLMs at the Crossroads between Formal Logic and Human-like Reasoning,Jason Chan Robert J. Gaizauskas Zhixue Zhao,https://icml.cc/virtual/2025/poster/43712,"Formal logic enables computers to reason in natural language by representing sentences in symbolic forms and applying rules to derive conclusions. However, in what our study characterizes as ""rulebreaker"" scenarios, this method can lead to conclusions that are typically not inferred or accepted by humans given their common sense and factual knowledge. Inspired by works in cognitive science, we create RULEBREAKERS, the first dataset for rigorously evaluating the ability of large language models (LLMs) to recognize and respond to rulebreakers (versus non-rulebreakers) in a knowledge-informed and human-like manner. Evaluating seven LLMs, we find that most models achieve mediocre accuracy on RULEBREAKERS and exhibit some tendency to over-rigidly apply logical rules, unlike what is expected from typical human reasoners. Further analysis suggests that this apparent failure is potentially associated with the models' poor utilization of their world knowledge and their attention distribution patterns. Whilst revealing a limitation of current LLMs, our study also provides a timely counterbalance to a growing body of recent works that propose methods relying on formal logic to improve LLMs' general reasoning capabilities, highlighting their risk of further increasing divergence between LLMs and human-like reasoning." Scaling Sparse Feature Circuits For Studying In-Context Learning,Dmitrii Kharlapenko Stepan Shabalin Arthur Conmy Neel Nanda,https://icml.cc/virtual/2025/poster/45531,"Sparse autoencoders (SAEs) are a popular tool for interpreting large language model activations, but their utility in addressing open questions in interpretability remains unclear. In this work, we demonstrate their effectiveness by using SAEsto deepen our understanding of the mechanism behind in-context learning (ICL). We identify abstract SAE features that (i) encode the model’s knowledge of which task to execute and (ii) whose latent vectors causally induce the task zero-shot.This aligns with prior work showing that ICL is mediated by task vectors. We further demonstrate that these task vectors are well approximated by a sparse sum of SAE latents, including these task-execution features. To explore the ICL mechanism, we scale the sparse feature circuits methodology of Marks et al. (2024) to the Gemma 1 2B model for the more complex task of ICL. Through circuit finding, we discover task-detecting features with corresponding SAE latents that activate earlier in the prompt, that detect when tasks have been performed. They are causally linked with task-execution features through the attention and MLP sublayers." -Temporally Sparse Attack for Fooling Large Language Models in Time Series Forecasting,Fuqiang Liu Sicong Jiang,https://openreview.net/forum?id=wEvAJFYXXN, The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models,Shishir G Patil Huanzhi Mao Fanjia Yan Charlie Cheng-Jie Ji Vishnu Suresh Ion Stoica Joseph E. Gonzalez,https://icml.cc/virtual/2025/poster/46593,"Function calling, also called tool use, refers to an LLM's ability to invoke external functions, APIs, or user-defined tools in response to user queries—an essential capability for agentic LLM applications. Despite its prominence, there did not exist a standard benchmark to evaluate function calling abilities, due to two reasons – the challenging nature of evaluating when a function call is valid, and the challenge of acquiring diverse, real-world functions. We present the Berkeley Function Calling Leaderboard (BFCL), a comprehensive benchmark designed to evaluate function calling capabilities in a wide range of real-world settings. The BFCL benchmark evaluates serial and parallel function calls, across various programming languages using a novel Abstract Syntax Tree (AST) evaluation method that can easily scale to thousands of functions. We construct the benchmark using a combination of expert curated, and user-contributed functions and associated prompts. Finally, BFCL benchmark evaluates the ability of models to abstain and reason in stateful multi-step agentic setting. Evaluating a wide range of models, we observe that while state-of-the-art LLMs excel at singleturn calls, memory, dynamic decision-making, and long-horizon reasoning remain open challenges. Since its preview, BFCL has become the defacto standard for evaluating function-calls, and can be accessed at gorilla.cs.berkeley.edu/leaderboard.html." -ASRC-SNN: Adaptive Skip Recurrent Connection Spiking Neural Network,Shang Xu Jiayu Zhang Ziming Wang Runhao Jiang Rui Yan Huajin Tang,https://openreview.net/forum?id=KsIhAcB84m, -Off-Policy Evaluation of Ranking Policies for Large Action Spaces via Embeddings and User Behavior Assumption,Tatsuki Takahashi Chihiro Maru Hiroko Shoji,https://openreview.net/forum?id=BJ0i8JfECt, Pfeife: Automatic Pipeline Parallelism for PyTorch,Ho Young Jhoo Chung-Kil Hur Nuno P. Lopes,https://icml.cc/virtual/2025/poster/45884,"The memory requirements of machine learning (ML) models has been growing quickly. However, the memory capacity of GPUs has not kept pace. Despite significant research on reducing the memory usage of ML models, the larger models do not fit in a single device. A popular solution to the memory capacity issue is to use multiple devices in parallel. In this paper, we focus on a particular form of parallelism called pipelining, as it offers a good balance between cost and performance for many ML models. We present Pfeife, the first tool that integrates with PyTorch to provide automatic pipelining of ML models. Pfeife intercepts the execution of models and parallelizes them transparently, requiring no manual work. We show that Pfeife can execute large models that would otherwise not run due to not fitting in a single device. Moreover, Pfeife can pipeline non-sequential models such as Stable Diffusion, which are not supported by existing pipelining parallelism tools. Pfeife outperforms state-of-the-art tools by up to 22%." Progressive Tempering Sampler with Diffusion,Severi Rissanen RuiKang OuYang Jiajun He Wenlin Chen Markus Heinonen Arno Solin José Miguel Hernández-Lobato,https://icml.cc/virtual/2025/poster/43740,"Recent research has focused on designing neural samplers that amortize the process of sampling from unnormalized densities. However, despite significant advancements, they still fall short of the state-of-the-art MCMC approach, Parallel Tempering (PT), when it comes to the efficiency of target evaluations. On the other hand, unlike a well-trained neural sampler, PT yields only dependent samples and needs to be rerun---at considerable computational cost---whenever new samples are required. To address these weaknesses, we propose the Progressive Tempering Sampler with Diffusion (PTSD), which trains diffusion models sequentially across temperatures, leveraging the advantages of PT to improve the training of neural samplers. We also introduce a novel method to combine high-temperature diffusion models to generate approximate lower-temperature samples, which are minimally refined using MCMC and used to train the next diffusion model. PTSD enables efficient reuse of sample information across temperature levels while generating well-mixed, uncorrelated samples. Our method significantly improves target evaluation efficiency,outperforming diffusion-based neural samplers." Efficient Skill Discovery via Regret-Aware Optimization,He Zhang Ming Zhou Shaopeng Zhai Ying Sun Hui Xiong,https://icml.cc/virtual/2025/poster/46465,"Unsupervised skill discovery aims to learn diverse and distinguishable behaviors in open-ended reinforcement learning.For the existing methods, they focus on improving the diversity via pure exploration, mutual information optimization and learning temporal representation. Despite they perform well on exploration, they remain limited in terms of efficiency, especially for the high-dimensional situations.In this work, we frame the skill discovery as a min-max game of skill generation and policy learning, proposing a regret-aware method on top of temporal representation learning that expands the discovered skill space along the direction of upgradable policy strength.The key insight behind the proposed method is that the skill discovery is adversarial to the policy learning, i.e., skills with weak strength should be further explored while less exploration for the skills with converged strength.As an implementation, we score the degree of strength convergence with regret, and guide the skill discovery with a learnable skill generator. To avoid degeneration, the skill generation comes from an upgradable population of skill generators.We conduct experiments on environments with varying complexities and dimension sizes.Empirical results show that our method outperforms baselines on both efficiency and diversity.Moreover, our method achieves 15\% zero-shot improvement on high-dimensional environments, compared to existing methods." @@ -3381,44 +3251,14 @@ EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelera Optimizing Noise Distributions for Differential Privacy,Atefeh Gilani Juan Felipe Gomez Shahab Asoodeh Flavio Calmon Oliver Kosut Lalitha Sankar,https://icml.cc/virtual/2025/poster/43633,"We propose a unified optimization framework for designing continuous and discrete noise distributions that ensure differential privacy (DP) by minimizing Rényi DP, a variant of DP, under a cost constraint. Rényi DP has the advantage that by considering different values of the Rényi parameter $\alpha$, we can tailor our optimization for any number of compositions. To solve the optimization problem, we reduce it to a finite-dimensional convex formulation and perform preconditioned gradient descent. The resulting noise distributions are then compared to their Gaussian and Laplace counterparts. Numerical results demonstrate that our optimized distributions are consistently better, with significant improvements in $(\varepsilon, \delta)$-DP guarantees in the moderate composition regimes, compared to Gaussian and Laplace distributions with the same variance." Privacy-Preserving Federated Convex Optimization: Balancing Partial-Participation and Efficiency via Noise Cancellation,Roie Reshef Kfir Yehuda Levy,https://icml.cc/virtual/2025/poster/45123,"This paper addresses the challenge of achieving Differential Privacy (DP) in Federated Learning (FL) under the partial-participation setting, where each machine participates in only some of training rounds.While earlier work achieved optimal performance and efficiency in full-participation scenarios, these methods could not extend effectively to cases with partial-participation.Our approach addresses this gap by introducing a novel noise-cancellation mechanism that ensures privacy without compromising convergence rates or computational efficiency.We analyze our method within the Stochastic Convex Optimization (SCO) framework and demonstrate that it achieves optimal performance for both homogeneous and heterogeneous data distributions.This work broadens the applicability of DP in FL, providing a practical and efficient solution for privacy-preserving learning in distributed systems with partial participation." AdaPTS: Adapting Univariate Foundation Models to Probabilistic Multivariate Time Series Forecasting,Abdelhakim Benechehab Vasilii Feofanov Giuseppe Paolo Albert Thomas Maurizio Filippone Balázs Kégl,https://icml.cc/virtual/2025/poster/43518,"Pre-trained foundation models (FMs) have shown exceptional performance in univariate time series forecasting tasks. However, several practical challenges persist, including managing intricate dependencies among features and quantifying uncertainty in predictions. This study aims to tackle these critical limitations by introducingadapters—feature-space transformations that facilitate the effective use of pre-trained univariate time series FMs for multivariate tasks. Adapters operate by projecting multivariate inputs into a suitable latent space and applying the FM independently to each dimension. Inspired by the literature on representation learning and partially stochastic Bayesian neural networks, we present a range of adapters and optimization/inference strategies. Experiments conducted on both synthetic and real-world datasets confirm the efficacy of adapters, demonstrating substantial enhancements in forecasting accuracy and uncertainty quantification compared to baseline methods. Our framework,AdaPTS, positions adapters as a modular, scalable, and effective solution for leveraging time series FMs in multivariate contexts, thereby promoting their wider adoption in real-world applications. We release the code at https://github.com/abenechehab/AdaPTS." -Locality-Sensitive Hashing for Efficient Hard Negative Sampling in Contrastive Learning,Fabian Deuser Philipp Hausenblas Hannah Schieber Daniel Roth Martin Werner Norbert Oswald,https://openreview.net/forum?id=OmfdOTVfWI, KEA: Keeping Exploration Alive by Proactively Coordinating Exploration Strategies,Shih-Min Yang Martin Magnusson Johannes A. Stork Todor Stoyanov,https://icml.cc/virtual/2025/poster/44965,"Soft Actor-Critic (SAC) has achieved notable success in continuous control tasks but struggles in sparse reward settings, where infrequent rewards make efficient exploration challenging. While novelty-based exploration methods address this issue by encouraging the agent to explore novel states, they are not trivial to apply to SAC. In particular, managing the interaction between novelty-based exploration and SAC’s stochastic policy can lead to inefficient exploration and redundant sample collection. In this paper, we propose KEA (Keeping Exploration Alive) which tackles the inefficiencies in balancing exploration strategies when combining SAC with novelty-based exploration. KEA integrates a novelty-augmented SAC with a standard SAC agent, proactively coordinated via a switching mechanism. This coordination allows the agent to maintain stochasticity in high-novelty regions, enhancing exploration efficiency and reducing repeated sample collection. We first analyze this potential issue in a 2D navigation task, and then evaluate KEA on the DeepSea hard-exploration benchmark as well as sparse reward control tasks from the DeepMind Control Suite. Compared to state-of-the-art novelty-based exploration baselines, our experiments show that KEA significantly improves learning efficiency and robustness in sparse reward setups." Understanding the Statistical Accuracy-Communication Trade-off in Personalized Federated Learning with Minimax Guarantees,Xin Yu Zelin He Ying Sun Lingzhou Xue Runze Li,https://icml.cc/virtual/2025/poster/45549,"Personalized federated learning (PFL) offers a flexible framework for aggregating information across distributed clients with heterogeneous data. This work considers a personalized federated learning setting that simultaneously learns global and local models. While purely local training has no communication cost, collaborative learning among the clients can leverage shared knowledge to improve statistical accuracy, presenting an accuracy-communication trade-off in personalized federated learning. However, the theoretical analysis of how personalization quantitatively influences sample and algorithmic efficiency and their inherent trade-off is largely unexplored. This paper makes a contribution towards filling this gap, by providing a quantitative characterization of the personalization degree on the tradeoff. The results further offer theoretical insights for choosing the personalization degree. As a side contribution, we establish the minimax optimality in terms of statistical accuracy for a widely studied PFL formulation. The theoretical result is validated on both synthetic and real-world datasets and its generalizability is verified in a non-convex setting." -Reinforcement Learning with learned gadgets to tackle hard quantum problems on real hardware,Akash Kundu Leopoldo Sarra,https://openreview.net/forum?id=NYlKnjmYJB, -Compression for Better: A General and Loss-Driven Compression Framework,Boyang Zhang Daning Cheng Yunquan Zhang Jiake Tian Fangming Liu,https://openreview.net/forum?id=b1g2jVGmjb, Simple and Critical Iterative Denoising: A Recasting of Discrete Diffusion in Graph Generation,Yoann Boget,https://icml.cc/virtual/2025/poster/44256,"Discrete Diffusion and Flow Matching models have significantly advanced generative modeling for discrete structures, including graphs. However, the dependencies of the noisy distributions across time of these models lead to error accumulation and propagation during the reverse denoising process—a phenomenon known as \emph{compounding denoising errors}. To address this problem, we propose a novel framework called \emph{Simple Iterative Denoising}, which simplifies discrete diffusion and circumvents the issue by removing dependencies on previous intermediate states in the noising process. Additionally, we enhance our model by incorporating a \emph{Critic}, which during generation selectively retains or corrupts elements in an instance based on their likelihood under the data distribution. Our empirical evaluations demonstrate that the proposed method significantly outperforms existing discrete diffusion baselines in graph generation tasks." -ALS: Attentive Long-Short-Range Message Passing,Yi Luo Xu Sun Guangchun Luo Aiguo Chen,https://openreview.net/forum?id=SvzO2rryxq, Can Classic GNNs Be Strong Baselines for Graph-level Tasks? Simple Architectures Meet Excellence,Yuankai Luo Lei Shi Xiao-Ming Wu,https://icml.cc/virtual/2025/poster/44865,"Message-passing Graph Neural Networks (GNNs) are often criticized for their limited expressiveness, issues like over-smoothing and over-squashing, and challenges in capturing long-range dependencies. Conversely, Graph Transformers (GTs) are regarded as superior due to their employment of global attention mechanisms, which potentially mitigate these challenges. Literature frequently suggests that GTs outperform GNNs in graph-level tasks, especially for graph classification and regression on small molecular graphs. In this study, we explore the untapped potential of GNNs through an enhanced framework, GNN+, which integrates six widely used techniques: edge feature integration, normalization, dropout, residual connections, feed-forward networks, and positional encoding, to effectively tackle graph-level tasks. We conduct a systematic re-evaluation of three classic GNNs—GCN, GIN, and GatedGCN—enhanced by the GNN+ framework across 14 well-known graph-level datasets. Our results reveal that, contrary to prevailing beliefs, these classic GNNs consistently match or surpass the performance of GTs, securing top-three rankings across all datasets and achieving first place in eight. Furthermore, they demonstrate greater efficiency, running several times faster than GTs on many datasets. This highlights the potential of simple GNN architectures, challenging the notion that complex mechanisms in GTs are essential for superior graph-level performance. Our source code is available at https://github.com/LUOyk1999/GNNPlus." -Evaluating VLMs' General Ability on Next Location Prediction,Ruixing Zhang Yang Zhang Yuou Chen Tongyu Zhu Leilei Sun Weifeng Lv,https://openreview.net/forum?id=mQPwFSTn6b, -The Expressivity of Fixed-Precision Transformers without Positional Encoding,Naoki Negishi Masaya Taniguchi Keisuke Sakaguchi Kentaro Inui,https://openreview.net/forum?id=3TGUvHmZ2v, -☕ Decaf: A Deconfounding Causal Generative Model,Alejandro Almodóvar Adrián Javaloy Juan Parras Santiago Zazo Isabel Valera,https://openreview.net/forum?id=pMfpm26D73, SEAD: Unsupervised Ensemble of Streaming Anomaly Detectors,Saumya Gaurang Shah Abishek Sankararaman Balakrishnan Murali Narayanaswamy Vikramank Singh,https://icml.cc/virtual/2025/poster/46199,"Can we efficiently choose the best Anomaly Detection (AD) algorithm for a data-stream without requiring anomaly labels? Streaming anomaly detection is hard. SOTA AD algorithms are sensitive to their hyperparameters and no single method works well on all datasets. The best algorithm/hyper-parameter combination for a given data-stream can change over time with data drift. 'What is an anomaly?' is often application, context and dataset dependent. We propose SEAD (Streaming Ensemble of Anomaly Detectors), the first model selection algorithm for streaming, unsupervised AD. All prior AD model selection algorithms are either supervised, or only work in the offline setting when all data from the test set is available upfront. We show that SEAD is {\em(i)} unsupervised, i.e., requires no true anomaly labels, {\em(ii)} efficiently implementable in a streaming setting, {\em (iii)} agnostic to the choice of the base algorithms among which it chooses from, and {\em (iv)} adaptive to non-stationarity in the data-stream. Experiments on 14 non-trivial public datasets and an internal dataset corroborate our claims." -Deep Positive-Unlabeled Anomaly Detection for Contaminated Unlabeled Data,Hiroshi Takahashi Tomoharu Iwata Atsutoshi Kumagai Yuuki Yamanaka,https://openreview.net/forum?id=UEOLHWNswj, -Simple Graph Contrastive Learning via Fractional-order Neural Diffusion Networks,Yanan Zhao Feng Ji Kai Zhao Xuhao Li Qiyu Kang Wenfei Liang Yahya Alkhatib Xingchao Jian Wee Peng Tay,https://openreview.net/forum?id=zgexeMtEWm, -Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol,Pai Liu LingfengZhao Shivangi Agarwal Jinghan Liu Audrey Huang Philip Amortila Nan Jiang,https://openreview.net/forum?id=oOaoNaoCJw, -ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic Prompts,Amelia Hardy Houjun Liu Bernard Lange Duncan Eddy Mykel Kochenderfer,https://openreview.net/forum?id=nlNFkVmuSK, Approximation to Smooth Functions by Low-Rank Swish Networks,Zimeng Li Hongjun LI Jingyuan Wang Ke Tang,https://icml.cc/virtual/2025/poster/44850,"While deep learning has witnessed remarkable achievements in a wide range of applications, its substantial computational cost imposes limitations on application scenarios of neural networks. To alleviate this problem, low-rank compression is proposed as a class of efficient and hardware-friendly network compression methods, which reduce computation by replacing large matrices in neural networks with products of two small ones. In this paper, we implement low-rank networks by inserting a sufficiently narrow linear layer without bias between each of two adjacent nonlinear layers. We prove that low-rank Swish networks with a fixed depth are capable of approximating any function from the Hölder ball $\mathcal{C}^{\beta, R}([0,1]^d)$ within an arbitrarily small error where $\beta$ is the smooth parameter and $R$ is the radius. Our proposed constructive approximation ensures that the width of linear hidden layers required for approximation is no more than one-third of the width of nonlinear layers, which implies that the computational cost can be decreased by at least one-third compared with a network with the same depth and width of nonlinear layers but without narrow linear hidden layers. Our theoretical finding can offer a theoretical basis for low-rank compression from the perspective of universal approximation theory." -Understanding Generalization in Physics Informed Models through Affine Variety Dimensions,Takeshi Koshizuka Issei Sato,https://openreview.net/forum?id=dY44CURN4v, -Meta-Learning in Self-Play Regret Minimization,David Sychrovský Martin Schmid Michal Sustr Michael Bowling,https://openreview.net/forum?id=d1uyU3JrpO, -Identifying key amino acid types that distinguish paralogous proteins using Shapley value based feature subset selection,Pranav Machingal Rakesh Busi Nandyala Hemachandra Petety V. Balaji,https://openreview.net/forum?id=qROnDTwgCr, -Lifelong Learning of Video Diffusion Models From a Single Video Stream,Jason Yoo Yingchen He Saeid Naderiparizi Dylan Green Gido M van de Ven Geoff Pleiss Frank Wood,https://openreview.net/forum?id=ReLY5VHNEZ, Secant Line Search for Frank-Wolfe Algorithms,Deborah Hendrych Sebastian Pokutta Mathieu Besançon David Martínez-Rubio,https://icml.cc/virtual/2025/poster/44401,"We present a new step-size strategy based on the secant method for Frank-Wolfe algorithms. This strategy, which requires mild assumptions about the function under consideration, can be applied to any Frank-Wolfe algorithm. It is as effective as full line search and, in particular, allows for adapting to the local smoothness of the function, such as in (Pedregosa et al., 2020), but comes with a significantly reduced computational cost, leading to higher effective rates of convergence. We provide theoretical guarantees and demonstrate the effectiveness of the strategy through numerical experiments." -Variational Learning Induces Adaptive Label Smoothing,Sin-Han Yang Zhedong Liu Gian Maria Marconi Mohammad Emtiyaz Khan,https://openreview.net/forum?id=NtxVmqPYJ8, Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes,Zhuocheng Gong Jian Guan Wei Wu Huishuai Zhang Dongyan Zhao,https://icml.cc/virtual/2025/poster/44849,"Large language models (LLMs) have achieved remarkable success, yet aligning their generations with human preferences remains a critical challenge. Existing approaches to preference modeling often rely on an explicit or implicit reward function, overlooking the intricate and multifaceted nature of human preferences that may encompass conflicting factors across diverse tasks and populations. To address this limitation, we introduce Latent Preference Coding (LPC), a novel framework that models the implicit factors as well as their combinations behind holistic preferences using discrete latent codes. LPC seamlessly integrates with various offline alignment algorithms, automatically inferring the underlying factors and their importance from data without relying on pre-defined reward functions and hand-crafted combination weights. Extensive experiments on multiple benchmarks demonstrate that LPC consistently improves upon three alignment algorithms (DPO, SimPO, and IPO) using three base models (Mistral-7B, Llama3-8B, and Llama3-Instruct-8B). Furthermore, deeper analysis reveals that the learned latent codes effectively capture the differences in the distribution of human preferences and significantly enhance the robustness of alignment algorithms against noise in data. By providing a unified representation for the multifarious preference factors, LPC paves the way towards developing more robust and versatile alignment techniques for responsible deployment of powerful LLMs." -Open-Set Text Classification with Limited Labeling Budget,Amit Tulsidas Chaulwar,https://openreview.net/forum?id=0UStwLPcQi, -Maximum Noise Level as Third Optimality Criterion in Black-box Optimization Problem,Aleksandr Lobanov,https://openreview.net/forum?id=g9XyES5Vna, -Rectified Robust Policy Optimization for Robust Constrained Reinforcement Learning without Strong Duality,Shaocong Ma Ziyi Chen Yi Zhou Heng Huang,https://openreview.net/forum?id=4SzmiJv2ew, -Machine learning on rigid classes of Euclidean clouds of unordered points,Yury Elkin Vitaliy Kurlin,https://openreview.net/forum?id=1Ct7Y3jsBx, -A Comparison of LLM fine-tuning Methods & Evaluation Metrics with Travel Chatbot Use Case,Sonia Wei Meyer Shreya Singh Christopher Ton Bertha Tam Xingjian REN,https://openreview.net/forum?id=mNYsX2kcsl, -Topo-Miner: CRISPR-Enhanced DNA Computing for Accelerated Topological Feature Extraction,Soong Kyum LEE,https://openreview.net/forum?id=vMtmlTCYqn, -Towards Foundational Models for Dynamical System Reconstruction: Hierarchical Meta-Learning via Mixture of Experts,Roussel Desmond Nzoyem David A.W. Barton Tom Deakin,https://openreview.net/forum?id=VLA7Uv2IcR, -Scale Invariance of Graph Neural Network for Node Classification,Qin Jiang Chengjia Wang Michael Lones Wei Pang,https://openreview.net/forum?id=0aS8nvlxpD, -Reflection System for the Abstraction and Reasoning Corpus,Kiril Bikov Mikel Bober-Irizar Soumya Banerjee,https://openreview.net/forum?id=kRFwzuv0ze, -Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization,Jakub Kopál Michal Gregor Santiago de Leon-Martinez Jakub Simko,https://openreview.net/forum?id=kNpK6aeccu, -Losses for Deep Probabilistic Regression,Franco Marchesoni-Acland Andrés Herrera Rodrigo Alonso-Suárez Jean-michel Morel Josselin Kherroubi Gabriele Facciolo,https://openreview.net/forum?id=WOauk8tsKx, -Demystifying MPNNs: Message Passing as Merely Efficient Matrix Multiplication,Qin Jiang Chengjia Wang Michael Lones Wei Pang,https://openreview.net/forum?id=Z87hDhsU5X, -"EKM: An Exact, Polynomial-Time Divide-and-Conquer Algorithm for the K-Medoids Problem",Xi He Max A Little,https://openreview.net/forum?id=FgttKcRbQO, -An efficient implementation for solving the all pairs minimax path problem in an undirected dense graph,Gangli Liu,https://openreview.net/forum?id=qNfEkSuGKk, An Analysis of Quantile Temporal-Difference Learning,"Mark Rowland, Remi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney",https://icml.cc/virtual/2025/poster/46716,"We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points. The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing. The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis." An Entropy-Based Model for Hierarchical Learning,Amir R. Asadi,https://icml.cc/virtual/2025/poster/46717,"Machine learning, the predominant approach in the field of artificial intelligence, enables computers to learn from data and experience. In the supervised learning framework, accurate and efficient learning of dependencies between data instances and their corresponding labels requires auxiliary information about the data distribution and the target function. This central concept aligns with the notion of regularization in statistical learning theory. Real-world datasets are often characterized by multiscale data instance distributions and well-behaved, smooth target functions. Scale-invariant probability distributions, such as power-law distributions, provide notable examples of multiscale data instance distributions in various contexts. This paper introduces a hierarchical learning model that leverages such a multiscale data structure with a multiscale entropy-based training procedure and explores its statistical and computational advantages. The hierarchical learning model is inspired by the logical progression in human learning from easy to complex tasks and features interpretable levels. In this model, the logarithm of any data instance’s norm can be construed as the data instance's complexity, and the allocation of computational resources is tailored to this complexity, resulting in benefits such as increased inference speed. Furthermore, our multiscale analysis of the statistical risk yields stronger guarantees compared to conventional uniform convergence bounds." Compressed and distributed least-squares regression: convergence rates with applications to federated learning,"Constantin Philippenko, Aymeric Dieuleveut",https://icml.cc/virtual/2025/poster/46714,"In this paper, we investigate the impact of compression on stochastic gradient algorithms for machine learning, a technique widely used in distributed and federated learning. We underline differences in terms of convergence rates between several unbiased compression operators, that all satisfy the same condition on their variance, thus going beyond the classical worst-case analysis. To do so, we focus on the case of least-squares regression (LSR) and analyze a general stochastic approximation algorithm for minimizing quadratic functions relying on a random field. We consider weak assumptions on the random field, tailored to the analysis (specifically, expected H{{\""o}}lder regularity), and on the noise covariance, enabling the analysis of various randomizing mechanisms, including compression. We then extend our results to the case of federated learning. More formally, we highlight the impact on the convergence of the covariance $\mathfrak{C}_{\mathrm{ania}}$ of the additive noise induced by the algorithm. We demonstrate despite the non-regularity of the stochastic field, that the limit variance term scales with $\mathrm{Tr}(\mathfrak{C}_{\mathrm{ania}} H^{-1})/K$ (where $H$ is the Hessian of the optimization problem and $K$ the number of iterations) generalizing the rate for the vanilla LSR case where it is $\sigma^2 \mathrm{Tr}(H H^{-1}) / K = \sigma^2 d / K$ (Bach and Moulines, 2013). Then, we analyze the dependency of $\mathfrak{C}_{\mathrm{ania}}$ on the compression strategy and ultimately its impact on convergence, first in the centralized case, then in two heterogeneous FL frameworks."