Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeMulti-Fidelity Covariance Estimation in the Log-Euclidean Geometry
We introduce a multi-fidelity estimator of covariance matrices that employs the log-Euclidean geometry of the symmetric positive-definite manifold. The estimator fuses samples from a hierarchy of data sources of differing fidelities and costs for variance reduction while guaranteeing definiteness, in contrast with previous approaches. The new estimator makes covariance estimation tractable in applications where simulation or data collection is expensive; to that end, we develop an optimal sample allocation scheme that minimizes the mean-squared error of the estimator given a fixed budget. Guaranteed definiteness is crucial to metric learning, data assimilation, and other downstream tasks. Evaluations of our approach using data from physical applications (heat conduction, fluid dynamics) demonstrate more accurate metric learning and speedups of more than one order of magnitude compared to benchmarks.
Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner
A diffusion model, which is formulated to produce an image using thousands of denoising steps, usually suffers from a slow inference speed. Existing acceleration algorithms simplify the sampling by skipping most steps yet exhibit considerable performance degradation. By viewing the generation of diffusion models as a discretized integrating process, we argue that the quality drop is partly caused by applying an inaccurate integral direction to a timestep interval. To rectify this issue, we propose a timestep aligner that helps find a more accurate integral direction for a particular interval at the minimum cost. Specifically, at each denoising step, we replace the original parameterization by conditioning the network on a new timestep, which is obtained by aligning the sampling distribution to the real distribution. Extensive experiments show that our plug-in design can be trained efficiently and boost the inference performance of various state-of-the-art acceleration methods, especially when there are few denoising steps. For example, when using 10 denoising steps on the popular LSUN Bedroom dataset, we improve the FID of DDIM from 9.65 to 6.07, simply by adopting our method for a more appropriate set of timesteps. Code will be made publicly available.
Optimally-Weighted Estimators of the Maximum Mean Discrepancy for Likelihood-Free Inference
Likelihood-free inference methods typically make use of a distance between simulated and real data. A common example is the maximum mean discrepancy (MMD), which has previously been used for approximate Bayesian computation, minimum distance estimation, generalised Bayesian inference, and within the nonparametric learning framework. The MMD is commonly estimated at a root-m rate, where m is the number of simulated samples. This can lead to significant computational challenges since a large m is required to obtain an accurate estimate, which is crucial for parameter estimation. In this paper, we propose a novel estimator for the MMD with significantly improved sample complexity. The estimator is particularly well suited for computationally expensive smooth simulators with low- to mid-dimensional inputs. This claim is supported through both theoretical results and an extensive simulation study on benchmark simulators.
Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocation
We consider the task of evaluating policies of algorithmic resource allocation through randomized controlled trials (RCTs). Such policies are tasked with optimizing the utilization of limited intervention resources, with the goal of maximizing the benefits derived. Evaluation of such allocation policies through RCTs proves difficult, notwithstanding the scale of the trial, because the individuals' outcomes are inextricably interlinked through resource constraints controlling the policy decisions. Our key contribution is to present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT. We identify conditions under which such reassignments are permissible and can be leveraged to construct counterfactual trials, whose outcomes can be accurately ascertained, for free. We prove theoretically that such an estimator is more accurate than common estimators based on sample means -- we show that it returns an unbiased estimate and simultaneously reduces variance. We demonstrate the value of our approach through empirical experiments on synthetic, semi-synthetic as well as real case study data and show improved estimation accuracy across the board.
A non-asymptotic approach for model selection via penalization in high-dimensional mixture of experts models
Mixture of experts (MoE) are a popular class of statistical and machine learning models that have gained attention over the years due to their flexibility and efficiency. In this work, we consider Gaussian-gated localized MoE (GLoME) and block-diagonal covariance localized MoE (BLoME) regression models to present nonlinear relationships in heterogeneous data with potential hidden graph-structured interactions between high-dimensional predictors. These models pose difficult statistical estimation and model selection questions, both from a computational and theoretical perspective. This paper is devoted to the study of the problem of model selection among a collection of GLoME or BLoME models characterized by the number of mixture components, the complexity of Gaussian mean experts, and the hidden block-diagonal structures of the covariance matrices, in a penalized maximum likelihood estimation framework. In particular, we establish non-asymptotic risk bounds that take the form of weak oracle inequalities, provided that lower bounds for the penalties hold. The good empirical behavior of our models is then demonstrated on synthetic and real datasets.
Denoising Diffusion Models for Plug-and-Play Image Restoration
Plug-and-play Image Restoration (IR) has been widely recognized as a flexible and interpretable method for solving various inverse problems by utilizing any off-the-shelf denoiser as the implicit image prior. However, most existing methods focus on discriminative Gaussian denoisers. Although diffusion models have shown impressive performance for high-quality image synthesis, their potential to serve as a generative denoiser prior to the plug-and-play IR methods remains to be further explored. While several other attempts have been made to adopt diffusion models for image restoration, they either fail to achieve satisfactory results or typically require an unacceptable number of Neural Function Evaluations (NFEs) during inference. This paper proposes DiffPIR, which integrates the traditional plug-and-play method into the diffusion sampling framework. Compared to plug-and-play IR methods that rely on discriminative Gaussian denoisers, DiffPIR is expected to inherit the generative ability of diffusion models. Experimental results on three representative IR tasks, including super-resolution, image deblurring, and inpainting, demonstrate that DiffPIR achieves state-of-the-art performance on both the FFHQ and ImageNet datasets in terms of reconstruction faithfulness and perceptual quality with no more than 100 NFEs. The source code is available at {https://github.com/yuanzhi-zhu/DiffPIR}
Optimal randomized multilevel Monte Carlo for repeatedly nested expectations
The estimation of repeatedly nested expectations is a challenging task that arises in many real-world systems. However, existing methods generally suffer from high computational costs when the number of nestings becomes large. Fix any non-negative integer D for the total number of nestings. Standard Monte Carlo methods typically cost at least O(varepsilon^{-(2+D)}) and sometimes O(varepsilon^{-2(1+D)}) to obtain an estimator up to varepsilon-error. More advanced methods, such as multilevel Monte Carlo, currently only exist for D = 1. In this paper, we propose a novel Monte Carlo estimator called READ, which stands for "Recursive Estimator for Arbitrary Depth.'' Our estimator has an optimal computational cost of O(varepsilon^{-2}) for every fixed D under suitable assumptions, and a nearly optimal computational cost of O(varepsilon^{-2(1 + delta)}) for any 0 < delta < frac12 under much more general assumptions. Our estimator is also unbiased, which makes it easy to parallelize. The key ingredients in our construction are an observation of the problem's recursive structure and the recursive use of the randomized multilevel Monte Carlo method.
USCD: Improving Code Generation of LLMs by Uncertainty-Aware Selective Contrastive Decoding
Large language models (LLMs) have shown remarkable capabilities in code generation. However, the effects of hallucinations (e.g., output noise) make it particularly challenging for LLMs to generate high-quality code in one pass. In this work, we propose a simple and effective uncertainty-aware selective contrastive decoding (USCD) mechanism to improve the quality of one-pass code generation in LLMs and reduce the impact of output noise. To be specific, we first elaborately designed a negative prompt (namely lame prompt) to output noise by removing input-output examples from the standard few-shot prompt. Our preliminary study shows that the Jensen-Shannon divergence (JS divergence) between token distribution uncertainty and the output noise is relatively low (approximately 0.25), indicating their high relevance. Then, we selectively eliminate output noise induced by lame prompts based on the uncertainty of the prediction distribution from the standard prompt. Notably, our proposed plug-and-play mechanism is an inference-only method, enjoying appealing flexibility. Extensive experiments on widely used benchmarks, e.g., HumanEval, MBPP, and MultiPL-E, upon several LLMs (i.e., Inocder-6b, CodeLlama-7b, WizardCoder-15b, StarCoder, and Llama2-7b), demonstrate that our proposed USCD significantly improves one-pass code generation, with an average pass@1 scores increase of 16.59\%. We will release code and data on GitHub.
Uncovering delayed patterns in noisy and irregularly sampled time series: an astronomy application
We study the problem of estimating the time delay between two signals representing delayed, irregularly sampled and noisy versions of the same underlying pattern. We propose and demonstrate an evolutionary algorithm for the (hyper)parameter estimation of a kernel-based technique in the context of an astronomical problem, namely estimating the time delay between two gravitationally lensed signals from a distant quasar. Mixed types (integer and real) are used to represent variables within the evolutionary algorithm. We test the algorithm on several artificial data sets, and also on real astronomical observations of quasar Q0957+561. By carrying out a statistical analysis of the results we present a detailed comparison of our method with the most popular methods for time delay estimation in astrophysics. Our method yields more accurate and more stable time delay estimates: for Q0957+561, we obtain 419.6 days for the time delay between images A and B. Our methodology can be readily applied to current state-of-the-art optical monitoring data in astronomy, but can also be applied in other disciplines involving similar time series data.
Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information
It is believed that in knowledge distillation (KD), the role of the teacher is to provide an estimate for the unknown Bayes conditional probability distribution (BCPD) to be used in the student training process. Conventionally, this estimate is obtained by training the teacher using maximum log-likelihood (MLL) method. To improve this estimate for KD, in this paper we introduce the concept of conditional mutual information (CMI) into the estimation of BCPD and propose a novel estimator called the maximum CMI (MCMI) method. Specifically, in MCMI estimation, both the log-likelihood and CMI of the teacher are simultaneously maximized when the teacher is trained. Through Eigen-CAM, it is further shown that maximizing the teacher's CMI value allows the teacher to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a teacher trained via MCMI estimation rather than one trained via MLL estimation in various state-of-the-art KD frameworks, the student's classification accuracy consistently increases, with the gain of up to 3.32\%. This suggests that the teacher's BCPD estimate provided by MCMI method is more accurate than that provided by MLL method. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, the student's accuracy increases with the gain of up to 5.72\% when 5\% of the training samples are available to the student (few-shot), and increases from 0\% to as high as 84\% for an omitted class (zero-shot). The code is available at https://github.com/iclr2024mcmi/ICLRMCMI.
Bounds on the conditional and average treatment effect with unobserved confounding factors
For observational studies, we study the sensitivity of causal inference when treatment assignments may depend on unobserved confounders. We develop a loss minimization approach for estimating bounds on the conditional average treatment effect (CATE) when unobserved confounders have a bounded effect on the odds ratio of treatment selection. Our approach is scalable and allows flexible use of model classes in estimation, including nonparametric and black-box machine learning methods. Based on these bounds for the CATE, we propose a sensitivity analysis for the average treatment effect (ATE). Our semi-parametric estimator extends/bounds the augmented inverse propensity weighted (AIPW) estimator for the ATE under bounded unobserved confounding. By constructing a Neyman orthogonal score, our estimator of the bound for the ATE is a regular root-n estimator so long as the nuisance parameters are estimated at the o_p(n^{-1/4}) rate. We complement our methodology with optimality results showing that our proposed bounds are tight in certain cases. We demonstrate our method on simulated and real data examples, and show accurate coverage of our confidence intervals in practical finite sample regimes with rich covariate information.
Doubly Robust Proximal Causal Learning for Continuous Treatments
Proximal causal learning is a promising framework for identifying the causal effect under the existence of unmeasured confounders. Within this framework, the doubly robust (DR) estimator was derived and has shown its effectiveness in estimation, especially when the model assumption is violated. However, the current form of the DR estimator is restricted to binary treatments, while the treatment can be continuous in many real-world applications. The primary obstacle to continuous treatments resides in the delta function present in the original DR estimator, making it infeasible in causal effect estimation and introducing a heavy computational burden in nuisance function estimation. To address these challenges, we propose a kernel-based DR estimator that can well handle continuous treatments. Equipped with its smoothness, we show that its oracle form is a consistent approximation of the influence function. Further, we propose a new approach to efficiently solve the nuisance functions. We then provide a comprehensive convergence analysis in terms of the mean square error. We demonstrate the utility of our estimator on synthetic datasets and real-world applications.
On the convergence of the MLE as an estimator of the learning rate in the Exp3 algorithm
When fitting the learning data of an individual to algorithm-like learning models, the observations are so dependent and non-stationary that one may wonder what the classical Maximum Likelihood Estimator (MLE) could do, even if it is the usual tool applied to experimental cognition. Our objective in this work is to show that the estimation of the learning rate cannot be efficient if the learning rate is constant in the classical Exp3 (Exponential weights for Exploration and Exploitation) algorithm. Secondly, we show that if the learning rate decreases polynomially with the sample size, then the prediction error and in some cases the estimation error of the MLE satisfy bounds in probability that decrease at a polynomial rate.
DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation
Diffusion probabilistic models (DPMs) have shown remarkable performance in visual synthesis but are computationally expensive due to the need for multiple evaluations during the sampling. Recent predictor-corrector diffusion samplers have significantly reduced the required number of function evaluations (NFE), but inherently suffer from a misalignment issue caused by the extra corrector step, especially with a large classifier-free guidance scale (CFG). In this paper, we introduce a new fast DPM sampler called DC-Solver, which leverages dynamic compensation (DC) to mitigate the misalignment of the predictor-corrector samplers. The dynamic compensation is controlled by compensation ratios that are adaptive to the sampling steps and can be optimized on only 10 datapoints by pushing the sampling trajectory toward a ground truth trajectory. We further propose a cascade polynomial regression (CPR) which can instantly predict the compensation ratios on unseen sampling configurations. Additionally, we find that the proposed dynamic compensation can also serve as a plug-and-play module to boost the performance of predictor-only samplers. Extensive experiments on both unconditional sampling and conditional sampling demonstrate that our DC-Solver can consistently improve the sampling quality over previous methods on different DPMs with a wide range of resolutions up to 1024times1024. Notably, we achieve 10.38 FID (NFE=5) on unconditional FFHQ and 0.394 MSE (NFE=5, CFG=7.5) on Stable-Diffusion-2.1. Code is available at https://github.com/wl-zhao/DC-Solver
Proximal Causal Learning of Conditional Average Treatment Effects
Efficiently and flexibly estimating treatment effect heterogeneity is an important task in a wide variety of settings ranging from medicine to marketing, and there are a considerable number of promising conditional average treatment effect estimators currently available. These, however, typically rely on the assumption that the measured covariates are enough to justify conditional exchangeability. We propose the P-learner, motivated by the R- and DR-learner, a tailored two-stage loss function for learning heterogeneous treatment effects in settings where exchangeability given observed covariates is an implausible assumption, and we wish to rely on proxy variables for causal inference. Our proposed estimator can be implemented by off-the-shelf loss-minimizing machine learning methods, which in the case of kernel regression satisfies an oracle bound on the estimated error as long as the nuisance components are estimated reasonably well.
Preserving Statistical Validity in Adaptive Data Analysis
A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of m adaptively chosen functions on an unknown distribution given n random samples. We show that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.
Comparison of meta-learners for estimating multi-valued treatment heterogeneous effects
Conditional Average Treatment Effects (CATE) estimation is one of the main challenges in causal inference with observational data. In addition to Machine Learning based-models, nonparametric estimators called meta-learners have been developed to estimate the CATE with the main advantage of not restraining the estimation to a specific supervised learning method. This task becomes, however, more complicated when the treatment is not binary as some limitations of the naive extensions emerge. This paper looks into meta-learners for estimating the heterogeneous effects of multi-valued treatments. We consider different meta-learners, and we carry out a theoretical analysis of their error upper bounds as functions of important parameters such as the number of treatment levels, showing that the naive extensions do not always provide satisfactory results. We introduce and discuss meta-learners that perform well as the number of treatments increases. We empirically confirm the strengths and weaknesses of those methods with synthetic and semi-synthetic datasets.
Adaptive Instrument Design for Indirect Experiments
Indirect experiments provide a valuable framework for estimating treatment effects in situations where conducting randomized control trials (RCTs) is impractical or unethical. Unlike RCTs, indirect experiments estimate treatment effects by leveraging (conditional) instrumental variables, enabling estimation through encouragement and recommendation rather than strict treatment assignment. However, the sample efficiency of such estimators depends not only on the inherent variability in outcomes but also on the varying compliance levels of users with the instrumental variables and the choice of estimator being used, especially when dealing with numerous instrumental variables. While adaptive experiment design has a rich literature for direct experiments, in this paper we take the initial steps towards enhancing sample efficiency for indirect experiments by adaptively designing a data collection policy over instrumental variables. Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy, minimizing the mean-squared error of the desired (non-linear) estimator. Through experiments conducted in various domains inspired by real-world applications, we showcase how our method can significantly improve the sample efficiency of indirect experiments.
Learning from Aggregate responses: Instance Level versus Bag Level Loss Functions
Due to the rise of privacy concerns, in many practical applications the training data is aggregated before being shared with the learner, in order to protect privacy of users' sensitive responses. In an aggregate learning framework, the dataset is grouped into bags of samples, where each bag is available only with an aggregate response, providing a summary of individuals' responses in that bag. In this paper, we study two natural loss functions for learning from aggregate responses: bag-level loss and the instance-level loss. In the former, the model is learnt by minimizing a loss between aggregate responses and aggregate model predictions, while in the latter the model aims to fit individual predictions to the aggregate responses. In this work, we show that the instance-level loss can be perceived as a regularized form of the bag-level loss. This observation lets us compare the two approaches with respect to bias and variance of the resulting estimators, and introduce a novel interpolating estimator which combines the two approaches. For linear regression tasks, we provide a precise characterization of the risk of the interpolating estimator in an asymptotic regime where the size of the training set grows in proportion to the features dimension. Our analysis allows us to theoretically understand the effect of different factors, such as bag size on the model prediction risk. In addition, we propose a mechanism for differentially private learning from aggregate responses and derive the optimal bag size in terms of prediction risk-privacy trade-off. We also carry out thorough experiments to corroborate our theory and show the efficacy of the interpolating estimator.
X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally, X-Adapter adds trainable mapping layers that bridge the decoders from models of different versions for feature remapping. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter, we employ a null-text training strategy for the upgraded model. After training, we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies, X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together, thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method, we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model.
Nonparametric extensions of randomized response for private confidence sets
This work derives methods for performing nonparametric, nonasymptotic statistical inference for population means under the constraint of local differential privacy (LDP). Given bounded observations (X_1, dots, X_n) with mean mu^star that are privatized into (Z_1, dots, Z_n), we present confidence intervals (CI) and time-uniform confidence sequences (CS) for mu^star when only given access to the privatized data. To achieve this, we introduce a nonparametric and sequentially interactive generalization of Warner's famous ``randomized response'' mechanism, satisfying LDP for arbitrary bounded random variables, and then provide CIs and CSs for their means given access to the resulting privatized observations. For example, our results yield private analogues of Hoeffding's inequality in both fixed-time and time-uniform regimes. We extend these Hoeffding-type CSs to capture time-varying (non-stationary) means, and conclude by illustrating how these methods can be used to conduct private online A/B tests.
Generative Principal Component Analysis
In this paper, we study the problem of principal component analysis with generative modeling assumptions, adopting a general model for the observed matrix that encompasses notable special cases, including spiked matrix recovery and phase retrieval. The key assumption is that the underlying signal lies near the range of an L-Lipschitz continuous generative model with bounded k-dimensional inputs. We propose a quadratic estimator, and show that it enjoys a statistical rate of order frac{klog L{m}}, where m is the number of samples. We also provide a near-matching algorithm-independent lower bound. Moreover, we provide a variant of the classic power method, which projects the calculated data onto the range of the generative model during each iteration. We show that under suitable conditions, this method converges exponentially fast to a point achieving the above-mentioned statistical rate. We perform experiments on various image datasets for spiked matrix and phase retrieval models, and illustrate performance gains of our method to the classic power method and the truncated power method devised for sparse principal component analysis.
Querying Easily Flip-flopped Samples for Deep Active Learning
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data. One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is. The sample's distance to the decision boundary is a natural measure of predictive uncertainty, but it is often intractable to compute, especially for complex decision boundaries formed in multiclass classification tasks. To address this issue, this paper proposes the {\it least disagree metric} (LDM), defined as the smallest probability of disagreement of the predicted label, and an estimator for LDM proven to be asymptotically consistent under mild assumptions. The estimator is computationally efficient and can be easily implemented for deep learning models using parameter perturbation. The LDM-based active learning is performed by querying unlabeled data with the smallest LDM. Experimental results show that our LDM-based active learning algorithm obtains state-of-the-art overall performance on all considered datasets and deep architectures.
PYInfer: Deep Learning Semantic Type Inference for Python Variables
Python type inference is challenging in practice. Due to its dynamic properties and extensive dependencies on third-party libraries without type annotations, the performance of traditional static analysis techniques is limited. Although semantics in source code can help manifest intended usage for variables (thus help infer types), they are usually ignored by existing tools. In this paper, we propose PYInfer, an end-to-end learning-based type inference tool that automatically generates type annotations for Python variables. The key insight is that contextual code semantics is critical in inferring the type for a variable. For each use of a variable, we collect a few tokens within its contextual scope, and design a neural network to predict its type. One challenge is that it is difficult to collect a high-quality human-labeled training dataset for this purpose. To address this issue, we apply an existing static analyzer to generate the ground truth for variables in source code. Our main contribution is a novel approach to statically infer variable types effectively and efficiently. Formulating the type inference as a classification problem, we can handle user-defined types and predict type probabilities for each variable. Our model achieves 91.2% accuracy on classifying 11 basic types in Python and 81.2% accuracy on classifying 500 most common types. Our results substantially outperform the state-of-the-art type annotators. Moreover, PYInfer achieves 5.2X more code coverage and is 187X faster than a state-of-the-art learning-based tool. With similar time consumption, our model annotates 5X more variables than a state-of-the-art static analysis tool. Our model also outperforms a learning-based function-level annotator on annotating types for variables and function arguments. All our tools and datasets are publicly available to facilitate future research in this direction.
Sketched Ridgeless Linear Regression: The Role of Downsampling
Overparametrization often helps improve the generalization performance. This paper proposes a dual view of overparametrization suggesting that downsampling may also help generalize. Motivated by this dual view, we characterize two out-of-sample prediction risks of the sketched ridgeless least square estimator in the proportional regime masymp n asymp p, where m is the sketching size, n the sample size, and p the feature dimensionality. Our results reveal the statistical role of downsampling. Specifically, downsampling does not always hurt the generalization performance, and may actually help improve it in some cases. We identify the optimal sketching sizes that minimize the out-of-sample prediction risks, and find that the optimally sketched estimator has stabler risk curves that eliminates the peaks of those for the full-sample estimator. We then propose a practical procedure to empirically identify the optimal sketching size. Finally, we extend our results to cover central limit theorems and misspecified models. Numerical studies strongly support our theory.
Kernel regression estimates of time delays between gravitationally lensed fluxes
Strongly lensed variable quasars can serve as precise cosmological probes, provided that time delays between the image fluxes can be accurately measured. A number of methods have been proposed to address this problem. In this paper, we explore in detail a new approach based on kernel regression estimates, which is able to estimate a single time delay given several datasets for the same quasar. We develop realistic artificial data sets in order to carry out controlled experiments to test of performance of this new approach. We also test our method on real data from strongly lensed quasar Q0957+561 and compare our estimates against existing results.
Estimation Beyond Data Reweighting: Kernel Method of Moments
Moment restrictions and their conditional counterparts emerge in many areas of machine learning and statistics ranging from causal inference to reinforcement learning. Estimators for these tasks, generally called methods of moments, include the prominent generalized method of moments (GMM) which has recently gained attention in causal inference. GMM is a special case of the broader family of empirical likelihood estimators which are based on approximating a population distribution by means of minimizing a varphi-divergence to an empirical distribution. However, the use of varphi-divergences effectively limits the candidate distributions to reweightings of the data samples. We lift this long-standing limitation and provide a method of moments that goes beyond data reweighting. This is achieved by defining an empirical likelihood estimator based on maximum mean discrepancy which we term the kernel method of moments (KMM). We provide a variant of our estimator for conditional moment restrictions and show that it is asymptotically first-order optimal for such problems. Finally, we show that our method achieves competitive performance on several conditional moment restriction tasks.
Flat Minima in Linear Estimation and an Extended Gauss Markov Theorem
We consider the problem of linear estimation, and establish an extension of the Gauss-Markov theorem, in which the bias operator is allowed to be non-zero but bounded with respect to a matrix norm of Schatten type. We derive simple and explicit formulas for the optimal estimator in the cases of Nuclear and Spectral norms (with the Frobenius case recovering ridge regression). Additionally, we analytically derive the generalization error in multiple random matrix ensembles, and compare with Ridge regression. Finally, we conduct an extensive simulation study, in which we show that the cross-validated Nuclear and Spectral regressors can outperform Ridge in several circumstances.
Learning Invariant Representations with Missing Data
Spurious correlations allow flexible models to predict well during training but poorly on related test distributions. Recent work has shown that models that satisfy particular independencies involving correlation-inducing nuisance variables have guarantees on their test performance. Enforcing such independencies requires nuisances to be observed during training. However, nuisances, such as demographics or image background labels, are often missing. Enforcing independence on just the observed data does not imply independence on the entire population. Here we derive mmd estimators used for invariance objectives under missing nuisances. On simulations and clinical data, optimizing through these estimates achieves test performance similar to using estimators that make use of the full data.
Towards Practical Plug-and-Play Diffusion Models
Diffusion-based generative models have achieved remarkable success in image generation. Their guidance formulation allows an external model to plug-and-play control the generation process for various tasks without finetuning the diffusion model. However, the direct use of publicly available off-the-shelf models for guidance fails due to their poor performance on noisy inputs. For that, the existing practice is to fine-tune the guidance models with labeled data corrupted with noises. In this paper, we argue that this practice has limitations in two aspects: (1) performing on inputs with extremely various noises is too hard for a single guidance model; (2) collecting labeled datasets hinders scaling up for various tasks. To tackle the limitations, we propose a novel strategy that leverages multiple experts where each expert is specialized in a particular noise range and guides the reverse process of the diffusion at its corresponding timesteps. However, as it is infeasible to manage multiple networks and utilize labeled data, we present a practical guidance framework termed Practical Plug-And-Play (PPAP), which leverages parameter-efficient fine-tuning and data-free knowledge transfer. We exhaustively conduct ImageNet class conditional generation experiments to show that our method can successfully guide diffusion with small trainable parameters and no labeled data. Finally, we show that image classifiers, depth estimators, and semantic segmentation models can guide publicly available GLIDE through our framework in a plug-and-play manner. Our code is available at https://github.com/riiid/PPAP.
A Meta-Learning Approach to Predicting Performance and Data Requirements
We propose an approach to estimate the number of samples required for a model to reach a target performance. We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset (e.g., 5 samples per class) for extrapolation. This is because the log-performance error against the log-dataset size follows a nonlinear progression in the few-shot regime followed by a linear progression in the high-shot regime. We introduce a novel piecewise power law (PPL) that handles the two data regimes differently. To estimate the parameters of the PPL, we introduce a random forest regressor trained via meta learning that generalizes across classification/detection tasks, ResNet/ViT based architectures, and random/pre-trained initializations. The PPL improves the performance estimation on average by 37% across 16 classification and 33% across 10 detection datasets, compared to the power law. We further extend the PPL to provide a confidence bound and use it to limit the prediction horizon that reduces over-estimation of data by 76% on classification and 91% on detection datasets.
Non-asymptotic oracle inequalities for the Lasso in high-dimensional mixture of experts
Mixture of experts (MoE) has a well-principled finite mixture model construction for prediction, allowing the gating network (mixture weights) to learn from the predictors (explanatory variables) together with the experts' network (mixture component densities). We investigate the estimation properties of MoEs in a high-dimensional setting, where the number of predictors is much larger than the sample size, for which the literature lacks computational and especially theoretical results. We consider the class of finite MoE models with softmax gating functions and Gaussian regression experts, and focus on the theoretical properties of their l_1-regularized estimation via the Lasso. We provide a lower bound on the regularization parameter of the Lasso penalty that ensures an l_1-oracle inequality is satisfied by the Lasso estimator according to the Kullback--Leibler loss. We further state an l_1-ball oracle inequality for the l_1-penalized maximum likelihood estimator from the model selection.
Unbiased Recommender Learning from Missing-Not-At-Random Implicit Feedback
Recommender systems widely use implicit feedback such as click data because of its general availability. Although the presence of clicks signals the users' preference to some extent, the lack of such clicks does not necessarily indicate a negative response from the users, as it is possible that the users were not exposed to the items (positive-unlabeled problem). This leads to a difficulty in predicting the users' preferences from implicit feedback. Previous studies addressed the positive-unlabeled problem by uniformly upweighting the loss for the positive feedback data or estimating the confidence of each data having relevance information via the EM-algorithm. However, these methods failed to address the missing-not-at-random problem in which popular or frequently recommended items are more likely to be clicked than other items even if a user does not have a considerable interest in them. To overcome these limitations, we first define an ideal loss function to be optimized to realize recommendations that maximize the relevance and propose an unbiased estimator for the ideal loss. Subsequently, we analyze the variance of the proposed unbiased estimator and further propose a clipped estimator that includes the unbiased estimator as a special case. We demonstrate that the clipped estimator is expected to improve the performance of the recommender system, by considering the bias-variance trade-off. We conduct semi-synthetic and real-world experiments and demonstrate that the proposed method largely outperforms the baselines. In particular, the proposed method works better for rare items that are less frequently observed in the training data. The findings indicate that the proposed method can better achieve the objective of recommending items with the highest relevance.
Composable Text Controls in Latent Space with ODEs
Real-world text applications often involve composing a wide range of text control operations, such as editing the text w.r.t. an attribute, manipulating keywords and structure, and generating new text of desired properties. Prior work typically learns/finetunes a language model (LM) to perform individual or specific subsets of operations. Recent research has studied combining operations in a plug-and-play manner, often with costly search or optimization in the complex sequence space. This paper proposes a new efficient approach for composable text operations in the compact latent space of text. The low-dimensionality and differentiability of the text latent vector allow us to develop an efficient sampler based on ordinary differential equations (ODEs) given arbitrary plug-in operators (e.g., attribute classifiers). By connecting pretrained LMs (e.g., GPT2) to the latent space through efficient adaption, we then decode the sampled vectors into desired text sequences. The flexible approach permits diverse control operators (sentiment, tense, formality, keywords, etc.) acquired using any relevant data from different domains. Experiments show that composing those operators within our approach manages to generate or edit high-quality text, substantially improving over previous methods in terms of generation quality and efficiency.
Abstract Reward Processes: Leveraging State Abstraction for Consistent Off-Policy Evaluation
Evaluating policies using off-policy data is crucial for applying reinforcement learning to real-world problems such as healthcare and autonomous driving. Previous methods for off-policy evaluation (OPE) generally suffer from high variance or irreducible bias, leading to unacceptably high prediction errors. In this work, we introduce STAR, a framework for OPE that encompasses a broad range of estimators -- which include existing OPE methods as special cases -- that achieve lower mean squared prediction errors. STAR leverages state abstraction to distill complex, potentially continuous problems into compact, discrete models which we call abstract reward processes (ARPs). Predictions from ARPs estimated from off-policy data are provably consistent (asymptotically correct). Rather than proposing a specific estimator, we present a new framework for OPE and empirically demonstrate that estimators within STAR outperform existing methods. The best STAR estimator outperforms baselines in all twelve cases studied, and even the median STAR estimator surpasses the baselines in seven out of the twelve cases.
Distributed Learning of Mixtures of Experts
In modern machine learning problems we deal with datasets that are either distributed by nature or potentially large for which distributing the computations is usually a standard way to proceed, since centralized algorithms are in general ineffective. We propose a distributed learning approach for mixtures of experts (MoE) models with an aggregation strategy to construct a reduction estimator from local estimators fitted parallelly to distributed subsets of the data. The aggregation is based on an optimal minimization of an expected transportation divergence between the large MoE composed of local estimators and the unknown desired MoE model. We show that the provided reduction estimator is consistent as soon as the local estimators to be aggregated are consistent, and its construction is performed by a proposed majorization-minimization (MM) algorithm that is computationally effective. We study the statistical and numerical properties for the proposed reduction estimator on experiments that demonstrate its performance compared to namely the global estimator constructed in a centralized way from the full dataset. For some situations, the computation time is more than ten times faster, for a comparable performance. Our source codes are publicly available on Github.
Efficient estimation of multiple expectations with the same sample by adaptive importance sampling and control variates
Some classical uncertainty quantification problems require the estimation of multiple expectations. Estimating all of them accurately is crucial and can have a major impact on the analysis to perform, and standard existing Monte Carlo methods can be costly to do so. We propose here a new procedure based on importance sampling and control variates for estimating more efficiently multiple expectations with the same sample. We first show that there exists a family of optimal estimators combining both importance sampling and control variates, which however cannot be used in practice because they require the knowledge of the values of the expectations to estimate. Motivated by the form of these optimal estimators and some interesting properties, we therefore propose an adaptive algorithm. The general idea is to adaptively update the parameters of the estimators for approaching the optimal ones. We suggest then a quantitative stopping criterion that exploits the trade-off between approaching these optimal parameters and having a sufficient budget left. This left budget is then used to draw a new independent sample from the final sampling distribution, allowing to get unbiased estimators of the expectations. We show how to apply our procedure to sensitivity analysis, by estimating Sobol' indices and quantifying the impact of the input distributions. Finally, realistic test cases show the practical interest of the proposed algorithm, and its significant improvement over estimating the expectations separately.
Maximum Likelihood Estimation is All You Need for Well-Specified Covariate Shift
A key challenge of modern machine learning systems is to achieve Out-of-Distribution (OOD) generalization -- generalizing to target data whose distribution differs from that of source data. Despite its significant importance, the fundamental question of ``what are the most effective algorithms for OOD generalization'' remains open even under the standard setting of covariate shift. This paper addresses this fundamental question by proving that, surprisingly, classical Maximum Likelihood Estimation (MLE) purely using source data (without any modification) achieves the minimax optimality for covariate shift under the well-specified setting. That is, no algorithm performs better than MLE in this setting (up to a constant factor), justifying MLE is all you need. Our result holds for a very rich class of parametric models, and does not require any boundedness condition on the density ratio. We illustrate the wide applicability of our framework by instantiating it to three concrete examples -- linear regression, logistic regression, and phase retrieval. This paper further complement the study by proving that, under the misspecified setting, MLE is no longer the optimal choice, whereas Maximum Weighted Likelihood Estimator (MWLE) emerges as minimax optimal in certain scenarios.
Where to Diffuse, How to Diffuse, and How to Get Back: Automated Learning for Multivariate Diffusions
Diffusion-based generative models (DBGMs) perturb data to a target noise distribution and reverse this process to generate samples. The choice of noising process, or inference diffusion process, affects both likelihoods and sample quality. For example, extending the inference process with auxiliary variables leads to improved sample quality. While there are many such multivariate diffusions to explore, each new one requires significant model-specific analysis, hindering rapid prototyping and evaluation. In this work, we study Multivariate Diffusion Models (MDMs). For any number of auxiliary variables, we provide a recipe for maximizing a lower-bound on the MDMs likelihood without requiring any model-specific analysis. We then demonstrate how to parameterize the diffusion for a specified target noise distribution; these two points together enable optimizing the inference diffusion process. Optimizing the diffusion expands easy experimentation from just a few well-known processes to an automatic search over all linear diffusions. To demonstrate these ideas, we introduce two new specific diffusions as well as learn a diffusion process on the MNIST, CIFAR10, and ImageNet32 datasets. We show learned MDMs match or surpass bits-per-dims (BPDs) relative to fixed choices of diffusions for a given dataset and model architecture.
Selective Machine Learning of the Average Treatment Effect with an Invalid Instrumental Variable
Instrumental variable methods have been widely used to identify causal effects in the presence of unmeasured confounding. A key identification condition known as the exclusion restriction states that the instrument cannot have a direct effect on the outcome which is not mediated by the exposure in view. In the health and social sciences, such an assumption is often not credible. To address this concern, we consider identification conditions of the population average treatment effect with an invalid instrumental variable which does not satisfy the exclusion restriction, and derive the efficient influence function targeting the identifying functional under a nonparametric observed data model. We propose a novel multiply robust locally efficient estimator of the average treatment effect that is consistent in the union of multiple parametric nuisance models, as well as a multiply debiased machine learning estimator for which the nuisance parameters are estimated using generic machine learning methods, that effectively exploit various forms of linear or nonlinear structured sparsity in the nuisance parameter space. When one cannot be confident that any of these machine learners is consistent at sufficiently fast rates to ensure n-consistency for the average treatment effect, we introduce a new criteria for selective machine learning which leverages the multiple robustness property in order to ensure small bias. The proposed methods are illustrated through extensive simulations and a data analysis evaluating the causal effect of 401(k) participation on savings.
On the Identifiability and Estimation of Causal Location-Scale Noise Models
We study the class of location-scale or heteroscedastic noise models (LSNMs), in which the effect Y can be written as a function of the cause X and a noise source N independent of X, which may be scaled by a positive function g over the cause, i.e., Y = f(X) + g(X)N. Despite the generality of the model class, we show the causal direction is identifiable up to some pathological cases. To empirically validate these theoretical findings, we propose two estimators for LSNMs: an estimator based on (non-linear) feature maps, and one based on neural networks. Both model the conditional distribution of Y given X as a Gaussian parameterized by its natural parameters. When the feature maps are correctly specified, we prove that our estimator is jointly concave, and a consistent estimator for the cause-effect identification task. Although the the neural network does not inherit those guarantees, it can fit functions of arbitrary complexity, and reaches state-of-the-art performance across benchmarks.
On the Posterior Distribution in Denoising: Application to Uncertainty Quantification
Denoisers play a central role in many applications, from noise suppression in low-grade imaging sensors, to empowering score-based generative models. The latter category of methods makes use of Tweedie's formula, which links the posterior mean in Gaussian denoising (\ie the minimum MSE denoiser) with the score of the data distribution. Here, we derive a fundamental relation between the higher-order central moments of the posterior distribution, and the higher-order derivatives of the posterior mean. We harness this result for uncertainty quantification of pre-trained denoisers. Particularly, we show how to efficiently compute the principal components of the posterior distribution for any desired region of an image, as well as to approximate the full marginal distribution along those (or any other) one-dimensional directions. Our method is fast and memory-efficient, as it does not explicitly compute or store the high-order moment tensors and it requires no training or fine tuning of the denoiser. Code and examples are available on the project webpage in https://hilamanor.github.io/GaussianDenoisingPosterior/ .
Multi-annotator Deep Learning: A Probabilistic Framework for Classification
Solving complex classification tasks using deep neural networks typically requires large amounts of annotated data. However, corresponding class labels are noisy when provided by error-prone annotators, e.g., crowd workers. Training standard deep neural networks leads to subpar performances in such multi-annotator supervised learning settings. We address this issue by presenting a probabilistic training framework named multi-annotator deep learning (MaDL). A ground truth and an annotator performance model are jointly trained in an end-to-end learning approach. The ground truth model learns to predict instances' true class labels, while the annotator performance model infers probabilistic estimates of annotators' performances. A modular network architecture enables us to make varying assumptions regarding annotators' performances, e.g., an optional class or instance dependency. Further, we learn annotator embeddings to estimate annotators' densities within a latent space as proxies of their potentially correlated annotations. Together with a weighted loss function, we improve the learning from correlated annotation patterns. In a comprehensive evaluation, we examine three research questions about multi-annotator supervised learning. Our findings indicate MaDL's state-of-the-art performance and robustness against many correlated, spamming annotators.
Model Averaging and Double Machine Learning
This paper discusses pairing double/debiased machine learning (DDML) with stacking, a model averaging method for combining multiple candidate learners, to estimate structural parameters. In addition to conventional stacking, we consider two stacking variants available for DDML: short-stacking exploits the cross-fitting step of DDML to substantially reduce the computational burden and pooled stacking enforces common stacking weights over cross-fitting folds. Using calibrated simulation studies and two applications estimating gender gaps in citations and wages, we show that DDML with stacking is more robust to partially unknown functional forms than common alternative approaches based on single pre-selected learners. We provide Stata and R software implementing our proposals.
Sequential Underspecified Instrument Selection for Cause-Effect Estimation
Instrumental variable (IV) methods are used to estimate causal effects in settings with unobserved confounding, where we cannot directly experiment on the treatment variable. Instruments are variables which only affect the outcome indirectly via the treatment variable(s). Most IV applications focus on low-dimensional treatments and crucially require at least as many instruments as treatments. This assumption is restrictive: in the natural sciences we often seek to infer causal effects of high-dimensional treatments (e.g., the effect of gene expressions or microbiota on health and disease), but can only run few experiments with a limited number of instruments (e.g., drugs or antibiotics). In such underspecified problems, the full treatment effect is not identifiable in a single experiment even in the linear case. We show that one can still reliably recover the projection of the treatment effect onto the instrumented subspace and develop techniques to consistently combine such partial estimates from different sets of instruments. We then leverage our combined estimators in an algorithm that iteratively proposes the most informative instruments at each round of experimentation to maximize the overall information about the full causal effect.
Autoencoder-based General Purpose Representation Learning for Customer Embedding
In recent years, exploiting the domain-specific underlying structure of data and its generative factors for representation learning has shown success in various use-case agnostic applications. However, the diversity and complexity of tabular data have made it challenging to represent these structures in a latent space through multi-dimensional vectors. We design an autoencoder-based framework for building general purpose embeddings, we assess the performance of different autoencoder architectures, and show simpler models outperform complex ones in embedding highly complex tabular data. We apply our framework to produce plug-and-play, rich, and anonymized embeddings representing AWS customers for usage in any model, saving up to 45% of development time, and observe significant improvements in downstream models. Moreover, we propose a significant improvement to the calculation of reconstruction loss for multi-layer contractive autoencoders (CAE) by calculating the Jacobian of the entire encoder leading to a 15% improvement in reconstruction quality when compared to a stacked CAE.