Machine Learning
- [1] arXiv:2405.09584 [pdf, ps, html, other]
-
Title: Restless Bandit Problem with Rewards Generated by a Linear Gaussian Dynamical SystemSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
The stochastic multi-armed bandit problem studies decision-making under uncertainty. In the problem, the learner interacts with an environment by choosing an action at each round, where a round is an instance of an interaction. In response, the environment reveals a reward, which is sampled from a stochastic process, to the learner. The goal of the learner is to maximize cumulative reward. A specific variation of the stochastic multi-armed bandit problem is the restless bandit, where the reward for each action is sampled from a Markov chain. The restless bandit with a discrete state-space is a well-studied problem, but to the best of our knowledge, not many results exist for the continuous state-space version which has many applications such as hyperparameter optimization. In this work, we tackle the restless bandit with continuous state-space by assuming the rewards are the inner product of an action vector and a state vector generated by a linear Gaussian dynamical system. To predict the reward for each action, we propose a method that takes a linear combination of previously observed rewards for predicting each action's next reward. We show that, regardless of the sequence of previous actions chosen, the reward sampled for any previously chosen action can be used for predicting another action's future reward, i.e. the reward sampled for action 1 at round $t-1$ can be used for predicting the reward for action $2$ at round $t$. This is accomplished by designing a modified Kalman filter with a matrix representation that can be learned for reward prediction. Numerical evaluations are carried out on a set of linear Gaussian dynamical systems.
- [2] arXiv:2405.09831 [pdf, ps, other]
-
Title: Nearly Minimax Optimal Regret for Multinomial Logistic BanditComments: Preprint. Under reviewSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
In this paper, we investigate the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the feature dimension $d$ and the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $\Omega(d\sqrt{\smash[b]{T/K}})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{\mathcal{O}}(d\sqrt{\smash[b]{T/K}})$. Under non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of $\tilde{\mathcal{O}}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the MNL contextual bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
- [3] arXiv:2405.09841 [pdf, ps, html, other]
-
Title: Simultaneous Identification of Sparse Structures and Communities in Heterogeneous Graphical ModelsComments: 61 pages, 11 figures, 4 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Exploring and detecting community structures hold significant importance in genetics, social sciences, neuroscience, and finance. Especially in graphical models, community detection can encourage the exploration of sets of variables with group-like properties. In this paper, within the framework of Gaussian graphical models, we introduce a novel decomposition of the underlying graphical structure into a sparse part and low-rank diagonal blocks (non-overlapped communities). We illustrate the significance of this decomposition through two modeling perspectives and propose a three-stage estimation procedure with a fast and efficient algorithm for the identification of the sparse structure and communities. Also on the theoretical front, we establish conditions for local identifiability and extend the traditional irrepresentability condition to an adaptive form by constructing an effective norm, which ensures the consistency of model selection for the adaptive $\ell_1$ penalized estimator in the second stage. Moreover, we also provide the clustering error bound for the K-means procedure in the third stage. Extensive numerical experiments are conducted to demonstrate the superiority of the proposed method over existing approaches in estimating graph structures. Furthermore, we apply our method to the stock return data, revealing its capability to accurately identify non-overlapped community structures.
- [4] arXiv:2405.10126 [pdf, ps, other]
-
Title: Estimating a Function and Its Derivatives Under a Smoothness ConditionComments: 27 pages. Mathematics of Operations Research 2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We consider the problem of estimating an unknown function f* and its partial derivatives from a noisy data set of n observations, where we make no assumptions about f* except that it is smooth in the sense that it has square integrable partial derivatives of order m. A natural candidate for the estimator of f* in such a case is the best fit to the data set that satisfies a certain smoothness condition. This estimator can be seen as a least squares estimator subject to an upper bound on some measure of smoothness. Another useful estimator is the one that minimizes the degree of smoothness subject to an upper bound on the average of squared errors. We prove that these two estimators are computable as solutions to quadratic programs, establish the consistency of these estimators and their partial derivatives, and study the convergence rate as n increases to infinity. The effectiveness of the estimators is illustrated numerically in a setting where the value of a stock option and its second derivative are estimated as functions of the underlying stock price.
- [5] arXiv:2405.10229 [pdf, ps, html, other]
-
Title: Random ReLU Neural Networks as Non-Gaussian ProcessesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
We consider a large class of shallow neural networks with randomly initialized parameters and rectified linear unit activation functions. We prove that these random neural networks are well-defined non-Gaussian processes. As a by-product, we demonstrate that these networks are solutions to stochastic differential equations driven by impulsive white noise (combinations of random Dirac measures). These processes are parameterized by the law of the weights and biases as well as the density of activation thresholds in each bounded region of the input domain. We prove that these processes are isotropic and wide-sense self-similar with Hurst exponent $3/2$. We also derive a remarkably simple closed-form expression for their autocovariance function. Our results are fundamentally different from prior work in that we consider a non-asymptotic viewpoint: The number of neurons in each bounded region of the input domain (i.e., the width) is itself a random variable with a Poisson law with mean proportional to the density parameter. Finally, we show that, under suitable hypotheses, as the expected width tends to infinity, these processes can converge in law not only to Gaussian processes, but also to non-Gaussian processes depending on the law of the weights. Our asymptotic results provide a new take on several classical results (wide networks converge to Gaussian processes) as well as some new ones (wide networks can converge to non-Gaussian processes).
- [6] arXiv:2405.10301 [pdf, ps, html, other]
-
Title: Conformal Alignment: Knowing When to Trust Foundation Models with GuaranteesSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.
New submissions for Friday, 17 May 2024 (showing 6 of 6 entries )
- [7] arXiv:2405.09579 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Scalable Sparse Regression for Model Discovery: The Fast Lane to InsightComments: Scripts to reproduce all figures are located at this https URL Standalone sparse regression scripts can be found at this https URLSubjects: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
There exist endless examples of dynamical systems with vast available data and unsatisfying mathematical descriptions. Sparse regression applied to symbolic libraries has quickly emerged as a powerful tool for learning governing equations directly from data; these learned equations balance quantitative accuracy with qualitative simplicity and human interpretability. Here, I present a general purpose, model agnostic sparse regression algorithm that extends a recently proposed exhaustive search leveraging iterative Singular Value Decompositions (SVD). This accelerated scheme, Scalable Pruning for Rapid Identification of Null vecTors (SPRINT), uses bisection with analytic bounds to quickly identify optimal rank-1 modifications to null vectors. It is intended to maintain sensitivity to small coefficients and be of reasonable computational cost for large symbolic libraries. A calculation that would take the age of the universe with an exhaustive search but can be achieved in a day with SPRINT.
- [8] arXiv:2405.09610 (cross-list from math.GT) [pdf, ps, html, other]
-
Title: Learning 3-Manifold TriangulationsComments: 35 pages; 23 figures; 7 tablesSubjects: Geometric Topology (math.GT); High Energy Physics - Theory (hep-th); Machine Learning (stat.ML)
Real 3-manifold triangulations can be uniquely represented by isomorphism signatures. Databases of these isomorphism signatures are generated for a variety of 3-manifolds and knot complements, using SnapPy and Regina, then these language-like inputs are used to train various machine learning architectures to differentiate the manifolds, as well as their Dehn surgeries, via their triangulations. Gradient saliency analysis then extracts key parts of this language-like encoding scheme from the trained models. The isomorphism signature databases are taken from the 3-manifolds' Pachner graphs, which are also generated in bulk for some selected manifolds of focus and for the subset of the SnapPy orientable cusped census with $<8$ initial tetrahedra. These Pachner graphs are further analysed through the lens of network science to identify new structure in the triangulation representation; in particular for the hyperbolic case, a relation between the length of the shortest geodesic (systole) and the size of the Pachner graph's ball is observed.
- [9] arXiv:2405.09676 (cross-list from math.ST) [pdf, ps, html, other]
-
Title: The radius of statistical efficiencySubjects: Statistics Theory (math.ST); Optimization and Control (math.OC); Machine Learning (stat.ML)
Classical results in asymptotic statistics show that the Fisher information matrix controls the difficulty of estimating a statistical model from observed data. In this work, we introduce a companion measure of robustness of an estimation problem: the radius of statistical efficiency (RSE) is the size of the smallest perturbation to the problem data that renders the Fisher information matrix singular. We compute RSE up to numerical constants for a variety of test bed problems, including principal component analysis, generalized linear models, phase retrieval, bilinear sensing, and matrix completion. In all cases, the RSE quantifies the compatibility between the covariance of the population data and the latent model parameter. Interestingly, we observe a precise reciprocal relationship between RSE and the intrinsic complexity/sensitivity of the problem instance, paralleling the classical Eckart-Young theorem in numerical analysis.
- [10] arXiv:2405.09784 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Online bipartite matching with imperfect adviceComments: Accepted into ICML 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
We study the problem of online unweighted bipartite matching with $n$ offline vertices and $n$ online vertices where one wishes to be competitive against the optimal offline algorithm. While the classic RANKING algorithm of Karp et al. [1990] provably attains competitive ratio of $1-1/e > 1/2$, we show that no learning-augmented method can be both 1-consistent and strictly better than $1/2$-robust under the adversarial arrival model. Meanwhile, under the random arrival model, we show how one can utilize methods from distribution testing to design an algorithm that takes in external advice about the online vertices and provably achieves competitive ratio interpolating between any ratio attainable by advice-free methods and the optimal ratio of 1, depending on the advice quality.
- [11] arXiv:2405.09797 (cross-list from stat.ME) [pdf, ps, html, other]
-
Title: Identification of Single-Treatment Effects in Factorial ExperimentsSubjects: Methodology (stat.ME); Machine Learning (stat.ML); Other Statistics (stat.OT)
Despite their cost, randomized controlled trials (RCTs) are widely regarded as gold-standard evidence in disciplines ranging from social science to medicine. In recent decades, researchers have increasingly sought to reduce the resource burden of repeated RCTs with factorial designs that simultaneously test multiple hypotheses, e.g. experiments that evaluate the effects of many medications or products simultaneously. Here I show that when multiple interventions are randomized in experiments, the effect any single intervention would have outside the experimental setting is not identified absent heroic assumptions, even if otherwise perfectly realistic conditions are achieved. This happens because single-treatment effects involve a counterfactual world with a single focal intervention, allowing other variables to take their natural values (which may be confounded or modified by the focal intervention). In contrast, observational studies and factorial experiments provide information about potential-outcome distributions with zero and multiple interventions, respectively. In this paper, I formalize sufficient conditions for the identifiability of those isolated quantities. I show that researchers who rely on this type of design have to justify either linearity of functional forms or -- in the nonparametric case -- specify with Directed Acyclic Graphs how variables are related in the real world. Finally, I develop nonparametric sharp bounds -- i.e., maximally informative best-/worst-case estimates consistent with limited RCT data -- that show when extrapolations about effect signs are empirically justified. These new results are illustrated with simulated data.
- [12] arXiv:2405.09878 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Hyperplane Arrangements and Fixed Points in Iterated PWL Neural NetworksSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
We leverage the framework of hyperplane arrangements to analyze potential regions of (stable) fixed points. We provide an upper bound on the number of fixed points for multi-layer neural networks equipped with piecewise linear (PWL) activation functions with arbitrary many linear pieces. The theoretical optimality of the exponential growth in the number of layers of the latter bound is shown. Specifically, we also derive a sharper upper bound on the number of stable fixed points for one-hidden-layer networks with hard tanh activation.
- [13] arXiv:2405.09989 (cross-list from stat.AP) [pdf, ps, html, other]
-
Title: A Gaussian Process Model for Ordinal Data with Applications to ChemoinformaticsSubjects: Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
With the proliferation of screening tools for chemical testing, it is now possible to create vast databases of chemicals easily. However, rigorous statistical methodologies employed to analyse these databases are in their infancy, and further development to facilitate chemical discovery is imperative. In this paper, we present conditional Gaussian process models to predict ordinal outcomes from chemical experiments, where the inputs are chemical compounds. We implement the Tanimoto distance, a metric on the chemical space, within the covariance of the Gaussian processes to capture correlated effects in the chemical space. A novel aspect of our model is that the kernel contains a scaling parameter, a feature not previously examined in the literature, that controls the strength of the correlation between elements of the chemical space. Using molecular fingerprints, a numerical representation of a compound's location within the chemical space, we show that accounting for correlation amongst chemical compounds improves predictive performance over the uncorrelated model, where effects are assumed to be independent. Moreover, we present a genetic algorithm for the facilitation of chemical discovery and identification of important features to the compound's efficacy. A simulation study is conducted to demonstrate the suitability of the proposed methods. Our proposed methods are demonstrated on a hazard classification problem of organic solvents.
- [14] arXiv:2405.10027 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: The Real Price of Bandit Information in Multiclass ClassificationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
We revisit the classical problem of multiclass classification with bandit feedback (Kakade, Shalev-Shwartz and Tewari, 2008), where each input classifies to one of $K$ possible labels and feedback is restricted to whether the predicted label is correct or not. Our primary inquiry is with regard to the dependency on the number of labels $K$, and whether $T$-step regret bounds in this setting can be improved beyond the $\smash{\sqrt{KT}}$ dependence exhibited by existing algorithms. Our main contribution is in showing that the minimax regret of bandit multiclass is in fact more nuanced, and is of the form $\smash{\widetilde{\Theta}\left(\min \left\{|\mathcal{H}| + \sqrt{T}, \sqrt{KT \log |{\mathcal{H}|}} \right\} \right) }$, where $\mathcal{H}$ is the underlying (finite) hypothesis class. In particular, we present a new bandit classification algorithm that guarantees regret $\smash{\widetilde{O}(|\mathcal{H}|+\sqrt{T})}$, improving over classical algorithms for moderately-sized hypothesis classes, and give a matching lower bound establishing tightness of the upper bounds (up to log-factors) in all parameter regimes.
- [15] arXiv:2405.10093 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: LaT-PFN: A Joint Embedding Predictive Architecture for In-context Time-series ForecastingComments: 9 pages plus references and appendix, 2 tables, 11 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
We introduce LatentTimePFN (LaT-PFN), a foundational Time Series model with a strong embedding space that enables zero-shot forecasting. To achieve this, we perform in-context learning in latent space utilizing a novel integration of the Prior-data Fitted Networks (PFN) and Joint Embedding Predictive Architecture (JEPA) frameworks. We leverage the JEPA framework to create a prediction-optimized latent representation of the underlying stochastic process that generates time series and combines it with contextual learning, using a PFN. Furthermore, we improve on preceding works by utilizing related time series as a context and introducing an abstract time axis. This drastically reduces training time and increases the versatility of the model by allowing any time granularity and forecast horizon. We show that this results in superior zero-shot predictions compared to established baselines. We also demonstrate our latent space produces informative embeddings of both individual time steps and fixed-length summaries of entire series. Finally, we observe the emergence of multi-step patch embeddings without explicit training, suggesting the model actively learns discrete tokens that encode local structures in the data, analogous to vision transformers.
- [16] arXiv:2405.10221 (cross-list from math.OC) [pdf, ps, html, other]
-
Title: Scalarisation-based risk concepts for robust multi-objective optimisationComments: The code is available at: this https URLSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
Robust optimisation is a well-established framework for optimising functions in the presence of uncertainty. The inherent goal of this problem is to identify a collection of inputs whose outputs are both desirable for the decision maker, whilst also being robust to the underlying uncertainties in the problem. In this work, we study the multi-objective extension of this problem from a computational standpoint. We identify that the majority of all robust multi-objective algorithms rely on two key operations: robustification and scalarisation. Robustification refers to the strategy that is used to marginalise over the uncertainty in the problem. Whilst scalarisation refers to the procedure that is used to encode the relative importance of each objective. As these operations are not necessarily commutative, the order that they are performed in has an impact on the resulting solutions that are identified and the final decisions that are made. This work aims to give an exposition on the philosophical differences between these two operations and highlight when one should opt for one ordering over the other. As part of our analysis, we showcase how many existing risk concepts can be easily integrated into the specification and solution of a robust multi-objective optimisation problem. Besides this, we also demonstrate how one can principally define the notion of a robust Pareto front and a robust performance metric based on our robustify and scalarise methodology. To illustrate the efficacy of these new ideas, we present two insightful numerical case studies which are based on real-world data sets.
- [17] arXiv:2405.10263 (cross-list from cs.LG) [pdf, ps, other]
-
Title: On Partially Unitary LearningComments: A working algorithm implementing Partially Unitary Learning arXiv:2212.14810 is developed and generalizedSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Quantum Physics (quant-ph); Machine Learning (stat.ML)
The problem of an optimal mapping between Hilbert spaces $IN$ of $\left|\psi\right\rangle$ and $OUT$ of $\left|\phi\right\rangle$ based on a set of wavefunction measurements (within a phase) $\psi_l \to \phi_l$, $l=1\dots M$, is formulated as an optimization problem maximizing the total fidelity $\sum_{l=1}^{M} \omega^{(l)} \left|\langle\phi_l|\mathcal{U}|\psi_l\rangle\right|^2$ subject to probability preservation constraints on $\mathcal{U}$ (partial unitarity). Constructed operator $\mathcal{U}$ can be considered as a $IN$ to $OUT$ quantum channel; it is a partially unitary rectangular matrix of the dimension $\dim(OUT) \times \dim(IN)$ transforming operators as $A^{OUT}=\mathcal{U} A^{IN} \mathcal{U}^{\dagger}$. An iteration algorithm finding the global maximum of this optimization problem is developed and it's application to a number of problems is demonstrated. A software product implementing the algorithm is available from the authors.
- [18] arXiv:2405.10289 (cross-list from math.OC) [pdf, ps, html, other]
-
Title: Subgradient Convergence Implies Subdifferential Convergence on Weakly Convex Functions: With Uniform Rates GuaranteesSubjects: Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
In nonsmooth, nonconvex stochastic optimization, understanding the uniform convergence of subdifferential mappings is crucial for analyzing stationary points of sample average approximations of risk as they approach the population risk. Yet, characterizing this convergence remains a fundamental challenge.
This work introduces a novel perspective by connecting the uniform convergence of subdifferential mappings to that of subgradient mappings as empirical risk converges to the population risk. We prove that, for stochastic weakly-convex objectives, and within any open set, a uniform bound on the convergence of subgradients -- chosen arbitrarily from the corresponding subdifferential sets -- translates to a uniform bound on the convergence of the subdifferential sets itself, measured by the Hausdorff metric.
Using this technique, we derive uniform convergence rates for subdifferential sets of stochastic convex-composite objectives. Our results do not rely on key distributional assumptions in the literature, which require the population and finite sample subdifferentials to be continuous in the Hausdorff metric, yet still provide tight convergence rates. These guarantees lead to new insights into the nonsmooth landscapes of such objectives within finite samples. - [19] arXiv:2405.10302 (cross-list from stat.ME) [pdf, ps, html, other]
-
Title: Optimal Aggregation of Prediction Intervals under Unsupervised Domain ShiftSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
As machine learning models are increasingly deployed in dynamic environments, it becomes paramount to assess and quantify uncertainties associated with distribution shifts. A distribution shift occurs when the underlying data-generating process changes, leading to a deviation in the model's performance. The prediction interval, which captures the range of likely outcomes for a given prediction, serves as a crucial tool for characterizing uncertainties induced by their underlying distribution. In this paper, we propose methodologies for aggregating prediction intervals to obtain one with minimal width and adequate coverage on the target domain under unsupervised domain shift, under which we have labeled samples from a related source domain and unlabeled covariates from the target domain. Our analysis encompasses scenarios where the source and the target domain are related via i) a bounded density ratio, and ii) a measure-preserving transformation. Our proposed methodologies are computationally efficient and easy to implement. Beyond illustrating the performance of our method through a real-world dataset, we also delve into the theoretical details. This includes establishing rigorous theoretical guarantees, coupled with finite sample bounds, regarding the coverage and width of our prediction intervals. Our approach excels in practical applications and is underpinned by a solid theoretical framework, ensuring its reliability and effectiveness across diverse contexts.
- [20] arXiv:2405.10310 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Stochastic Q-learning for Large Discrete Action SpacesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF); Robotics (cs.RO); Machine Learning (stat.ML)
In complex environments with large discrete action spaces, effective decision-making is critical in reinforcement learning (RL). Despite the widespread use of value-based RL approaches like Q-learning, they come with a computational burden, necessitating the maximization of a value function over all actions in each iteration. This burden becomes particularly challenging when addressing large-scale problems and using deep neural networks as function approximators. In this paper, we present stochastic value-based RL approaches which, in each iteration, as opposed to optimizing over the entire set of $n$ actions, only consider a variable stochastic set of a sublinear number of actions, possibly as small as $\mathcal{O}(\log(n))$. The presented stochastic value-based RL methods include, among others, Stochastic Q-learning, StochDQN, and StochDDQN, all of which integrate this stochastic approach for both value-function updates and action selection. The theoretical convergence of Stochastic Q-learning is established, while an analysis of stochastic maximization is provided. Moreover, through empirical validation, we illustrate that the various proposed approaches outperform the baseline methods across diverse environments, including different control problems, achieving near-optimal average returns in significantly reduced time.
Cross submissions for Friday, 17 May 2024 (showing 14 of 14 entries )
- [21] arXiv:2203.15945 (replaced) [pdf, ps, html, other]
-
Title: A Framework for Improving the Reliability of Black-box Variational InferenceSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
Black-box variational inference (BBVI) now sees widespread use in machine learning and statistics as a fast yet flexible alternative to Markov chain Monte Carlo methods for approximate Bayesian inference. However, stochastic optimization methods for BBVI remain unreliable and require substantial expertise and hand-tuning to apply effectively. In this paper, we propose Robust and Automated Black-box VI (RABVI), a framework for improving the reliability of BBVI optimization. RABVI is based on rigorously justified automation techniques, includes just a small number of intuitive tuning parameters, and detects inaccurate estimates of the optimal variational approximation. RABVI adaptively decreases the learning rate by detecting convergence of the fixed--learning-rate iterates, then estimates the symmetrized Kullback--Leibler (KL) divergence between the current variational approximation and the optimal one. It also employs a novel optimization termination criterion that enables the user to balance desired accuracy against computational cost by comparing (i) the predicted relative decrease in the symmetrized KL divergence if a smaller learning were used and (ii) the predicted computation required to converge with the smaller learning rate. We validate the robustness and accuracy of RABVI through carefully designed simulation studies and on a diverse set of real-world model and data examples.
- [22] arXiv:2312.07186 (replaced) [pdf, ps, other]
-
Title: Towards Optimal Sobolev Norm Rates for the Vector-Valued Regularized Least-Squares AlgorithmComments: arXiv admin note: text overlap with arXiv:2208.01711Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We present the first optimal rates for infinite-dimensional vector-valued ridge regression on a continuous scale of norms that interpolate between $L_2$ and the hypothesis space, which we consider as a vector-valued reproducing kernel Hilbert space. These rates allow to treat the misspecified case in which the true regression function is not contained in the hypothesis space. We combine standard assumptions on the capacity of the hypothesis space with a novel tensor product construction of vector-valued interpolation spaces in order to characterize the smoothness of the regression function. Our upper bound not only attains the same rate as real-valued kernel ridge regression, but also removes the assumption that the target regression function is bounded. For the lower bound, we reduce the problem to the scalar setting using a projection argument. We show that these rates are optimal in most cases and independent of the dimension of the output space. We illustrate our results for the special case of vector-valued Sobolev spaces.
- [23] arXiv:2404.17483 (replaced) [pdf, ps, html, other]
-
Title: Differentiable Pareto-Smoothed Weighting for High-Dimensional Heterogeneous Treatment Effect EstimationComments: Accepted to the 40th Conference on Uncertainty in Artificial Intelligence (UAI2024). 14 pages, 4 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
There is a growing interest in estimating heterogeneous treatment effects across individuals using their high-dimensional feature attributes. Achieving high performance in such high-dimensional heterogeneous treatment effect estimation is challenging because in this setup, it is usual that some features induce sample selection bias while others do not but are predictive of potential outcomes. To avoid losing such predictive feature information, existing methods learn separate feature representations using inverse probability weighting (IPW). However, due to their numerically unstable IPW weights, these methods suffer from estimation bias under a finite sample setup. To develop a numerically robust estimator by weighted representation learning, we propose a differentiable Pareto-smoothed weighting framework that replaces extreme weight values in an end-to-end fashion. Our experimental results show that by effectively correcting the weight values, our proposed method outperforms the existing ones, including traditional weighting schemes. Our code is available at this https URL.
- [24] arXiv:2302.07200 (replaced) [pdf, ps, html, other]
-
Title: Neurosymbolic AI for Reasoning over Knowledge Graphs: A SurveyLauren Nicole DeLong, Ramon Fernández Mir, Jacques D. Fleuriot (The University of Edinburgh School of Informatics, Artificial Intelligence and its Applications Institute)Comments: 21 pages, 6 figures, 2 tables, currently under review. Corresponding GitHub page here: this https URL. Revised in February 2024 according to major revisions, again in May 2024 according to minor revisionsSubjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Machine Learning (stat.ML)
Neurosymbolic AI is an increasingly active area of research that combines symbolic reasoning methods with deep learning to leverage their complementary benefits. As knowledge graphs are becoming a popular way to represent heterogeneous and multi-relational data, methods for reasoning on graph structures have attempted to follow this neurosymbolic paradigm. Traditionally, such approaches have utilized either rule-based inference or generated representative numerical embeddings from which patterns could be extracted. However, several recent studies have attempted to bridge this dichotomy to generate models that facilitate interpretability, maintain competitive performance, and integrate expert knowledge. Therefore, we survey methods that perform neurosymbolic reasoning tasks on knowledge graphs and propose a novel taxonomy by which we can classify them. Specifically, we propose three major categories: (1) logically-informed embedding approaches, (2) embedding approaches with logical constraints, and (3) rule learning approaches. Alongside the taxonomy, we provide a tabular overview of the approaches and links to their source code, if available, for more direct comparison. Finally, we discuss the unique characteristics and limitations of these methods, then propose several prospective directions toward which this field of research could evolve.
- [25] arXiv:2303.07158 (replaced) [pdf, ps, html, other]
-
Title: Uniform Pessimistic Risk and Optimal PortfolioSubjects: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
The optimal allocation of assets has been widely discussed with the theoretical analysis of risk measures, and pessimism is one of the most attractive approaches beyond the conventional optimal portfolio model. The $\alpha$-risk plays a crucial role in deriving a broad class of pessimistic optimal portfolios. However, estimating an optimal portfolio assessed by a pessimistic risk is still challenging due to the absence of a computationally tractable model. In this study, we propose an integral of $\alpha$-risk called the \textit{uniform pessimistic risk} and the computational algorithm to obtain an optimal portfolio based on the risk. Further, we investigate the theoretical properties of the proposed risk in view of three different approaches: multiple quantile regression, the proper scoring rule, and distributionally robust optimization. Real data analysis of three stock datasets (S\&P500, CSI500, KOSPI200) demonstrates the usefulness of the proposed risk and portfolio model.
- [26] arXiv:2310.02278 (replaced) [pdf, ps, html, other]
-
Title: A Stable and Efficient Covariate-Balancing Estimator for Causal Survival EffectsKhiem Pham, David A. Hirshberg, Phuong-Mai Huynh-Pham, Michele Santacatterina, Ser-Nam Lim, Ramin ZabihComments: 32 pages, 5 figuresSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
We propose an empirically stable and asymptotically efficient covariate-balancing approach to the problem of estimating survival causal effects in data with conditionally-independent censoring. This addresses a challenge often encountered in state-of-the-art nonparametric methods: the use of inverses of small estimated probabilities and the resulting amplification of estimation error. We validate our theoretical results in experiments on synthetic and semi-synthetic data.
- [27] arXiv:2311.14646 (replaced) [pdf, ps, html, other]
-
Title: More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is ObligatoryComments: Appeared in ICLR 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In our era of enormous neural networks, empirical progress has been driven by the philosophy that more is better. Recent deep learning practice has found repeatedly that larger model size, more data, and more computation (resulting in lower training loss) improves performance. In this paper, we give theoretical backing to these empirical observations by showing that these three properties hold in random feature (RF) regression, a class of models equivalent to shallow networks with only the last layer trained.
Concretely, we first show that the test risk of RF regression decreases monotonically with both the number of features and the number of samples, provided the ridge penalty is tuned optimally. In particular, this implies that infinite width RF architectures are preferable to those of any finite width. We then proceed to demonstrate that, for a large class of tasks characterized by powerlaw eigenstructure, training to near-zero training loss is obligatory: near-optimal performance can only be achieved when the training error is much smaller than the test error. Grounding our theory in real-world data, we find empirically that standard computer vision tasks with convolutional neural tangent kernels clearly fall into this class. Taken together, our results tell a simple, testable story of the benefits of overparameterization, overfitting, and more data in random feature models. - [28] arXiv:2312.12641 (replaced) [pdf, ps, html, other]
-
Title: Robust Point Matching with Distance ProfilesSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
While matching procedures based on pairwise distances are conceptually appealing and thus favored in practice, theoretical guarantees for such procedures are rarely found in the literature. We propose and analyze matching procedures based on distance profiles that are easily implementable in practice, showing these procedures are robust to outliers and noise. We demonstrate the performance of the proposed method using a real data example and provide simulation studies to complement the theoretical findings.
- [29] arXiv:2401.09787 (replaced) [pdf, ps, html, other]
-
Title: Querying Easily Flip-flopped Samples for Deep Active LearningComments: 34 pages, 17 figures, 5 tables. Accepted to the 12th International Conference on Learning Representations (ICLR 2024) (ver2: fixed some typos and improved some parts of the writing)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data. One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is. The sample's distance to the decision boundary is a natural measure of predictive uncertainty, but it is often intractable to compute, especially for complex decision boundaries formed in multiclass classification tasks. To address this issue, this paper proposes the {\it least disagree metric} (LDM), defined as the smallest probability of disagreement of the predicted label, and an estimator for LDM proven to be asymptotically consistent under mild assumptions. The estimator is computationally efficient and can be easily implemented for deep learning models using parameter perturbation. The LDM-based active learning is performed by querying unlabeled data with the smallest LDM. Experimental results show that our LDM-based active learning algorithm obtains state-of-the-art overall performance on all considered datasets and deep architectures.
- [30] arXiv:2405.08920 (replaced) [pdf, ps, html, other]
-
Title: Neural Collapse Meets Differential Privacy: Curious Behaviors of NoisyGD with Near-perfect Representation LearningComments: To appear in ICML 2024Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
A recent study by De et al. (2022) has reported that large-scale representation learning through pre-training on a public dataset significantly enhances differentially private (DP) learning in downstream tasks, despite the high dimensionality of the feature space. To theoretically explain this phenomenon, we consider the setting of a layer-peeled model in representation learning, which results in interesting phenomena related to learned features in deep learning and transfer learning, known as Neural Collapse (NC).
Within the framework of NC, we establish an error bound indicating that the misclassification error is independent of dimension when the distance between actual features and the ideal ones is smaller than a threshold. Additionally, the quality of the features in the last layer is empirically evaluated under different pre-trained models within the framework of NC, showing that a more powerful transformer leads to a better feature representation. Furthermore, we reveal that DP fine-tuning is less robust compared to fine-tuning without DP, particularly in the presence of perturbations. These observations are supported by both theoretical analyses and experimental evaluation. Moreover, to enhance the robustness of DP fine-tuning, we suggest several strategies, such as feature normalization or employing dimension reduction methods like Principal Component Analysis (PCA). Empirically, we demonstrate a significant improvement in testing accuracy by conducting PCA on the last-layer features.