Specific data compression techniques, formalized by the concept of coresets, proved to be powerful for many optimization problems. In fact, while tightly controlling the approximation error, coresets may lead to significant speed up of the computations and hence allow to extend algorithms to much larger problem sizes. The present paper deals with a weight-balanced clustering problem, and is specifically motivated by an application in materials science where a voxel-based image is to be processed into a diagram representation. Here, the class of desired coresets is naturally confined to those which can be viewed as lowering the resolution of the input data. While one might expect that such resolution coresets are inferior to unrestricted coreset we prove bounds for resolution coresets which improve known bounds in the relevant dimensions and also lead to significantly faster algorithms in practice.
We seek to extract a small number of representative scenarios from large panel data that are consistent with sample moments. Among two novel algorithms, the first identifies scenarios that have not been observed before, and comes with a scenario-based representation of covariance matrices. The second proposal selects important data points from states of the world that have already realized, and are consistent with higher-order sample moment information. Both algorithms are efficient to compute and lend themselves to consistent scenario-based modeling and multi-dimensional numerical integration that can be used for interpretable decision-making under uncertainty. Extensive numerical benchmarking studies and an application in portfolio optimization favor the proposed algorithms.
The quality of machine learning solutions, and of classifier models in general, depend largely on the performance of the chosen algorithm, and on the intrinsic characteristics of the input data. Although work has been extensive on the former of these aspects, the latter has received comparably less attention. In this paper, we introduce the Multiscale Impurity Complexity Analysis (MICA) algorithm for the quantification of class separability and decision-boundary complexity of datasets. MICA is both model and dimensionality-independent and can provide a measure of separability based on regional impurity values. This makes it so that MICA is sensible to both global and local data conditions. We show MICA to be capable of properly describing class separability in a comprehensive set of both synthetic and real datasets and comparing it against other state-of-the-art methods. After establishing the robustness of the proposed method, alternative applications are discussed, including a streaming-data variant of MICA (MICA-S), that can be repurposed into a model-independent method for concept drift detection.
Optimization of parameters and hyperparameters is a general process for any data analysis. Because not all models are mathematically well-behaved, stochastic optimization can be useful in many analyses by randomly choosing parameters in each optimization iteration. Many such algorithms have been reported and applied in chemistry data analysis, but the one reported here is interesting to check out, where a naïve algorithm searches each parameter sequentially and randomly in its bounds. Then it picks the best for the next iteration. Thus, one can ignore irrational solution of the model itself or its gradient in parameter space and continue the optimization.