S. Levine, Chelsea Finn, Trevor Darrell, P. Abbeel
Policy search methods can allow robots to learn control policies for a wide range of tasks, but practical applications of policy search often require hand-engineered components for perception, state estimation, and low-level control. In this paper, we aim to answer the following question: does training the perception and control systems jointly end-to-end provide better performance than training each component separately? To this end, we develop a method that can be used to learn policies that map raw image observations directly to torques at the robot's motors. The policies are represented by deep convolutional neural networks (CNNs) with 92,000 parameters, and are trained using a partially observed guided policy search method, which transforms policy search into supervised learning, with supervision provided by a simple trajectory-centric reinforcement learning method. We evaluate our method on a range of real-world manipulation tasks that require close coordination between vision and control, such as screwing a cap onto a bottle, and present simulated comparisons to a range of prior policy search methods.
{"title":"End-to-End Training of Deep Visuomotor Policies","authors":"S. Levine, Chelsea Finn, Trevor Darrell, P. Abbeel","doi":"10.5555/2946645.2946684","DOIUrl":"https://doi.org/10.5555/2946645.2946684","url":null,"abstract":"Policy search methods can allow robots to learn control policies for a wide range of tasks, but practical applications of policy search often require hand-engineered components for perception, state estimation, and low-level control. In this paper, we aim to answer the following question: does training the perception and control systems jointly end-to-end provide better performance than training each component separately? To this end, we develop a method that can be used to learn policies that map raw image observations directly to torques at the robot's motors. The policies are represented by deep convolutional neural networks (CNNs) with 92,000 parameters, and are trained using a partially observed guided policy search method, which transforms policy search into supervised learning, with supervision provided by a simple trajectory-centric reinforcement learning method. We evaluate our method on a range of real-world manipulation tasks that require close coordination between vision and control, such as screwing a cap onto a bottle, and present simulated comparisons to a range of prior policy search methods.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"4 1 1","pages":"39:1-39:40"},"PeriodicalIF":0.0,"publicationDate":"2015-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90811761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Libra Toolkit is a collection of algorithms for learning and inference with discrete probabilistic models, including Bayesian networks, Markov networks, dependency networks, and sum-product networks. Compared to other toolkits, Libra places a greater emphasis on learning the structure of tractable models in which exact inference is efficient. It also includes a variety of algorithms for learning graphical models in which inference is potentially intractable, and for performing exact and approximate inference. Libra is released under a 2-clause BSD license to encourage broad use in academia and industry.
{"title":"The Libra toolkit for probabilistic models","authors":"Daniel Lowd, Pedram Rooshenas","doi":"10.5555/2789272.2912077","DOIUrl":"https://doi.org/10.5555/2789272.2912077","url":null,"abstract":"The Libra Toolkit is a collection of algorithms for learning and inference with discrete probabilistic models, including Bayesian networks, Markov networks, dependency networks, and sum-product networks. Compared to other toolkits, Libra places a greater emphasis on learning the structure of tractable models in which exact inference is efficient. It also includes a variety of algorithms for learning graphical models in which inference is potentially intractable, and for performing exact and approximate inference. Libra is released under a 2-clause BSD license to encourage broad use in academia and industry.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"26 1","pages":"2459-2463"},"PeriodicalIF":0.0,"publicationDate":"2015-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81443275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pascal Germain, A. Lacasse, François Laviolette, M. Marchand, Jean-Francis Roy
We propose an extensive analysis of the behavior of majority votes in binary classification. In particular, we introduce a risk bound for majority votes, called the C-bound, that takes into account the average quality of the voters and their average disagreement. We also propose an extensive PAC-Bayesian analysis that shows how the C-bound can be estimated from various observations contained in the training data. The analysis intends to be self-contained and can be used as introductory material to PAC-Bayesian statistical learning theory. It starts from a general PAC-Bayesian perspective and ends with uncommon PAC-Bayesian bounds. Some of these bounds contain no Kullback-Leibler divergence and others allow kernel functions to be used as voters (via the sample compression setting). Finally, out of the analysis, we propose the MinCq learning algorithm that basically minimizes the C-bound. MinCq reduces to a simple quadratic program. Aside from being theoretically grounded, MinCq achieves state-of-the-art performance, as shown in our extensive empirical comparison with both AdaBoost and the Support Vector Machine.
{"title":"Risk bounds for the majority vote: from a PAC-Bayesian analysis to a learning algorithm","authors":"Pascal Germain, A. Lacasse, François Laviolette, M. Marchand, Jean-Francis Roy","doi":"10.5555/2789272.2831140","DOIUrl":"https://doi.org/10.5555/2789272.2831140","url":null,"abstract":"We propose an extensive analysis of the behavior of majority votes in binary classification. In particular, we introduce a risk bound for majority votes, called the C-bound, that takes into account the average quality of the voters and their average disagreement. We also propose an extensive PAC-Bayesian analysis that shows how the C-bound can be estimated from various observations contained in the training data. The analysis intends to be self-contained and can be used as introductory material to PAC-Bayesian statistical learning theory. It starts from a general PAC-Bayesian perspective and ends with uncommon PAC-Bayesian bounds. Some of these bounds contain no Kullback-Leibler divergence and others allow kernel functions to be used as voters (via the sample compression setting). Finally, out of the analysis, we propose the MinCq learning algorithm that basically minimizes the C-bound. MinCq reduces to a simple quadratic program. Aside from being theoretically grounded, MinCq achieves state-of-the-art performance, as shown in our extensive empirical comparison with both AdaBoost and the Support Vector Machine.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"81 1","pages":"787-860"},"PeriodicalIF":0.0,"publicationDate":"2015-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85481769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To discuss the existence and uniqueness of proper scoring rules one needs to extend the associated entropy functions as sublinear functions to the conic hull of the prediction set. In some natural function spaces, such as the Lebesgue Lp-spaces over Rd, the positive cones have empty interior. Entropy functions defined on such cones have directional derivatives only, which typically exist on large subspaces and behave similarly to gradients. Certain entropies may be further extended continuously to open cones in normed spaces containing signed densities. The extended densities are Gâteaux differentiable except on a negligible set and have everywhere continuous subgradients due to the supporting hyperplane theorem. We introduce the necessary framework from analysis and algebra that allows us to give an affirmative answer to the titular question of the paper. As a result of this, we give a formal sense in which entropy functions have uniquely associated proper scoring rules. We illustrate our framework by studying the derivatives and subgradients of the following three prototypical entropies: Shannon entropy, Hyvarinen entropy, and quadratic entropy.
{"title":"Existence and uniqueness of proper scoring rules","authors":"E. Ovcharov","doi":"10.5555/2789272.2886820","DOIUrl":"https://doi.org/10.5555/2789272.2886820","url":null,"abstract":"To discuss the existence and uniqueness of proper scoring rules one needs to extend the associated entropy functions as sublinear functions to the conic hull of the prediction set. In some natural function spaces, such as the Lebesgue Lp-spaces over Rd, the positive cones have empty interior. Entropy functions defined on such cones have directional derivatives only, which typically exist on large subspaces and behave similarly to gradients. Certain entropies may be further extended continuously to open cones in normed spaces containing signed densities. The extended densities are Gâteaux differentiable except on a negligible set and have everywhere continuous subgradients due to the supporting hyperplane theorem. We introduce the necessary framework from analysis and algebra that allows us to give an affirmative answer to the titular question of the paper. As a result of this, we give a formal sense in which entropy functions have uniquely associated proper scoring rules. We illustrate our framework by studying the derivatives and subgradients of the following three prototypical entropies: Shannon entropy, Hyvarinen entropy, and quadratic entropy.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"12 1","pages":"2207-2230"},"PeriodicalIF":0.0,"publicationDate":"2015-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76915820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michiel Hermans, M. C. Soriano, J. Dambre, P. Bienstman, Ingo Fischer
Nonlinear photonic delay systems present interesting implementation platforms for machine learning models. They can be extremely fast, offer great degrees of parallelism and potentially consume far less power than digital processors. So far they have been successfully employed for signal processing using the Reservoir Computing paradigm. In this paper we show that their range of applicability can be greatly extended if we use gradient descent with backpropagation through time on a model of the system to optimize the input encoding of such systems. We perform physical experiments that demonstrate that the obtained input encodings work well in reality, and we show that optimized systems perform significantly better than the common Reservoir Computing approach. The results presented here demonstrate that common gradient descent techniques from machine learning may well be applicable on physical neuro-inspired analog computers.
{"title":"Photonic delay systems as machine learning implementations","authors":"Michiel Hermans, M. C. Soriano, J. Dambre, P. Bienstman, Ingo Fischer","doi":"10.5555/2789272.2886817","DOIUrl":"https://doi.org/10.5555/2789272.2886817","url":null,"abstract":"Nonlinear photonic delay systems present interesting implementation platforms for machine learning models. They can be extremely fast, offer great degrees of parallelism and potentially consume far less power than digital processors. So far they have been successfully employed for signal processing using the Reservoir Computing paradigm. In this paper we show that their range of applicability can be greatly extended if we use gradient descent with backpropagation through time on a model of the system to optimize the input encoding of such systems. We perform physical experiments that demonstrate that the obtained input encodings work well in reality, and we show that optimized systems perform significantly better than the common Reservoir Computing approach. The results presented here demonstrate that common gradient descent techniques from machine learning may well be applicable on physical neuro-inspired analog computers.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"112 1","pages":"2081-2097"},"PeriodicalIF":0.0,"publicationDate":"2015-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76787368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study reproducing kernels, and associated reproducing kernel Hilbert spaces (RKHSs) $mathscr{H}$ over infinite, discrete and countable sets $V$. In this setting we analyze in detail the distributions of the corresponding Dirac point-masses of $V$. Illustrations include certain models from neural networks: An Extreme Learning Machine (ELM) is a neural network-configuration in which a hidden layer of weights are randomly sampled, and where the object is then to compute resulting output. For RKHSs $mathscr{H}$ of functions defined on a prescribed countable infinite discrete set $V$, we characterize those which contain the Dirac masses $delta_{x}$ for all points $x$ in $V$. Further examples and applications where this question plays an important role are: (i) discrete Brownian motion-Hilbert spaces, i.e., discrete versions of the Cameron-Martin Hilbert space; (ii) energy-Hilbert spaces corresponding to graph-Laplacians where the set $V$ of vertices is then equipped with a resistance metric; and finally (iii) the study of Gaussian free fields.
{"title":"Discrete reproducing kernel Hilbert spaces: sampling and distribution of Dirac-masses","authors":"P. Jorgensen, Feng Tian","doi":"10.5555/2789272.2912098","DOIUrl":"https://doi.org/10.5555/2789272.2912098","url":null,"abstract":"We study reproducing kernels, and associated reproducing kernel Hilbert spaces (RKHSs) $mathscr{H}$ over infinite, discrete and countable sets $V$. In this setting we analyze in detail the distributions of the corresponding Dirac point-masses of $V$. Illustrations include certain models from neural networks: An Extreme Learning Machine (ELM) is a neural network-configuration in which a hidden layer of weights are randomly sampled, and where the object is then to compute resulting output. For RKHSs $mathscr{H}$ of functions defined on a prescribed countable infinite discrete set $V$, we characterize those which contain the Dirac masses $delta_{x}$ for all points $x$ in $V$. Further examples and applications where this question plays an important role are: (i) discrete Brownian motion-Hilbert spaces, i.e., discrete versions of the Cameron-Martin Hilbert space; (ii) energy-Hilbert spaces corresponding to graph-Laplacians where the set $V$ of vertices is then equipped with a resistance metric; and finally (iii) the study of Gaussian free fields.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"31 1","pages":"3079-3114"},"PeriodicalIF":0.0,"publicationDate":"2015-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77194683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stochastic multiplicity automata (SMA) are weighted finite automata that generalize probabilistic automata. They have been used in the context of probabilistic grammatical inference. Observable operator models (OOMs) are a generalization of hidden Markov models, which in turn are models for discrete-valued stochastic processes and are used ubiquitously in the context of speech recognition and bio-sequence modeling. Predictive state representations (PSRs) extend OOMs to stochastic input-output systems and are employed in the context of agent modeling and planning. We present SMA, OOMs, and PSRs under the common framework of sequential systems, which are an algebraic characterization of multiplicity automata, and examine the precise relationships between them. Furthermore, we establish a unified approach to learning such models from data. Many of the learning algorithms that have been proposed can be understood as variations of this basic learning scheme, and several turn out to be closely related to each other, or even equivalent.
{"title":"Links between multiplicity automata, observable operator models and predictive state representations: a unified learning framework","authors":"Michael R. Thon, H. Jaeger","doi":"10.5555/2789272.2789276","DOIUrl":"https://doi.org/10.5555/2789272.2789276","url":null,"abstract":"Stochastic multiplicity automata (SMA) are weighted finite automata that generalize probabilistic automata. They have been used in the context of probabilistic grammatical inference. Observable operator models (OOMs) are a generalization of hidden Markov models, which in turn are models for discrete-valued stochastic processes and are used ubiquitously in the context of speech recognition and bio-sequence modeling. Predictive state representations (PSRs) extend OOMs to stochastic input-output systems and are employed in the context of agent modeling and planning. We present SMA, OOMs, and PSRs under the common framework of sequential systems, which are an algebraic characterization of multiplicity automata, and examine the precise relationships between them. Furthermore, we establish a unified approach to learning such models from data. Many of the learning algorithms that have been proposed can be understood as variations of this basic learning scheme, and several turn out to be closely related to each other, or even equivalent.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"13 1","pages":"103-147"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75702845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This issue of JMLR is devoted to the memory of Alexey Chervonenkis. Over the period of a dozen years between 1962 and 1973 he and Vladimir Vapnik created a new discipline of statistical learning theory—the foundation on which all our modern understanding of pattern recognition is based. Alexey was 28 years old when they made their most famous and original discovery, the uniform law of large numbers. In that short period Vapnik and Chervonenkis also introduced the main concepts of statistical learning theory, such as VCdimension, capacity control, and the Structural Risk Minimization principle, and designed two powerful pattern recognition methods, Generalised Portrait and Optimal Separating Hyperplane, later transformed by Vladimir Vapnik into Support Vector Machine—arguably one of the best tools for pattern recognition and regression estimation. Thereafter Alexey continued to publish original and important contributions to learning theory. He was also active in research in several applied fields, including geology, bioinformatics, medicine, and advertising. Alexey tragically died in September 2014 after getting lost during a hike in the Elk Island park on the outskirts of Moscow. Vladimir Vapnik suggested to prepare an issue of JMLR to be published at the first anniversary of the death of his long-term collaborator and close friend. Vladimir and the editors contacted a few dozen leading researchers in the fields of machine learning related to Alexey’s research interests and had many enthusiastic replies. In the end eleven papers were accepted. This issue also contains a first attempt at a complete bibliography of Alexey Chervonenkis’s publications. Simultaneously with this special issue will appear Alexey’s Festschrift (Vovk et al., 2015), to which the reader is referred for information about Alexey’s research, life, and death. The Festschrift is based in part on a symposium held in Pathos, Cyprus, in 2013 to celebrate Alexey’s 75th anniversary. Apart from research contributions, it contains Alexey’s reminiscences about his early work on statistical learning with Vladimir Vapnik, a reprint of their seminal 1971 paper, a historical chapter by R. M. Dudley, reminiscences of Alexey’s and Vladimir’s close colleague Vasily Novoseltsev, and three reviews of various measures of complexity used in machine learning (“Measures of Complexity” is both the name of the symposium and the title of the book). Among Alexey’s contributions to machine learning (mostly joint with Vladimir Vapnik) discussed in the book are:
本期《JMLR》是为了纪念Alexey Chervonenkis。在1962年到1973年的十几年间,他和Vladimir Vapnik创造了统计学习理论这一新的学科,这是我们所有现代模式识别理解的基础。阿列克谢28岁时,他们做出了最著名、最具原创性的发现,大数统一定律。在这段时间内,Vapnik和Chervonenkis还引入了统计学习理论的主要概念,如vc维、容量控制和结构风险最小化原则,并设计了两种强大的模式识别方法,即广义肖像和最优分离超平面,后来被Vladimir Vapnik转化为支持向量机,这是模式识别和回归估计的最佳工具之一。此后,阿列克谢继续发表对学习理论的原创和重要贡献。他还活跃于多个应用领域的研究,包括地质学、生物信息学、医学和广告。2014年9月,阿列克谢在莫斯科郊区的麋鹿岛公园徒步旅行时迷路,不幸去世。弗拉基米尔·瓦普尼克建议在他的长期合作者和亲密朋友逝世一周年之际出版一期《联合政治与政治研究》。Vladimir和编辑们联系了几十位与Alexey的研究兴趣相关的机器学习领域的顶尖研究人员,并得到了许多热情的回复。最后十一篇论文被接受了。本期还首次尝试对Alexey Chervonenkis的出版物进行完整的参考书目。与此特刊同时将出现Alexey的Festschrift (Vovk等人,2015),读者可以参考有关Alexey的研究,生活和死亡的信息。2013年,为庆祝阿列克谢75周年,在塞浦路斯的帕索斯举行了一次研讨会。除了研究贡献之外,它还包括阿列克谢对他与弗拉基米尔·瓦普尼克(Vladimir Vapnik)在统计学习方面的早期工作的回忆,这是他们1971年开创性论文的再版,R. M. Dudley的历史章节,阿列克谢和弗拉基米尔的亲密同事瓦西里·诺沃seltsev的回忆,以及对机器学习中使用的各种复杂性度量的三篇评论(“复杂性度量”既是研讨会的名称,也是本书的标题)。书中讨论了Alexey对机器学习的贡献(主要是与Vladimir Vapnik合作):
{"title":"Preface to this special issue","authors":"A. Gammerman, V. Vovk","doi":"10.5555/2789272.2886803","DOIUrl":"https://doi.org/10.5555/2789272.2886803","url":null,"abstract":"This issue of JMLR is devoted to the memory of Alexey Chervonenkis. Over the period of a dozen years between 1962 and 1973 he and Vladimir Vapnik created a new discipline of statistical learning theory—the foundation on which all our modern understanding of pattern recognition is based. Alexey was 28 years old when they made their most famous and original discovery, the uniform law of large numbers. In that short period Vapnik and Chervonenkis also introduced the main concepts of statistical learning theory, such as VCdimension, capacity control, and the Structural Risk Minimization principle, and designed two powerful pattern recognition methods, Generalised Portrait and Optimal Separating Hyperplane, later transformed by Vladimir Vapnik into Support Vector Machine—arguably one of the best tools for pattern recognition and regression estimation. Thereafter Alexey continued to publish original and important contributions to learning theory. He was also active in research in several applied fields, including geology, bioinformatics, medicine, and advertising. Alexey tragically died in September 2014 after getting lost during a hike in the Elk Island park on the outskirts of Moscow. Vladimir Vapnik suggested to prepare an issue of JMLR to be published at the first anniversary of the death of his long-term collaborator and close friend. Vladimir and the editors contacted a few dozen leading researchers in the fields of machine learning related to Alexey’s research interests and had many enthusiastic replies. In the end eleven papers were accepted. This issue also contains a first attempt at a complete bibliography of Alexey Chervonenkis’s publications. Simultaneously with this special issue will appear Alexey’s Festschrift (Vovk et al., 2015), to which the reader is referred for information about Alexey’s research, life, and death. The Festschrift is based in part on a symposium held in Pathos, Cyprus, in 2013 to celebrate Alexey’s 75th anniversary. Apart from research contributions, it contains Alexey’s reminiscences about his early work on statistical learning with Vladimir Vapnik, a reprint of their seminal 1971 paper, a historical chapter by R. M. Dudley, reminiscences of Alexey’s and Vladimir’s close colleague Vasily Novoseltsev, and three reviews of various measures of complexity used in machine learning (“Measures of Complexity” is both the name of the symposium and the title of the book). Among Alexey’s contributions to machine learning (mostly joint with Vladimir Vapnik) discussed in the book are:","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"30 1","pages":"1677-1681"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73946949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
According to a recently stated 'independence postulate', the distribution Pcause contains no information about the conditional Peffect|cause while Peffect may contain information about Pcause|effect. Since semi-supervised learning (SSL) attempts to exploit information from PX to assist in predicting Y from X, it should only work in anticausal direction, i.e., when Y is the cause and X is the effect. In causal direction, when X is the cause and Y the effect, unlabelled x-values should be useless. To shed light on this asymmetry, we study a deterministic causal relation Y = f(X) as recently assayed in Information-Geometric Causal Inference (IGCI). Within this model, we discuss two options to formalize the independence of PX and f as an orthogonality of vectors in appropriate inner product spaces. We prove that unlabelled data help for the problem of interpolating a monotonically increasing function if and only if the orthogonality conditions are violated - which we only expect for the anticausal direction. Here, performance of SSL and its supervised baseline analogue is measured in terms of two different loss functions: first, the mean squared error and second the surprise in a Bayesian prediction scenario.
{"title":"Semi-supervised interpolation in an anticausal learning scenario","authors":"D. Janzing, B. Scholkopf","doi":"10.5555/2789272.2886811","DOIUrl":"https://doi.org/10.5555/2789272.2886811","url":null,"abstract":"According to a recently stated 'independence postulate', the distribution Pcause contains no information about the conditional Peffect|cause while Peffect may contain information about Pcause|effect. Since semi-supervised learning (SSL) attempts to exploit information from PX to assist in predicting Y from X, it should only work in anticausal direction, i.e., when Y is the cause and X is the effect. In causal direction, when X is the cause and Y the effect, unlabelled x-values should be useless. To shed light on this asymmetry, we study a deterministic causal relation Y = f(X) as recently assayed in Information-Geometric Causal Inference (IGCI). Within this model, we discuss two options to formalize the independence of PX and f as an orthogonality of vectors in appropriate inner product spaces. We prove that unlabelled data help for the problem of interpolating a monotonically increasing function if and only if the orthogonality conditions are violated - which we only expect for the anticausal direction. Here, performance of SSL and its supervised baseline analogue is measured in terms of two different loss functions: first, the mean squared error and second the surprise in a Bayesian prediction scenario.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"9 1","pages":"1923-1948"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84605148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Regularization is commonly used in classifier design, to assure good generalization. Classical regularization enforces a cost on classifier complexity, by constraining parameters. This is usually combined with a margin loss, which favors large-margin decision rules. A novel and unified view of this architecture is proposed, by showing that margin losses act as regularizers of posterior class probabilities, in a way that amplifies classical parameter regularization. The problem of controlling the regularization strength of a margin loss is considered, using a decomposition of the loss in terms of a link and a binding function. The link function is shown to be responsible for the regularization strength of the loss, while the binding function determines its outlier robustness. A large class of losses is then categorized into equivalence classes of identical regularization strength or outlier robustness. It is shown that losses in the same regularization class can be parameterized so as to have tunable regularization strength. This parameterization is finally used to derive boosting algorithms with loss regularization (BoostLR). Three classes of tunable regularization losses are considered in detail. Canonical losses can implement all regularization behaviors but have no flexibility in terms of outlier modeling. Shrinkage losses support equally parameterized link and binding functions, leading to boosting algorithms that implement the popular shrinkage procedure. This offers a new explanation for shrinkage as a special case of loss-based regularization. Finally, α-tunable losses enable the independent parameterization of link and binding functions, leading to boosting algorithms of great exibility. This is illustrated by the derivation of an algorithm that generalizes both AdaBoost and LogitBoost, behaving as either one when that best suits the data to classify. Various experiments provide evidence of the benefits of probability regularization for both classification and estimation of posterior class probabilities.
{"title":"A view of margin losses as regularizers of probability estimates","authors":"Hamed Masnadi-Shirazi, N. Vasconcelos","doi":"10.5555/2789272.2912087","DOIUrl":"https://doi.org/10.5555/2789272.2912087","url":null,"abstract":"Regularization is commonly used in classifier design, to assure good generalization. Classical regularization enforces a cost on classifier complexity, by constraining parameters. This is usually combined with a margin loss, which favors large-margin decision rules. A novel and unified view of this architecture is proposed, by showing that margin losses act as regularizers of posterior class probabilities, in a way that amplifies classical parameter regularization. The problem of controlling the regularization strength of a margin loss is considered, using a decomposition of the loss in terms of a link and a binding function. The link function is shown to be responsible for the regularization strength of the loss, while the binding function determines its outlier robustness. A large class of losses is then categorized into equivalence classes of identical regularization strength or outlier robustness. It is shown that losses in the same regularization class can be parameterized so as to have tunable regularization strength. This parameterization is finally used to derive boosting algorithms with loss regularization (BoostLR). Three classes of tunable regularization losses are considered in detail. Canonical losses can implement all regularization behaviors but have no flexibility in terms of outlier modeling. Shrinkage losses support equally parameterized link and binding functions, leading to boosting algorithms that implement the popular shrinkage procedure. This offers a new explanation for shrinkage as a special case of loss-based regularization. Finally, α-tunable losses enable the independent parameterization of link and binding functions, leading to boosting algorithms of great exibility. This is illustrated by the derivation of an algorithm that generalizes both AdaBoost and LogitBoost, behaving as either one when that best suits the data to classify. Various experiments provide evidence of the benefits of probability regularization for both classification and estimation of posterior class probabilities.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"17 1","pages":"2751-2795"},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80322330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}