{"title":"感知器","authors":"Volker Tresp","doi":"10.1142/9789811201233_0004","DOIUrl":null,"url":null,"abstract":"The perceptron implements a binary classifier f : R D → {+1, −1} with a linear decision surface through the origin: f (x) = step(θ x). (1) where step(z) = 1 if z ≥ 0 −1 otherwise. Using the zero-one loss L(y, f (x)) = 0 if y = f (x) 1 otherwise, the empirical risk of the perceptron on training data S = 1. The problem with this is that R emp (θ) is not differentiable in θ, so we cannot do gradient descent to learn θ. To circumvent this, we use the modified empirical loss R emp (θ) = i∈(1,2,...,N) : yi =step θ T xi −y i θ T x i. (2) This just says that correctly classified examples don't incur any loss at all, while incorrectly classified examples contribute θ T x i , which is some sort of measure of confidence in the (incorrect) labeling. 1 We can now use gradient descent to learn θ. Starting from an arbitrary θ (0) , we update our parameter vector according to θ (t+1) = θ (t) − η∇ θ R| θ (t) , where η, called the learning rate, is a parameter of our choosing. The gradient of (2) is again a sum over the misclassified examples: ∇ θ R emp (θ) = 1 A slightly more principled way to look at this is to derive this modified risk from the hinge loss L(y, θ T x) = max 0, −y θ T x .","PeriodicalId":188131,"journal":{"name":"Principles of Artificial Neural Networks","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"44","resultStr":"{\"title\":\"The Perceptron\",\"authors\":\"Volker Tresp\",\"doi\":\"10.1142/9789811201233_0004\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The perceptron implements a binary classifier f : R D → {+1, −1} with a linear decision surface through the origin: f (x) = step(θ x). (1) where step(z) = 1 if z ≥ 0 −1 otherwise. Using the zero-one loss L(y, f (x)) = 0 if y = f (x) 1 otherwise, the empirical risk of the perceptron on training data S = 1. The problem with this is that R emp (θ) is not differentiable in θ, so we cannot do gradient descent to learn θ. To circumvent this, we use the modified empirical loss R emp (θ) = i∈(1,2,...,N) : yi =step θ T xi −y i θ T x i. (2) This just says that correctly classified examples don't incur any loss at all, while incorrectly classified examples contribute θ T x i , which is some sort of measure of confidence in the (incorrect) labeling. 1 We can now use gradient descent to learn θ. Starting from an arbitrary θ (0) , we update our parameter vector according to θ (t+1) = θ (t) − η∇ θ R| θ (t) , where η, called the learning rate, is a parameter of our choosing. The gradient of (2) is again a sum over the misclassified examples: ∇ θ R emp (θ) = 1 A slightly more principled way to look at this is to derive this modified risk from the hinge loss L(y, θ T x) = max 0, −y θ T x .\",\"PeriodicalId\":188131,\"journal\":{\"name\":\"Principles of Artificial Neural Networks\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"44\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Principles of Artificial Neural Networks\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1142/9789811201233_0004\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Principles of Artificial Neural Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/9789811201233_0004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 44
摘要
感知器实现了一个二元分类器f: R D→{+1,−1},其线性决策曲面通过原点:f (x) = step(θ x).(1)其中,如果z≥0−1,则step(z) = 1。使用0 - 1损失,如果y = f (x) 1,则L(y, f (x)) = 0,否则,感知器在训练数据上的经验风险S = 1。这里的问题是remp (θ)在θ下不可导,所以我们不能用梯度下降来学习θ。为了避免这种情况,我们使用修正的经验损失R emp (θ) = i∈(1,2,…,N): yi =步长θ T xi - yi θ T xi。(2)这只是说正确分类的示例根本不会产生任何损失,而错误分类的示例贡献θ T xi,这是对(不正确)标记的某种置信度度量。我们现在可以用梯度下降来学习θ。从任意一个θ(0)开始,我们根据θ (t+1) = θ (t)−η∇θ R| θ (t)来更新我们的参数向量,其中η称为学习率,是我们选择的一个参数。(2)的梯度也是对错误分类的例子求和:∇θ R emp (θ) = 1一个更有原则的方法是,从铰链损失L(y, θ T x) = max 0, - y θ T x推导出修正后的风险。
The perceptron implements a binary classifier f : R D → {+1, −1} with a linear decision surface through the origin: f (x) = step(θ x). (1) where step(z) = 1 if z ≥ 0 −1 otherwise. Using the zero-one loss L(y, f (x)) = 0 if y = f (x) 1 otherwise, the empirical risk of the perceptron on training data S = 1. The problem with this is that R emp (θ) is not differentiable in θ, so we cannot do gradient descent to learn θ. To circumvent this, we use the modified empirical loss R emp (θ) = i∈(1,2,...,N) : yi =step θ T xi −y i θ T x i. (2) This just says that correctly classified examples don't incur any loss at all, while incorrectly classified examples contribute θ T x i , which is some sort of measure of confidence in the (incorrect) labeling. 1 We can now use gradient descent to learn θ. Starting from an arbitrary θ (0) , we update our parameter vector according to θ (t+1) = θ (t) − η∇ θ R| θ (t) , where η, called the learning rate, is a parameter of our choosing. The gradient of (2) is again a sum over the misclassified examples: ∇ θ R emp (θ) = 1 A slightly more principled way to look at this is to derive this modified risk from the hinge loss L(y, θ T x) = max 0, −y θ T x .