Discussion:
Error function in the output layer of MLP
(too old to reply)
xinfan meng
2012-06-06 08:38:16 UTC
Permalink
Hi, all. I post this question to the list, since it might be related to the
MLP being developed.

I found two versions of the error function for output layer of MLP are used
in the literature.


1. \delta_o = (y-a) f'(z)
http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm
2. \delta_o = (y-a) http://www.idsia.ch/NNcourse/backprop.html

Given that they all use the same sigmoid activation function and loss
function, how can the error function be different? Also note that the error
functions will ultimately lead to different propagating errors in the
hidden layers.
--
Best Wishes
--------------------------------------------
Meng Xinfan蒙新泛
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China
David Marek
2012-06-06 10:53:43 UTC
Permalink
Hi
Post by xinfan meng
Hi, all. I post this question to the list, since it might be related to
the MLP being developed.
I found two versions of the error function for output layer of MLP are
used in the literature.
1. \delta_o = (y-a) f'(z)
http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm
2. \delta_o = (y-a) http://www.idsia.ch/NNcourse/backprop.html
Given that they all use the same sigmoid activation function and loss
function, how can the error function be different? Also note that the error
functions will ultimately lead to different propagating errors in the
hidden layers.
I just skimmed through them and there are few differencies between those
two pages:

* \delta_o doesn't mean the same in those pages. In the second one, it's
just the derivative of the error function.
* The second page doesn't use sigmoid as output function, look at the
examples on next page and you'll see that y_o = a + f tanh(x) + g tanh(x).
Derivative of this function is y. As can be seen in the matrix form \Delta
W = \delta_l y_{l-1}

I hope this answers your question. Sometimes it's possible to make the
computations simpler, because the error function and output function are
natural pairs, see
http://www.willamette.edu/~gorr/classes/cs449/classify.html

David
xinfan meng
2012-06-06 11:50:57 UTC
Permalink
Thanks for your reply.

I think these two delta_o have the same meaning. If you have "Pattern
Recognition and Machine Learning" by Bishop, you can find that Bishop use
exactly the second formula in the back propagation algorithm. I suspect
these two formulae lead to the same update iterations, but I can't see why
now.

What formula do you adopt in your implementation?
Post by David Marek
Hi
Post by xinfan meng
Hi, all. I post this question to the list, since it might be related to
the MLP being developed.
I found two versions of the error function for output layer of MLP are
used in the literature.
1. \delta_o = (y-a) f'(z)
http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm
2. \delta_o = (y-a) http://www.idsia.ch/NNcourse/backprop.html
Given that they all use the same sigmoid activation function and loss
function, how can the error function be different? Also note that the error
functions will ultimately lead to different propagating errors in the
hidden layers.
I just skimmed through them and there are few differencies between those
* \delta_o doesn't mean the same in those pages. In the second one, it's
just the derivative of the error function.
* The second page doesn't use sigmoid as output function, look at the
examples on next page and you'll see that y_o = a + f tanh(x) + g tanh(x).
Derivative of this function is y. As can be seen in the matrix form \Delta
W = \delta_l y_{l-1}
I hope this answers your question. Sometimes it's possible to make the
computations simpler, because the error function and output function are
natural pairs, see
http://www.willamette.edu/~gorr/classes/cs449/classify.html
David
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Best Wishes
--------------------------------------------
Meng Xinfan蒙新泛
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China
David Marek
2012-06-06 12:59:44 UTC
Permalink
Post by xinfan meng
I think these two delta_o have the same meaning. If you have "Pattern
Recognition and Machine Learning" by Bishop, you can find that Bishop use
exactly the second formula in the back propagation algorithm. I suspect
these two formulae lead to the same update iterations, but I can't see why
now.
Thanks for the idea, I read about NN there and here is my explanation,
Bishop uses this forward step (page 245):

a_j = \sum_{i=0}^D w_{ji} x_i
z_j = tanh(a_j)
y_k = \sum_{j=0}^M w_{kj} z_j

He is using a linear activation function, because it's easy to compute it's
derivation. ;-) So in this case
∇w_{ij} = delta_j * x_i
∇w_{ij} = (y_j - t_j) * x_i

If you'd use another activation function, for example tanh, which can be
used for classification in current implementation, you'd have
∇w = (t-y) * dtanh(y) * x

So both pages are correct because each uses different activation function.

The difference is when you have f(x) = w*x then
f'(x) = x
while for f(x) = tanh(w*x)
f'(x) = dtanh(w*x) * x

David
xinfan meng
2012-06-06 13:39:22 UTC
Permalink
Yes, I think your explanation is correct. Thanks.

Those notation differences really make me confused, given that MLP is much
more complex than Perceptron. :-(
Post by David Marek
Post by xinfan meng
I think these two delta_o have the same meaning. If you have "Pattern
Recognition and Machine Learning" by Bishop, you can find that Bishop use
exactly the second formula in the back propagation algorithm. I suspect
these two formulae lead to the same update iterations, but I can't see why
now.
Thanks for the idea, I read about NN there and here is my explanation,
a_j = \sum_{i=0}^D w_{ji} x_i
z_j = tanh(a_j)
y_k = \sum_{j=0}^M w_{kj} z_j
He is using a linear activation function, because it's easy to compute
it's derivation. ;-) So in this case
∇w_{ij} = delta_j * x_i
∇w_{ij} = (y_j - t_j) * x_i
If you'd use another activation function, for example tanh, which can be
used for classification in current implementation, you'd have
∇w = (t-y) * dtanh(y) * x
So both pages are correct because each uses different activation function.
The difference is when you have f(x) = w*x then
f'(x) = x
while for f(x) = tanh(w*x)
f'(x) = dtanh(w*x) * x
David
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Best Wishes
--------------------------------------------
Meng Xinfan蒙新泛
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China
David Warde-Farley
2012-06-06 18:27:24 UTC
Permalink
Post by xinfan meng
Hi, all. I post this question to the list, since it might be related to the
MLP being developed.
I found two versions of the error function for output layer of MLP are used
in the literature.
1. \delta_o = (y-a) f'(z)
http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm
2. \delta_o = (y-a) http://www.idsia.ch/NNcourse/backprop.html
Given that they all use the same sigmoid activation function and loss
function, how can the error function be different? Also note that the error
functions will ultimately lead to different propagating errors in the
hidden layers.
If the output layer has no nonlinearity, then "f(z)" is the identity function
and f'(z) is just 1.

If you have a nonlinearity, you need to backpropagate through it, which is
where the f'(z) comes from.

Note that in both those examples, they are using squared error, which is only
really appropriate for real-valued targets. Cross-entropy is much more
appropriate for classification with softmax outputs. You can derive other
cross-entropy-based error functions if you're predicting a collection of
binary targets.

David
xinfan meng
2012-06-07 01:01:33 UTC
Permalink
Thank you. I see the differences now. Your explanation should be put into
the MLP docs :-)

On Thu, Jun 7, 2012 at 2:27 AM, David Warde-Farley <
Post by David Warde-Farley
Post by xinfan meng
Hi, all. I post this question to the list, since it might be related to
the
Post by xinfan meng
MLP being developed.
I found two versions of the error function for output layer of MLP are
used
Post by xinfan meng
in the literature.
1. \delta_o = (y-a) f'(z)
http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm
2. \delta_o = (y-a) http://www.idsia.ch/NNcourse/backprop.html
Given that they all use the same sigmoid activation function and loss
function, how can the error function be different? Also note that the
error
Post by xinfan meng
functions will ultimately lead to different propagating errors in the
hidden layers.
If the output layer has no nonlinearity, then "f(z)" is the identity function
and f'(z) is just 1.
If you have a nonlinearity, you need to backpropagate through it, which is
where the f'(z) comes from.
Note that in both those examples, they are using squared error, which is only
really appropriate for real-valued targets. Cross-entropy is much more
appropriate for classification with softmax outputs. You can derive other
cross-entropy-based error functions if you're predicting a collection of
binary targets.
David
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Best Wishes
--------------------------------------------
Meng Xinfan蒙新泛
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China
Continue reading on narkive:
Loading...