C Board  

Go Back   C Board > Cprogramming.com and AIHorizon.com's Artificial Intelligence Boards > General AI Programming

Reply
 
LinkBack Thread Tools Display Modes
Old 02-08-2009, 09:14 PM   #1
Registered User
 
IdioticCreation's Avatar
 
Join Date: Nov 2006
Location: Lurking about
Posts: 212
Neural Networks - Calculating Error Gradients

Hi

I'm working on implementing a neural network, but I'm having trouble calculating error gradients on both output and hidden layers. I'm using the identity function as my activation function: f(x) = x. I am pretty clueless when it comes to calculus so I'm really having trouble with it.

I found this Web page that has a good explanation
http://www.willamette.edu/~gorr/clas...9/linear2.html

I just can't seem to figure out how to implement it.

I have an example of the gradient calculation of a network that is using a sigmoid activation function
Code:
inline double trainer::getOutputErrorGradient( double desiredValue, double outputValue)
{
	//return error gradient
	return outputValue * ( 1 - outputValue ) * ( desiredValue - outputValue );
}
but this isn't all that helpful.

Any help would be greatly appreciated.
__________________
If you take something apart and put it back together enough times, you will eventually have enough parts left over to build a second one.
IdioticCreation is offline   Reply With Quote
Old 02-09-2009, 11:17 AM   #2
Crazy Fool
 
Perspective's Avatar
 
Join Date: Jan 2003
Location: Canada
Posts: 2,588
[disclaimer: it's been a long time since I've studied NNs]. But anyway, a couple of points. First, a neural network with a linear function doesn't need any hidden layers. They can be collapsed since it's just a linear combination (you can directly compute the outputs from the inputs).

Second, the backprop algorithm propagates errors using the derivative of the activation, the example you have is likely the closed form for the sigmoid (the most common function used from what I remember). In your case the derivative is constant since the function is linear.

Why do you want to use f(x) = x? You should use the sigmoid.
Perspective is offline   Reply With Quote
Old 02-09-2009, 03:16 PM   #3
Rampaging 35 Stone Welsh
 
abachler's Avatar
 
Join Date: Apr 2007
Posts: 2,926
Neural networks more or less require a sigmoid to be neural networks, otherwise its not a neural network. That said there are a lot of sigmoid functions. The one I use the most is f(x) = sin(atan(x)) because it is easy to impliment in hardware for feedforward networks. It takes a little longer to train than some other sigmoid functions, but once trained executes a lot faster. Another sigmoid that is fast for feed forward networks in f(x) = x / (abs(x) + 1.0).

However, for calculating the error gradient its just the difference between the actual output and the expected output. Most feedback functions use some algorithm for assigning a lesser error to each input based on that inputs influence on the output
__________________
He is free, you say. Ah! That is his misfortune… These men… [have] the most terrible, the most imperious of masters, that is, need. … They must therefore find someone to hire them, or die of hunger. Is that to be free? - Simon Linguet

Last edited by abachler; 02-09-2009 at 03:22 PM.
abachler is offline   Reply With Quote
Old 02-09-2009, 03:37 PM   #4
Registered User
 
IdioticCreation's Avatar
 
Join Date: Nov 2006
Location: Lurking about
Posts: 212
Oh, interesting. I only chose a linear function because I was/still am under the impression that the sigmoid function can only give outputs between 0 and 1 (or maybe some other range if the functions is altered). I wanted analog outputs, I don't see how you can do that with a sigmoid function.

It seems to me that with a sigmoid function if the output is close to one, than the neuron fires, if it is close to zero than it does not fire. What if your output was a decimal number? Would you have to have enough output nodes to get it in binary? or is there some other way?

edit:
Ohhh, I was thinking. Could I just say f(x) = sin(atan(x))*9 Then it would return output between -9 and 9. Will it work like that? I was also thinking maybe some kind of step function, but that's just a guess.
__________________
If you take something apart and put it back together enough times, you will eventually have enough parts left over to build a second one.

Last edited by IdioticCreation; 02-09-2009 at 04:01 PM.
IdioticCreation is offline   Reply With Quote
Old 02-09-2009, 04:32 PM   #5
Crazy Fool
 
Perspective's Avatar
 
Join Date: Jan 2003
Location: Canada
Posts: 2,588
A NN does not output the answer as a computation. The idea is that each output node represents some answer and the values are like a probability distribution over answers.

For example, if you are using an NN as a classifier, you'd have an output node for each output class. Let's say our inputs are a feature vector of a text document, the output nodes could represent SPORTS, POLITICS, and ENTERTAINMENT. The (normalized) output of running the NN on a partcilar document about a movie may be [0.2, 0.1, 0.7] which suggests that the document is most likely an ENTERTAINMENT document.
Perspective is offline   Reply With Quote
Old 02-09-2009, 05:22 PM   #6
Registered User
 
IdioticCreation's Avatar
 
Join Date: Nov 2006
Location: Lurking about
Posts: 212
Wow, I can't believe I never realized that. I knew that was how they seemed to be used most of the time, but I thought they would preform computations as well. So in order to output an 8 digit number, I would need 10 nodes for each number, each node representing 0-9. Which ever node activates would be the number for that position. That would mean 72 output nodes though

I was hoping I might train a neural network to solve a problem like this:
1 ? 2 = 21
45 ? 65 = 5465
98 ? 8 = 898
32 ?43 = 3342

The question mark is just an operator, and basically the output is just a rearrangement of the inputs. A pattern.

Is this at all possible with a neural network?
__________________
If you take something apart and put it back together enough times, you will eventually have enough parts left over to build a second one.
IdioticCreation is offline   Reply With Quote
Old 02-09-2009, 05:41 PM   #7
Senior software engineer
 
brewbuck's Avatar
 
Join Date: Mar 2007
Location: Portland, OR
Posts: 5,381
Quote:
Originally Posted by IdioticCreation View Post
Oh, interesting. I only chose a linear function because I was/still am under the impression that the sigmoid function can only give outputs between 0 and 1 (or maybe some other range if the functions is altered). I wanted analog outputs, I don't see how you can do that with a sigmoid function.
Not ALL the activation functions must be sigmoids. The connection between the final hidden layer and the output layer can be linear, if you want. But unless you have nonlinearity somewhere, the entire network collapses to a single-layer perceptron and you lose computative power.

Quote:
It seems to me that with a sigmoid function if the output is close to one, than the neuron fires, if it is close to zero than it does not fire.
Not really. These kinds of networks don't "fire," they just combine values and forward them on. The sigmoid serves two purposes: it provides the necessary nonlinearity, and it range-limits the values so that you don't get numeric overflow. Any monotonic, nonlinear function could be used -- the sigmoid is most common out of tradition.
__________________
"Congratulations on your purchase. To begin using your quantum computer, set the power switch to both off and on simultaneously." -- raftpeople@slashdot
brewbuck is offline   Reply With Quote
Old 02-09-2009, 05:52 PM   #8
Registered User
 
IdioticCreation's Avatar
 
Join Date: Nov 2006
Location: Lurking about
Posts: 212
Quote:
Originally Posted by brewbuck View Post
Not really. These kinds of networks don't "fire," they just combine values and forward them on. The sigmoid serves two purposes: it provides the necessary nonlinearity, and it range-limits the values so that you don't get numeric overflow. Any monotonic, nonlinear function could be used -- the sigmoid is most common out of tradition.
OK, I was thinking that because in the example I was looking at they had a clampOutput function, which clamped output to 1 or 0, or if it wasn't close to either then -1. I was thinking they were clamping each neuron, but now I see it was only used on the output neurons.

At any rate, do you think I can still salvage my project? Or is it not something that can be solved with neural networks?
__________________
If you take something apart and put it back together enough times, you will eventually have enough parts left over to build a second one.
IdioticCreation is offline   Reply With Quote
Old 02-09-2009, 05:53 PM   #9
Senior software engineer
 
brewbuck's Avatar
 
Join Date: Mar 2007
Location: Portland, OR
Posts: 5,381
Quote:
Originally Posted by IdioticCreation View Post
Ohhh, I was thinking. Could I just say f(x) = sin(atan(x))*9 Then it would return output between -9 and 9. Will it work like that? I was also thinking maybe some kind of step function, but that's just a guess.
If you want output between -9 and 9, just take the output of the sigmoid and multiply by 9. You can perform that final scaling outside of the NN. Or, as I said, you could have linear connections between the final hidden layer and the output layer, but why complicate it if you don't have to?

You could also select an activation function which is not bounded. The function sigmoid(x) + x is still nonlinear, but not bounded.
__________________
"Congratulations on your purchase. To begin using your quantum computer, set the power switch to both off and on simultaneously." -- raftpeople@slashdot
brewbuck is offline   Reply With Quote
Old 02-09-2009, 05:55 PM   #10
Senior software engineer
 
brewbuck's Avatar
 
Join Date: Mar 2007
Location: Portland, OR
Posts: 5,381
Quote:
Originally Posted by IdioticCreation View Post
At any rate, do you think I can still salvage my project? Or is it not something that can be solved with neural networks?
A neural network can literally learn anything, if the network is large enough and you have enough training examples. The problem will be overfitting. Can a neural network make the kind of generalization you are looking for here? Maybe, depending on how you encode the inputs and outputs.

If you were designing the weights in the network by hand, I'm sure you could come up with a way to make the network do what you are asking. The question is whether the backprop algorithm can relax the network into the right set of weights. Who knows without trying.
__________________
"Congratulations on your purchase. To begin using your quantum computer, set the power switch to both off and on simultaneously." -- raftpeople@slashdot
brewbuck is offline   Reply With Quote
Old 02-09-2009, 06:03 PM   #11
Registered User
 
IdioticCreation's Avatar
 
Join Date: Nov 2006
Location: Lurking about
Posts: 212
Thanks brewbuck, I guess it's back to the drawing board for me. I read somewhere that the number of training examples needs to be about 60 times the number of weights in a network to avoid overfitting, so having 72 output nodes would make it very difficult.

Thank you for the help everyone!
__________________
If you take something apart and put it back together enough times, you will eventually have enough parts left over to build a second one.
IdioticCreation is offline   Reply With Quote
Old 02-09-2009, 06:38 PM   #12
Senior software engineer
 
brewbuck's Avatar
 
Join Date: Mar 2007
Location: Portland, OR
Posts: 5,381
Quote:
Originally Posted by IdioticCreation View Post
Thanks brewbuck, I guess it's back to the drawing board for me. I read somewhere that the number of training examples needs to be about 60 times the number of weights in a network to avoid overfitting, so having 72 output nodes would make it very difficult.

Thank you for the help everyone!
I don't think you need 72 output nodes. You should only need one node to encode the value of a single digit.

Assign:
0.0 -> 0
0.1 -> 1
0.2 -> 2
...
0.9 -> 9

For both the input layer and output layer. The network should be able to zero in on that. You could even distribute the values "sigmoidally" to be friendly to your activation function.

EDIT: It's just my hunch from working with these in the past, that for this task you'll probably need multiple hidden layers. Don't ask me why, just a hunch.
__________________
"Congratulations on your purchase. To begin using your quantum computer, set the power switch to both off and on simultaneously." -- raftpeople@slashdot
brewbuck is offline   Reply With Quote
Old 02-09-2009, 06:56 PM   #13
Rampaging 35 Stone Welsh
 
abachler's Avatar
 
Join Date: Apr 2007
Posts: 2,926
Quote:
Originally Posted by IdioticCreation View Post
Thanks brewbuck, I guess it's back to the drawing board for me. I read somewhere that the number of training examples needs to be about 60 times the number of weights in a network to avoid overfitting, so having 72 output nodes would make it very difficult.

Thank you for the help everyone!
The general requirement is at most one weight for every example, that places an upper bound on the complexity of the network for a given data set. If you cannot find a network architecture under those constraints that can learn the data set then you need a larger data set. Of course you also need to check networks smaller than that as well, since your dataset may be larger than necessary. Now that said there are many examples of minimal networks that violate that rule. The traditional XOR exampel contains 3 nodes, with a total of 9 weights, yet there are only 4 possible examples in the data set, indicating that the network has more learning potential than the dataset requires. In practice the one weight per example is a ballpark figure. It is usually within an order of magnitude of the optimal size though.
__________________
He is free, you say. Ah! That is his misfortune… These men… [have] the most terrible, the most imperious of masters, that is, need. … They must therefore find someone to hire them, or die of hunger. Is that to be free? - Simon Linguet
abachler is offline   Reply With Quote
Old 02-09-2009, 09:25 PM   #14
Registered User
 
IdioticCreation's Avatar
 
Join Date: Nov 2006
Location: Lurking about
Posts: 212
Quote:
Originally Posted by brewbuck View Post
I don't think you need 72 output nodes. You should only need one node to encode the value of a single digit.

Assign:
0.0 -> 0
0.1 -> 1
0.2 -> 2
...
0.9 -> 9

For both the input layer and output layer. The network should be able to zero in on that. You could even distribute the values "sigmoidally" to be friendly to your activation function.

EDIT: It's just my hunch from working with these in the past, that for this task you'll probably need multiple hidden layers. Don't ask me why, just a hunch.
Oh wow, that's exactly what I decided to try as a last shot, but then I started getting memory corruption errors and said screw it. Now that someone else thinks it could work (not just a hunch of mine) I'm going to work on it.

I also read some stuff about people saying someone did a proof that shows no backprop network would ever need more than one hidden layer. I can't find the link, but I'll mess with the structure stuff and just do what works.

Thanks for the info abachler, I don't understand some of that at the moment, but once I start adjusting training sets and the network structure I will come back and check it out more.

You guys are great, thank you for all the help.
__________________
If you take something apart and put it back together enough times, you will eventually have enough parts left over to build a second one.
IdioticCreation is offline   Reply With Quote
Old 02-10-2009, 12:53 AM   #15
Rampaging 35 Stone Welsh
 
abachler's Avatar
 
Join Date: Apr 2007
Posts: 2,926
That is incorrect, the number of hidden layers depends on the particulars of the output manifold; e.g. a smooth multivariate manifold can be approximated with no fewer than 3 layers of weights (i.e. 2 hidden layers). For example, this image would require such a network to classify whether a given point is black or white. A circle would require 4 layers. This is assuming you use arbitrary precision mathematics. In practice a network that uses finite precision floating point, like doubles, may require more layers or nodes, or both.
Attached Images
 
__________________
He is free, you say. Ah! That is his misfortune… These men… [have] the most terrible, the most imperious of masters, that is, need. … They must therefore find someone to hire them, or die of hunger. Is that to be free? - Simon Linguet

Last edited by abachler; 02-10-2009 at 12:57 AM.
abachler is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Neural networks - am i doing this properly? Bobcat General AI Programming 6 04-01-2009 08:48 PM
Details about artificial neural networks ChadJohnson General AI Programming 1 07-23-2005 10:29 AM
Recursion Lionmane C Programming 11 06-04-2005 12:00 AM
Need examples on Neural Networks khpuce General AI Programming 2 05-23-2005 11:26 AM
Neural Networks VS. Spike Neural Networks magis General AI Programming 1 04-12-2005 06:37 AM


All times are GMT -6. The time now is 03:55 AM.


Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.3.0 RC2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22