Thread: Neural Networks - Calculating Error Gradients

  1. #1
    Registered User IdioticCreation's Avatar
    Join Date
    Nov 2006
    Location
    Lurking about
    Posts
    229

    Neural Networks - Calculating Error Gradients

    Hi

    I'm working on implementing a neural network, but I'm having trouble calculating error gradients on both output and hidden layers. I'm using the identity function as my activation function: f(x) = x. I am pretty clueless when it comes to calculus so I'm really having trouble with it.

    I found this Web page that has a good explanation
    http://www.willamette.edu/~gorr/clas...9/linear2.html

    I just can't seem to figure out how to implement it.

    I have an example of the gradient calculation of a network that is using a sigmoid activation function
    Code:
    inline double trainer::getOutputErrorGradient( double desiredValue, double outputValue)
    {
    	//return error gradient
    	return outputValue * ( 1 - outputValue ) * ( desiredValue - outputValue );
    }
    but this isn't all that helpful.

    Any help would be greatly appreciated.

  2. #2
    Crazy Fool Perspective's Avatar
    Join Date
    Jan 2003
    Location
    Canada
    Posts
    2,640
    [disclaimer: it's been a long time since I've studied NNs]. But anyway, a couple of points. First, a neural network with a linear function doesn't need any hidden layers. They can be collapsed since it's just a linear combination (you can directly compute the outputs from the inputs).

    Second, the backprop algorithm propagates errors using the derivative of the activation, the example you have is likely the closed form for the sigmoid (the most common function used from what I remember). In your case the derivative is constant since the function is linear.

    Why do you want to use f(x) = x? You should use the sigmoid.

  3. #3
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Neural networks more or less require a sigmoid to be neural networks, otherwise its not a neural network. That said there are a lot of sigmoid functions. The one I use the most is f(x) = sin(atan(x)) because it is easy to impliment in hardware for feedforward networks. It takes a little longer to train than some other sigmoid functions, but once trained executes a lot faster. Another sigmoid that is fast for feed forward networks in f(x) = x / (abs(x) + 1.0).

    However, for calculating the error gradient its just the difference between the actual output and the expected output. Most feedback functions use some algorithm for assigning a lesser error to each input based on that inputs influence on the output
    Last edited by abachler; 02-09-2009 at 03:22 PM.

  4. #4
    Registered User IdioticCreation's Avatar
    Join Date
    Nov 2006
    Location
    Lurking about
    Posts
    229
    Oh, interesting. I only chose a linear function because I was/still am under the impression that the sigmoid function can only give outputs between 0 and 1 (or maybe some other range if the functions is altered). I wanted analog outputs, I don't see how you can do that with a sigmoid function.

    It seems to me that with a sigmoid function if the output is close to one, than the neuron fires, if it is close to zero than it does not fire. What if your output was a decimal number? Would you have to have enough output nodes to get it in binary? or is there some other way?

    edit:
    Ohhh, I was thinking. Could I just say f(x) = sin(atan(x))*9 Then it would return output between -9 and 9. Will it work like that? I was also thinking maybe some kind of step function, but that's just a guess.
    Last edited by IdioticCreation; 02-09-2009 at 04:01 PM.

  5. #5
    Crazy Fool Perspective's Avatar
    Join Date
    Jan 2003
    Location
    Canada
    Posts
    2,640
    A NN does not output the answer as a computation. The idea is that each output node represents some answer and the values are like a probability distribution over answers.

    For example, if you are using an NN as a classifier, you'd have an output node for each output class. Let's say our inputs are a feature vector of a text document, the output nodes could represent SPORTS, POLITICS, and ENTERTAINMENT. The (normalized) output of running the NN on a partcilar document about a movie may be [0.2, 0.1, 0.7] which suggests that the document is most likely an ENTERTAINMENT document.

  6. #6
    Registered User IdioticCreation's Avatar
    Join Date
    Nov 2006
    Location
    Lurking about
    Posts
    229
    Wow, I can't believe I never realized that. I knew that was how they seemed to be used most of the time, but I thought they would preform computations as well. So in order to output an 8 digit number, I would need 10 nodes for each number, each node representing 0-9. Which ever node activates would be the number for that position. That would mean 72 output nodes though

    I was hoping I might train a neural network to solve a problem like this:
    1 ? 2 = 21
    45 ? 65 = 5465
    98 ? 8 = 898
    32 ?43 = 3342

    The question mark is just an operator, and basically the output is just a rearrangement of the inputs. A pattern.

    Is this at all possible with a neural network?

  7. #7
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by IdioticCreation View Post
    Oh, interesting. I only chose a linear function because I was/still am under the impression that the sigmoid function can only give outputs between 0 and 1 (or maybe some other range if the functions is altered). I wanted analog outputs, I don't see how you can do that with a sigmoid function.
    Not ALL the activation functions must be sigmoids. The connection between the final hidden layer and the output layer can be linear, if you want. But unless you have nonlinearity somewhere, the entire network collapses to a single-layer perceptron and you lose computative power.

    It seems to me that with a sigmoid function if the output is close to one, than the neuron fires, if it is close to zero than it does not fire.
    Not really. These kinds of networks don't "fire," they just combine values and forward them on. The sigmoid serves two purposes: it provides the necessary nonlinearity, and it range-limits the values so that you don't get numeric overflow. Any monotonic, nonlinear function could be used -- the sigmoid is most common out of tradition.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  8. #8
    Registered User IdioticCreation's Avatar
    Join Date
    Nov 2006
    Location
    Lurking about
    Posts
    229
    Quote Originally Posted by brewbuck View Post
    Not really. These kinds of networks don't "fire," they just combine values and forward them on. The sigmoid serves two purposes: it provides the necessary nonlinearity, and it range-limits the values so that you don't get numeric overflow. Any monotonic, nonlinear function could be used -- the sigmoid is most common out of tradition.
    OK, I was thinking that because in the example I was looking at they had a clampOutput function, which clamped output to 1 or 0, or if it wasn't close to either then -1. I was thinking they were clamping each neuron, but now I see it was only used on the output neurons.

    At any rate, do you think I can still salvage my project? Or is it not something that can be solved with neural networks?

  9. #9
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by IdioticCreation View Post
    Ohhh, I was thinking. Could I just say f(x) = sin(atan(x))*9 Then it would return output between -9 and 9. Will it work like that? I was also thinking maybe some kind of step function, but that's just a guess.
    If you want output between -9 and 9, just take the output of the sigmoid and multiply by 9. You can perform that final scaling outside of the NN. Or, as I said, you could have linear connections between the final hidden layer and the output layer, but why complicate it if you don't have to?

    You could also select an activation function which is not bounded. The function sigmoid(x) + x is still nonlinear, but not bounded.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  10. #10
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by IdioticCreation View Post
    At any rate, do you think I can still salvage my project? Or is it not something that can be solved with neural networks?
    A neural network can literally learn anything, if the network is large enough and you have enough training examples. The problem will be overfitting. Can a neural network make the kind of generalization you are looking for here? Maybe, depending on how you encode the inputs and outputs.

    If you were designing the weights in the network by hand, I'm sure you could come up with a way to make the network do what you are asking. The question is whether the backprop algorithm can relax the network into the right set of weights. Who knows without trying.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  11. #11
    Registered User IdioticCreation's Avatar
    Join Date
    Nov 2006
    Location
    Lurking about
    Posts
    229
    Thanks brewbuck, I guess it's back to the drawing board for me. I read somewhere that the number of training examples needs to be about 60 times the number of weights in a network to avoid overfitting, so having 72 output nodes would make it very difficult.

    Thank you for the help everyone!

  12. #12
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by IdioticCreation View Post
    Thanks brewbuck, I guess it's back to the drawing board for me. I read somewhere that the number of training examples needs to be about 60 times the number of weights in a network to avoid overfitting, so having 72 output nodes would make it very difficult.

    Thank you for the help everyone!
    I don't think you need 72 output nodes. You should only need one node to encode the value of a single digit.

    Assign:
    0.0 -> 0
    0.1 -> 1
    0.2 -> 2
    ...
    0.9 -> 9

    For both the input layer and output layer. The network should be able to zero in on that. You could even distribute the values "sigmoidally" to be friendly to your activation function.

    EDIT: It's just my hunch from working with these in the past, that for this task you'll probably need multiple hidden layers. Don't ask me why, just a hunch.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  13. #13
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by IdioticCreation View Post
    Thanks brewbuck, I guess it's back to the drawing board for me. I read somewhere that the number of training examples needs to be about 60 times the number of weights in a network to avoid overfitting, so having 72 output nodes would make it very difficult.

    Thank you for the help everyone!
    The general requirement is at most one weight for every example, that places an upper bound on the complexity of the network for a given data set. If you cannot find a network architecture under those constraints that can learn the data set then you need a larger data set. Of course you also need to check networks smaller than that as well, since your dataset may be larger than necessary. Now that said there are many examples of minimal networks that violate that rule. The traditional XOR exampel contains 3 nodes, with a total of 9 weights, yet there are only 4 possible examples in the data set, indicating that the network has more learning potential than the dataset requires. In practice the one weight per example is a ballpark figure. It is usually within an order of magnitude of the optimal size though.

  14. #14
    Registered User IdioticCreation's Avatar
    Join Date
    Nov 2006
    Location
    Lurking about
    Posts
    229
    Quote Originally Posted by brewbuck View Post
    I don't think you need 72 output nodes. You should only need one node to encode the value of a single digit.

    Assign:
    0.0 -> 0
    0.1 -> 1
    0.2 -> 2
    ...
    0.9 -> 9

    For both the input layer and output layer. The network should be able to zero in on that. You could even distribute the values "sigmoidally" to be friendly to your activation function.

    EDIT: It's just my hunch from working with these in the past, that for this task you'll probably need multiple hidden layers. Don't ask me why, just a hunch.
    Oh wow, that's exactly what I decided to try as a last shot, but then I started getting memory corruption errors and said screw it. Now that someone else thinks it could work (not just a hunch of mine) I'm going to work on it.

    I also read some stuff about people saying someone did a proof that shows no backprop network would ever need more than one hidden layer. I can't find the link, but I'll mess with the structure stuff and just do what works.

    Thanks for the info abachler, I don't understand some of that at the moment, but once I start adjusting training sets and the network structure I will come back and check it out more.

    You guys are great, thank you for all the help.

  15. #15
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    That is incorrect, the number of hidden layers depends on the particulars of the output manifold; e.g. a smooth multivariate manifold can be approximated with no fewer than 3 layers of weights (i.e. 2 hidden layers). For example, this image would require such a network to classify whether a given point is black or white. A circle would require 4 layers. This is assuming you use arbitrary precision mathematics. In practice a network that uses finite precision floating point, like doubles, may require more layers or nodes, or both.
    Last edited by abachler; 02-10-2009 at 12:57 AM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Neural networks - am i doing this properly?
    By Bobcat in forum General AI Programming
    Replies: 6
    Last Post: 04-01-2009, 08:48 PM
  2. Details about artificial neural networks
    By ChadJohnson in forum General AI Programming
    Replies: 1
    Last Post: 07-23-2005, 10:29 AM
  3. Recursion
    By Lionmane in forum C Programming
    Replies: 11
    Last Post: 06-04-2005, 12:00 AM
  4. Need examples on Neural Networks
    By khpuce in forum General AI Programming
    Replies: 2
    Last Post: 05-23-2005, 11:26 AM
  5. Neural Networks VS. Spike Neural Networks
    By magis in forum General AI Programming
    Replies: 1
    Last Post: 04-12-2005, 06:37 AM