1. ## Neural Networks - Calculating Error Gradients

Hi

I'm working on implementing a neural network, but I'm having trouble calculating error gradients on both output and hidden layers. I'm using the identity function as my activation function: f(x) = x. I am pretty clueless when it comes to calculus so I'm really having trouble with it.

I found this Web page that has a good explanation
http://www.willamette.edu/~gorr/clas...9/linear2.html

I just can't seem to figure out how to implement it.

I have an example of the gradient calculation of a network that is using a sigmoid activation function
Code:
```inline double trainer::getOutputErrorGradient( double desiredValue, double outputValue)
{
return outputValue * ( 1 - outputValue ) * ( desiredValue - outputValue );
}```
but this isn't all that helpful.

Any help would be greatly appreciated. 2. [disclaimer: it's been a long time since I've studied NNs]. But anyway, a couple of points. First, a neural network with a linear function doesn't need any hidden layers. They can be collapsed since it's just a linear combination (you can directly compute the outputs from the inputs).

Second, the backprop algorithm propagates errors using the derivative of the activation, the example you have is likely the closed form for the sigmoid (the most common function used from what I remember). In your case the derivative is constant since the function is linear.

Why do you want to use f(x) = x? You should use the sigmoid. 3. Neural networks more or less require a sigmoid to be neural networks, otherwise its not a neural network. That said there are a lot of sigmoid functions. The one I use the most is f(x) = sin(atan(x)) because it is easy to impliment in hardware for feedforward networks. It takes a little longer to train than some other sigmoid functions, but once trained executes a lot faster. Another sigmoid that is fast for feed forward networks in f(x) = x / (abs(x) + 1.0).

However, for calculating the error gradient its just the difference between the actual output and the expected output. Most feedback functions use some algorithm for assigning a lesser error to each input based on that inputs influence on the output 4. Oh, interesting. I only chose a linear function because I was/still am under the impression that the sigmoid function can only give outputs between 0 and 1 (or maybe some other range if the functions is altered). I wanted analog outputs, I don't see how you can do that with a sigmoid function.

It seems to me that with a sigmoid function if the output is close to one, than the neuron fires, if it is close to zero than it does not fire. What if your output was a decimal number? Would you have to have enough output nodes to get it in binary? or is there some other way?

edit:
Ohhh, I was thinking. Could I just say f(x) = sin(atan(x))*9 Then it would return output between -9 and 9. Will it work like that? I was also thinking maybe some kind of step function, but that's just a guess. 5. A NN does not output the answer as a computation. The idea is that each output node represents some answer and the values are like a probability distribution over answers.

For example, if you are using an NN as a classifier, you'd have an output node for each output class. Let's say our inputs are a feature vector of a text document, the output nodes could represent SPORTS, POLITICS, and ENTERTAINMENT. The (normalized) output of running the NN on a partcilar document about a movie may be [0.2, 0.1, 0.7] which suggests that the document is most likely an ENTERTAINMENT document. 6. Wow, I can't believe I never realized that. I knew that was how they seemed to be used most of the time, but I thought they would preform computations as well. So in order to output an 8 digit number, I would need 10 nodes for each number, each node representing 0-9. Which ever node activates would be the number for that position. That would mean 72 output nodes though I was hoping I might train a neural network to solve a problem like this:
1 ? 2 = 21
45 ? 65 = 5465
98 ? 8 = 898
32 ?43 = 3342

The question mark is just an operator, and basically the output is just a rearrangement of the inputs. A pattern.

Is this at all possible with a neural network? 7. Originally Posted by IdioticCreation Oh, interesting. I only chose a linear function because I was/still am under the impression that the sigmoid function can only give outputs between 0 and 1 (or maybe some other range if the functions is altered). I wanted analog outputs, I don't see how you can do that with a sigmoid function.
Not ALL the activation functions must be sigmoids. The connection between the final hidden layer and the output layer can be linear, if you want. But unless you have nonlinearity somewhere, the entire network collapses to a single-layer perceptron and you lose computative power.

It seems to me that with a sigmoid function if the output is close to one, than the neuron fires, if it is close to zero than it does not fire.
Not really. These kinds of networks don't "fire," they just combine values and forward them on. The sigmoid serves two purposes: it provides the necessary nonlinearity, and it range-limits the values so that you don't get numeric overflow. Any monotonic, nonlinear function could be used -- the sigmoid is most common out of tradition. 8. Originally Posted by brewbuck Not really. These kinds of networks don't "fire," they just combine values and forward them on. The sigmoid serves two purposes: it provides the necessary nonlinearity, and it range-limits the values so that you don't get numeric overflow. Any monotonic, nonlinear function could be used -- the sigmoid is most common out of tradition.
OK, I was thinking that because in the example I was looking at they had a clampOutput function, which clamped output to 1 or 0, or if it wasn't close to either then -1. I was thinking they were clamping each neuron, but now I see it was only used on the output neurons.

At any rate, do you think I can still salvage my project? Or is it not something that can be solved with neural networks? 9. Originally Posted by IdioticCreation Ohhh, I was thinking. Could I just say f(x) = sin(atan(x))*9 Then it would return output between -9 and 9. Will it work like that? I was also thinking maybe some kind of step function, but that's just a guess.
If you want output between -9 and 9, just take the output of the sigmoid and multiply by 9. You can perform that final scaling outside of the NN. Or, as I said, you could have linear connections between the final hidden layer and the output layer, but why complicate it if you don't have to?

You could also select an activation function which is not bounded. The function sigmoid(x) + x is still nonlinear, but not bounded. 10. Originally Posted by IdioticCreation At any rate, do you think I can still salvage my project? Or is it not something that can be solved with neural networks?
A neural network can literally learn anything, if the network is large enough and you have enough training examples. The problem will be overfitting. Can a neural network make the kind of generalization you are looking for here? Maybe, depending on how you encode the inputs and outputs.

If you were designing the weights in the network by hand, I'm sure you could come up with a way to make the network do what you are asking. The question is whether the backprop algorithm can relax the network into the right set of weights. Who knows without trying. 11. Thanks brewbuck, I guess it's back to the drawing board for me. I read somewhere that the number of training examples needs to be about 60 times the number of weights in a network to avoid overfitting, so having 72 output nodes would make it very difficult.

Thank you for the help everyone! 12. Originally Posted by IdioticCreation Thanks brewbuck, I guess it's back to the drawing board for me. I read somewhere that the number of training examples needs to be about 60 times the number of weights in a network to avoid overfitting, so having 72 output nodes would make it very difficult.

Thank you for the help everyone!
I don't think you need 72 output nodes. You should only need one node to encode the value of a single digit.

Assign:
0.0 -> 0
0.1 -> 1
0.2 -> 2
...
0.9 -> 9

For both the input layer and output layer. The network should be able to zero in on that. You could even distribute the values "sigmoidally" to be friendly to your activation function.

EDIT: It's just my hunch from working with these in the past, that for this task you'll probably need multiple hidden layers. Don't ask me why, just a hunch. 13. Originally Posted by IdioticCreation Thanks brewbuck, I guess it's back to the drawing board for me. I read somewhere that the number of training examples needs to be about 60 times the number of weights in a network to avoid overfitting, so having 72 output nodes would make it very difficult.

Thank you for the help everyone!
The general requirement is at most one weight for every example, that places an upper bound on the complexity of the network for a given data set. If you cannot find a network architecture under those constraints that can learn the data set then you need a larger data set. Of course you also need to check networks smaller than that as well, since your dataset may be larger than necessary. Now that said there are many examples of minimal networks that violate that rule. The traditional XOR exampel contains 3 nodes, with a total of 9 weights, yet there are only 4 possible examples in the data set, indicating that the network has more learning potential than the dataset requires. In practice the one weight per example is a ballpark figure. It is usually within an order of magnitude of the optimal size though. 14. Originally Posted by brewbuck I don't think you need 72 output nodes. You should only need one node to encode the value of a single digit.

Assign:
0.0 -> 0
0.1 -> 1
0.2 -> 2
...
0.9 -> 9

For both the input layer and output layer. The network should be able to zero in on that. You could even distribute the values "sigmoidally" to be friendly to your activation function.

EDIT: It's just my hunch from working with these in the past, that for this task you'll probably need multiple hidden layers. Don't ask me why, just a hunch.
Oh wow, that's exactly what I decided to try as a last shot, but then I started getting memory corruption errors and said screw it. Now that someone else thinks it could work (not just a hunch of mine) I'm going to work on it.

I also read some stuff about people saying someone did a proof that shows no backprop network would ever need more than one hidden layer. I can't find the link, but I'll mess with the structure stuff and just do what works.

Thanks for the info abachler, I don't understand some of that at the moment, but once I start adjusting training sets and the network structure I will come back and check it out more.

You guys are great, thank you for all the help. 15. That is incorrect, the number of hidden layers depends on the particulars of the output manifold; e.g. a smooth multivariate manifold can be approximated with no fewer than 3 layers of weights (i.e. 2 hidden layers). For example, this image would require such a network to classify whether a given point is black or white. A circle would require 4 layers. This is assuming you use arbitrary precision mathematics. In practice a network that uses finite precision floating point, like doubles, may require more layers or nodes, or both. Popular pages Recent additions 