### 7.4.1 Neural Networks

Neural networks are a popular target representation for learning.
These networks are inspired by the neurons in the brain but do not
actually simulate neurons. Artificial neural networks typically contain many fewer than the
approximately *10 ^{11}* neurons that are in the human brain, and the
artificial neurons, called

**units**, are much simpler than their biological counterparts.

Artificial neural networks are interesting to study for a number of reasons:

- As part of neuroscience, to understand real neural systems, researchers are simulating the neural systems of simple animals such as worms, which promises to lead to an understanding about which aspects of neural systems are necessary to explain the behavior of these animals.
- Some researchers seek to automate not only the functionality of intelligence (which is what the field of artificial intelligence is about) but also the mechanism of the brain, suitably abstracted. One hypothesis is that the only way to build the functionality of the brain is by using the mechanism of the brain. This hypothesis can be tested by attempting to build intelligence using the mechanism of the brain, as well as without using the mechanism of the brain. Experience with building other machines - such as flying machines, which use the same principles, but not the same mechanism, that birds use to fly - would indicate that this hypothesis may not be true. However, it is interesting to test the hypothesis.
- The brain inspires a new way to think about computation that contrasts with currently available computers. Unlike current computers, which have a few processors and a large but essentially inert memory, the brain consists of a huge number of asynchronous distributed processes, all running concurrently with no master controller. One should not think that the current computers are the only architecture available for computation.
- As far as learning is concerned, neural networks provide a different measure of simplicity as a learning bias than, for example, decision trees. Multilayer neural networks, like decision trees, can represent any function of a set of discrete features. However, the functions that correspond to simple neural networks do not necessarily correspond to simple decision trees. Neural network learning imposes a different bias than decision tree learning. Which is better, in practice, is an empirical question that can be tested on different domains.

There are many different types of neural networks. This book considers
one kind of neural network, the **feed-forward neural
network**. Feed-forward networks can be seen as cascaded squashed linear
functions. The inputs feed into a layer of hidden units, which can
feed into layers of more hidden units, which eventually feed into the
output layer. Each of the hidden units is a squashed linear function of its
inputs.

Neural networks of this type can have as inputs any real numbers, and they have a real number as output. For regression, it is typical for the output units to be a linear function of their inputs. For classification it is typical for the output to be a sigmoid function of its inputs (because there is no point in predicting a value outside of [0,1]). For the hidden layers, there is no point in having their output be a linear function of their inputs because a linear function of a linear function is a linear function; adding the extra layers gives no added functionality. The output of each hidden unit is thus a squashed linear function of its inputs.

Associated with a network are the parameters for all of the linear functions. These parameters can be tuned simultaneously to minimize the prediction error on the training examples.

**Example 7.15:**Figure 7.11 shows a neural network with one hidden layer for the classification data of Figure 7.9. As explained in Example 7.11, this data set is not linearly separable. In this example, five Boolean inputs correspond to whether there is culture, whether the person has to fly, whether the destination is hot, whether there is music, and whether there is nature, and a single output corresponds to whether the person likes the holiday. In this network, there is one hidden layer, which contains two hidden units that have no a priori meaning. The network represents the following equations:

pval(e,Likes) = f(w _{0}+w_{1}×val(e,H1)+w_{2}×val(e,H2))val(e,H1) = f(w _{3}+w_{4}×val(e,Culture)+w_{5}×val(e,Fly) +w_{6}×val(e,Hot)+ w_{7}×val(e,Music)+ w_{8}×val(e,Nature)val(e,H2) = f(w _{9}+w_{10}×val(e,Culture)+w_{11}×val(e,Fly) +w_{12}×val(e,Hot)+ w_{13}×val(e,Music) + w_{14}×val(e,Nature)) ,

where *f(x)* is an activation function.
For this example, there are 15 real numbers to be learned (*w _{0},...,w_{14}*). The
hypothesis space is thus a 15-dimensional real space. Each point in
this 15-dimensional space corresponds to a function that predicts
a value for

*Likes*for every example with

*Culture*,

*Fly*,

*Hot*,

*Music*, and

*Nature*given.

Given particular values for the parameters, and given values for the inputs, a neural network
predicts a value for each target feature.
The aim of neural network learning is, given a set of examples, to
find parameter settings that minimize the error. If
there are *m* parameters, finding the parameter settings with minimum
error involves searching through an *m*-dimensional Euclidean space.

**Back-propagation learning** is
gradient descent search through the parameter space to minimize
the sum-of-squares error.

**Procedure**BackPropagationLearner(

*X,Y,E,n*)

_{h},η2:

**Inputs**

3:

*X*: set of input features,

*X={X*

_{1},...,X_{n}}4:

*Y*: set of target features,

*Y={Y*

_{1},...,Y_{k}}5:

*E*: set of examples from which to learn

6:

*n*: number of hidden units

_{h}7:

*η*: learning rate

8:

**Output**

9: hidden unit weights

*hw[0:n,1:n*

_{h}]10: output unit weights

*ow[0:n*

_{h},1:k]11:

12:

**Local**

13:

*hw[0:n,1:n*weights for hidden units

_{h}]14:

*ow[0:n*weights for output units

_{h},1:k]15:

*hid[0:n*values for hidden units

_{h}]16:

*hErr[1:n*errors for hidden units

_{h}]17:

*out[1:k]*predicted values for output units

18:

*oErr[1:k]*errors for output units

19:

20: initialize

*hw*and

*ow*randomly

21:

*hid[0]←1*

22:

**repeat**

23:

**for each**example

*e*in

*E*

**do**

24:

**for each**

*h ∈{1,...,n*

_{h}}**do**

25:

*hid[h] ←∑*

_{i=0}^{n}hw[i,h] ×val(e,X_{i})26:

27:

**for each**

*o ∈{1,...,k}*

**do**

28:

*out[o] ←∑*

_{h=0}^{n}hw[i,h] ×hid[h]29:

*oErr[o] ←out[o]×(1-out[o])×(val(e,Y*

_{o})-out[o])30:

31:

**for each**

*h ∈{0,...,n*

_{h}}**do**

32:

*hErr[h] ←hid[h]×(1-hid[h])×∑*

_{o=0}^{k}ow[h,o] ×oErr[o]33:

**for each**

*i ∈{0,...,n}*

**do**

34:

*hw[i,h]←hw[i,h] + η×hErr[h]×val(e,X*

_{i})35:

36:

**for each**

*o ∈{1,...,k}*

**do**

37:

*ow[h,o]←ow[h,o] + η×oErr[o]×hid[h]*

38:

39:

40:

41:

**until**termination

42:

**return**

*w*

_{0},...,w_{n}Figure 7.12 gives the incremental gradient
descent version of back-propagation for networks with one
layer of hidden units. This approach assumes *n* input features, *k*
output features, and *n _{h}* hidden units. Both

*hw*and

*ow*are two-dimensional arrays of weights. Note that

*0:n*means the index ranges from

_{k}*0*to

*n*(inclusive) and

_{k}*1:n*means the index ranges from

_{k}*1*to

*n*(inclusive). This algorithm assumes that

_{k}*val(e,X*for all

_{0})=1*e*.

The back-propagation algorithm is similar to the linear learner of Figure 7.6, but it takes into account multiple layers and the activation function. Intuitively, for each example it involves simulating the network on that example, determining first the value of the hidden units (line 25) then the value of the output units (line 28). It then passes the error back through the network, computing the error on the output nodes (line 29) and the error on the hidden nodes (line 32). It then updates all of the weights based on the derivative of the error.

Gradient descent search involves repeated evaluation of the function to be minimized - in this case the error - and its derivative. An evaluation of the error involves iterating through all of the examples. Back-propagation learning thus involves repeatedly evaluating the network on all examples. Fortunately, with the logistic function the derivative is easy to determine given the value of the output for each unit.

**Example 7.16:**The network of Figure 7.11, with one hidden layer containing two units, trained on the data of Figure 7.9, can perfectly fit this data. One run of back-propagation with the learning rate

*η=0.05*, and taking 10,000 steps, gave weights that accurately predicted the training data:

H1=f( -2.0×Culture -4.43×Fly+ 2.5×Hot +2.4×Music-6.1×Nature+1.63) H2=f( -0.7×Culture +3.0×Fly+ 5.8×Hot +2.0×Music-1.7×Nature-5.0) Likes=f( -8.5×H1 -8.8×H2 + 4.36) .

The use of neural networks may seem to challenge the
**physical symbol system
hypothesis**, which relies on
symbols having meaning.
Part
of the appeal of neural networks is that, although meaning is attached to
the input and output units, the designer does not associate a meaning
with the hidden units. What the hidden units
actually represent is something that is learned. After a neural network
has been trained, it is often possible to look inside the network to
determine what a particular hidden unit actually represents. Sometimes
it is easy to express concisely in language what it represents, but
often it is not. However, arguably, the computer has an internal meaning; it can
explain its internal meaning by showing how examples map into the
values of the hidden unit.