foundations of computational agents
The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).
The previous methods tried to find a compact representation of the data to be used for future prediction. In case-based reasoning, the training examples, the cases, are stored and accessed to solve a new problem. To get a prediction for a new example, those cases that are similar, or close to, the new example are used to predict the value of the target features of the new example. This is at one extreme of the learning problem where, unlike decision trees and neural networks, relatively little work must be done offline, and virtually all of the work is performed at query time.
Case-based reasoning is used for classification and for regression. It is also applicable when the cases are complicated, such as in legal cases, where the cases are complex legal rulings, and in planning, where the cases are previous solutions to complex problems.
If the cases are simple, one algorithm that works well is to use the $k$-nearest neighbors for some given number $k$. Given a new example, the $k$ training examples that have the input features closest to that example are used to predict the target value for the new example. The prediction could be the mode, average, or some interpolation between the prediction of these $k$ training examples, perhaps weighting closer examples more than distant examples.
For this method to work, a distance metric is required that measures the closeness of two examples. First define a metric for the domain of each feature, in which the values of the features are converted to a numerical scale that is used to compare values. Suppose ${X}_{i}(e)$ is a numerical representation of the value of feature ${X}_{i}$ for the example $e$. Then $({X}_{i}({e}_{1})-{X}_{i}({e}_{2}))$ is the difference between example ${e}_{1}$ and ${e}_{2}$ on the dimension defined by feature ${X}_{i}$. The Euclidean distance, the square root of the sum of the squares of the dimension differences, could be used as the distance between two examples. One important issue is the relative scales of different dimensions; increasing the scale of one dimension increases the importance of that feature. Let ${w}_{i}$ be a non-negative real-valued parameter that specifies the weight of feature ${X}_{i}$. The distance between examples ${e}_{1}$ and ${e}_{2}$ is then
$$d({e}_{1},{e}_{2})=\sqrt{\sum _{i}{w}_{i}*{({X}_{i}({e}_{1})-{X}_{i}({e}_{2}))}^{2}}.$$ |
The feature weights could be provided as input. It is also possible to learn these weights. The learning agent would try to find weights that minimize the error in predicting the value of each element of the training set, based on every other instance in the training set. This is an instance of leave-one-out cross validation.
Consider using case-based reasoning on the data of Figure 7.1. Rather than converting the data to a secondary representation as in decision tree or neural network learning, case-based reasoning uses the examples directly to predict the value for the user action in a new case.
Suppose a learning agent wants to classify example ${{e}}_{{\mathrm{20}}}$, for which the author is unknown, the thread is a follow-up, the length is short, and it is read at home. First the learner tries to find similar cases. There is an exact match in example ${{e}}_{{\mathrm{11}}}$, so it may want to predict that the user does the same action as for example ${{e}}_{{\mathrm{11}}}$ and thus skips the article. It could also include other close examples.
Consider classifying example ${{e}}_{{\mathrm{19}}}$, where the author is unknown, the thread is new, the length is long, and it was read at work. In this case there are no exact matches. Consider the close matches. Examples ${{e}}_{{\mathrm{2}}}$, ${{e}}_{{\mathrm{8}}}$, and ${{e}}_{{\mathrm{18}}}$ agree on the features ${A}{\mathit{}}{u}{\mathit{}}{t}{\mathit{}}{h}{\mathit{}}{o}{\mathit{}}{r}$, ${T}{\mathit{}}{h}{\mathit{}}{r}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}$, and ${W}{\mathit{}}{h}{\mathit{}}{e}{\mathit{}}{r}{\mathit{}}{e}{\mathit{}}{\mathrm{\_}}{\mathit{}}{r}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}$. Examples ${{e}}_{{\mathrm{10}}}$ and ${{e}}_{{\mathrm{12}}}$ agree on ${T}{\mathit{}}{h}{\mathit{}}{r}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}$, ${L}{\mathit{}}{e}{\mathit{}}{n}{\mathit{}}{g}{\mathit{}}{t}{\mathit{}}{h}$, and ${W}{\mathit{}}{h}{\mathit{}}{e}{\mathit{}}{r}{\mathit{}}{e}{\mathit{}}{\mathrm{\_}}{\mathit{}}{r}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}$. Example ${{e}}_{{\mathrm{3}}}$ agrees on ${A}{\mathit{}}{u}{\mathit{}}{t}{\mathit{}}{h}{\mathit{}}{o}{\mathit{}}{r}$, ${L}{\mathit{}}{e}{\mathit{}}{n}{\mathit{}}{g}{\mathit{}}{t}{\mathit{}}{h}$, and ${W}{\mathit{}}{h}{\mathit{}}{e}{\mathit{}}{r}{\mathit{}}{e}{\mathit{}}{\mathrm{\_}}{\mathit{}}{r}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}$. Examples ${{e}}_{{\mathrm{2}}}$, ${{e}}_{{\mathrm{8}}}$, and ${{e}}_{{\mathrm{18}}}$ predict ${R}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{s}$, but the other examples predict ${S}{\mathit{}}{k}{\mathit{}}{i}{\mathit{}}{p}{\mathit{}}{s}$. So what should be predicted? The decision tree algorithm says that ${L}{\mathit{}}{e}{\mathit{}}{n}{\mathit{}}{g}{\mathit{}}{t}{\mathit{}}{h}$ is the best predictor, and so ${{e}}_{{\mathrm{2}}}$, ${{e}}_{{\mathrm{8}}}$, and ${{e}}_{{\mathrm{18}}}$ should be ignored. For the sigmoid linear learning algorithm, the parameter values in Example 7.11 similarly predict that the reader skips the article. A case-based reasoning algorithm to predict whether the user will or will not read this article must determine the relative importance of the dimensions.
One of the problems in case-based reasoning is accessing the relevant cases. A kd-tree is a way to index the training examples so that training examples that are close to a given example can be found quickly. Like a decision tree, a $kd$-tree splits on input features, but at the leaves are subsets of the training examples. In building a $kd$-tree from a set of examples, the learner tries to find an input feature that partitions the examples into set of approximately equal size and then builds $kd$-trees for the examples in each partition. This division stops when there are only few examples at each leaf. A new example is filtered down the tree, as in a decision tree. The exact matches will be at the leaf found. However, the examples at the leaves of the $kd$-tree could possibly be quite distant from the example to be classified; they agree on the values down the branch of the tree but could disagree on the values of all other features. The same tree is used to search for those examples that have one feature different from those tested in the tree, by allowing for one branch of a different value to be used when filtering a new case down the tree. See Exercise 20.
Case-based reasoning is also applicable when the cases are more complicated, for example, when they are legal cases or previous solutions to planning problems. In this scenario, the cases are carefully chosen and edited to be useful. Case-based reasoning consists of a cycle of the following four steps:
Given a new case, retrieve similar cases from the case base.
Adapt the retrieved cases to fit to the new case.
Evaluate the solution and revise it based on how well it works.
Decide whether to retain this new case in the case base.
If the case retrieved works for the current situation, it should be used. Otherwise, it may need to be adapted. The revision may involve other reasoning techniques, such as using the proposed solution as a starting point to search for a solution, or a human could do the adaptation in an interactive system. The new case and the solution can then be saved if retaining it will help in the future.
A common example of a case-based reasoning system is a help desk that users call with problems to be solved. Case-based reasoning could be used by the diagnostic assistant to help users diagnose problems on their computer systems. When users give a description of a problem, the closest cases in the case base are retrieved. The diagnostic assistant could recommend some of these to the user, adapting each case to the user’s particular situation. An example of adaptation is to change the recommendation based on what software the user has, what method they use to connect to the Internet, and the model of the printer. If one of the adapted cases works, that case is added to the case base, to be used when another user asks a similar question. In this way, all of the common different cases will eventually be in the case base.
If none of the cases found works, some other method is attempted to solve the problem, perhaps by adapting other cases or having a human help diagnose the problem. When the problem is finally solved, the solution is added to the case base.
A case can be removed from the case base if similar cases all give the same recommendation. In particular, ${C}$ is redundant and can be removed if there are cases ${{C}}_{{i}}$ such that any case that uses ${C}$ as the closest case, would, if ${C}$ were not there, use one of the ${{C}}_{{i}}$, and these would give the same recommendation.