7 Supervised Machine Learning

7.10 References and Further Reading

For good overviews of machine learning see Briscoe and Caelli [1996], Mitchell [1997], Duda et al. [2001], Bishop [2008], Hastie et al. [2009] and Murphy [2012]. Halevy et al. [2009] discuss big data. Domingos [2012] overviews issues in machine learning. The UCI machine learning repository [Lichman, 2013] is a collection of classic machine learning data sets.

The collection of papers by Shavlik and Dietterich [1990] contains many classic learning papers. Michie et al. [1994] give empirical evaluation of many learning algorithms on multiple problems. Davis and Goadrich [2006] discusses precision, recall, and ROC curves. Settles [2012] overviews active learning.

The approach to combining expert knowledge and data was proposed by Spiegelhalter et al. [1990].

Decision tree learning is discussed by Breiman et al. [1984] and Quinlan [1986]. For an overview of a more mature decision tree learning tool see Quinlan [1993].

Ng [2004] compares L1 and L2 regularization for logistic regression.

Goodfellow et al. [2016] provide a modern overview of neural networks and deep learning. For classic overviews of neural networks see Hertz et al. [1991] and Bishop [1995]. McCulloch and Pitts [1943] defines a formal neuron, and Minsky [1952] showed how such representations can be learned from data. Rosenblatt [1958] introduced the perceptron. Back-propagation is introduced in Rumelhart et al. [1986]. LeCun et al. [1998] describe how to effectively implement back-propagation. Minsky and Papert [1988] analyze the limitations of neural networks. LeCun et al. [2015] review how multilayer neural networks have been used for deep learning in many applications. Hinton et al. [2012] review neural networks for speech recognition, Goldberg [2016] for natural language processing, and Krizhevsky et al. [2012] for vision. Rectified linear units are discussed by Glorot et al. [2011]. Nocedal and Wright [2006] provides practical advice on gradient descent and related methods. Karimi et al. [2016] analyze how many iterations of stochastic gradient descent are needed.

Random forests were introduced by Breiman [2001], and are compared by Dietterich [2000a] and Denil et al. [2014]. For reviews of ensemble learning see Dietterich [2002]. Boosting is described in Schapire [2002] and Meir and Rätsch [2003].

For overviews of case-based reasoning see Kolodner and Leake [1996] and López [2013]. For a review of nearest-neighbor algorithms, see Duda et al. [2001] and Dasarathy [1991]. The dimension-weight learning nearest-neighbor algorithm is from Lowe [1995].

Version spaces were defined by Mitchell [1977]. PAC learning was introduced by Valiant [1984]. The analysis here is due to Haussler [1988]. Kearns and Vazirani [1994] give a good introduction to computational learning theory and PAC learning. For more details on version spaces and PAC learning, see Mitchell [1997].

For research results on machine learning, see the journals Journal of Machine Learning Research (JMLR), Machine Learning, the annual International Conference on Machine Learning (ICML), the Proceedings of the Neural Information Processing Society (NIPS), or general AI journals such as Artificial Intelligence and the Journal of Artificial Intelligence Research, and many specialized conferences and journals.