7.9 References and Further Reading

For overviews of machine learning, see Mitchell [1997], Duda et al. [2001], Bishop [2008], Hastie et al. [2009], and [Murphy, 2022]. Halevy et al. [2009] discuss the unreasonable effectiveness of big data.

The UCI machine learning repository [Dua and Graff, 2017] is a collection of classic machine learning datasets. Kaggle (https://www.kaggle.com) runs machine learning competitions and has many datasets available for testing algorithms.

The collection of papers by Shavlik and Dietterich [1990] contains many classic learning papers. Michie et al. [1994] give empirical evaluation of many early learning algorithms on multiple problems. Davis and Goadrich [2006] discuss precision, recall, and ROC curves. Settles [2012] overviews active learning.

The approach to combining expert knowledge and data was proposed by Spiegelhalter et al. [1990].

Logistic regression dates back to 1832 by Verhulst [see Cramer, 2002], and has a long history in many areas, including the economics Nobel prize to McFadden [2000]. Decision tree learning is discussed by Breiman et al. [1984] and Quinlan [1993]. Gelman et al. [2020] provide theory and practice for linear and logistic regression, with many practical examples. Ng [2004] compares L1 and L2 regularization for logistic regression. Hierarchical softmax is due to Morin and Bengio [2005].

Feurer and Hutter [2019] overview automatic hyperparameter optimization, or hyperparameter tuning, part of autoML [Hutter et al., 2019], which involves search to choose algorithms and hyperparameters for machine learning. Shahriari et al. [2016] review Bayesian optimization, one of the most useful tools for hyperparameter optimization. Stanley et al. [2019] and Real et al. [2020] show how genetic algorithms can be successfully used for autoML.

Random forests were introduced by Breiman [2001], and are compared by Dietterich [2000a] and Denil et al. [2014]. For reviews of ensemble learning, see Dietterich [2002]. Boosting is described in Schapire [2002] and Meir and Rätsch [2003]. Gradient tree boosting is by Friedman [2001]. The notation used in that section follows Chen and Guestrin [2016], who develop the algorithm in much more generality, and discuss many efficiency refinements. XGBoost [Chen and Guestrin, 2016] and LightGBM [Ke et al., 2017] are modern implementations (see Appendix B).

The no-free-lunch theorem for machine learning is due to Wolpert [1996].

Rudin et al. [2022] review interpretable machine learning, which involves building models that can be understood.

For research results on machine learning, see the journals Journal of Machine Learning Research (JMLR), Machine Learning, the annual International Conference on Machine Learning (ICML), the Proceedings of the Neural Information Processing Society (NeurIPS), or general AI journals such as Artificial Intelligence and the Journal of Artificial Intelligence Research, and many specialized conferences and journals.