New Book Review: "Applied Predictive Analytics"
New book review for Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst, by Dean Abbott, Wiley, 2014, reposted here:
Copy provided by Amazon.
Abbott is sure to discuss his philosophy and approach at the outset. "This book describes the predictive modeling process from the perspective of a practitioner rather than a theoretician. Predictive analytics is both science and art. The science is relatively easy to describe, but to do the subject justice requires considerable knowledge of mathematics. I don't believe a good practitioner needs to understand the mathematics of the algorithms to be able to apply them successfully, any more than a graphic designer needs to understand the mathematics of image correction algorithms to apply sharpening filters well. However, the better practitioners understand the effects of changing modeling parameters, the implications of the assumptions of algorithms in predictions, and the limitations of the algorithms, the more successful they will be, especially in the most challenging modeling projects."
"Science is covered in this book, but not in the same depth you will find in academic treatments of the subject. Even though you won't find sections describing how to decompose matrices while building linear regression models, this book does not treat algorithms as a black boxes. The book describes what I would be telling you if you were looking over my shoulder while I was solving a problem, especially when surprises occur in the data. How can algorithms fool us into thinking their performance is excellent when they are actually brittle? Why did I bin a variable rather than just transform it numerically? Why did I use logistic regression instead of a neural network, or vice versa? Why did I build a linear regression model to predict a binary outcome? These are the kinds of questions, the art of predictive modeling, that this book addresses directly and indirectly."
Amazon provided me a copy of this book to review after it was first published, and since I have a personal policy never to write a book review until after it has actually been read, this review is unfortunately being written a bit late. It has been more than 4 years since this book was published, and because the field being discussed has evolved considerably during this time period, it is showing its age at least a little bit, but this is mainly contained to Chapter 12 ("Model Deployment"). Considerable strides have been made in the namesake task being considered in this chapter, model deployment, but much can be said for artifact deployment in non-analytics areas of work as well. However, it is in this chapter where Abbott also covers various deployment options, and one of my only grievances about the content the author has to share is his assertation that deployment of models to databases is actually a viable option. While I realize that some commercial products now provide this capability, modelers should follow the best practices of software development not to embed logic in databases.
After the author provides an overview of predictive analytics, he delves into setting up the problem, understanding data, and preparing data, followed by discussions on itemsets and association rules, descriptive modeling, interpreting descriptive models, predictive modeling, assessing predictive models, model ensembles, text mining, model deployment, and case studies. Potential readers of this book might be interested in knowing that the author references a process called CRISP-DM (Cross-Industry Standard Process Model for Data Modeling) throughout, mainly because custom processes typically fall in line with the steps outlined by this process, and it helps describe the most commonly applied steps in the process, arguably providing needed structure for practitioners that reminds them of steps that need to be accomplished including the documentation and reporting especially valuable for new modelers.
The chapters that I especially appreciated are chapter 3 ("Data Understanding"), chapter 4 ("Data Preparation") and chapter 5 ("Itemsets and Association Rules"), which in my view cover topics often skipped or downplayed by other texts. As the author comments in the summary to chapter 3, the processes involved in understanding the data after it has been collected cannot be rushed: "problems missed during data understanding will come back to haunt the analyst during modeling". Chapter 4 also provides warnings. "Do not consider data preparation a process that concludes after the first pass. This stage is often revisited once problems or deficiencies are discovered while building models." Additionally, "Overfitting the data is perhaps the biggest reason predictive models fail when deployed. You should always take care to construct the sampling strategy well so that overfitting can be identified and models adjusted appropriately."
The bulk of the material, however, is concentrated in chapters 6 through 11, and while the material on descriptive modeling, prescriptive modeling, model ensembles, and text mining is presented well, these discussions largely provide a survey of the different approaches, and in 2019 more details can be found elsewhere. One aspect that potential readers should keep in mind is that while the author focuses on the practical, this book doesn't provide any code and doesn't discuss any open source frameworks or commercial products. In a sense the author rises above such specific implementations, so come here for practical approaches and lessons learned, but not hard details beyond the functional. However, the functional is often what data professionals need, as I have often found from my own consulting experience that field practitioners often can't see the functional forest for the framework trees.