New Book Review: "Data Science for Business"
New book review for Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking, by Foster Provost and Tom Fawcett, O'Reilly Media, 2013, reposted here:
As the authors discuss in the preface to this text, the content presented here provides a conceptual foundation for many well-known data mining algorithms, and is therefore not about algorithms or a replacement for a book about algorithms. "We believe there is a relatively small set of fundamental concepts or principles that underlie techniques for extracting useful knowledge from data." While this book does contain significant technical content, the conceptual approach that the authors take revolves around (1) how data science fits in the organization and the competitive landscape, (2) ways of thinking data-analytically, which help identify appropriate data and consider appropriate methods, and (3) discussions on extracting the knowledge from data that undergirds the vast array of data science tasks and algorithms.
Content is broken down into 14 chapters, followed by two appendixes which outline factors to consider when assessing potential data mining projects and provide another sample proposal beyond what was presented in Chapter 13 ("Data Science and Business Strategy"). After providing an introduction to data-analytic thinking, the authors present discussions on the following topics: the data mining process, supervised versus unsupervised data mining, identifying informative attributes, segmenting data by progressive attribute selection, fitting a model to data, overfitting and its avoidance, similarity, neighbors, and clusters, model evaluation, model performance, evidence and probabilities, representing and mining text, and analytical engineering, followed by some additional tasks and techniques which build on the foundation presented in earlier chapters, and a discussion of data science and strategy.
The content that the authors present will likely be weighty for many potential readers. While I do agree with the authors that the math is kept to a minimum, this weightiness will likely be due to the number of topics that are discussed as well as the detail of many of the discussions. In short, most will not find this book a short read, although I did find it curious how many book reviews were written so soon after the publish date. Readers who do not have time to read the entire text but are interested in this space might at minimum be advised to read Chapter 1 ("Introduction: Data-Analytic Thinking"), Chapter 2 ("Business Problems and Data Science Solutions"), and Chapter 13 ("Data Science and Business Strategy"), followed by the appendixes, before moving forward to the remaining chapters where a bulk of the material is presented.
As a consultant architect, I especially appreciated the discussion in Chapter 1 ("Introduction: Data-Analytic Thinking") on where data science fits in the context of other data-related processes in the organization, as well as its relationship with Big Data (it is refreshing to read material that presents the correct definition of the term, unlike some other publications in this space). In addition, I enjoyed the discussions in Chapter 7 ("Decision Analytic Thinking I: What Is a Good Model?") and Chapter 11 ("Decision Analytic Thinking II: Toward Analytical Engineering"). The other chapters that I especially appreciate include Chapter 3 ("Introduction to Predictive Modeling: From Correlation to Supervised Segmentation"), Chapter 4 ("Fitting a Model to Data"), Chapter 5 ("Overfitting and its Avoidance"), and Chapter 6 ("Similarity, Neighbors, and Clusters"). And now that I have listed these favorites, I realize that together they comprise half the text.
While I understand that this book is based on an MBA course that Provost taught at NYU over the past ten years, and "The Wall Street Journal" discussed just today that analytics is starting to be increasingly prevalent in MBA coursework, I am impressed at the level of detail in some of the discussions that the authors present, even though the varying language used in some of the segments seems to point to several different authors. As a visual thinker, however, it is the abundant level of diagrams that continued to grab my attention and bring me to understand them in light of the textual component of the discussions. Recommended reading for anyone new to data science or anyone concentrating in one area of the field that seeks better understanding of the big picture.