

Explore more Data Science Visit the Store Explore more Data Science Visit the Store Start learning with O'Reilly Visit the Store Start learning with O'Reilly More From O'Reilly Visit the Store More From O'Reilly Sharing the knowledge of experts O'Reilly's mission is to change the world by sharing the knowledge of innovators. For over 40 years, we've inspired companies and individuals to do new things (and do them better) by providing the skills and understanding that are necessary for success. Our customers are hungry to build the innovations that propel the world forward. And we help them do just that. Sharing the knowledge of experts O'Reilly's mission is to change the world by sharing the knowledge of innovators. For over 40 years, we've inspired companies and individuals to do new things (and do them better) by providing the skills and understanding that are necessary for success. Our customers are hungry to build the innovations that propel the world forward. And we help them do just that. Review: Outstanding Book Providing Exactly What The Title Says - I thoroughly enjoyed this book, one of my favorite books ever on programming. It does three things superbly: covers the basic low level tools of a data scientist (the "from scratch" part), gives a great overview of useful Python programming examples for those new to Python, and gives an amazingly succinct yet high level overview of the mathematics and statistics required for data science. At first I was very worried about this book based on the first few chapters for the one reason that the author was cracking jokes throughout the text and I thought if it kept up for the rest of the book I was going to be very upset. But it did not happen and it turns out to have been a very reasonable way to ease into this complicated subject. The author steps through the toolbox of the data scientist, chapter by chapter, giving useful, insightful, clear pieces of code and textual explanations of each topic. So, for those new to data science it gives just enough to get the basic idea of a concept in terms of code and mathematical explanation, and then moves on to the next topic. It is often said that in writing, less is better and this book gets things down to their essence. That is one of the great things about the book - that the length of each chapter is about 20 pages (over 25 chapters). So each chapter can be read and the code even exercised in about an hour. Further, the references at the end of each chapter invite the reader to expanded information at the level of one or more entire textbooks or references. Thus the book can be seen as kind of boiling down a 25-volume set of highly technical subject matter into roughly 300 pages. The topics that were explored the best seem to be the ones on probability, working with data, regression, clustering, and databases (SQL). Some of the small but dense code samples were tough to follow but that is based on their algorithmic complexity - such as that for logistical regression and MapReduce. Occasionally the author uses a term that is not defined or in the index (such as data munging - which I still haven't looked up to see what it means). There are only a small number of typos which indicates good editing. While the Python crash course was pretty good, Python is a vast language and there could have been more to that section. I read this book from cover to cover and stepped through logically all the code (but did not actually run any of it) and I would wholeheartedly recommend this book for anyone wanting to work in the area of data science or its related fields, such as big data engineering or data analysis. Review: A clear explanation of some of the concepts central to data science - The book begins with the basics of the Python language in a chapter entitled "A Crash Course in Python." Grus recommends the Anaconda distribution of Python 2.7, as do I. It is free, includes Python, NumPy, SciPy, matplotlib, and IPython that are used in the book, and includes pandas which we will use to handle financial data. This is not the book I would recommend for a person new to Python to learn the language, but it establishes the style and notation used for the remainder of the book. Chapters 4, 5, 6 are quick reviews of linear algebra and the Python data structures used, frequentist statistics, and probability, respectively. Chapter 7 discusses hypothesis and inference, and has a nice discussion of the beta distribution and its use in describing the "prior" distribution for Bayesian analysis. Chapter 8 begins to get into the data science with a description of the gradient descent method of finding the set of parameter values that maximize (or minimize) the objective function. The "from scratch" approach shows all the details. Chapter 10, Working with Data, begins with methods for exploring the data. Examining the distribution, plotting single dimensional data, comparing multiple data series, normalizing, rescaling, and dimensionality reduction. Chapter 11 begins machine learning -- models, overfitting, underfitting, bias-variance tradeoff, and feature extraction. Chapter 12 continues with k-nearest neighbors and the curse of dimensionality. Chapter 13 illustrates naive Bayes to implement a spam filter. Chapters 14 and 15 treat linear regression and multiple regression, fitting a model to data, and regularization to limit the tendency to overfit. Chapter 16 explains the logistic function and logistic regression. Examples look at measures of goodness of fit. The concept of support vector machine is explained, although the mathematics are beyond from scratch. Chapter 17 has a nice explanation of decision trees (the models that result from rule-based trading system development, such as AmiBroker). Entropy, as it applies to information content, is well explained and used to partition data as the rules are created. Random forests, one of the ensemble techniques for machine learning, is described in surprisingly concise code. Neural networks are described in chapter 18, including code for a feed forward, back propagation network that identifies digits. The interpretation of the weights of each of nodes gives insight into the workings of neural networks. The book continues on with discussions of clustering, natural language processing, network analysis, recommender system, and databases. While this is not the best book to learn Python, machine learning, or model development, it is valuable in explaining each of these topics with fully disclosed logic and computer code. This book gets five stars based on meeting its objectives -- to clearly illustrate some of the central concepts of data science.















| Best Sellers Rank | #807,965 in Books ( See Top 100 in Books ) #242 in Data Modeling & Design (Books) #371 in Data Processing #667 in Python Programming |
| Customer Reviews | 4.4 out of 5 stars 406 Reviews |
K**E
Outstanding Book Providing Exactly What The Title Says
I thoroughly enjoyed this book, one of my favorite books ever on programming. It does three things superbly: covers the basic low level tools of a data scientist (the "from scratch" part), gives a great overview of useful Python programming examples for those new to Python, and gives an amazingly succinct yet high level overview of the mathematics and statistics required for data science. At first I was very worried about this book based on the first few chapters for the one reason that the author was cracking jokes throughout the text and I thought if it kept up for the rest of the book I was going to be very upset. But it did not happen and it turns out to have been a very reasonable way to ease into this complicated subject. The author steps through the toolbox of the data scientist, chapter by chapter, giving useful, insightful, clear pieces of code and textual explanations of each topic. So, for those new to data science it gives just enough to get the basic idea of a concept in terms of code and mathematical explanation, and then moves on to the next topic. It is often said that in writing, less is better and this book gets things down to their essence. That is one of the great things about the book - that the length of each chapter is about 20 pages (over 25 chapters). So each chapter can be read and the code even exercised in about an hour. Further, the references at the end of each chapter invite the reader to expanded information at the level of one or more entire textbooks or references. Thus the book can be seen as kind of boiling down a 25-volume set of highly technical subject matter into roughly 300 pages. The topics that were explored the best seem to be the ones on probability, working with data, regression, clustering, and databases (SQL). Some of the small but dense code samples were tough to follow but that is based on their algorithmic complexity - such as that for logistical regression and MapReduce. Occasionally the author uses a term that is not defined or in the index (such as data munging - which I still haven't looked up to see what it means). There are only a small number of typos which indicates good editing. While the Python crash course was pretty good, Python is a vast language and there could have been more to that section. I read this book from cover to cover and stepped through logically all the code (but did not actually run any of it) and I would wholeheartedly recommend this book for anyone wanting to work in the area of data science or its related fields, such as big data engineering or data analysis.
D**Y
A clear explanation of some of the concepts central to data science
The book begins with the basics of the Python language in a chapter entitled "A Crash Course in Python." Grus recommends the Anaconda distribution of Python 2.7, as do I. It is free, includes Python, NumPy, SciPy, matplotlib, and IPython that are used in the book, and includes pandas which we will use to handle financial data. This is not the book I would recommend for a person new to Python to learn the language, but it establishes the style and notation used for the remainder of the book. Chapters 4, 5, 6 are quick reviews of linear algebra and the Python data structures used, frequentist statistics, and probability, respectively. Chapter 7 discusses hypothesis and inference, and has a nice discussion of the beta distribution and its use in describing the "prior" distribution for Bayesian analysis. Chapter 8 begins to get into the data science with a description of the gradient descent method of finding the set of parameter values that maximize (or minimize) the objective function. The "from scratch" approach shows all the details. Chapter 10, Working with Data, begins with methods for exploring the data. Examining the distribution, plotting single dimensional data, comparing multiple data series, normalizing, rescaling, and dimensionality reduction. Chapter 11 begins machine learning -- models, overfitting, underfitting, bias-variance tradeoff, and feature extraction. Chapter 12 continues with k-nearest neighbors and the curse of dimensionality. Chapter 13 illustrates naive Bayes to implement a spam filter. Chapters 14 and 15 treat linear regression and multiple regression, fitting a model to data, and regularization to limit the tendency to overfit. Chapter 16 explains the logistic function and logistic regression. Examples look at measures of goodness of fit. The concept of support vector machine is explained, although the mathematics are beyond from scratch. Chapter 17 has a nice explanation of decision trees (the models that result from rule-based trading system development, such as AmiBroker). Entropy, as it applies to information content, is well explained and used to partition data as the rules are created. Random forests, one of the ensemble techniques for machine learning, is described in surprisingly concise code. Neural networks are described in chapter 18, including code for a feed forward, back propagation network that identifies digits. The interpretation of the weights of each of nodes gives insight into the workings of neural networks. The book continues on with discussions of clustering, natural language processing, network analysis, recommender system, and databases. While this is not the best book to learn Python, machine learning, or model development, it is valuable in explaining each of these topics with fully disclosed logic and computer code. This book gets five stars based on meeting its objectives -- to clearly illustrate some of the central concepts of data science.
U**N
Jump Start for understanding Data Science.
Minus one star for using outdated Python 2.7. Essentially ALL data science tools you are likely to run across have been updated to Python 3.4+. I would have knocked off two stars but this book is actually quite good and delivers on its title. This is a very basic book on Data Science but it gives a broad overview which helps you get a perspective on the tools that are available. This book teaches methods by developing actual code for these methods. You will find in work situations that you will use library functions instead of "rolling your own" but this book helps bring the details together by having you actually code these techniques. I support this approach 100% Once you have this overview, you can drill down into specifics with other materials like textbooks or cookbooks. I'd did flinch at some of the explanations in this book but it really is a "from Scratch" approach and some things are simplified to avoid distractions. This book also teaches basic Python 2.7 with a quick start chapter, so it is self contained for any scientist or engineer that wants to get started adding Data Science techniques to their repertoire.
T**F
Great introduction, Easy read. Buy it.
This is a great book-- well written, easy to digest and informative. I've been in Data Mining and Statistical Analysis for a little over a decade now; I was looking for a book to share with my team to ensure we were all up-to-speed on some foundational concepts: this book is it. EDIT: I also forgot to mention, it has probably the best get-up-and-running in Python introduction I've seen (see, e.g., Chapter 2, ~20pp.) It's the right size and correct coverage for the content and the author's sense of humor (indeed, that of a data scientist) resonates with the audience. Solid introduction, even better review or brief explanation of commonly encountered topics. One of the best O'Reilly books I've read in a long time-- in fact, a technical book at the level I used to expect from O'Reilly.
0**1
Great content delivered with humor and style
The phrase "from scratch" made me think this was a beginner book, but it's not really. It builds step-by-step from first principles to quite advanced algorithms and topics. For me, it's usually much easier to understand a mathematical or statistical concept if I can implement it or see it implemented in code, and so I found the approach taken here very effective. As a bonus, Joel has a very entertaining sense of humor and writes some seriously elegant Python code. I learned as much about coding as anything else. I swear I felt like I was in the movie Inception while trying to unpack some of his amazingly efficient list comprehensions. I'll be returning to this book again and again. Great job Joel!
W**D
Not quite what I was expecting...
As a python programmer interested in learning applications of data science, I was really excited to come across this book. Unfortunately what I expected to get out of this book and what I eventually got were not quite in line with one another. My expectation for this book was that it would give a good overview of how to apply statistical libraries in Python (NumPy, scikit, pandas) in the field of data science. Instead, the book shows you how to build a lot of formulas "from scratch" and apply them conceptually. As someone not very classically trained in statistics, I had a hard time grasping the notation used and was hoping the book would provide more "real world" applications to explain the theory. Instead it starts often with the theory and applies it to hypothetical examples - a format which may be better served to someone with a good statistics background. For me personally, I found Data Science for Business: What you need to know about data mining and data-analytic thinking to give a better introduction to a lot of the theories used in data science (albeit without any mention of Python) and found Data Smart: Using Data Science to Transform Information into Insight to provide better applications of data science techniques, granted it's mostly done in Excel with a very brief overview of R applications at the end.
J**Y
The Fundamentals - Crystal Clear
This is among the handful of very best technical books I have ever read. As the "from Scratch" in the title implies, the objective of this book is to teach the fundamental ideas and techniques of data science from first (or nearly first) principles. After working through this book, you'll be better able to meaningfully utilize the pre-packaged software (whether it's Matlab, R, scikit-learn, or whatever) that you will use in "real life". And although the knowledge you'll gain is largely independent of the programming language, you will as a bonus learn from the clear and elegant python code included. Every key topic, from probability, statistics, and other mathematical subjects, to machine learning and databases, is covered in a crystal clear manner. In summary, this book is the bee's knees.
A**Y
Excellent Python book for Data Science, Data Analysis, and Data Visualization
Joel does an excellent job tying a "real-world" scenario into the flow of this book to teach python for Data Science. I have experience with python but even those with little to no python experience can learn from this book. The only reasons for 4/5 stars are first, I would like to see the code for all charts used in the book like in section on "Statistics". Second, the setup used in book assumes you have placed sample code in your working directory. Some code calls classes from other code and expects it to be in the working directory. I would just call these out clearly. Other than these minor thoughts, I think this is an excellently written book and I highly recommend for anyone interested in data science, data analysis, or just wanting the skills to visualize clean and meaningful data. Well done!
Trustpilot
1 day ago
3 days ago