"Python machine learning and Practice: from scratch to the road to the Kaggle race"

Last Update:2017-04-18 Source: Internet

Author: User

Tags jupyter notebook xgboost nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

"Python Machine learning and practice – from scratch to the road to Kaggle race" very basic
The main introduction of Scikit-learn, incidentally introduced pandas, NumPy, Matplotlib, scipy.
The code of this book is based on python2.x. But most can adapt to python3.5.x by modifying print ().
The provided code uses Jupyter Notebook by default, and it is recommended to install ANACONDA3.

The best is to https://www.kaggle.com registered account, run the fourth chapter of the code, feel the next.

Supervised Learning:
2.1.1 Classification Study (Classifier)
2.1.1.1 linear classifier (Linear Classifier)
2.1.1.2 Support vector Machine (supported vectors Classifier)
2.1.1.3 Naive Bayes (Native Bayes)
2.1.1.4 K Nearest neighbor (K-nearest Neighbor)
2.1.1.5 Decision Trees (decision tree)
2.1.1.6 Integration Model (Ensemble): Stochastic forest: Random Forest Classifier, gradient elevation decision tree: Gradient tree boosting.
2.1.2 Regression prediction (regressor)
2.1.2.1 Linear regression Device
2.1.2.2 Support Vector Machine
2.1.2.3 k Nearest Neighbor
2.1.2.4 regression Tree
2.1.2.5 Integration Model
2.2 Unsupervised Learning
2.2.1 Data Clustering
2.2.1.1 K mean value algorithm (K-means)
2.2.2 Features reduced dimension
2.2.2.1 principal component Analysis (Principal Component ANALYSIS:PCA)
3.1 Model Usage Tips
3.1.1 Feature Enhancement
3.1.1.1 Feature Extraction
3.1.1.2 Feature Screening
Regularization of the 3.1.2 model
3.1.2.1 Under-fitting and over-fitting
3.1.2.2 L1 Norm regularization
3.1.2.3 L2 Norm regularization
3.1.3 Model Test
3.1.3.1 Leave a verification
3.1.3.2 Cross-validation
3.1.4 Super Parametric Search
3.1.4.1 Grid Search
3.1.4.2 Parallel Search
3.2 Popular libraries/model practices
Natural Language Pack: NLTK
Word vector: Word2vec
Xgboost model
TensorFlow

Editorial recommendations

"Python machine learning and Practice: from scratch to the Kaggle race" helps integrate popular Python-based libraries with readers interested in machine learning and data mining. such as Scikit-learn,pandas, Nltk,gensim, Xgboost,tensorflow, and so on, and for the actual data encountered, and even the Kaggle competition in the analysis of the task, quickly build an effective machine learning system.

At the same time, the author tries to reduce readers ' over-reliance on programming skills and mathematics backgrounds in order to understand the book, thus reducing the practice threshold of the machine learning model, so that more interested people realize the pleasure of using the classic model and the new efficient method to solve the practical problem.

Content Introduction

This book is intended for all readers interested in the practice and competition of machine learning and data mining, starting from scratch, based on the Python programming language, and gradually leading the reader to familiarize themselves with the current popular machine learning, data mining and natural language processing tools without involving a large number of mathematical models and complex programming knowledge. such as Scikit-learn, NLTK, Pandas, Gensim, Xgboost, Google TensorFlow and so on.

The book is divided into 4 chapters. The 1th chapter introduces the machine learning concept and Python programming knowledge, the 2nd chapter is about how to use Scikit-learn as the basic machine learning tool, and the 3rd chapter is about how to improve the performance of existing machine learning system by advanced technology or model. 4th Chapter Contest, With the Kaggle platform as the object, help the reader to use the model and skills introduced in this book, to complete three representative competition tasks.

About the author

Fan, Ph. D. In computer science, Tsinghua University, research interests include machine learning and natural language processing. March 2015 by the National Study fund Committee sponsored to the United States New York University Computer Department joint training. During his PhD, he has published nearly 20 papers in several important international conferences and journals in his field of study. He has worked in the research and development departments of Hulu, MSRA (Microsoft Research Asia), Baidu Natural Language processing department, Bosch (North America Silicon Valley Research Institute), and undertakes research tasks related to machine learning and natural language processing.

Li Chao, Ph. D., associate professor, Tsinghua University, deputy director of Web and Software Technology Research Center, Institute of Information Technology. Member of China Computer Society information Storage Technology Committee, senior member of China Computer Society, member of National Technical Commission for the Standardization of Documents Imaging Technology (SAC/TC86/SC6), IEEE member. Research areas include mass data storage, organization and management, analysis, and its applications in the Digital Library/Archives/education/medical/financial sectors. Presided over and participated in a number of national 973, 863, science and technology Support, natural fund and other vertical projects and horizontal cooperation projects. Has published more than 50 academic papers, the authorized invention patents more than 10.

Great reviews

"Python machine learning and practice" is very practical, starting with the simple Python syntax and how to write machine learning models in Python language. Each chapter is linked together with code samples that are ideal for beginners who want to learn about the field of machine learning, even those with no programming foundation. I hope to see this new book promote the Popularization of machine learning.

--Today's headline lab scientist, former Baidu American deep Learning laboratory, less handsome scientist-Li Lei

This is a good book for machine learning practice with a strong practical, suitable for the use of machine learning methods to solve practical problems of PhD, master, senior undergraduate, as well as the engineering and technical personnel in the enterprise reading, is a quick grasp of machine learning methods to solve practical problems of the introductory reading, I believe readers will benefit from this book.

--Ma Shaoping, professor of computer science, Tsinghua University

Although there are many books on machine learning in the market, there are few practical teaching books that can combine the development language and machine learning theory, use open source technology, and adopt similar "practical training" methods. The author of the book fully integrates the experience of his study into the book, in layman's terms, is a suitable for school students and engineering and technical personnel in the rapid introduction of machine learning instructions.

--professor of software College of the University of Posts and Telecommunications, Director of Research center-Wu Shi

Unlike most professional books, the book has a lower reading threshold. Even if you are not a computer science professional, you can follow this book to quickly get started with Zui new and ZUI effective machine learning models with basic Python programming.

--professor, Department of Computer and Engineering, Hong Kong University of Science and Technology, director, IEEE, AAAI Fellow, executive director of the International Association of Artificial Intelligence (IJCAI,AAAI), Vice-director of China Artificial Intelligence Association, chairman of ACM KDD China (ACM Data Mining Commission Chinese branch)-Prof Qiang Yang

From a beginner's point of view, the author of the book leads the reader from Zero Foundation to a hobby who can analyze data independently and participate in machine learning competitions. The book will benefit from readers who are interested in understanding machine learning and do not want to be bothered by complex mathematical theories.

--Vice dean of School of Computer Science and Technology, Soochow University, Director of Institute of Human Language Technology, Distinguished Professor, winner of National Outstanding Youth Science Fund-Zhang Min

If machine learning dominates the next wave of the information industry, is it necessary for us to get a glimpse of it before the wave comes. I am very happy to have such a good book of 0 basic combat services to the vast number of readers, in order to popularize this trend to make modest effort. Just like the last few decades of our relentless popularization of computers and the Internet, the core idea of AI, especially machine learning, should come out of the ivory tower and embrace the general public, as much as possible to engage more interested enthusiasts in the practice.

Professor-Zhengfang, director, Speech and Language Technology Center, Tsinghua University

This is a good entry-level book that explains how to use Python for machine learning. The book leads just-in-time readers, starting from scratch with data analysis and mastering machine Learning competition skills, suitable for students and researchers working in machine learning research and applications.

--chief researcher of Microsoft Research, senior expert in natural language processing-Zhou Ming

Directory

The 1th chapter is about ... ..... ..... ..... ..... .................................. 1
1.1 Machine Learning Overview ..... ... .... ... ..... ..... ..... ..... ..... .....????????? .......-........ 1
1.1.1 Task ... ..... ... .... ... .... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ....... 3
1.1.2 experience ... ..... .... ... .... ..... .... ..... ..... .....-... and ..... ...., ... and ..... ...., ... and ..... ...., ..... 5
1.1.3 performance ... ..... ... .... ... ..... ... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... 5
1.2 Python Programming Library ... ..... .... ... ..... ..... ..... ..... ..... .......................... 8
Why is 1.2.1 using Python ...? ..... ..... ..... ..... ..... ....... .................... 8
1.2.2 The advantages of Python machine learning ..... ..... ..... ..... .......................... 9
1.2.3 NumPy & SciPy ..... ..... ..... .................................... 10
1.2.4 matplotlib ... ..... ..... .... ..... ..... ................................ 11
1.2.5 Scikit-learn ... ..... ..... ..... ..... ................................ 11
1.2.6 Pandas ... ..... ... ... ... ..... ..... ..... ..... ................................. 11
1.2.7 Anaconda ... ..... .... ... .... ..... ..... .... ..... ............................. 12
1.3 Python environment configuration ..... ..... ..... ..... ..... ..... ..... ........................ 12
1.3.1 Windows System Environment ..... ..... ..... ..... ..... ............................ 12
1.3.2 Mac OS System environment ..... ..... ..... ..... ....... .......................... 17
1.4 The basics of Python programming ... ..... .... .... ..... .... ..... .....? .....-.....-.....-.....-....... 18
1.4.1 Python basic syntax ..... ..... ..... ..... ..... .... ... and .... ..... .....?? ........-........ 19
1.4.2 Python data Type ... ..... ..... ..... ..... ..... ........................... 20
1.4.3 Python Data Operations ..... ... ..... ..... .... ..... .....? ........................ 22
1.4.4 Python Process Control ..... ... ..... ..... ..... ..... ..... ........................ 26
1.4.5 Python function (module) design ...... ..... ..... .......................... 28
1.4.6 Python Programming Library (Package) import ..... ..... ............................. 29
1.4.7 Python Foundation Integrated Practice ..... ..... ..... ..... ............................ 30
The end of Chapter 1.5 ... ..... ..... ..... ..... ..... .....-.....-.....-.....-.....-..... and ....... 33

The 2nd chapter of the

... ..... ..... ..... ..... ..... .......-.....-.....-.....-.....-......-..... The
2.1 supervised the learning of classical models ... ..... ..... ..... ..... ........................... The
2.1.1 Classification of Learning ..... ..... ..... ..... ..... ..... ..... ..... ....... ..................
2.1.1.1 Linear classifier
2.1.1.2 support vector Machine (category)
2.1.1.3 Naive Bayes
2.1.1.4 k nearest neighbor (category)
2.1.1.5 Decision Tree
2.1.1.6 Integrated Model (category)
2.1.2 regression ... ..... ... .... ... ..... ..... ..... ..... ..... ..... ..... ..... ..................
2.1.2.1 Linear regression
2.1.2.2 support vector Machine (regression)
2.1.2.3 k nearest neighbor (regression)
2.1.2.4 regression Tree
2.1.2.5 Integrated Model (regression)
2.2 Unsupervised Learning Classic model ....................................................... Bayi
2.2.1 data clustering ..... ..... ..... ..... ..... ..... ..... ..... ..... ............... Bayi
2.2.1.1 K-Mean Algorithm
2.2.2 features reduced dimension ... ..... ..... ..... ..... ..... ..... ..... ..... ...................... 2.2.2.1 Principal Component Analysis
2.3 Summary of the end of the chapter ... ..... ..... ..... ... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ...

The 3rd chapter of the

... ..... ..... ..... ..... ..... ....... ..........-.....-.....-.....-...... 98
3.1 Model Practical tips ... ..... ..... ..... ..... ..... ..... ..... ....... ..................... 98?
3.1.1 feature lift ... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .............. The
3.1.2 model regularization ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .... ..... .....???? ..... 111
3.1.3 model test ... ..... ... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ............ 121
3.1.4 Super Parametric search ..... ..... ..... ..... ..... ..... ..... ..... ...................... 122
3.2 Popular Library/model practice ..... ... ..... ..... ..... ..... ..... ..... ..... .................. 129
3.2.1 Natural Language Processing package (NLTK) ..... ..... ..... ..... ..... ..... ...... ....... ........ 131
3.2.2 word vector (Word2vec) technology ..... .... ..... ..... ........................ 133
3.2.3 Xgboost model ... ..... ..... ..... ..... ....... .......................... 138
3.2.4 TensorFlow frame ... ..... ..... ..... ....... ............................ The summary at the end of the
3.3 Chapter ... ... ..... ..... .....? .....-... and ..... ... and ..... ... and ..... ......-.........

The 4th chapter of the actual combat ... ..... .... ..... ..... ......-......-.....-.......-.....-.....-..... 153
4.1 Kaggle platform profile ..... ..... ..... ..... ..... .............................. 153
4.2 Titanic the victim of a deceased passenger forecast ..... ..... .... ..... ..... ...................... 157
4.3 imdb film review score estimate ..... .... ... .... ..... ..... ............................ 165
4.4 Mnist Handwritten digital image recognition ...... ..... ..... ........................... 174
The end of Chapter 4.5 ... ..... .... ... ..... ..... ..... ..... .........-.....-.....-.....-.....-...... 180

Postscript..................................................................... 181

References ... ..... ... .... ... .... ..... ..... ..... .....-... and ..... ... and ..... .....-.....??????????? 182

Close all ↑ Wonderful book pick

3rd Advanced Chapter
in Chapter 2nd, we introduce a large number of classic machine learning models to the reader and use the Python programming language to analyze the performance of these models on many different real-world data. However, careful readers who delve into the data or consult Scikit-learn's documentation will find that all of the data we have used in chapter 2nd has been normalized, and most of the models are simply initialized with the default configuration. In other words, although we can use the processed data to learn a set of parameters to fit the data in the default configuration, and use these parameters and default configuration to achieve some seemingly good performance, but we still can not answer a few of the most critical questions: Is the data that comes into contact with actual research and work so structured? Are these default configurations the best? Is there room for improvement in our model performance? This chapter, "3.1 Model Usage Tips" section, will help readers to answer these questions. After reading this section, I believe that readers will learn how to improve the performance of classic models by extracting or filtering data features and optimizing model configurations.
However, with the rapid development of the research and application of machine learning in recent years, the classical model is unable to meet the increasing data volume and complex data analysis demand. As a result, a growing number of more efficient and powerful learning models and corresponding libraries are being designed and written, and are slowly being widely accepted and adopted by research circles and industry. These models and libraries include: NLTK packages for natural language processing, word vector technology Word2vec, xgboost models that provide powerful predictive capabilities, and TensorFlow frameworks that Google publishes for Deep learning, and so on. More exciting is that these most popular libraries and models, not only provides the Python programming interface API, and some become the Python programming language Toolkit, but also facilitates our subsequent learning and use. As a result, the "3.2 Pop library/Model Practice" section will bring readers together to discover the mysteries of these most popular libraries and new models.

3.1 Model Practical and tricks
This section will teach reader friends a series of more practical techniques for using the model. Having tasted a number of classic machine learning models in the 2nd chapter, I believe that once we determine the use of a model, the library provided in this book will help us to learn the parameters (Parameters) required by the model from the standard training data, using the default configuration. We can use this set of parameters to guide the model on the test data set to predict, and then to evaluate the performance of the model.
However, this scheme does not guarantee that: (1) All the data characteristics used for training are the best; (2) The learning parameters must be optimal; (3) The model under the default configuration is always optimal. In other words, we can improve the performance of the models we used earlier, from multiple angles. This section introduces ways to improve the performance of the model, including how to preprocess data, control parameter training, and optimize model configuration.
3.1.1 Feature Promotion
The research and application of early machine learning is limited by the type of model and the ability of computing. As a result, most developers put more effort into preprocessing the data. They expect to improve the performance of the model by extracting or filtering the characteristics of the data. The so-called feature extraction is the transformation of the original data into the form of eigenvector, which involves the quantification of the data features, while feature filtering is further enhanced by selecting a feature combination that is more efficient for a given task in a high-dimensional, quantified feature vector, further enhancing the performance of the model.
3.1.1.1 Feature Extraction
There are many kinds of raw data, in addition to digitized signal data (voice print, image), and a large number of symbolic text. However, we cannot directly use the symbolic text itself for computational tasks, but we need some processing to pre-measure the text into eigenvectors.
Some of the data features represented by symbols are already relatively structured and stored in the data structure of the dictionary. At this point, we use Dictvectorizer to extract and vectorize features. For example, the following code 55.

Code 55:dictvectorizer feature extraction and vectorization of data stored using a dictionary
>>> # defines a list of dictionaries that represent multiple data samples (each dictionary represents a sample of data).
>>>measurements= [{' City ': ' Dubai ', ' Temperature ': ".}, {' City ': ' London ', ' Temperature ':"}, {' City ': ' San Fransisco ', ' temperature ': 18.}]
>>> # import Dictvectorizer from sklearn.feature_extraction
>>>from sklearn.feature_extraction Import Dictvectorizer
>>> # Initialize Dictvectorizer feature extractor
>>>vec=dictvectorizer ()
>> > # The feature matrix after the output conversion.
>>>print Vec.fit_transform (measurements). ToArray ()
>>> # outputs the feature meaning of each dimension.
>>>print vec.get_feature_names () [[1.0 0.33]
[0. 1.0.12.]
[0.0. 1.18]]
[' City=dubai ', ' City=london ', ' City=san fransisco ', ' temperature ']

As you can see from the output of code 55: In the process of feature vectorization, there is a great difference between the Dictvectorizer of class type (categorical) and the numeric type (numerical) feature. Because the category type feature cannot be digitized directly, it is necessary to use the name of the original feature to generate new features and quantify it using the 0/12 value method, while the conversion of the numerical features is relatively convenient, in general it is only necessary to maintain the original eigenvalue value.
Other textual data are more primitive, with little or no special data structures to store, just a series of strings. We deal with this data, the more commonly used text feature notation is the word bag method (bag of Words): As the name implies, regardless of the order of words appearing, but the training text in each of the occurrences of the word as a single column of characteristics. We call these non-repeating words a glossary (vocabulary), so each training text can map a eigenvector on a high-dimensional thesaurus. And there are two kinds of common calculation methods of characteristic values: Countvectorizer and Tfidfvectorizer. For each training text, Countvectorizer only takes into account the frequency (term Frequency) in which each word appears in the training text. Tfidfvectorizer, in addition to considering the frequency (term Frequency) in which a word appears in the current text, is concerned with the reciprocal of the number of text bars containing the word (inverse Document Frequency). By contrast, the more entries in the training text, the more advantageous is the Tfidfvectorizer this feature quantization method. Because we calculate the word frequency (term Frequency) to find the important words that contribute to the meaning of the text. However, if a word appears in almost every text, it means that it is a common term, but it does not help the model to classify the text, and when the amount of training text is higher, the interference of the classification decision with Tfidfvectorizer to suppress these common words can often play a role in improving the performance of the model.
We often call these common words that appear in each text as a stop word (stop Words), such as the English one, a, and so on. These discontinued words are often filtered out in the text feature extraction and used to improve the performance of the model. The following code lets us re-analyze the "category 20 News text categorization" issue, and this time focuses on enumerating the usage of the two text feature quantization models and comparing their performance differences.
......

Close all ↑ Preface/Preface

Objective

To the vast number of readers friends:

You are welcome to buy and read "Python machine learning practice"!

The book is written to help readers who are interested in machine learning and data mining applications to integrate and practice the most popular Python-based libraries: Scikit-learn, NLTK, Gensim, Xgboost, TensorFlow, etc. And to quickly build an effective machine learning system for real-world research issues, even in the Kaggle competition (currently the most popular machine learning competition platform).

After reading a few chapters, my friends will find the special place of the book. The author tries to reduce the reader's over-reliance on programming skills and mathematics knowledge, thus reducing the threshold of understanding this book and practicing machine learning model, and trying to make more interested people realize the fun of solving practical problems by using classical models and even more efficient methods. At the same time, the author of each of the key terminology in the book provides a standard English expression, but also convenient for readers to quickly access and understand the relevant English literature.

Since this book does not cover a large number of mathematical models and complex programming knowledge, the audience is very broad. This includes: research and development personnel engaged in machine learning and data mining related tasks in the Internet, it-related fields, PhD and postgraduate students enrolled in the university, and even senior undergraduates who have a rudimentary understanding of computer programming, as well as computer amateurs interested in machine learning and data mining competitions.

Finally, I sincerely hope that the readers can benefit from this book, but also this is my greatest encouragement and support. The book Code is: Http://pan.baidu.com/s/1bGp15G. For any errors appearing in the book, you are welcome to criticize and send to e-mail: [email protected], we will be in this book's Errata website https://coding.net/u/fanmiao_thu/p/Python_ML_and_ Record your important contributions on the kaggle/topic.

Written in Central Park, New York, USA

December 25, 2015

Postscript

One night in December 2015, I received a Li Chao from Tsinghua University in my home in New York. She said she personally appreciated the number of posts I posted on the web about how to use Python to quickly build a machine learning system and work on the Kaggle competition platform, and I wanted to sort out a book for publication.

At first I was surprised because all the posts I posted on the Internet were the experience of everyday learning, and there was not much logic to it, let alone publishing books. At that time the original intention of publishing those posts, just do not want many machine learning enthusiasts to repeat my mistakes in practice, but also hope to help more students quickly get started and experience the fun of the actual combat.

However, when I took the task of finishing the manuscript, I suddenly felt that my burdens were much heavier. In particular, after knowing that the book was likely to be selected as a generic textbook, it was immediately found that all the posts I posted on the Internet were almost unusable. The reason is that, as a teaching material to put yourself in the interests of the reader, especially the target audience of this textbook is not only the computer professionals, but also non-computer professional enthusiasts and the first to enter this way of undergraduates. So, I almost re-compiled the whole book outline, referring to the online post rewrite the second and third chapters, and take into account the needs of different levels of readers, added the first chapter of the Python Programming Foundation and the fourth Chapter Kaggle contest and other related content.

Despite the haste of the book, I also strive to clear the book, in a comprehensible way for the vast number of readers of the reader service, but also because of the limitations of capacity, the force does not catch the place, but also hope that your friends criticize correct, timely errata.

Finally, thank you again for purchasing "Python Machine learning and Practice", and by the author I often quote Steve Jobs a famous saying, as the end of the book: Hungry for Knowledge, modesty if foolish (stay hungry, stay foolish), Hope in the future life on the road can be mutual encouragement with readers friends.

Written in Tsinghua Park, Beijing, China

May 1, 2016

"Python machine learning and Practice: from scratch to the road to the Kaggle race"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More