Chapter I: Fundamentals of machine learning

Source: Internet
Author: User



Part I: Classification



The first two parts of the book focus on supervised Learning (supervisedieaming). In the process of supervising learning, we only need to give the input sample set , and the machine can push the possible results of the specified target variable from it. Supervised learning is relatively simple, and the machine simply predicts the appropriate model from the input data and calculates the result of the target variable from it.
Supervised learning generally uses two types of target variables: nominal and numerical. The result of the nominal target variable is only in the limited target set value, such as true and false, animal classification set {reptiles, fishes, mammals, amphibians, plants, fungi}; Numerical target variables can be used to take values from an infinite set of values, such as 0.100, 42.001, 000.743, and so on. The numerical target variable is mainly used for regression analysis , the second part of this book is studied, the first part mainly introduces classification .



The first seven chapters of this book mainly study the classification algorithm, the 2nd chapter describes the simplest classification algorithm: K-nearest neighbor algorithm, which uses the distance matrix to classify; The 3rd chapter introduces decision Tree, it is more intuitive, easy to understand, but relatively difficult to achieve; 4th chapter will discuss how to use probability theory to build a classifier; 5th Chapter will discuss the logistic regression, how to use the optimal parameters to correctly classify the original data, in the process of searching the optimal parameters, will use several commonly used optimization algorithm; the 6th chapter introduces the very popular support vector machine; The first part of the last 7th chapter introduces the meta-algorithm-adaboost, It is composed of a number of classifiers, in addition to the first part of the classification algorithm discussed in the actual use of the non-equilibrium classification problem, once the training sample a classification of data more than other classified data, it will produce a non-equilibrium classification problem.






Chapter I: Fundamentals of machine learning



Chapter Content
A simple overview of machine learning
The main task of machine learning
Learning the causes of machine learning
The advantages of the Python language



machine learning allows us to inspired by the data set, in other words, we use computers to demonstrate the true meaning behind the data , which is machine learning the true meaning of learning. It is neither a robot that will imitate in vain, nor a bionic man with human affection.



all of the above mentioned scenarios have the existence of machine learning software. Many companies now use machine learning software to improve business decisions, improve productivity, detect disease, predict weather , and more






1.1 What is machine learning



It is difficult to  you need directly from the raw data itself, except for some unimportant situations. For example, for spam detection, detecting whether a word exists is not much of a function, but when a certain number of words appear at the same time, supplemented by the study of the length of the message and other factors, people can more accurately determine whether the message is spam. Simply put, machine learning is the conversion of unordered data into useful information.
Machine learning spans many disciplines, such as computer science, engineering and statistics, and requires multidisciplinary expertise. As you can see later, it can also be used as a practical tool to solve many of the problems, from politics to geology. Even so, machine learning can be useful in any area where data needs to be interpreted and manipulated.



finally we decided to use a machine learning algorithm to classify, the first thing to do is algorithm training , that is, how to classify the learning. Usually we use the data as the algorithm training set for the algorithm to lose the mass. Training set is a set of data samples used to train machine learning algorithms



Table 1-1 is a training set with six training samples, each with 4 characteristics, one target variable, and 1-2 shown. The target variable is the prediction result of machine learning algorithm , the type of target variable is usually nominal in the classification algorithm, and it is usually continuous in the regression algorithm. The training sample set must determine the value of the target variable so that the machine learning algorithm can discover the relationship between the feature and the target variable.



we usually refer to the target variable in the classification problem as a category and assume that there are only a limited number of categories for the classification problem.



to test the effectiveness of the machine learning algorithm, two sets of separate sets of samples are typically used: training data and test data . When the machine learning program starts running, the training sample set is used as the sender of the algorithm, and the test sample is lost after the training is completed. The test sample does not provide a target variable for the sample, and the program determines which category the samples belong to. comparing the difference between the target variable value of the test sample forecast and the actual sample category , the actual accuracy of the algorithm can be obtained. The subsequent chapters of this book will lead to better use of test samples and methods of training sample information, which are not detailed here.



1.3 main tasks of machine learning



The example above describes how machine learning solves the classification problem, and its main task is to divide the instances into appropriate sub - class. Another task of machine learning is regression, which is primarily used to predict numerical data . Most people may have seen the example of regression-Data fit curve: The best fit curve by a given data point. classification and regression belong to supervised learning , which is called supervised learning because such algorithms must know what to predict, that is, the categorical information of the target variable.



In contrast to supervised learning, unsupervised learning occurs when the data has no category information and is not given a target value. In unsupervised learning, the process of dividing a collection of data into multiple classes consisting of similar objects is called clustering; the process of finding a descriptive statistic is called a density estimate. In addition, unsupervised learning can reduce the dimensions of data features so that we can visualize data information more visually using two-dimensional or three-dimensional graphics. Table 1-2 lists the main tasks of machine learning, as well as algorithms for solving the corresponding problems.





1.4 How to choose the right algorithm



choosing the actual available algorithm from the algorithm listed in table 1-2, you must consider the following two questions: first, the purpose of using machine learning algorithm, what kind of task do you want the algorithm to accomplish, such as predicting the probability of rain tomorrow or grouping the voters by interest; Firstly, the purpose of using machine learning algorithm is considered. If you want to predict the value of the target variable , you can choose to supervise the learning algorithm, otherwise you can choose unsupervised learning algorithm. After selecting the supervised learning algorithm, we need to further determine the target variable type, if the target variable is discrete , if/No, 1/2/3, ABC or RED/yellow/black, etc., you can choose the classifier algorithm ; If the target variable is a continuous type Values, such as 0.0~ 100.00, -999~999, or +00~-00, you need to select a regression algorithm .



      if , you can choose unsupervised learning algorithm. Further analyze whether you need to divide the data into discrete groups. If this is the only requirement, the clustering algorithm is used, and if you also need to estimate is required when the data is similar to each grouping. In most cases, the choices given above can help readers choose the right machine learning algorithm, but this is not always the case. In the 9th chapter, we will use the classification algorithm to deal with the regression problem, obviously this will be different from the principle of the above supervised learning to deal with the regression problem.



The second thing to consider is the data problem. We should be fully aware of the data and the more we know about the actual data, the easier it is to create an application that meets the actual needs. It is important to understand the following characteristics of the data: whether the eigenvalue is a discrete variable or a continuous variable, if there is a missing value in the eigenvalue, what causes the missing value , whether there is an outlier in the data, how often a feature occurs (Is it rare as haidilaozhen), etc.? A good understanding of the data features mentioned above can shorten the time to select machine learning algorithms.
We can only narrow the selection of the algorithm to a certain extent, there is generally no best algorithm or can give the best results of the algorithm, but also try different algorithm execution effect. For each of the selected algorithms, other machine learning techniques can be used to improve their performance. After processing the input data, the relative performance of the two algorithms may also vary. Subsequent chapters we will discuss these issues further, and in general find the key link to the best algorithm is iterative process of repeated trial and error. Although machine learning algorithms are different, the steps to create an application using an algorithm are basically similar, and the next section describes how to use the common steps of the machine learning algorithm.






1.5 Steps to develop a machine learning application



This book learns and uses machine learning algorithms to develop applications that typically follow the steps below.
(1) Collect data. We can use many methods to collect sample data, such as: the production of web crawlers from the site to extract data from the RSS feed feedback or API to get information, equipment sent over the measured data (wind speed, blood sugar, etc.). There are many ways to extract data , and to save time and effort, you can use publicly available data sources.
(2) Prepare input data. Once you have the data, you must also ensure that the data format meets the requirements, and that the format is a list of Python languages. Using this standard data format, algorithms and data sources can be fused to facilitate matching operations. This book uses the Python language constructs the algorithm application, the unfamiliar reader may study appendix eight.



There is also a need to prepare specific data formats for machine learning algorithms, such as certain algorithms that require the use of specific formats for eigenvalue values, some algorithms require that the target and eigenvalues are string types, while others may require an integer type. We'll discuss this later in the next section, but it's much simpler to deal with a particular algorithm's requirements than the format for collecting the data.


(3) analyze input data. This step is primarily a manual analysis of previously obtained data. To ensure that the first two steps are effective, the simplestthe method is to open the data file with a text editor and see if the data is a null value. In addition, you can further explore the number ofWhether a pattern can be identified, whether there are significant outliers in the data, such as some data points and other data sets in the datasetThere are significant differences in values. Presenting data through one-, two-, or three-dimensional graphics is a good idea, but most of the time wethe eigenvalues of the data are no less than three, and they cannot be graphically displayed for all features at once. The following chapters of this book will introducethe method of refining data, so that multidimensional data can be compressed to two-dimensional or three-dimensional, convenient for us to graphically display data. The main purpose of this step is to ensure that there is no garbage data in the dataset. If you are using machine learning algorithms in a product-based systemand the algorithm can handle the data format produced by the system, or we trust the data source, can skip the 3rd step directly. This stepmanual intervention is required, and if manual intervention is required in the automation system, it is clear that the system value is reduced.
(4) training algorithm . The machine learning algorithm does not really begin to learn from this step. Depending on the algorithm, steps 4th and 5th arethe core of the machine learning algorithm. We input the formatted data from the first two steps into the algorithm and extract the knowledge or information from it. Herethe resulting knowledge needs to be stored as a format that can be processed by the computer for easy use in subsequent steps. if the unsupervised learning algorithm is used, there is no need to train the algorithm because there is no value of the target variable, all the algorithmsfocus on the 5th step.


(5) test algorithm . This step will actually use the knowledge information obtained from the 4th Step machine learning. To evaluate the algorithm, it is necessary to test the effectiveness of the algorithm. For supervised learning, it must be known to evaluate the target variable value of the algorithm, and for unsupervised learning, other evaluation methods must be used to verify the success rate of the algorithm. In either case, if the output of the algorithm is not satisfied, you can go back to the 4th step, correct and test it. The problem is often related to data collection and preparation, and you have to jump back to the 1th step and start over again.
(6) Use the algorithm. Convert machine learning algorithms into applications and perform actual tasks to verify that the above steps work correctly in the real world. If you encounter new data problems, you need to repeat the above steps as well. In the next section we will discuss the programming language for implementing the machine learning algorithm Python, because it has the advantage of other programming languages, such as easy to understand, rich libraries (especially matrix operations), active developer Community , etc.









 1.6.1 executable pseudo-code
     Python has a clear syntax structure and is also known as executable pseudo-code. The default installation of python development environment already comes with many advanced data types, such as , and so on, can use the operations of these data types without further programming. Using these data types makes it very easy to implement abstract mathematical concepts. In addition, readers can use their familiar programming styles, such as object-oriented programming, process-oriented programming, or functional programming. Unfamiliar with python can refer to Appendix A, which details python Language, python the type of data used and the installation guide.
     Python word processing and manipulating text files is very simple and very easy to handle with non-numeric data . python language provides rich regular expression functions and many libraries of functions that access his web pages, making extracting data from HTML very simple and intuitive.



1.6.2 Python is more popular


python uses a wide range of language, code examples are also many, easy for readers to learn and grasp quickly. In addition, in the development of practical applicationsprogram, you can also use a rich library of modules to shorten the development cycle.
in the field of science and finance,Python language has been widely used. Many scientific libraries, such as scipy and NumPy, are implementedvector and matrix operations, these libraries add readability to the code, and people who have learned linear algebra can understand the actualfunction. In addition, the scientific libraries scipy and NumPy are written using the underlying language (C and Fortran), which improves the relevant applicationscomputational performance. This book will use a lot of Python's numpy.
Python's scientific tools can work with the drawing tools matplotlib. The matplotlib can draw 2D, 30 graphics, andIt is possible to work with graphics that are often used in scientific research, so this book will also use a lot of matplotlib.
the Python Development Environment also provides an interactive shell environment that allows users to view and detect program content as they develop programs.


The Python development environment will also integrate the Pylib module in the future, merging scipy and NumPy and matplotlib into a single development loop . While writing in this book, Pylab has not yet pyth0n the environment, but in the near future we will certainly find it in the Python development ring .





















Chapter I: Fundamentals of machine learning


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.