Machine Learning Series: (c) Feature extraction and processing

Source: Internet
Author: User
Tags regular expression square root

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Directory (?) [+]

Disclaimer: All rights reserved, please contact the author and indicate the source http://blog.csdn.net/u013719780?viewmode=contents


Bo Master Introduction: Snow Night to Son (English name: Allen), machine learning algorithm siege lion, love to delve into the black technology learning, deep learning and artificial intelligence full of interest, Always pay attention to kaggle data Mining competition platform, the data, machine learning and artificial intelligence interested in children's shoes can be explored together Oh, personal csdn Blog: http://blog.csdn.net/ U013719780?viewmode=contents


feature extraction and processing

The explanatory variables in the previous chapter are numeric, such as Pizza's direct. Many machine learning problems need to be studied by classification variables, words and even images. In this chapter, we describe methods for extracting the characteristics of these variables. These technologies are the premise of data processing-serialization, which is the basis of machine learning and affects all chapters of this book. feature extraction of categorical variables

Many machine learning problems have categorical, tagged variables, not sequential. For example, an application uses a classification feature, such as a work place, to predict wage levels. Categorical variables are usually encoded by a single-hot code (ONE-OF-K or One-hot Encoding), and a binary number is used to represent the characteristics of each explanatory variable.

For example, suppose the city variable has three values: New York, San Francisco, Chapel Hill. The single-hot coding method is to use a three-bit binary number, each representing a city.

There are dictvectorizer classes in Scikit-learn that can be used to represent categorical features:

From sklearn.feature_extraction import dictvectorizer
onehot_encoder = Dictvectorizer ()
instances = [{' City ': ' New York '},{' city ': ' San Francisco '}, {' City ': ' Chapel Hill '} '
print (Onehot_encoder.fit_transform (instances). ToArray ())
[0.  1.  0.]
 [0.  0.  1.]
 [1.  0.  0.]

You will see that the location of the code is not corresponding to the city one by one above. The first city code of New York is [0. 1.0.], denoted by the second element as 1. This approach looks straightforward compared to using separate values to represent the classification. New York, San Francisco, Chapel Hill can be expressed as a few. The size of the numerical value has no practical significance, and the city does not have a natural number order. text Feature Extraction

Many machine learning problems involve natural language processing (NLP), which is bound to deal with textual information. The text must be converted to a feature vector that can be quantified. Here we introduce the most commonly used text representations: The Thesaurus model (Bag-of-words models). Word Bank notation

The Thesaurus model is the most common method for text modeling. For a document, ignoring its word order and syntax, syntax, simply as a collection of words, or a combination of words, the appearance of each word in the document is independent, does not depend on whether other words appear, Or, when the author of this article chooses a word in any position, it is independent of the previous sentence. The thesaurus model can be seen as an extension of the single-hot code, which sets a value for each word. The Thesaurus model is based on the same words as the article meaning. The Thesaurus model can realize effective document classification and retrieval through the limited coding information.

A collection of a batch of documents is called a Corpus (corpus). Let's demonstrate the Thesaurus model with a corpus of two documents:

CORPUS = [
    ' UNC played Duke in basketball ',
    ' Duke lost the basketball game '
]

The anthology consists of 8 words: UNC, played, Duke, in, basketball, lost, the, game. The word form of the file is a glossary (vocabulary). The Thesaurus model uses the feature vectors of each word in the corpus's glossary to represent each document. Our anthology has 8 words, so each document is composed of a vector of 8-bit elements. The number of elements that make up the eigenvector is called a dimension (dimension). A dictionary (dictionary) is used to denote the correspondence between a glossary and a eigenvector index.

In most thesaurus models, each element of a feature vector is a binary number that indicates whether the word is in the document. For example, the first word of the first document is a UNC, and the first word of the glossary is a UNC, so the first element of the eigenvectors is 1. The last word in the glossary is the game. The first document does not have the word, so the last element of the eigenvectors is 0. The Countvectorizer class converts the document all to lowercase, and then the document Word is chunked (tokenize). Document Word chunking is the process of dividing a sentence into a word block (token) or a meaningful sequence of letters. Word blocks are mostly words, but they may also be phrases such as punctuation and affixes. The Countvectorizer class divides a sentence with a space using a regular expression and then extracts a sequence of letters with a length greater than or equal to 2. The Scikit-learn implementation code is as follows:

From Sklearn.feature_extraction.text import countvectorizer
corpus = [
    ' UNC played Duke in basketball ',
    ' Duke lost the basketball game '
]
Vectorizer = Countvectorizer ()
print (Vectorizer.fit_transform (corpus). Todense ())
print (Vectorizer.vocabulary_)
[[1 1 0 1 0 1 0 1]
 [1 1 1 0 1 0 1 0]]
{' UNC ': 7, ' played ': 5, ' Game ': 2, ' in ': 3, ' basketball ': 0, ' the ': 6, ' Duke ': 1, ' Lost ': 4}

Let's add one more document to the anthology:

CORPUS = [
    ' UNC played Duke in basketball ',
    ' Duke lost the basketball game ',
    ' I ate a sandwich '
]
vect Orizer = Countvectorizer ()
print (Vectorizer.fit_transform (corpus). Todense ())
print (vectorizer.vocabulary _)
[[0 1 1 0 1 0 1 0 0 1]
 [0 1 1 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0]]
{' UNC ': 9, ' played ': 6, ' Game ': 3, ' in ': 4, ' ate ': 0, ' basketball ': 1, ' the ': 8, ' sandwich ': 7, ' Duke ': 2, ' Lost ': 5}

The above results can be obtained through the Countvectorizer class. There are 10 words in the glossary, but a is not in the glossary because the length of a does not meet the requirements of the Countvectorizer class.

Comparing the feature vectors of a document, you will find that the first two documents are more similar than the third document. If you compute their eigenvectors with Euclidean distances (Euclidean distance), they will be closer than they are to the third document. The Euclidean distance between the two vectors is the absolute value of the two vector Euclidean norm (Euclidean norm) or L2 norm difference:

The Euclidean norm of a vector is the square root of its element's sum of squares:

The Euclidean_distances function inside the scikit-learn calculates the distance of several vectors, representing the two most semantically similar documents whose vectors are the closest in space.

From sklearn.metrics.pairwise import euclidean_distances
counts = Vectorizer.fit_transform (corpus). Todense ()
for x, y in [[0,1],[0,2],[1,2]]:
    dist = euclidean_distances (counts[x],counts[y])
    print (' document {} ' distance to document {} ') . Format (x,y,dist))
Document 0 distance from document 1 [[2.44948974]]
document 0 distance from document 2 [[2.64575131]]
document 1 distance from document 2 [[2.64575131]]

If we use the content of the news report to do the anthology, the vocabulary can be used thousands of words. Each news feature Vector will have thousands of elements, many of which will be 0. Sports News does not contain the jargon of financial news, and the same cultural news does not include the jargon of financial news. There are many 0-element high-dimensional eigenvectors that become sparse vectors (sparse vectors).

There are some problems when using high-dimensional data to quantify machine learning tasks, not just in the field of natural language processing. The first problem is that high-dimensional vectors need to occupy more memory. NumPy provides some data types that show only non-0 elements of sparse vectors, and can handle this problem effectively.

The second problem is the well-known dimension disaster (Curse of dimensionality,hughes effect), and the more dimensions require greater training set data to ensure that the model is fully learning. If the training sample is not enough, then the algorithm can fit too much to induce the induction failure. Below, we introduce some methods of dimensionality reduction. In the 7th Chapter, PCA dimensionality Reduction, we will also introduce the numerical method to reduce the dimension. Disable Word filtering

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.