scikit-learn:4.2.3. Text Feature Extraction

Source: Internet
Author: User

Http://scikit-learn.org/stable/modules/feature_extraction.html

Section 4.2 contains too much content, so the text feature is extracted individually as a piece.


1. The bag of words representation

The Scikit-learn provides three ways to represent raw data as a fixed-length digital eigenvector:

Tokenizing: Give each token (word, word, granularity) an integer index ID

Counting: The number of times each token appears in each document

Normalizing: The importance of normalizing/weighting tokens based on the number of times each token appears in a sample/document.


Re-understand what is feature and what is the sample:

Each individual token occurrence frequency (normalized or not) is treated as a feature. The vector of all the tokens frequencies for a given document is considered a multivariate sample.


Bag of Words or "bag of n-grams" representation:

General process (tokenization, counting and normalization) of turning a collection of text documents into numerical Featur E vectors,while completelyignoring the relative position information of the words in the document.


2, sparsity

The words in each document are only a small part of all the words in the entire corpus, which results in the sparsity of the feature vector (many values are 0). To solve the problem of storage and operation speed, use Python's Scipy.sparse package.



3. Common Vectorizer usage

Countvectorizer also implements Tokenizing and counting.

Many parameters, but the default is very reasonable, suitable for most cases, specific reference: http://blog.csdn.net/mmc2015/article/details/46866537

>>> Vectorizer = Countvectorizer (min_df=1)
>>> vectorizer                     
countvectorizer (analyzer= ... ') Word ', binary=false, decode_error= ... ' strict ',
        dtype=< ' Numpy.int64 ', encoding= ... ' utf-8 ', input= ... ' content ', Lowercase=true, max_df=1.0, Max_features=none, Min_
        Df=1,
        ngram_range= (1, 1), Preprocessor=none, Stop_words=none,
        strip_accents=none, token_pattern= ... ' (? u) \ \ b\\w\\w+\\b ',
        tokenizer=none, Vocabulary=none)
Here's an example of how it's used:

http://blog.csdn.net/mmc2015/article/details/46857887

including Fit_transform, transform, Get_feature_names (), ngram_range= (Min,max), Vocabulary_.get (), etc...


4, TF-IDF term weighting

Solution (e.g. "the", "a", "is" in 中文版) some words appear too many times, but they are not the subject of our attention.

The text. The Tfidftransformer class implements the Mormalization:

>>> from Sklearn.feature_extraction.text import tfidftransformer
>>> transformer = Tfidftransformer ()
>> counts = [[3, 0, 1],
...           [2, 0, 0],
...           [3, 0, 0],
...           [4, 0, 0],
...           [3, 2, 0],
...           [3, 0, 2]
...
>>> TFIDF = Transformer.fit_transform (counts)
>>> TFIDF                         
<6x3 sparse matrix of type ' < ... ' Numpy.float64 ' > ' with
    9 stored elements in compressed Sparse ... format>

>>> tfidf.toarray () C18/>array ([[0.85 ...,  0.  ...,  0.52 ...],
       [1.  ...,  0.  ...,  0.  ...],
       [1.  ...,  0.  ...,  0.  ...],
       [1.  ...,  0.  ...,  0.  ...],
       [0.55 ...,  0.83 ...,  0.  ...],
       [0.63 ...,  0.  ...,  0.77 ...])
>>> transformer.idf_  #idf_保存fit之后的结果
Array ([1 ...., 2.25 ...  ,  1.84 ...])

Another class called Tfidfvectorizer combines all the options of Countvectorizer and Tfidftransformer in a Singl E Model:

If the feature of binary occurrence is used, the parameter of Countvectorizer is set to binary better ... Bernoulli Naive Bayes is also more suitable to do estimator.


5. Decoding text Files

Text is composed of character, but file is composed of bytes, so to let Scikit-learn work, first to tell him file encoding, then Countvectorizer will automatically decode . The default encoding method is UTF-8, and the decoded character set is called Unicode. If the file encoding you are loading is not UTF-8 and there is no encoding parameter set, Unicodedecodeerror will appear.

If the encoding is wrong, try:

Find out what's the actual encoding of the text is. The file might come with a header or README this tells you the encoding, or there might is some standard encoding you can Assume based on where the text comes from. You could be able to find out what kind of encoding it was in general using the UNIX command file. The python chardet module comes with a script called chardetect.py that would guess the specific encod ING, though cannot rely on its guess being correct. You could try UTF-8 and disregard the errors. You can decode byte strings With bytes.decode (errors= ' replace ')  to replace all decoding errors with a meaningles s character, or set decode_error= ' replace '  in the Vectorizer. This may damage the usefulness of your features. Real text may come from a variety of sources which may have used different encodings, or even is sloppily decoded in a diff Erent encoding than the one it is encoded with. This is common in text retrieved from the WEB. The Python Package ftfy can automatically sort out some classes of decoding errors, so you could try decoding T He unknown text as latin-1 and then using ftfy to fix errors. If the text is in a mish-mash of encodings that's simply too hard-to-sort out (which are the case for the Newsgroups da Taset), you can fall back on a simple single-byte encoding such as latin-1. Some text may display incorrectly, but at least the same sequence of bytes is always represent the same feature.

For example, the following snippet uses Chardet (not shipped with Scikit-learn, must is installed separately) to Figure OU t the encoding of three texts. It then vectorizes the texts and prints the learned vocabulary. The output is not a shown here. >>>

>>> import Chardet    
>>> Text1 = B "Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
>>> Text2 = B "Holdselig sind deine Ger\xfcche"
>>> Text3 = B "\xff\xfea\x00u\x00f\x00 \x00f\x00l\x00\xfc\x00g\ x00e\x00l\x00n\x00 \x00d\x00e\x00s\x00 \x00g\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\ x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.