What did the Scikit-learn:countvectorizer extract TF do __scikit-learn

Source: Internet
Author: User
Tags iterable alphanumeric characters

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html# Sklearn.feature_extraction.text.CountVectorizer


Class Sklearn.feature_extraction.text. Countvectorizer (input=u ' content ', encoding=u ' utf-8 ', decode_error=u ' strict ', Strip_accents=none, Lowercase=true, PR Eprocessor=none, Tokenizer=none, Stop_words=none, Token_pattern=u ' (u) \b\w\w+\b ', ngram_range= (1, 1), Analyzer=u ' wor d ', max_df=1.0, Min_df=1, Max_features=none, Vocabulary=none, Binary=false, Dtype=<type ' Numpy.int64 ' >) [SOURC E

Role: Convert A collection of text documents to a matrix of token counts (the number of words calculated, that is, TF); The result is sparse representation of Scipy.sparse.coo_matrix.

Look at the parameters to see what Countvectorizer did when extracting the TF:

strip_accents : {' ASCII ', ' Unicode ', None}: "Tone" is not known as "tone". Look: http://textmechanic.com/?reqp=1&reqr=nzcdYz9hqaSbYaOvrt==


lowercase : Boolean, True by default: Converts all characters to lowercase before the TF is calculated. This argument is generally true.


preprocessor : Callable or None (default): Carbon Replication The preprocessing (String transformation) stage, but preserves tokenizing and n Grams generation steps. This parameter can be written by yourself.


Tokenizer : Callable or None (default): Carbon replication The string tokenization step, but retains preprocessing and n-grams generation steps. This parameter can be written by yourself.


Stop_words : string {' 中文版 '}, list, or None (default): If it is ' Chinese ', a built-in stop word list for Chinese is used. If it is a list, then the final tokens will remove all of the stop word in the list. If none, the pause word is not processed, but the parameter MAX_DF can be set between [0.7, 1.0), and thus automatically terms according to intra Corpus document frequency (DF) of Detect and filter stop words. This parameter should be adjusted according to your own requirements.


Token_pattern : string: Regular expression, with the default filter length greater than or equal to 2 of the alphanumeric characters (select tokens of 2 or more alphanumeric characters), Parameter analyzer is set to Word only when it is valid.


ngram_range : Tuple (Min_n, max_n): N-values is worth the upper and lower bounds, the default is ngram_range= (1, 1), the range of N-ary feature will be extracted. This parameter should be adjusted according to your own requirements.


Analyzer : string, {' word ', ' char ', ' CHAR_WB '} or callable: feature is based on Wordn-grams or character N-grams. If callable is a function of its own to extract features from the raw, unprocessed input.

MAX_DF : float in range [0.0, 1.0] or int, default=1.0:

min_df : float in range [0.0, 1.0] or int, default=1: Removes the word max_df with a proportional, or absolute quantity, of DF over MIN_DF or DF less than tokens. The valid premise is that the parameter vocabulary is set to node.

Max_features : int or None, Default=none: Select the maximum max_features feature of the TF. The valid premise is that the parameter vocabulary is set to node.


Vocabulary : Mapping or iterable, Optional: Custom feature word tokens, if not none, only the tf of the word in vocabulary is computed. or set to none of the reliable.


binary : boolean, Default=false: If the value of TRUE,TF is only 0 and 1, which means that it appears and does not appear, useful for discrete probabilistic of that model Binary events rather than integer counts.

dtype : Type, Optional:type of the Matrix returned by Fit_transform () or transform ().



Conclusion:

Countvectorizer extract TF All do these: go to the tone, lowercase, go to pause the word, in Word (rather than character, also can choose their own parameters) based on the extraction of all the characteristics of the Ngram_range range, while deleting the "MAX_DF, Min_ Df,max_features " characteristics of the TF. Of course, you can also select TF for binary.

This should be assured that Countvectorizer deal with the results are not what they want .... Wow, haha.


Finally, look at the two functions:

Fit (raw_documents[, y]) Learn a Vocabulary Dictionary of all tokens in the raw documents.
Fit_transform (raw_documents[, y]) Learn the Vocabulary dictionary and return term-document matrix.
Fit (raw_documents, y=none) [source]¶

Learn a Vocabulary Dictionary of all tokens in the raw documents.

Parameters:

raw_documents : iterable

An iterable which yields either STR, Unicode or file objects.

Returns:

Self :

Fit_transform (raw_documents, Y=none)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.