What did the Scikit-learn:countvectorizer extract TF do _

What did the Scikit-learn:countvectorizer extract TF do __scikit-learn

Last Update:2018-08-20 Source: Internet

Author: User

Tags iterable alphanumeric characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html# Sklearn.feature_extraction.text.CountVectorizer

Class Sklearn.feature_extraction.text. Countvectorizer (input=u ' content ', encoding=u ' utf-8 ', decode_error=u ' strict ', Strip_accents=none, Lowercase=true, PR Eprocessor=none, Tokenizer=none, Stop_words=none, Token_pattern=u ' (u) \b\w\w+\b ', ngram_range= (1, 1), Analyzer=u ' wor d ', max_df=1.0, Min_df=1, Max_features=none, Vocabulary=none, Binary=false, Dtype=<type ' Numpy.int64 ' >) [SOURC E

Role: Convert A collection of text documents to a matrix of token counts (the number of words calculated, that is, TF); The result is sparse representation of Scipy.sparse.coo_matrix.

Look at the parameters to see what Countvectorizer did when extracting the TF:

strip_accents : {' ASCII ', ' Unicode ', None}: "Tone" is not known as "tone". Look: http://textmechanic.com/?reqp=1&reqr=nzcdYz9hqaSbYaOvrt==

lowercase : Boolean, True by default: Converts all characters to lowercase before the TF is calculated. This argument is generally true.

preprocessor : Callable or None (default): Carbon Replication The preprocessing (String transformation) stage, but preserves tokenizing and n Grams generation steps. This parameter can be written by yourself.

Tokenizer : Callable or None (default): Carbon replication The string tokenization step, but retains preprocessing and n-grams generation steps. This parameter can be written by yourself.

Stop_words : string {' 中文版 '}, list, or None (default): If it is ' Chinese ', a built-in stop word list for Chinese is used. If it is a list, then the final tokens will remove all of the stop word in the list. If none, the pause word is not processed, but the parameter MAX_DF can be set between [0.7, 1.0), and thus automatically terms according to intra Corpus document frequency (DF) of Detect and filter stop words. This parameter should be adjusted according to your own requirements.

Token_pattern : string: Regular expression, with the default filter length greater than or equal to 2 of the alphanumeric characters (select tokens of 2 or more alphanumeric characters), Parameter analyzer is set to Word only when it is valid.

ngram_range : Tuple (Min_n, max_n): N-values is worth the upper and lower bounds, the default is ngram_range= (1, 1), the range of N-ary feature will be extracted. This parameter should be adjusted according to your own requirements.

Analyzer : string, {' word ', ' char ', ' CHAR_WB '} or callable: feature is based on Wordn-grams or character N-grams. If callable is a function of its own to extract features from the raw, unprocessed input.

MAX_DF : float in range [0.0, 1.0] or int, default=1.0:

min_df : float in range [0.0, 1.0] or int, default=1: Removes the word max_df with a proportional, or absolute quantity, of DF over MIN_DF or DF less than tokens. The valid premise is that the parameter vocabulary is set to node.

Max_features : int or None, Default=none: Select the maximum max_features feature of the TF. The valid premise is that the parameter vocabulary is set to node.

Vocabulary : Mapping or iterable, Optional: Custom feature word tokens, if not none, only the tf of the word in vocabulary is computed. or set to none of the reliable.

binary : boolean, Default=false: If the value of TRUE,TF is only 0 and 1, which means that it appears and does not appear, useful for discrete probabilistic of that model Binary events rather than integer counts.

dtype : Type, Optional:type of the Matrix returned by Fit_transform () or transform ().

Conclusion:

Countvectorizer extract TF All do these: go to the tone, lowercase, go to pause the word, in Word (rather than character, also can choose their own parameters) based on the extraction of all the characteristics of the Ngram_range range, while deleting the "MAX_DF, Min_ Df,max_features " characteristics of the TF. Of course, you can also select TF for binary.

This should be assured that Countvectorizer deal with the results are not what they want .... Wow, haha.

Finally, look at the two functions:

Fit (raw_documents[, y])	Learn a Vocabulary Dictionary of all tokens in the raw documents.
Fit_transform (raw_documents[, y])	Learn the Vocabulary dictionary and return term-document matrix.

Fit (raw_documents, y=none) [source]¶

Learn a Vocabulary Dictionary of all tokens in the raw documents.

Parameters:	raw_documents : iterable An iterable which yields either STR, Unicode or file objects.
Returns:	Self :

Parameters:

raw_documents : iterable

An iterable which yields either STR, Unicode or file objects.

Returns:

Self :

Fit_transform (raw_documents, Y=none)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More