1. Convert the data in the dictionary format to a feature .
The premise: The data is stored in a dictionary format, by calling the Dictvectorizer class to convert it to a feature, for a variable with a character value of type, automatically converted to more than one feature variable, similar to the previously mentioned Onehot encoding.
In [226]: measurements = [
...: {' City ': ' Dubai ', ' Temperature ':.}, ...: ' City ' : ' London ', ' Temperature ':},
...: {' City ': ' San fransisco ', ' Temperature ':.},
...:] in
[227]: From Sklearn.feature_extraction import Dictvectorizer in
[228]: Vec=dictvectorizer () in
[229]: Vec.fit_transform (measurements). ToArray ()
out[229]:
Array ([[[ 1], 0., 0., + .],
[ 0., 1., 0.,
[ 0., 0., 1.,. ])
In []: Vec.get_feature_names ()
out[230]: [' City=dubai ', ' City=london ', ' City=san fransisco ', ' temperature ']
At the same time, the feature can be selected directly through the restrict function of VEC.
In [247]: from sklearn.feature_selection import selectkbest,chi2 in
[249]: Z=vec.fit_transform (measurements) Support = Selectkbest (Chi2, k=2). Fit (z, [0, +])
in []: Z.toarray ()
Out[250]:array ([[1., 0., 0., 33.], [0., 1., 0., [0., 0., 1..])
In [251]: Vec.get_feature_names ()
out[251]: [' City=dubai ', ' City=london ', ' City=san fransisco ', ' temperature '] In
[252]: Vec.restrict (Support.get_support ())
OUT[252]:D ictvectorizer (dtype=<type ' Numpy.float64 '), separator= ' = ', Sort=true, sparse=true) in
[253]: Vec.get_feature_names ()
out[253]: [' City=san Fransisco ', ' temperature ']
You can also call the Inverse_transform function to get the original value.
2. Feature hashes
When the list of feature values is large, and there are multiple onehot encodings that result in a large feature matrix and a lot of 0, the hash function can be used to map the feature to the matrix of the specified dimension based on the feature name and value. Since hash is a one-way function, Featurehash does not have a inverse_transform function.
From sklearn.feature_extraction import featurehasher
h = featurehasher (n_features=10,input_type= ' dict ')
D = [ {' Dog ': 1, ' Cat ': 2, ' Elephant ': 4},{' dog ': 2, ' Run ': 5}]
f = h.transform (D)
F.toarray ()
Array ([[0., 0., -4., -1., 0., 0., 0., 0., 0., 2. ],
[0. , 0., 0., -2., -5., 0., 0., 0., 0., 0.])
3. Text Processing
(1) Count
A word in a corpus as a feature, and the number of occurrences of that word in a document as a characteristic value.
In [1]: From Sklearn.feature_extraction.text import Countvectorizer in
[2]: Vec=countvectorizer () in
[3]: vec< C2/>OUT[3]:
countvectorizer (analyzer=u ' word ', binary=false, decode_error=u ' strict ',
dtype=<type ' Numpy.int64 ', Encoding=u ' utf-8 ', input=u ' content '
lowercase=true, max_df=1.0, Max_features=none, Min_df=1,
ngram_range= (1, 1), Preprocessor=none, Stop_words=none,
strip_accents=none, Token_pattern=u ' (? u) \\b\\w\\w +\\b ',
Tokenizer=none, Vocabulary=none) in
[4]: corpus = [
...: ' This is the first document. ',
...: ' This is the second second document. ', ...:
' and the third one. ',
... . Cument ',
...:] in
[5]: X=vec.fit_transform (Corpus) in
[6]: X.toarray ()
out[6]:
Array ([[[0 , 1, 1, 1, 0, 0, 1, 0, 1], [0, 1, 0, 1, 0, 2, 1, 0, 1], [1, 0, 0, 0, 1, 0, 1, 1, 0
],
[0, 1, 1, 1, 0, 0, 1, 0, 1]], Dtype=int64)
Ngram can also be used as a word bag, specified by parameter ngram_range.
in [+]: Bigram_vec=countvectorizer (ngram_range= (1,3), token_pattern=r ' \b\w+\b '
...: min_df=1) in
[22]: Bigram_vec.fit_transform (Corpus). ToArray ()
out[22]:
Array ([[0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
0, 0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 2, 1, 1, 1, 1, 0, 0, 1, 1, 0,
0, 0, 0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0
, 0],
[0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1
]], Dtype=int64 ) in [all]:
Analyze=bigram_vec.build_analyzer () in
[]: Analyze (' Hello a b C ')
out[24]:
[u ' Hello ' ,
u ' a ', u
' b ',
u ' C ', U '
hello a ', U ' a B ', u ' b C ', U '
hello a b ',
u ' a b C ']
To identify the character encoding problem, we can see the Chardet module (2) Sparse matrix conversion
Hashtransformerhashvectorizer=countvectorizer+hashtransfomer
Reference:
Http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction