Sklearn spectral clustering and text mining (i.)

Last Update:2018-07-31 Source: Internet

Author: User

Tags generator shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The discussion about the double clustering.
Data that produces a double cluster can use a function,
Sklearn.datasets.make_biclusters (Shape = (row, col), n_clusters, noise, \
Shuffle, Random_state)
N_clusters Specifies the number of cluster data produced, noise specifies the standard deviation of the Gaussian noise used.
It returns a tuple, that is, the generated data, not the same kind of row, not the same kind of column label.

From sklearn.datasets import sample_generator as SG
Sg.shuffle_ executes the Clutter method and then rearrange the original data.
As you can see from its implementation, the following features of the rearrangement code are similar:

def _shuffle (data, Random_state=none):
generator = check_random_state (random_state)
n_rows, n_cols = data. Shape
row_idx = generator.permutation (n_rows)
col_idx = generator.permutation (n_cols) result
= Dat a[row_idx][:, Col_idx] return result
, Row_idx, Col_idx

Sklearn.cluster.bicluster.SpectralCoclustering
Realizing the main algorithm of the double clustering, after giving the algorithm fit, we can call Model.biclusters to get the double clustering
corresponding groups of subscript.
Model.row_labels_ and Model.column_labels_ respectively returned to the ranks of the subscript, when
When you rearrange it, you can return a couple of sequential split-time images of the cluster. (Np.argsort)

Here is the sample code:

Import NumPy as NP from
matplotlib import Pyplot as PLT to


sklearn.datasets import make_biclusters from
s Klearn.datasets import samples_generator as SG from
sklearn.cluster.bicluster import spectralcoclustering
From sklearn.metrics import consensus_score


data, rows, Columms = make_biclusters (shape = (+), N_clusters = 5, noise = 5, \
shuffle = False, random_state = 0)


plt.matshow (data, cmap = plt.cm.Blues)
plt.title ("Original dat Asets ")


data, row_idx, Col_idx = sg._shuffle (data, random_state = 0,)
plt.matshow (data, cmap = plt.cm.Blues) 
  plt.title ("shuffled datasets")


model = spectralcoclustering (n_clusters = 5, random_state = 0)
Model.fit ( Data)
score = Consensus_score (Model.biclusters_, (rows[:, Row_idx], columms[:, Col_idx))


print "Consensus_ Score: {:. 3f} ". Format (Score)

Fit_data = Data[np.argsort (model.row_labels_)]
fit_data = fit_data[:, Np.argsort (Model.column_labels_)]


Plt.matshow (fit_data, cmap = plt.cm.Blues)
plt.title ("after biclustering;" Rearranged to show biclusters ")


plt.show ()

The corresponding result of replacing the above n_clusters with a two-dimensional array is that the block is not diagonal, and is more
A generalized classification method. (although also belongs to the double cluster), the related function may refer to the following function and the correlation,
The Easier way
Spectralbiclustering

Python's default regular expression can only match the number of characters specified in the ASCII encoding, to extend it to a more general
Unicode requires special declarations, such as Re.compile (U ' (? u) \\b\\w\\w+\\b ')
The corresponding character set is extended to the Unicode character set by the (? u). (Enlarge match scope)
That is, the rule is a regular expression pointing to Python2, and Python3 is the default match for Unicode.

\b is used to match the characters between the word character \w i.e.[a-za-z0-9_] and the non-word character \w.
The function of this regular expression is to match at least two digits or character \\w\\w+, for only one bit
\w or directly matching more bits of the case \w+ is an exclusion.

def Number_aware_tokenizer (DOC):
token_pattern = re.compile (U ' (? u) \\b\\w\\w+\\b ')
tokens = Token_ Pattern.findall (DOC)
tokens = [' #NUMBER ' if token[0] in ' 0123456789_ ' else token for token on tokens] return


token S

The function completes the method of replacing the first letter in doc with ' _ ' or the number ' #NUMBER ', otherwise the

There are some general concepts here, but basically token can be seen as some sort of extraction of string streams, and don't delve into them.

Here's a related deal about stop_words, the word stop means meaningless words,
Python package stop-words can be used in English to remove the word, such as:
From stop_words import get_stop_words
Get_stop_words ("中文版")
will return to the commonly used stop words in English.
This involves data cleansing in the preprocessing of text mining data.
The application of the above algorithm to text mining is considered below and the savings are discussed.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More