Sklearn spectral clustering and text mining (i.)

Source: Internet
Author: User
Tags generator shuffle
The discussion about the double clustering.
Data that produces a double cluster can use a function,
Sklearn.datasets.make_biclusters (Shape = (row, col), n_clusters, noise, \
Shuffle, Random_state)
N_clusters Specifies the number of cluster data produced, noise specifies the standard deviation of the Gaussian noise used.
It returns a tuple, that is, the generated data, not the same kind of row, not the same kind of column label.


From sklearn.datasets import sample_generator as SG
Sg.shuffle_ executes the Clutter method and then rearrange the original data.
As you can see from its implementation, the following features of the rearrangement code are similar:


def _shuffle (data, Random_state=none):
generator = check_random_state (random_state)
n_rows, n_cols = data. Shape
row_idx = generator.permutation (n_rows)
col_idx = generator.permutation (n_cols) result
= Dat a[row_idx][:, Col_idx] return result
, Row_idx, Col_idx




Sklearn.cluster.bicluster.SpectralCoclustering
Realizing the main algorithm of the double clustering, after giving the algorithm fit, we can call Model.biclusters to get the double clustering
corresponding groups of subscript.
Model.row_labels_ and Model.column_labels_ respectively returned to the ranks of the subscript, when
When you rearrange it, you can return a couple of sequential split-time images of the cluster. (Np.argsort)


Here is the sample code:
Import NumPy as NP from
matplotlib import Pyplot as PLT to


sklearn.datasets import make_biclusters from
s Klearn.datasets import samples_generator as SG from
sklearn.cluster.bicluster import spectralcoclustering
From sklearn.metrics import consensus_score


data, rows, Columms = make_biclusters (shape = (+), N_clusters = 5, noise = 5, \
shuffle = False, random_state = 0)


plt.matshow (data, cmap = plt.cm.Blues)
plt.title ("Original dat Asets ")


data, row_idx, Col_idx = sg._shuffle (data, random_state = 0,)
plt.matshow (data, cmap = plt.cm.Blues) 
  plt.title ("shuffled datasets")


model = spectralcoclustering (n_clusters = 5, random_state = 0)
Model.fit ( Data)
score = Consensus_score (Model.biclusters_, (rows[:, Row_idx], columms[:, Col_idx))


print "Consensus_ Score: {:. 3f} ". Format (Score)




Fit_data = Data[np.argsort (model.row_labels_)]
fit_data = fit_data[:, Np.argsort (Model.column_labels_)]


Plt.matshow (fit_data, cmap = plt.cm.Blues)
plt.title ("after biclustering;" Rearranged to show biclusters ")


plt.show ()




The corresponding result of replacing the above n_clusters with a two-dimensional array is that the block is not diagonal, and is more
A generalized classification method. (although also belongs to the double cluster), the related function may refer to the following function and the correlation,
The Easier way
Spectralbiclustering




Python's default regular expression can only match the number of characters specified in the ASCII encoding, to extend it to a more general
Unicode requires special declarations, such as Re.compile (U ' (? u) \\b\\w\\w+\\b ')
The corresponding character set is extended to the Unicode character set by the (? u). (Enlarge match scope)
That is, the rule is a regular expression pointing to Python2, and Python3 is the default match for Unicode.


\b is used to match the characters between the word character \w i.e.[a-za-z0-9_] and the non-word character \w.
The function of this regular expression is to match at least two digits or character \\w\\w+, for only one bit
\w or directly matching more bits of the case \w+ is an exclusion.


def Number_aware_tokenizer (DOC):
token_pattern = re.compile (U ' (? u) \\b\\w\\w+\\b ')
tokens = Token_ Pattern.findall (DOC)
tokens = [' #NUMBER ' if token[0] in ' 0123456789_ ' else token for token on tokens] return


token S




The function completes the method of replacing the first letter in doc with ' _ ' or the number ' #NUMBER ', otherwise the


There are some general concepts here, but basically token can be seen as some sort of extraction of string streams, and don't delve into them.


Here's a related deal about stop_words, the word stop means meaningless words,
Python package stop-words can be used in English to remove the word, such as:
From stop_words import get_stop_words
Get_stop_words ("中文版")
will return to the commonly used stop words in English.
This involves data cleansing in the preprocessing of text mining data.
The application of the above algorithm to text mining is considered below and the savings are discussed.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.