File fingerprint-based Web Text Mining

Source: Internet
Author: User

The rapid increase in massive heterogeneous Web Information Resources contains huge potential data. How to discover potentially valuable knowledge from vast Web resources becomes an urgent issue. People urgently need tools that can quickly and effectively discover resources and data on the Web to improve the efficiency of information retrieval and utilization on the Web.

At present, most research on Web text mining is based on bag of words or Vector Representation. This method regards a single word as an attribute in a collection of documents, only the words are viewed in an isolated manner from the statistical perspective, ignoring the location and context of the words. One drawback of the bag-of-words method is that the free text is rich in data, the vocabulary is very large, and the processing is very difficult. To solve this problem, people have made corresponding research and adopted different technologies, for example, information gain, cross entropy, and variance ratio are all aimed at reducing attributes. A more meaningful method is Latent Semantic Indexing. It analyzes the shared words of the same topic in different documents and finds their common root, use this common root to replace all words to reduce the dimension space. Other attribute representations include the appearance location, hierarchy, use of phrases, use of terms, and naming entities of words in the document. Currently, no research has been conducted to demonstrate that one notation is significantly better than the other.

Figure 1 text clustering model

The mining technology proposed in this article is not based on Vocabulary attributes, but text blocks. Based on the Tag Tree Structure of the web page, extract the title and text block to generate the SHA-1 fingerprint sequence. If the two pages have the same fingerprint block within our set range, the two pages are classified as one type. The class value is the accurate number of k to be clustering. Next, k-means is used for text clustering to achieve the purpose of Text Mining [2] [3]. Figure 1 shows the text clustering model.

Text preprocessing

· Web page Purification

Because the Web text contains a large amount of useless information such as advertisements, html tags, and related links, we must first purify the collected Web pages, which are also called webpage de-noise, to improve the clustering effect. We define the text added by web designers to assist website organizations as "noise" and call the text material to be expressed as "theme content ". These noises are the areas and items irrelevant to the page topic (that is, the viewer does not care about), including the advertisement bar, navigation bar, and modifier.

In this way, we analyze the HTML source code, remove the noise part based on the separator, and extract the webpage body [4].

· Generate a SHA-1 fingerprint

The full name of SHA is Secure Hash Algorithm, which is a Secure Hash Algorithm. It was developed by the American National Institute of Standards and Technology (NIST) and was published in 1993 as a federal information processing standard (fips pub 180. In 1995, another revision fips pub 180-1 was released, which is generally called SHA-1. It has become one of the most secure hash algorithms and is widely used. The idea of this algorithm is to receive a piece of plain text, and then convert it into a section (usually smaller) ciphertext in an irreversible way, it can also be understood as a string of input codes (called pre- ing or information ), and convert them into a short, fixed-digit output sequence, namely, the hash value (also known as the information digest or information Authentication Code) process [5].

Because of the avalanche effect of the sha-1 algorithm, when the text block is used as an information digest, invisible characters in the text block should be eliminated, while the text block sorting aims to reduce the complexity of the algorithm. For purified text blocks, the format analysis generates M text blocks B1, B2 ,... BM (Text block sort by importance), take the first m (≤ M) text block to generate a sha-1 fingerprint sha-11, sha-12 ,... Sha-1m. For a webpage pair (pi, pj), define STm (pi, pj) = m0/m, where m0 is the number of identical sha-1 fingerprints of pi and pj. It is easy to get. Given the range t, if STm (pi, pj) ε t, the two pages will be classified into a certain category.

Text clustering

Currently, there are multiple text clustering algorithms. Common clustering methods include hierarchical clustering and plane partitioning represented by k-means.

Hierarchical Clustering can generate hierarchical nested clusters with high accuracy. However, during each merge, the similarity between all clusters needs to be compared globally and the best two clusters are selected. Therefore, the running speed is slow and it is not suitable for the collection of a large number of documents.

In recent years, various studies have shown that the plane partitioning method is more suitable for clustering large-scale documents than the hierarchical clustering method, because the plane partitioning method has a relatively small amount of computing. For example, the time complexity of the Single-link and group-average methods in the hierarchical consortium method is O (n2), the time complexity of the complete-link method is (n3), and n is the number of documents. The time complexity of the k-means method in the plane division is O (nKT), and the time complexity of the single-pass method is O (nK), where n is the number of documents, k indicates the number of final clusters and T indicates the number of iterations.

Therefore, this article selects the k-means algorithm for text clustering, and the k-means algorithm receives the input k. Then, n data objects are divided into k clusters to meet the cluster requirements, the object similarity in the same cluster is high, while the object similarity in different clusters is small. Clustering similarity is calculated by using the mean value of objects in each cluster to obtain a "central object" (gravity center.

The k-means algorithm is described as follows: First, k objects are randomly selected as the initial cluster center from n data objects, then, based on their similarity (distance) with these Clustering Centers, they are allocated to the most similar (represented by the clustering Center) Clustering respectively; then calculate the clustering center of each new cluster (the mean of all objects in the cluster). repeat this process until the standard measure function starts to converge. Generally, the mean variance is used as the standard measure function.

Although the k-means algorithm is sensitive to the selection of the initial cluster center, in this article, the number of k objects in the text is divided into several classes. Use the same number of fingerprints of two text blocks as their similarity for clustering to obtain the final clustering result.

Summary

This article discards the commonly used methods for extracting feature values and calculating Text Similarity. Instead, it abstracts the purified text blocks into blocks (that is, file fingerprints ), text is classified based on the comparison of the same fingerprint. The class value is the initial clustering value of the k-means algorithm, and the same number of fingerprints of the two texts are used as Text Similarity for text clustering.

References

[1] Xiao xiangping, Gao Yubin. Web text mining [J]. Computer Knowledge and Technology. 2007.04

[2] Wang jianyong, Xie zhengmao, Lei Ming, and Li Xiaoming. Research and Evaluation of approximate image web page detection algorithms [J]. Journal of electronics 2000.05

[3] Zhang minghui, Wang chengyao, and Song Wei. A new section-based segmented signature-based image approximation algorithm [J]. Intelligence Technology. 2005.01

[4] Zhang Zhigang, Chen Jing, Li Xiaoming. An HTML webpage purification method [J]. Journal of intelligence 2004.07

[5] Jie Hua. Security Hash Algorithm SHA [J]. South China Financial computer. 2003.06

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.