XML Data Mining: Clustering XML documents to improve data mining

Source: Internet
Author: User
Keywords Data mining Xml xml documents

The 3rd part of this XML data Mining series explains several concepts about clustered XML documents and describes the XML document cluster tasks to perform when the content and structure of the document change over time. In real-world applications, XML documents evolve from one version to another, and the number of changes to be implemented is unpredictable. It is normal for the original cluster solution to be eliminated after the change is implemented. To overcome this, this article describes a non-redundant methodology that can recalculate new clusters of XML documents after a change. This article will provide a detailed example of use cases to help you understand the technology and how to apply its technology to practice.

Background concepts

A cluster is a data mining task (usually implemented using a distance metric) that looks for a region in a dense data set. In other words, a cluster is a process of partitioning data by grouping similar data items into a set, called Clusters.

Because of the hierarchical structure of XML, the clustered XML document differs from the other datasets in the cluster. Several XML clustering methods have been introduced, such as the use of structured XML document clusters, semantic XML clusters, modeless XML document clusters, and linked XML document clusters. In this series of XML data Mining, part 1th: Examine several XML data mining methods articles to read more about different types of XML clusters. This paper focuses on using hierarchical (distance-based) XML clustering technology to implement XML clustering through architecture.

In the distance clustering technique, each object from a given set is first assigned to a cluster. Next, the distance between the cluster pairs is computed and the nearest (most similar) cluster is grouped to form a new (larger) cluster. In other words, compared to other XML document pairs, these two XML documents are much more similar and have shorter distances, so they can be members of the same cluster.

To illustrate the concept of XML document similarity, Figure 1 shows three XML documents, two of which are highly similar (i.e., document DA and DB), and the document DC has no similarities to either DA or DB. The documents DA and DB list information about two students, including the school year, subjects and exams, and the names of the students. The document DC lists information about a book, including the title, ISB number, and the names of two authors.

Figure 1. Examples of similar and dissimilar XML documents

In Figure 1, task queries about student details apply only to the appropriate documents (DA and DB) and not to other documents that include different information, such as DCs. Intuitively, the document DA and DB are grouped together in one cluster, and the DC itself forms another separate cluster.

The distance between the XML document and the Xmldelta

If you treat two XML documents (D1 and D2) and their performance as two trees, the distance between the two documents is recorded as D (D1 and D2), determined by the basic set of operations (i.e., inserts, updates, and deletes), which have a minimum total cost, And you can convert a document to another document.

For example, to determine the distance between file da and DB in Figure 1, you must first look for the basic set of operations (forward) that can convert DA to db, and then find the basic set of operations (backlinks) that convert DB to DA. You want to calculate the cost for both sets of operations, and finally select the set with the minimum total cost:

d (DA--> DB) ={update (Student, John, Mary), update (year, 2, 3), insert (Exams), insert (Subject, Drama), insert ( Subject, http://www.aliyun.com/zixun/aggregation/6141.html ">music"} and D (DB--> DA) ={update (Student, John, Mary), update (year, 2, 3), delete (exams), delete (Subject), delete (Subject)}

To calculate the minimum cost per set of operations, you use a cost model based on the node location in the XML document. The cost in this example is:

d (da--> d B) = d (DB--> DA) = 5

In this example, you can select one of the set of actions because they have the same total cost.

In the use case of a dynamic (also known as a multiple version) XML document, each new document version is actually created with some degree of update to the previous version of the document. This update is typically implemented by mixing basic operations in previous versions of XML documents (that is, inserts, updates, and deletes). If you look at these actions as a whole, they will form a so-called delta. When you talk about the XML delta, you know that it refers to the difference between two consecutive versions of an XML document. The cost of the delta is the total cost of the operation combination listed in Delta and mentioned earlier.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.