Parse XML and Web-oriented data mining technology

Source: Internet
Author: User
Tags xsl

Web-Oriented Data Mining

There is a massive amount of data information on the Web, and how to make complex applications of this data has become a hot research topic in today's database technology. Data mining is to discover hidden regularity from a large amount of data and solve the quality problem of data application. Making full use of useful data and abandoning false and useless data is the most important application of data mining technology. Compared with Web data, the data structure in traditional databases is very strong, that is, the data is fully structured data, and the data on the Web is characterized by semi-structured data. The so-called semi-structured is relative to the data of a fully structured traditional database. Obviously, Web-oriented data mining is much more complex than single data warehouse-oriented data mining.

1. Heterogeneous Database Environment

From the perspective of database research, the information on the Web site can also be seen as a database, a larger and more complex database. Each site on the Web is a data source, and each data source is heterogeneous. Therefore, the information and organization between each site are different, which forms a huge heterogeneous database environment. If you want to use this data for data mining, you must first study the integration of heterogeneous data between sites. You only need to integrate the data of these sites to provide users with a unified view, it is possible to obtain the required information from massive data resources. Second, we need to solve the problem of data query on the Web, because if the required data cannot be obtained effectively, it is impossible to analyze, integrate, and process the data.

2. semi-structured data structure

Data on the Web is different from data in traditional databases. Traditional databases have certain data models and can describe specific data according to the models. The data on the Web is very complex and does not have a specific model description. The data on each site is designed independently, and the data itself has self-reporting and dynamic variability. Therefore, data on the Web has a certain structure, but because of the existence of the self-reporting layer, it is a kind of non-fully structured data, which is also called semi-structured data. Semi-structured data is the biggest feature of Web data.

3. Solve the Problem of semi-structured data sources

Web Data Mining is designed to query and integrate semi-structured data sources and semi-structured data models. To solve the problem of integration and query of heterogeneous data on the Web, you must have a model to clearly describe the data on the Web. Looking for a semi-structured data model is the key to solving the problem. In addition to defining a semi-structured data model, a semi-structured model extraction technology is also required to automatically extract semi-structured models from existing data. Web-oriented data mining must be based on the semi-structured model and semi-structured data model extraction technology.

XML and Web Data Mining Technology

XML-based next-generation WWW environment directly faces Web data. It is not only compatible with existing Web applications, but also can better achieve information sharing and exchange on the Web. XML can be seen as a semi-structured data model, which can easily match the XML document description with the attributes in the relational database, and implement precise query and model extraction.

1. generation and development of XML

XML (eXtensibleMarkupLanguage) is an important branch of SGML (StandardGeneralMarkupLanguage) designed by the W3C. In general, XML is a Meta-markupLanguage that provides a format for describing structured data. In details, XML is similar to HTML, A language designed to describe data. XML provides an independent method for running programs to share data. It is a new standard language used to automatically describe information, it enables computer communication to extend Internet functions from information transmission to other diverse activities of human beings. XML consists of several rules that can be used to create a markup language and process all newly created Markup languages using a concise program called an analyzer, just as HTML provides a display method for the first computer user to read Internet documents, XML also creates a world language that anyone can read and write. XML solves two Web problems that cannot be solved by HTML, that is, the speed of Internet development and slow access, and the amount of available information, however, it is difficult to find the information you need. XML can add structure and semantic information, allowing computers and servers to process multiple forms of information in real time. Therefore, the XML Extension function can not only download a large amount of information from the Web server, but also greatly reduce the network traffic.

Tags in XML are not pre-defined. Users must customize the required tags. XML is a language that can be customized. XML uses DTD (documenttypedefinition document type definition) to display the data. XSL (extensiblestylesheetlanguage) is a mechanism to describe how these documents are displayed. It is the XML style table description language. The history of XSL is longer than that of CSS used in HTML (cascadingstylesheets). XSL consists of two parts: a method for converting XML documents and a method for formatting XML documents. Xll (extensiblelinklanguage) is an XML connection language that provides connections in XML, similar to HTML, but more powerful. Xll can be used for multi-direction connection, and the connection can exist at the object level, not just the page level. Because XML can mark more information, it allows users to easily find the information they need. With XML, web designers can not only create text and graphics, but also build multi-level, interdependent systems, data trees, metadata, hyperlink structures, and style sheets for document type definitions.

2. Main Features of XML

It is the characteristics of XML that determine its superior performance. XML, as a markup language, has many features:

· Simple. XML is well-designed and the entire specification is simple and clear. It consists of several rules that can be used to create Markup languages, and can process all newly created Markup languages with a concise program that is often called an analyzer. XML allows you to create a world language that anyone can read and write. This function is called the unified function. For example, tags created in XML always appear in pairs and rely on new encoding standards called unified code.

· Open. XML is SGML with many mature software available in the market to help write and manage. The basis of open standard XML is verified standard technology and optimized for the network. Many of the industry's leading companies work with W3C teams to help ensure interaction, support developers, authors and users on various systems and browsers, and improve XML standards. The XML interpreter can use a programming method to load an XML document. After the document is loaded, you can use the XML file object model to obtain and manipulate the information of the entire document, this accelerates network operation.

· Efficient and scalable. Document fragments can be reused. Users can create and use their own tags, and share tags with others. The scalability is high. In XML, users can define an unlimited set of tags. XML provides an architecture that identifies structured data. An XML component can declare its related information as retail price, business tax, title, quantity, or any other data element. As many organizations around the world gradually adopt XML standards, more functions will emerge: once the data is locked, it can be transmitted in any way through a cable and displayed in a browser, or transfer it to another application for further processing. XML provides an independent application method to share data. Using DTD, people in different groups can use the common DTD to exchange data. Your application can use this standard DTD to verify whether the data you receive is valid. You can also use a DTD to verify your own data.

· Internationalization. The standard is internationalized and supports most texts in the world. This stems from the new coding standard that relies on its unified code, which supports all the hybrid texts written in major languages in the world. In HTML, for most word processing, a document is generally written in a special language, whether in English, Japanese, or Arabic. If your software cannot read characters in a special language, this document cannot be used. However, software that can read the XML language can smoothly process any combination of characters in these different languages. Therefore, XML can not only exchange information between different computer systems, but also exchange information across national boundaries and beyond different cultural boundaries.

3. Application of XML in Web Data Mining

XML has become a formal specification. developers can tag and exchange data in XML format. XML provides a good method for data processing in a three-tier architecture. Using an upgradeable three-tier model, XML can be generated from existing data, and XML structured data can be separated from business specifications and forms. Data integration, sending, processing, and display are steps in the following process:


What promotes XML applications is Web applications that cannot be completed using standard HTML. These applications can be divided into the following four categories: applications requiring Web clients to communicate between two or more heterogeneous databases; applications that attempt to transfer most of the processing load from the Web server to the Web Client must provide the same data to different users in Different browsing forms; intelligent Web Proxy is required to cut information content applications based on individual user needs. Obviously, these applications are closely related to the Web data mining technology, and Web-based data mining must rely on them.
XML provides powerful functions and flexibility for Web-based applications, which brings many benefits to developers and users. For example, you can perform a more meaningful search and the Web data can be uniquely identified by XML. Without XML, search software must understand how each database is built, but this is actually impossible, because the format of each database description data is almost different. Because of the integration problem of data from different sources, it is impossible to search for a variety of incompatible databases. XML can easily combine structured data from different sources. Software agents can integrate data from backend databases and other applications on servers in the middle layer. Then, the data can be sent to the customer or other servers for further collection, processing and distribution. The scalability and flexibility of XML allows it to describe data in different types of application software, from web pages collected by descriptions to data records, so as to get data through multiple applications. At the same time, Because XML-based data is self-describing, data can be exchanged and processed without any internal description. With XML, you can easily perform local computing and processing. After XML-format data is sent to the customer, the customer can use the application software to parse the data and edit and process the data. Users can process data in different ways, instead of simply displaying it. The XML Document Object Mode (DOM) allows you to process data using scripts or other programming languages. data computing can be performed without returning to the server. XML can be used to separate the user's data Viewing Interface. Using simple, flexible, and open formats, you can create powerful application software for the Web. Previously, these software can only be built on high-end databases. In addition, data can be displayed in multiple ways after being sent to the desktop.

XML can also describe structured data in a simple, open, and scalable way. xml supplements HTML and is widely used to describe user interfaces. HTML describes the appearance of the data, while XML describes the data itself. Because the data display and content are separated, the data defined in XML can be displayed in different ways to make the data more reasonable. Local data can be dynamically expressed in a way determined by customer configuration, user selection, or other standards. CSS and XSL provide a publishing mechanism for data display. With XML, data can be updated in a granular manner. When a part of data changes, you do not need to resend the entire structured data. The changed elements must be sent from the server to the customer. The changed data can be displayed without refreshing the entire user interface. However, as long as one piece of data changes, the whole page must be rebuilt. This severely limits the performance of server upgrades. XML also allows the addition of other data, such as the predicted temperature. The added information can enter the existing page, and the browser does not need to resend a new page. When XML is used to interact with different data sources, the data may come from different databases and they all have different complex formats. However, the customer interacts with these databases in only one standard language, that is, XML. Due to the customization and scalability of XML, it is sufficient to express various types of data. After receiving the data, the customer can process the data or transmit it between different databases. In short, XML solves the unified interface problem of data in such applications. However, unlike other data transmission standards, XML does not define specific specifications for data in data files, but adds tags to the data to express the logical structure and meaning of the data. This makes XML a specification automatically understood by a program.

XML is used to distribute a large amount of computing load on the client, that is, the customer can select and create different applications to process data according to their own needs, and the server only needs to issue the same XML file. For example, in the traditional "Client/Server" mode, the customer sends different requests to the server, and the server responds separately, which not only increases the load on the server itself, in addition, network managers must investigate various user needs in advance to develop different programs. However, if the user needs are complex and changing, it is not appropriate to concentrate all the business logic on the server, because the server programmers may not be able to meet the needs of a large number of applications, nor be able to keep up with the changes in requirements. Both sides are passive. The application XML gives the customer the initiative to process data. What the server does is to encapsulate the data into XML files as far as possible and accurately, which is exactly what they need and what they do. The self-explanatory XML allows the client to understand the logical structure and meaning of the data while receiving the data, thus making extensive and common distributed computing possible.

XML is also applied to network proxies to edit, increase, and decrease the obtained information to meet the needs of individual users. Some customers do not directly use the data to organize their own databases as needed. For example, the Education Department needs to establish a huge question bank. During the examination, the questions in the question bank are taken out to form a number of examination papers, and the examination papers are encapsulated into XML files. Next, each school needs to pass a filter, filter out all the answers and send them to all the candidates. Unfiltered content can be directly sent to the teacher. Of course, after the exam, you can also send a compilation of answers. In addition, the XML file can contain other related information such as the difficulty coefficient and the error rate in previous years. This requires only a few small programs, the same XML file can be transferred into multiple files to different users.


Web-oriented data mining is a complex technology. Because Web Data Mining is much more complex than single data warehouse mining, Web-oriented data mining has become a difficult problem to solve. The emergence of XML brings an opportunity to solve the problem of Web data mining. Because XML can easily combine structured data from different sources, it is possible to search for a variety of incompatible databases, this brings hope to solve the problem of Web data mining. The scalability and flexibility of XML allows XML to describe data in different types of application software, so as to describe the data records in the collected web pages. At the same time, Because XML-based data is self-describing, data can be exchanged and processed without any internal description. As an industrial standard for representation of structured data, XML provides many favorable conditions for organizations, software developers, Web sites, and end users. I believe that with the emergence of XML as a standard way of exchanging data on the Web, Web-oriented data mining will become very easy.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.