web|xml| data
Web-oriented data mining
There is a large amount of data information on the Web, and how to apply these data to complex applications has become a hot research topic in modern database technology. Data mining is to find out the hidden regularity of data from a large number of data, and to solve the problem of application quality. The most important application of data mining technology is to make full use of useful data and discard false and useless data. Compared to the web data, the traditional data structure of the database is very strong, that is, the data is fully structured data, and the biggest feature of the data on the Web is semi-structured. The so-called semi-structured is relative to the data of the fully structured traditional database. It is clear that web-oriented data mining is much more complex than data mining for a single data warehouse.
1. Heterogeneous database environment
From the perspective of database research, information on Web sites can also be viewed as a database, a larger, more complex database. Each site on the web is a data source, each data source is heterogeneous, so the information and organization between each site is different, which constitutes a huge heterogeneous database environment. If you want to use this data for data mining, first of all, it is necessary to study the integration of heterogeneous data between sites, only the data of these sites are integrated to provide users with a unified view, it is possible to get from the huge data resources needed. Second, we need to solve the problem of data query on the Web, because if the required data can not be very effective, the data analysis, integration, processing can not be discussed.
2. Semi-structured data structure
The data on the web is different from the data in the traditional database, the traditional database has a certain data model, which can describe the specific data according to the model. The data on the web is very complex, there is no specific model description, each site's data are individually designed, and the data itself is self-describing and dynamic variability. Therefore, the data on the web has a certain structure, but because of the existence of the readme level, it is a kind of not fully structured data, which is also called semi-structured data. Semi-structured is the largest feature of data on the Web.
3. Solve the semi-structured data source problem
Web data mining technology mainly solves the problem of querying and integrating semi-structured data source model and semi-structured data model. To solve the problem of integration and query of heterogeneous data on the Web, you must have a model to clearly describe the data on the Web. For the characteristics of data semi-structured on the web, finding a semi-structured data model is the key to solve the problem. In addition to defining a semi-structured data model, a semi-structured model extraction technique is needed to automatically extract semi-structured models from existing data. Web-oriented data mining must be based on semi-structured model and semi-structured data Model extraction technology.
XML and web Data mining technology
The new generation WWW environment based on XML is directly faced with the Web data, not only can be well compatible with the original Web application, but also can better realize the information sharing and exchange in the Web. XML can be considered as a semi-structured data model, which can easily correspond the XML document description with the attribute in the relational database, and implement accurate query and model extraction.
Generation and development of 1.XML
XML (Extensiblemarkuplanguage) is an important branch of SGML (Standardgeneralmarkuplanguage) designed by the World Wide Web Consortium (WWW), especially for Web application services. In general, XML is a Mediation Markup Language (Meta-markuplanguage) that provides a format for describing structured data, and in detail XML is a language similar to HTML that is designed to describe data. XML provides a stand-alone way of running programs to share data, a new standard language used to automatically describe information, which enables computer communications to extend the function of the Internet from information transfer to other kinds of human activities. XML consists of a number of rules that can be used to create markup languages and to process all newly created markup languages with a concise program called an analyzer, just as HTML provides a way for users of the first computer to read Internet documents. XML also creates a Esperanto that anyone can read and write. XML solves two Web problems that HTML cannot solve, namely the problem of fast Internet development and slow access, as well as the amount of information available, but it is difficult to find the part of the information that you need. XML adds structural and semantic information that enables computers and servers to instantly process multiple forms of information. Therefore, the extensible function of XML can not only download a lot of information from the Web server, but also reduce the network traffic greatly.
The tags in XML are not predefined, and the consumer must customize the required flags, XML is a language that can be interpreted (selfdescribing). XML uses DTDs (documenttypedefinition document type definitions) to display this data, and XSL (extensiblestylesheetlanguage) is a mechanism to describe how these documents are displayed, which is the Stylesheet Description Language for XML. The history of XSL is much older than HTML CSS (cascading style sheet cascadingstylesheets), which includes two parts: a way to transform an XML document, and a way to format an XML document. XLL (Extensiblelinklanguage) is an XML connection language that provides connections in XML, similar to HTML, but more powerful. With XLL, you can connect in multiple directions, and the connection can exist at the object level, not just at the page level. Because XML can tag more information, it makes it easy for users to find the information they need. Xml,web designers can not only create text and graphics, but also build multi-level, interdependent systems, data trees, metadata, hyperlink structures, and style sheets that are defined by the document type.
Main features of 2.XML
It is the characteristics of XML that determine its outstanding performance. XML, as a markup language, has many features:
(1) Simple. XML is well designed, and the entire specification is straightforward, consisting of a number of rules that can be used to create markup languages and to handle all newly created markup languages with a concise program often called an analyzer. XML creates a Esperanto that can be read and written by anyone, and the creation of Esperanto is called the unity function. tags such as XML creation always appear in pairs and rely on new coding standards called Unified Code.
(2) Open. XML is SGML in the market there are many mature software can be used to help write, manage, and so on, open standard XML is based on validated standard technology, and for the network to optimize. Many of the industry's leading companies work together with the consortium's working groups to help ensure interoperability, support developers, authors, and users of various systems and browsers, and improve XML standards. XML interpreter can use programming method to load an XML document, when this document is loaded, the user can obtain and manipulate the entire document information through the XML file object model, speed up the network running speed.
(3) Efficient and extensible. Support for the reuse of document fragments, users can invent and use their own tags, but also share with others, can be extensible, in XML, you can define an unlimited number of a set of annotations. XML provides a schema that identifies structured data. An XML component can declare the data associated with it as a retail price, business tax, book title, quantity, or any other data element. As many organizations around the world gradually adopt XML standards, there will be more related features: Once the data is locked, it can be passed through the cable in any way, presented in the browser, or forwarded to other applications for further processing. XML provides an independent method of using programs to share data, and using DTDs, people from different groups can use a common DTD to exchange data. Your application can use this standard DTD to verify that the data you receive is valid, and you can use a DTD to validate your own data.
(4) Internationalization. Standard internationalization, and support most of the world's text. This stems from the new coding standard that relies on its unified code, which supports all the world's mixed text in the main language. In HTML, for most word processing, a document is usually written in a special language, whether in English or Japanese or Arabic, and if the user's software cannot read characters in a particular language, then he cannot use the document. But software that can read the XML language can handle any combination of these different language characters smoothly. Therefore, XML can not only exchange information between different computer systems, but also exchange information across national boundaries and beyond different cultural boundaries.
The application of 3.XML in Web data mining
XML has become a formal specification that enables developers to mark and Exchange data in XML format. XML provides a good method for data processing on a three-tier architecture. With an upgradeable three-tier model, XML can be generated from existing data, and structured data using XML can be separated from commercial specifications and representations. The integration, delivery, processing, and display of data is each step of the following process:
XML applications are facilitated by Web applications that cannot be completed with standard HTML. These applications can be grouped into the following four categories: Applications that require Web clients to communicate between two or more heterogeneous databases; an application that attempts to transfer most of the processing load from the Web server to the Web client Requires a Web client to use the same data in different browsing formats for different users; a smart Web proxy needs to reduce the application of the information content according to the needs of individual users. Obviously, these applications and web data mining technology have an important link, web-based data mining must rely on them to implement.
XML gives developers and users a lot of benefits by giving them powerful functionality and flexibility for web-based applications. For example, a more meaningful search, and web data can be uniquely identified by XML. Without XML, the search software must understand how each database is built, but this is actually impossible because each database describes the data in a format that is almost always different. Because of the integration of data from different sources, it is virtually impossible to search for a variety of incompatible databases now. XML makes it easy to combine structured data from different sources. Software agents can integrate data from back-end databases and other applications on the middle tier servers. The data can then be sent to customers or other servers for further collection, processing, and distribution. The extensibility and flexibility of XML allow it to describe data from different kinds of applications, from describing the collected web pages to data records, so that data can be obtained through a variety of applications. At the same time, since xml-based data is self-describing, data can be exchanged and processed without internal description. With the help of XML, the user can compute and process the data conveniently, and after the XML data is sent to the customer, the customer can use the application software to parse and edit the data. Users can work with data in different ways, not just show it. XML Document Object Mode (DOM) allows data to be processed in scripts or other programming languages, and data calculations do not need to be returned to the server. XML can be used to separate the user's view of the data interface, using a simple and flexible open format, you can create powerful applications to the Web, and the original software can only be built on the high-end database. In addition, when data is sent to the desktop, it can be displayed in a variety of ways.
XML can also be used to describe structured data in a simple, open, extensible way, and XML complements HTML and is widely described as a consumer interface. HTML describes the appearance of the data, while the XML describes the data itself. Because the data display is separate from the content, XML-defined data allows you to specify different display modes to make the data more reasonable. Local data can be dynamically expressed in the manner of client configuration, user selection, or other standard decisions. CSS and XSL provide a mechanism for the display of data. With XML, data can be granular to update. Every time a part of the data changes, you do not need to send the entire structured data back. The changing elements must be sent from the server to the customer, and the changed data will be displayed without refreshing the entire user interface. But at the moment, as long as one piece of data changes, the entire page must be rebuilt. This severely limits the server's upgrade performance. XML also allows other data to be added, such as the predicted temperature. The added information is able to enter the existing page without requiring the browser to send a new page again. When XML is applied to customers who need to interact with different data sources, the data may come from different databases, each with a different complex format. But customers interact with these databases through only one standard language, which is XML. Because of the customization and extensibility of XML, it is sufficient to express various types of data. Customers can process data after they receive it, or they can pass between different databases. In a word, XML solves the problem of unified interface of data in this kind of application. However, unlike other data-passing standards, XML does not define the specific specification of data in the data file, but instead appends tag to the data to express the logical structure and meaning of the data. This makes XML a specification that the program can automatically understand.
XML is applied to distribute a large number of computing loads on the client, where customers can select and produce different applications to process data according to their own needs, and the server only needs to emit the same XML file. If according to the traditional "client/server" work mode, the customer sends different requests to the server, the server responds separately, this not only aggravates the load of the server itself, but also the network manager must investigate the different user needs in advance to make the corresponding different procedure, but if the user's demand is multifarious and changeable, It is not appropriate to concentrate all business logic on the server side, because the server-side programmers may not be able to meet the needs of many applications, too late to keep up with the changes in demand, both sides are very passive. The application of XML will be the process of processing data to the customer, the server is just as perfect as possible, accurately encapsulate the data into the XML file, it is the right to do their own duties. The self explanatory nature of XML makes it possible for the client to understand the logical structure and meaning of the data while receiving the data, so that wide and universal distributed computing can be made.
XML is also applied to network proxies to edit and increase the information obtained to suit the needs of individual users. Some customers get data not for direct use but for organizing their own databases as needed. For example, the Education department wants to build a huge question bank, when the exam will be the question in the test questions out of a number of composition papers, and then packaging papers into the XML file, the next in each school let it through a filter, filtering out all the answers, and then sent to each candidate, unfiltered content can be directly to the teacher's hands, Of course, after the exam can also send a copy of the answer compiled. In addition, the XML file can also include other relevant information such as difficulty coefficient, error rate of previous years, and so on, so that only a few small programs, the same XML file can be converted into multiple files to the hands of different users.
Conclusion
Web-oriented data mining is a complex technology, because web data mining is more complex than a single data warehouse, so web-oriented data mining becomes a difficult problem to solve. The advent of XML presents an opportunity to solve the problem of web data mining. Because XML makes it easy to combine structured data from different sources, it makes it possible to search for a variety of incompatible databases, which gives hope for solving web data mining challenges. The extensibility and flexibility of XML allow XML to describe data in different kinds of applications to describe the data records in the collected Web pages. At the same time, since xml-based data is self-describing, data can be exchanged and processed without internal description. As an industry standard for structured data, XML provides many advantages for organizations, software developers, Web sites, and end users. It is believed that the web-oriented data mining will become very easy in the future as XML becomes a standard way of exchanging data on the web.