Application of search engine in Network information mining

Last Update:2014-12-19 Source: Internet

Author: User

Keywords Search engine you can

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall

with the rapid growth of network information resources, people pay more and more attention to how to extract potential and valuable information from massive network information quickly and effectively, so that it can effectively play a role in management and decision-making. Search engine technology solves the difficulty of users to retrieve network information, and the search engine technology is becoming the object of research and development in computer science and information industry. The purpose of this paper is to explore the application of search engine technology in Network information mining.

The research status of data mining

Discussion on network information mining, first of all, from the traditional data mining.

1, what is data mining

according to the definition of people such as W.j.frawley and G.p.shapiro, data mining refers to the extraction of people's interest from the data of large databases, which are hidden, unknown and potentially useful information. Raw data can be structured, such as data in relational databases, or semi-structured, such as text, graphics, image data, or even heterogeneous data distributed across networks. The method of data mining can be mathematical or mathematical, or it can be deductive or inductive. The information can be used for information management, decision support, Process control, etc., and can be used for the maintenance of data itself. Therefore, data mining is a broad interdisciplinary, which brings together researchers in different fields, especially in database, artificial intelligence, mathematical statistics, visualization, parallel computing and other scholars and engineering technicians.

2, the research status of data mining

at present, the development trend and research of data mining in foreign countries mainly include: further research on knowledge discovery methods, such as the research and improvement of Bayes (Bayesian) method and boosting method in recent years; the application of statistical regression method in KDD; The close combination of KDD and database And the research on the method of network information mining. Many foreign computer companies attach great importance to the development and application of data mining, IBM and Microsoft have set up a corresponding research center, some company's related software also began to sell in the country, such as platinum, Bo and IBM.

Domestic data mining research personnel mainly in the university, there are some in the institute or company. There are many research fields involved, generally focused on the study of learning algorithms, the actual application of data mining and data mining theory. Most of the ongoing research projects are funded by the Government, such as the National Natural Science Foundation, the 863 plan, and the 95 plan.

It can be seen that the research and application of data mining has been paid more and more attention by academia, businessmen and government departments.

3, classification of data mining and its tools

1), according to the application types of data mining, can be divided into the following categories. ① classification model. Its main function is to assign data to different groups according to the properties of commercial data, and to find the attribute model of data by analyzing various attributes of data in the packet. ② Association model. This paper mainly describes the close degree or relation of a set of data items, and derives association rules by mining data to understand customer behavior. ③ sequential model. It is mainly used to analyze a kind of time related data in the Data Warehouse, and find the correlation processing model of the data in a certain time period. It is a specific association model that adds time attributes to the association model. ④ cluster model. Mainly used when the data to be analyzed lacks descriptive information or can be organized into any classification mode, the user data is divided into different groupings according to a measure of similarity. Then, by using the clustering model, according to some data discovery rule, we find out the description of the whole data.

2), the typical methods and tools used in data mining

for the above application types, the field of data mining proposed a variety of implementation methods and algorithms. Only a few common typical implementations are discussed here. ① Neural Network. Based on the mathematical model which can be self-learning, it can analyze a large number of complex data and complete the very complicated pattern extraction and trend analysis. The neural network is suitable for the classification model, but the conclusion is not very obvious, and the output result is not any explanation, which affects the reliability and acceptability of the result. Second, it takes a long time to learn, so when the volume of data is large, performance can be problematic. ② decision Tree. is to classify data by a series of rules. With the decision tree, the data rules can be visualized and the output result is easy to understand. The decision tree method has high precision and simple construction process, so it is more commonly used. The disadvantage is that it is difficult to find rules based on multiple variable combinations, and the split between branches of different decision trees is not smooth. ③ Online Analytical Processing (OLAP). Mainly through the way of the user's current and historical data analysis, inquiries and reports, assistant leadership decision-making. ④ data visualization. Data warehouses contain a large amount of data, enrich the various data models, so much data visualization requires complex data visualization tools.

Currently, data mining technology is in the process of development. Data mining involves mathematical statistics, fuzzy theory, neural network and artificial intelligence, etc. However, the combination of data mining technology and visualization technology, geographic information system and statistical analysis system can enrich the function and performance of data mining technology and tools.

4. Network information Mining and its classification

Network Information Mining is an extremely complicated process, it is different from the traditional data warehouse technology and simple Knowledge Discovery (KDD), it faces the mass of information is not all simple structured data, and often for semi-structured data, such as text, graphics, image data, even heterogeneous data. The method of discovering knowledge can be either mathematical or mathematical, or it can be deductive or inductive.

Network Information mining is roughly divided into four steps: ① resource discovery, that is, to retrieve the required network documents, ② information selection and preprocessing, that is, to automatically select and advance processing from the retrieved network resources to obtain specialized information, ③ generalization, that is, from a single Web site and multiple sites to find a common pattern; ④ analysis, Confirm or explain the excavated pattern.

according to different mining objects, network information mining can be divided into network content mining, network structure Mining and network usage mining. ① Network content Mining. The process of discovering useful information from a network's content/data/documents. There are many types of network information resources, from the point of view of network information sources, a large number of network information resources can be directly crawled from the Internet, indexing, implementation of retrieval services, but there are some network information is "hidden", such as the user's questions and dynamically generated results, or the existence of DBMS data, or those private data, They cannot be indexed to provide an efficient way to retrieve them; From the view of resource form, the content of network information is composed of text, image, audio, video, metadata and so on, so the network content mining is a kind of multimedia data mining form. ② Network structure Mining. That is, mining the web's potential link structure pattern. The idea stems from citation analysis, which establishes the web's own link structure pattern by analyzing a Web page link and the number of links and objects. can be used for Web page collation, and can thus be related to different pages of similarity and relevance of information, to help users find related topics authoritative site. ③ Network usage Mining. Through the network usage mining, can understand the user's network behavior data has the significance. The object of network content mining and network structure mining is the original data on the net, while the network usage mining is facing the second hand data extracted in the process of user and network interaction. This data includes: network server access records, proxy server logging, browser logging, user profiles, registration information, user conversations or transaction information, user questions, and so on.

second, the main technology of search engine and its application and development trend

narrowly speaking, network information retrieval is a kind of network information (content) mining. Therefore, to explore the network information mining, it is also necessary to explore the problem of search engines.

1, what is the search engine

search engine is a kind of Web site specialized in providing inquiry service on Internet, which collects the pages of a large number of websites on the Internet through Network search software (also known as network search robot) or Web site login, and then builds a storehouse after processing, which can respond to various inquiries made by users. Provide the information the user needs. The user's query way mainly includes the free word, the full-text search, the keyword search, the classification retrieval and other special information retrieval (Enterprise, person name, telephone yellow pages and so on).

2, the main technical
of the search engine
The
search engine generally consists of four parts, the searcher, the indexer, the searcher and the user interface. ① Finder: Its function is to roam the Internet, discover and collect information. It wants to gather as much new information as possible and regularly update old information to avoid dead joins and invalid connections, and the implementation of the searcher often uses distributed, parallel computing techniques to improve the speed of information discovery and updating. ② Indexer: The function is to understand the information searched by the searcher, extract index entries from it, to represent the document and to generate the index table of the document library. Indexers can use either a centralized indexing algorithm or a distributed indexing algorithm. ③: Its function is to check out the document quickly in the index library according to the user's query, to evaluate the relevance of the document and query, to sort out the results that will be output, and to realize some feedback mechanism of user relevance. The commonly used information retrieval models include set theory model, algebraic model, probability model and mixed model four kinds. ④ user interface: Its role is to input user query, display query results, provide user relevance feedback mechanism. There are two kinds of simple interface and complex interface. The simple interface only provides a text box for the user to enter a query string, and a complex interface allows the user to restrict the query.

3, search engine application

currently several of the larger Chinese search engines are: Yahoo China, Sohu, Sina, NetEase, North Skynet Search (http://e.pku.edu.cn) and so on.

Search for information in the Internet's Ocean of information, first, you should use more than one search engine, unless you find the perfect search results for the first time. Secondly, through a lot of practice, carefully understand the characteristics of each search engine and function. Third, statistics show that many users only enter a Word to query, query results often have a lot of redundancy. It is recommended that you use multiple words at the same time to narrow your search. Four, if the initial search is not successful, you can use synonyms to find. In addition, at ordinary times should pay more attention to accumulate excellent professional website and database website.

According to the statistics published by ***ic on July 27, 2000, the use of search engines has accounted for 55.91% of the network application, becoming the second largest Internet application in China, after sending and receiving e-mail, the importance of search engine has become an important function of Web site construction, the main way of network information mining.

4, the future development trend of the search engine

with the exponential increase of WWW information, the current search engine exists slow search speed, dead link too many, duplicate information or irrelevant information more, difficult to meet people's various information needs, search engine will be intelligent, accurate, cross language retrieval, multimedia retrieval, specialization, etc. to adapt to the direction of different user needs. ① Intelligent Search Engine: It is the development direction of the search engine. It uses the intelligent agent technology to infer the user's query plan, intention and interest direction, automatically collects and filters the information, and submits the information which is interested by the user to the user. ② pay attention to the accuracy of query results, improve the effectiveness of search: To solve the phenomenon of excessive query results there are several methods: A. Building content-based search engines. The more mature solution of content-based search relies on the information processing technology of semantic network, Chinese word segmentation, syntactic analysis, synonym processing and so on to get the most knowledge of the users. B. Translate user questions into known issues and then answer known questions to reduce reliance on natural language understanding technologies. C. Classify results by using the text classification technique, using visualization techniques to display the classification structure, and users can browse only the categories they are interested in. D. site clustering or content clustering, reduce the total amount of information. E. Allowing users to select the results of the return, two queries is a very effective means. ③ implements cross language retrieval: cross-language information retrieval of databases in multiple languages, and returns documents for all languages that can answer user questions. This technology is still in the preliminary research stage, is the development direction of the search engine. ④ Multimedia search Engine: As the future of the Internet is a multimedia data network, the development of query images, sounds, pictures and movies Search engine is a new direction. ⑤ Specialized Search Engine: It is to set up the information of a certain industry, a certain subject or a region, and it has the characteristics of strong pertinence and practicability. such as business inquiries, business inquiries, people's names inquiries, professional information inquiries and so on.

third, the application of search engine in Network information mining

1, an application example of search engine in Network information mining

under the famous foreign search engine Google (http://www.google.com) as an example, analysis of network information retrieval in the application of network information mining. First let's look at Google's architecture (see Figure 1).

Google's search mechanism is: several distributed crawler (automatic search software) work together-"crawl" on the Web, the URL server is responsible for the crawler to provide a list of URLs. The pages found by crawler are sent to the storage server. The storage server then compresses these pages and stores them in a repository (repository). Each page has an associated Id--doc ID that is assigned a doc ID when a new URL is parsed from a Web page. The index library and the sequencer are responsible for indexing, the index library reads records from the knowledge Base, and the documents are decompressed and parsed. Each document is converted into a set of words that appear as hits. Hits records the position of the word, the word in the document, the font size, the capitalization, and so on. The index library divides these hit into a set of "barrels" that produce a partially sorted index. The index library analyzes all the links in the Web page at the same time and holds the important information in the anchors document, which contains enough information to determine the node information that a link is linked to or linked.

The
URL Resolver reads the anchors document, converts the relative URL to an absolute URLs, and generates a DOC ID, which further indexes the anchor text and associates it with the Doc ID pointed to by anchor. It also produces a database formed by a doc ID pair (pairs of doc id). This link database (links) is used to calculate the page level (Pagerank) for all documents.

The
sequencer reads the barrels and generates the reverse gear based on the word ID number (Word ID) list. A program called Dumplexicon combines the above list with a new thesaurus generated by the index library to produce another new thesaurus for use by the Searcher (Searcher). The search engine uses a Web server and uses the thesaurus generated by Dumplexicon and answers the user's questions using the reverse gear and page level.

from Google's architecture, search principles can be seen, the key is: the use of URL decomposition to obtain links information, and the use of certain algorithms to get the page level information, which is the network structure mining technology.

2, the application foreground of network information mining

Network information mining has been widely used in finance, retailing, telecommunications, government management, manufacturing, medical services and sports, and its application and research is becoming a hot spot. The application foreground of network information mining mainly manifests in three aspects: ① electronic commerce. The network mining technology can automatically discover the pattern information hidden in the data from the log records of server and browser, understand the system's access mode and user's behavior pattern, and make predictive analysis. For example, by evaluating the time that a user spends browsing an information resource, you can determine the user's interest in resources, the domain name data collected by the log files, such as the classification analysis of the country or type (. com,.edu,.gov), the application of clustering analysis to identify the user's access motivation and access trends. ② website design. Through the excavation of the content of the website, it can organize the information of the website effectively, such as using the automatic classification technology to realize the hierarchical organization of the information of the website, through the excavation of the logging information of the user and grasping the interest of the users, it is helpful to carry out the website information Push service and the ③ search engine. The greatest feature of web information mining using search engine is embodied in its mining technology of web links information. If through to the Web page content mining, may realize to the webpage clustering, the classification, realizes the network information classification browsing and the retrieval, through the user's question type history record analysis, may carry on the question extension effectively, enhances the user the retrieval effect (recall, precision; precision, recall) Using the network content mining technology to improve the keyword weighting algorithm, improve the indexing accuracy of network information, and improve the retrieval effect.

above only listed the application of Network information mining technology in these three aspects. The application of this technology is becoming more and more extensive, and the user's demand for high quality and personalized information will promote the research and development of academia and business.

Iv. Concluding remarks

Web-oriented data mining is a complex technology, as web data mining is much more complex than a single data warehouse. We believe that with the advent of XML as a standard way of exchanging data on the web, the diversification of users ' information needs and the deepening of the Research on Network information mining, the "intelligent" search engine will emerge and web-oriented information mining will become very easy.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More