Internet Information Mining Technology (Author: Zhang chengmin Zhang Chengzhi)

Source: Internet
Author: User
Author: Zhang chengmin Zhang Chengzhi

Abstract This article introduces the Internet information mining technology, describes the key technologies and system processes in Network Information Mining, and combines the development and application of the Agricultural Network Information Mining System, the application prospect of network information mining is pointed out.

Keywords Data Mining Internet webpage Information Extraction

About the WDM Technology
Zhang Chengzhi
(Department of Information Management, Nanjing agriculturl University, Nanjing 210095)

Abstract This paper introduces the Web data mining (WDM) expoundes the key technology, the system process of the WDM, then use the agricultural web data mining (awdm) as a example, declare that the WDM has good foreground in the practice.
Keywords data mining, Internet, web pages, information extration

I. Overview

With the rapid development of the Internet, more and more information is displayed in front of users, but the problem is that users are increasingly difficult to obtain the information they need most. To solve this problem, a semi-automated search engine, represented by Yahoo, emerged ). The network search engine consists of three parts: network robot (robot), index database, and query service [1]. Network robots traverse Internet resources to discover and collect new information as much as possible. Full-text retrieval technology is used to create an index for the collected information and store it in the index database, the query service can greatly improve the speed of Information Retrieval. The query service receives and analyzes user queries, that is, user queries are used as database inquiry methods. Based on a certain matching policy, for example, a Boolean model or a Fuzzy Boolean Model traverses the index database and returns the matching result (including the title item, simple digest, and link address) to the user. Because artificial intelligence research has not yet reached the practical level, network robots cannot accurately classify information, making retrieval results unsatisfactory. For example, when a user uses "cotton planting" for retrieval, his/her intention was to get information about the regional distribution of cotton planting. However, the search engine mostly returned a large number of articles on cotton planting technology, the reason is that most of the existing search engines are based on simple keyword matching and cannot really understand the user's search intent. In addition, at present, most search sites are manually processing information, which makes information sorting far slower than network information expansion.
In order to realize personalized active information services, Web mining technology has become a new research topic in recent years. It is the application of data mining technology in Network Information Processing [2]. Network Information mining is to obtain internal features between Data Objects Based on a large number of training samples and extract targeted information based on this. For example, when the information mining system finds that users' interests are "cotton planting distribution", it automatically filters out irrelevant data such as cotton planting technology, in this way, the retrieval time and cost can be greatly reduced.
There are many similarities between network information mining and network information retrieval technologies, but there are also essential differences. The Network Information Mining Technology follows the excellent achievements in network information retrieval, such as robot and full-text retrieval. It also comprehensively utilizes various technologies in the fields of artificial intelligence, pattern recognition, and neural network. The biggest difference between the network information mining system and network information retrieval is that it can obtain personalized information requirements of users and search for targeted information on the network or in the Information Library Based on the Target Feature Information. This article describes the overall process and implementation of network information mining technology, and points out the feasibility and development prospects of the application in the field of network information mining agricultural information.

II. Key Technologies and system processes in Network Information Mining Technology

1. Key Technologies in Network Information Mining
(1) Feature Extraction of target samples
The network information mining system uses Vector Space modal and VSM, and uses feature entries (T1, T2 ,..., Tn) and its weight wi indicate the target information. During information matching, these feature items are used to evaluate the correlation between the unknown text and the target sample. The selection of feature entries and their weights is called the feature extraction of the target sample. The advantages and disadvantages of the feature extraction algorithm directly affect the system running effect. The frequency distribution of entries in different documents is different. Therefore, feature extraction and weight evaluation can be performed based on the frequency characteristics of entries.
An effective feature item set can reflect both the target content and distinguish the target from other documents. Therefore, the word weight is proportional to the Document Frequency of the entry, it is inversely proportional to the Document Frequency of the entry in the training text. Construct the following feature item weight evaluation function:
Weight (Word) = tfik * idfi = tfik * log (N/NK + 1)
Tfik indicates the frequency of occurrence of the entry TK in document Di, idfi indicates the frequency of inverse document, N indicates the number of documents in all target samples, and NK indicates the number of documents with the entry TK. If you consider the term length, you can standardize the process to obtain:
Weight (Word) = tfik * log (N/NK + 1 )/
Compared with common text files, HTML documents have obvious identifiers, more obvious structure information, and richer object attributes. When calculating the feature entry weight, the system fully considers the features of HTML documents and gives a higher weight to texts with more titles and feature information. To improve the running efficiency, the system dimensionality reduction processing is performed on the feature vectors. Only the entries with higher weights are retained as the feature items of the document to form a target feature vector with lower dimensions.
(2) Chinese Word Segmentation
English sentences use spaces as fixed delimiters, but are not in Chinese. This poses a major obstacle to Chinese Information Processing. For example, a computer cannot tell whether a "racket bought" is a "racket, if you have bought a word or a ball or an auction, you must split the entries before processing Word Frequency Statistics. A simple and effective word segmentation method is a machine word segmentation method based on large-scale word libraries. The general dictionary contains a large number of frequently used words that do not become feature items. In order to improve the system operation efficiency, the system creates a specialized Word Segmentation table based on the mining objectives. This ensures the accuracy of feature extraction, it significantly improves the system operation efficiency.
When you split an entry, you must first perform Rough Segmentation Based on the punctuation, and then use the forward and reverse largest matching methods for segmentation. Considering the diversity of natural languages, the system establishes and uses auxiliary dictionaries such as synonymous dictionaries and related word dictionaries to improve the accuracy of information matching.
(3) Obtain Dynamic information in the network
Robot is an important part of traditional search engines. It reads Web Pages Based on HTTP protocol and automatically roaming WWW Based on hyperchains in HTML documents. Robot is also called Spider, worm, or crawler. However, robot can only obtain static Web pages, and valuable information is often stored in network databases. people cannot obtain the data through search engines and can only log on to professional information websites, use the query interface provided by the website to submit a query request to obtain and browse the dynamic page generated by the system. The Network Information Mining System traverses the information in the network database through the query interface provided by the website, automatically analyzes and sorts the traversal results based on the professional knowledge base, and finally imports the information to the local database.

2. Network Information Mining technology implementation process
Figure 1 shows the overall flowchart of network information mining technology implementation. Each step is explained as follows:
Step 1: Establish the target sample, that is, the user selects the target text as the feature information to extract the user;
Step 2: extract the feature information, that is, extract the feature vectors of the target from the statistical Dictionary Based on the word frequency distribution of the target sample and calculate the corresponding weights;
Step 3: Obtain the network information, that is, first select the site to be collected using the search engine site, then use the robot program to collect static Web pages, and finally obtain the dynamic information in the network database of the accessed site, generate the WWW Resource Index library;
Step 4: Information Feature Matching: extract the feature vectors of the source information in the index database, and match the feature vectors of the target sample to return the information that meets the threshold conditions to the user.

Iii. Application Prospects of Network Information Mining Technology

The Internet provides rich resources for users, but it is difficult to obtain useful information without a good information mining tool. The application of network information mining technology in the field of agricultural information is described as an example. With the further development of China's telecom industry, network information is also growing. In particular, agriculture is China's largest industry, the informatization of agriculture requires us to establish an Information Mining System in the agricultural field to meet the needs of users at all levels of agriculture information. The construction of an agricultural network information mining system should be based on the existing mature theory and gradually completed based on the distribution characteristics of the current WWW agricultural information resources, statistical Dictionaries can be subdivided into several specialized dictionaries, including basic agricultural science, agricultural engineering, agricultural science, plant protection, crops, gardening, forestry, animal husbandry, aquatic products, and fishery. In this way, the matching accuracy is improved and the retrieval accuracy is improved.
In the process of building the system, three key issues are involved:
1. Problems in determining the target sample
The extraction of user feature information comes from the network resources browsed by the user (generally HTML text) and submits the webpage that the user has accessed to the server as a target sample of the user, the number of target samples should be 50, because
The extracted keywords are too sparse to express the user's interest in features. If too many keywords are extracted, the system overhead is increased and a long operation time is required. In the user feature information extraction algorithm, we mainly consider Word Frequency (tfik), inverse Document Frequency (idfi), and location factors to measure the word weight. To improve the feature expression ability of keywords, we can further consider the term length and word distribution as the weight factor. Generally, long words can express more specific concepts. For example, "crop cultivation" should specifically refer to "crop". Correspondingly, it should give a higher weight to "crop cultivation. Word distribution refers to the distribution of words in a certain text. A certain word A is not a deprecated word) appears in each section of the article, when word B appears in one section, it is considered that word a is more expressive than word B. Therefore, a is given a higher weight.
2. Construction of the statistical dictionary
Word Segmentation is required for extraction of user feature information and Automatic Indexing of Internet information. The advantages and disadvantages of word segmentation are closely related to the word segmentation algorithm and the statistical dictionary used in word segmentation. In this system, the Chinese word segmentation processing module uses the longest matching method (MM matching) as the word segmentation algorithm. The statistical dictionary used mainly consists of the keyword dictionary, synonym dictionary, and related word dictionary. The data in the keyword dictionary mainly comes from the Chinese Library Classification (Fourth Edition). Category s data in Chinese Classification subject table, agricultural major classification table, Chinese Marc, and Chinese Science and Technology Journal Database. The specific data processing process is limited by space. The data in the synonym dictionary is mainly constructed based on the above data resources and the synonym forest. When dealing with user queries and text classification, the synonym dictionary is very useful. The relevant word dictionary consists of upper and lower-level words (for example, plant test and fruit test) and implication relation words (such as grafting and dwarf anchor, grafted seedlings, Spike, bridging, median anchor, Root Stocks, grafting affinity and other words. The structure of this dictionary can be determined by the above data resources and the statistical algorithm based on word co-occurrence.
Agricultural Network Information Mining System Design should also take into account the mining of user interests, such as the discovery of a user's search generated feature vector contains "aloe vera, planting ", after learning, the mining system should be able to increase the weight of the feature item "aloe vera and planting", and use the user feed-back mechanism to push data in a timely manner ). In addition, you can mine deeper knowledge through the interests of group users. For example, you can find that many users in a region have aloe vera in the feature vectors generated during retrieval ", it can be inferred that this area may have the requirement of Aloe vera. Based on this, the mining system can analyze the regional requirements of Aloe vera market, so as to provide a scientific basis for the circulation of Aloe vera.
Currently, the development of AI and other technologies is not yet mature. Using statistical mathematical models to build an agricultural information network mining system is of some inspiration, all parts of the system need to be further improved and improved.

Contribution

1. gudivada v n. Information Retrieval on the World Wide Web. IEEE Internet computing, 11997,1 (5): 58 ~ 68
2. Li horizontal. Review of data mining technology. Small computer system, (4): 74 ~ 81

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.