Contents
Chapter 1 Introduction ........................................................................................... 1
1.1 Background... 1
1.2 arrangement in this article... 2
Chapter 2 Overview of Web Information Collection ........................................................................ 4
2.1 Basic Principles of the Web information collection system... 4
2.2 Basic Structure of the Web information collection system... 4
2.3 Main difficulties and corresponding technical means for Web Information Collection: 6
2.4 collection system instance... 8
Chapter 3 Research Status of Web Information Collection ........................................................... ... 11
3.1 web-based information collection... 11
3.2 Incremental Web Information Collection: 12
3.3 topic-based Web Information Collection: 12
3.4 User-Defined Web information collection... 13
3.5 agent-based information collection... 14
3.6 collection of migration information... 15
3.7 meta-search-based Information Collection: 15
3.8 summary... 15
Chapter 4 basic topic-based Web information collection problems ....................................... ... 17
4.1 Definition of topic-based Web information collection... 17
4.2 advantages of topic-based Web information collection... 17
4.3 category of topic-based Web information collection... 18
4.4 Distribution Characteristics of theme pages on the web... 19
4.5 Study on relevance discriminant algorithms... 21
Chapter 5 topic-based Web information collection system model and Our Countermeasures ........................... ... 37
5.1 system model... 37
5.2 Key Issues in the model and our strategies... 37
Chapter 6 select the topic .................................................................................... 41
6.1 theme definition... 41
6.2 topic category directory... 41
6.3 features of topic classification directories on the Web... 42
6.4 theme selection policy... 42
Chapter 7 spider collection .............................................................................. 44
7.1 spider system model... 44
7.2 collection algorithm and implementation... 45
Chapter 8 page analysis .................................................................................... 49
8.1 HTML syntax analysis... 49
8.2 text extraction on the page... 49
8.3 extraction of links on the page... 50
8.4 extraction of titles on the page... 51
Chapter 9 determining the relevance between URL, page, and topic ...................................................... 52
91 determining the correlation between URL and topic-ipagerank algorithm... 53
9.2 determining the correlation between pages and themes-vector space model algorithm... 56
Chapter 10 system implementation and summary ........................................................................ 58
10.1 system implementation... 58
10.2 system test results... 58
103 further work... 62
10.4 conclusion... 62
References ................................................................................................ 64
Thank you ................................................................................................ 68
Introduction ................................................................................................ 69
Chapter 1 Introduction 1.1 Background
With the rapid development of Internet/Intranet, the network is profoundly changing our lives. With its intuitive and convenient use and rich expression capabilities, the World Wide Web (WWW) technology, which is the most rapidly developed on the internet, it has gradually become the most important information publishing and transmission method on the Internet. With the advent and development of the information age, information on the Web has sprung up rapidly. By July 2000, the number of web pages on the Internet had already exceeded 2.1 billion, with over 0.3 billion Internet users, and the number of web pages was also increasing by 7 million every day [Xu zeping 2001]. This provides rich resources for people's lives.
However, the rapid expansion of Web information, while providing rich information to people, makes them face a huge challenge in their effective use. On the one hand, the information on the Internet is diverse and rich, while on the other hand, users cannot find the information they need. Therefore, WWW-based online information collection, publishing, and related information processing have increasingly become the focus of attention.
To this end, people have developed a search service based on Web search engines. In order to solve the problem of information retrieval on the Internet, people have made a lot of research in the information retrieval field and developed various search engines (such as Google and Yahoo ). These search engines usually use one or more collectors to collect various types of data (such as WWW, FTP, email, and news) from the Internet, and then index the data on the local server, when a user searches, the required information is quickly found from the index database based on the search conditions submitted by the user [Bowman 1994]. As the basis and component of these search engines, Web Information Collection plays an important role. With the deepening of applications and the development of technologies, it is also increasingly used in site structure analysis, page validity analysis, Web Graph evolution, content security detection, user interest mining, personalized information acquisition, and other services and research. Simply put, Web information collection is a process in which page information is automatically obtained from the Web through links between web pages, and the links are constantly extended to the required web pages.
The goal of traditional Web information collection is to collect as many information pages as possible, and even the entire web resource, in this process, it does not care too much about the collection sequence and related topics of the collected page. One of the major advantages of doing so is the ability to focus on the collection speed and quantity, and the implementation is relatively simple, for example, Google's acquisition system can speed up to 100 pages per second when four collectors are concurrently used, so that it can work with search engines to bring great convenience to network users. However, this traditional collection method also has many defects.
With the explosive growth of WWW information, the speed of information collection is increasingly insufficient to meet the needs of practical applications. Recent experiments show that even a large information collection system has a web coverage rate of only 30-40%. The direct solution to this problem is to upgrade the hardware of the Information Collector and adopt a computer system with higher processing capability. However, this method has limited scalability and low cost performance. A better solution is to use a distributed Method to Improve parallel capabilities. However, parallelism not only increases system overhead and design complexity, in addition, the benefits of parallel acquisition are significantly reduced as the number of parallel collectors increases. At present, generally large collection systems adopt parallel mechanisms, but the improvement effect of parallel processing is far from meeting people's needs. People need to improve their current dilemmas from other perspectives. For example, you can collect the entire web block and integrate the collection results of different blocks to Improve the collection coverage of the entire web.
Scattered storage, management, and dynamic changes of Internet information are also one of the problems that plague information collection. Because the information source may be changing at any time, the Information Collector must refresh the data from time to time, but the collected page still cannot be avoided. For traditional information collection, the huge number of new pages to be refreshed makes it several weeks to a month for many collection systems to refresh [Aggarwal et al. 2001] [Brin & page 1998], making the page very inefficient. In a 1995 survey conducted by Selberg and etzioni, it was found that 14.9% of the target pages in the result URLs found by some of the most commonly used search engines on the Internet have expired [Selberg & etzioni 1995]. Obviously, the solution is to reduce the number of pages to be collected, reduce the time to refresh the page, and thus reduce the efficiency of the page that has been collected.
The traditional Web-based information collection requires a large number of pages, which consumes a lot of system resources and network resources, however, the consumption of these resources is not in exchange for a high utilization rate of the collected pages. In fact, a considerable portion of them have a low utilization rate. This is because users often only care about a very small number of pages, and these pages are usually concentrated in one or several topics, and most of the pages collected by collectors are useless for them. Although the combined effect of many users improves the utilization of the entire page, the utilization is still low, which is obviously a huge waste of system and network resources. In order to effectively improve their utilization efficiency, we need to find another path.
For users' general information query and retrieval requirements, the search engine composed of traditional information collectors can provide better services, but for more specific requirements of users, this traditional Web-based information collection service is difficult to provide. For each user, even though they enter the same query word, the query results they desire are different, while the traditional information collection and search engines can only return the same results, this is unreasonable and requires further improvement.
These problems mainly stem from two reasons: the number of collection pages is too large and the collection page content is too messy. The idea of classifying the entire web page, collecting by category, and collecting by topic came into being. It effectively reduces the number of collection pages, increases the normalization of collection pages, and effectively relieves the above problems. Therefore, research on topic-based Web information collection is required.
1.2 Arrangement
Chapter 2 summarizes the basic structure, main difficulties and corresponding technical means of Web Information Collection. In the third chapter, the research status and hot development direction of Web Information Collection are discussed, and the urgency and necessity of Web Information Collection Based on themes are pointed out. In chapter 4, we discuss the basic issues of topic-based Web information collection, focusing on the study of the distribution of topic pages on the web and the relevance determination algorithm. Chapter 5 provides the topic-based structure model of the Web information collection system, and briefly describes the key problems and Corresponding Countermeasures Faced by the system. In the next four chapters (from Chapter 6 to Chapter 9 ), according to the main part of the structure model, the topic selection, spider collection, PAGE analysis, URL, and page-to-topic correlation determination are described in detail, the design scheme and algorithm are provided. Finally, in chapter 10, we provide the experimental results of the system and further questions to be studied.