2. Width-first crawler and crawler with preference (4)

Source: Internet
Author: User

5 Java crawler with preference example

In the previous section, we have pointed out that priority queue (priorityqueue) can be used to implement this crawler with preference. Before giving an in-depth explanation, we will first introduce priority queues.

A priority queue is a special queue. Elements in a common queue are FIFO, while a priority queue is an outbound queue based on the priority of elements in the queue. For example, priority queue can be used for priority process management in the operating system. There are also two types of Priority Queues: The minimum priority queue and the maximum priority queue.
Theoretically, a priority queue can be any data structure, linear or non-linear, or ordered or unordered. For ordered priority queues, it is very easy to obtain the smallest or largest value, but insertion is very difficult. For unordered cohesive queues, insertion is very easy, however, it is difficult to obtain the maximum and minimum values. Based on the above analysis, you can use the data structure in the fold "Heap" to implement priority queues.
Since jdk1.5, Java provides a built-in data structure that supports priority queues-java. util. priorityqueue.
WeCodeYou can select a URL with a higher priority from the URL queue.

Linkqueue class:

 

1 Public   Class  Linkqueue { 2 //  Accessed URL set 3 Private   Static Set visitedurl = New  Hashset (); 4 //  URL set to be accessed 5 Private   Static Queueun visitedurl = New  Priorityqueue (); 6 //  Get URL queue 7Public   Static  Queue getunvisitedurl (){ 8 Return  Unvisitedurl; 9 } 10 //  Added to the accessed URL queue 11 Public   Static   Void  Addvisitedurl (string URL ){ 12 Visitedurl. Add (URL ); 13 } 14//  Remove the accessed URL 15 Public   Static   Void  Removevisitedurl (string URL ){ 16 Visitedurl. Remove (URL ); 17 } 18 //  Unaccessed URL output queue 19 Public   Static  Object unvisitedurldequeue (){ 20 Return Unvisitedurl. Poll (); 21 } 22 //  Ensure that each URL is accessed only once 23 Public   Static   Void  Addunvisitedurl (string URL ){ 24 If (URL! = Null &&! URL. Trim (). Equals ("" ) 25 &&! Visitedurl. Contains (URL) 26 &&!Unvisitedurl. Contains (URL )) 27 Unvisitedurl. Add (URL ); 28 } 29 //  Obtain the number of accessed URLs 30 Public   Static   Int  Getvisitedurlnum (){ 31 Return  Visitedurl. Size (); 32 } 33 //  Determines whether the unaccessed URL queue is empty. 34 Public   Static   Boolean  Unvisitedurlsempty (){ 35 Return  Unvisitedurl. isempty (); 36 } 37}

 

 

 

 

 

In crawlers with preferences, the priority of queue elements is determined by the URL priority. There are some dedicated link analysis methods for determining the URL priority, such as Google's PageRank and hitsAlgorithm.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.