5 Java crawler with preference example
In the previous section, we have pointed out that priority queue (priorityqueue) can be used to implement this crawler with preference. Before giving an in-depth explanation, we will first introduce priority queues.
A priority queue is a special queue. Elements in a common queue are FIFO, while a priority queue is an outbound queue based on the priority of elements in the queue. For example, priority queue can be used for priority process management in the operating system. There are also two types of Priority Queues: The minimum priority queue and the maximum priority queue.
Theoretically, a priority queue can be any data structure, linear or non-linear, or ordered or unordered. For ordered priority queues, it is very easy to obtain the smallest or largest value, but insertion is very difficult. For unordered cohesive queues, insertion is very easy, however, it is difficult to obtain the maximum and minimum values. Based on the above analysis, you can use the data structure in the fold "Heap" to implement priority queues.
Since jdk1.5, Java provides a built-in data structure that supports priority queues-java. util. priorityqueue.
WeCodeYou can select a URL with a higher priority from the URL queue.
Linkqueue class:
1 Public Class Linkqueue { 2 // Accessed URL set 3 Private Static Set visitedurl = New Hashset (); 4 // URL set to be accessed 5 Private Static Queueun visitedurl = New Priorityqueue (); 6 // Get URL queue 7Public Static Queue getunvisitedurl (){ 8 Return Unvisitedurl; 9 } 10 // Added to the accessed URL queue 11 Public Static Void Addvisitedurl (string URL ){ 12 Visitedurl. Add (URL ); 13 } 14// Remove the accessed URL 15 Public Static Void Removevisitedurl (string URL ){ 16 Visitedurl. Remove (URL ); 17 } 18 // Unaccessed URL output queue 19 Public Static Object unvisitedurldequeue (){ 20 Return Unvisitedurl. Poll (); 21 } 22 // Ensure that each URL is accessed only once 23 Public Static Void Addunvisitedurl (string URL ){ 24 If (URL! = Null &&! URL. Trim (). Equals ("" ) 25 &&! Visitedurl. Contains (URL) 26 &&!Unvisitedurl. Contains (URL )) 27 Unvisitedurl. Add (URL ); 28 } 29 // Obtain the number of accessed URLs 30 Public Static Int Getvisitedurlnum (){ 31 Return Visitedurl. Size (); 32 } 33 // Determines whether the unaccessed URL queue is empty. 34 Public Static Boolean Unvisitedurlsempty (){ 35 Return Unvisitedurl. isempty (); 36 } 37}
In crawlers with preferences, the priority of queue elements is determined by the URL priority. There are some dedicated link analysis methods for determining the URL priority, such as Google's PageRank and hitsAlgorithm.