When crawling URLs using crawlers, we always use the data structure of the queue, in the example, write a queue class in Java can solve the problem, but this kind of queue stored data can only be stored in memory, once the power outage, all the data is emptied, the next time again. So, this queue can't be used to solve the problem, we have to implement a queue that can persist data.
Here is a queue I implemented with Berkeley DB, BerkeleyDB is a memory embedded database that automatically persists data to disk when the data stored in memory is larger than its buffer size.
Berkeley DB uses a key-value pair to store it, so I use the Java BigInteger as the key, and the URL is stored as value. Key is incremented, BigInteger can be achieved after 1 plus thousands of 0, fully meet the requirements of a large number of URLs.
The queue maintains two team heads and two BigInteger at the end of the queue, preserving the head and tail values of the queues, respectively, deleting the data from the head and adding 1 to the header value, and adding the data to the queue, and the trailing value is added 1;size to return the length; There are several cursor operations, such as first, current, Next, Prev and last, to traverse the queue.
The implementation of the queue is based on the Myberkeleydb Class I previously encapsulated, with just a few simple APIs that are handy to use, which is also code reuse. Here's the code:
PackageCom.nosql;ImportJava.math.BigInteger;/********************************* * uses BerkeleyDB to encapsulate some database operations * including setting the buffer, setting the encoding, setting the data library * path, storing the key value pairs, looking up the values according to the key, closing the number of * database operations. * @author Administrator *********************************/ Public class myberkeleydbqueue { PrivateMyberkeleydb database;//Database Private Static FinalBigInteger bigintegerincrement = biginteger.valueof (1);increment value of//key value PrivateBigInteger Head;//Queue header PrivateBigInteger tail;//Queue tail PrivateBigInteger current;//used to traverse the current location of the database Private Static FinalString headstring ="Head";Private Static FinalString tailstring ="Tail"; Public Myberkeleydbqueue() {database =NewMyberkeleydb (); }//Initialize database Public void Open(String dbName) {database.setenvironment (Database.getpath (), database.getchachesize ()); Database.open (DbName);//Open DatabaseHead = (BigInteger) database.get ("Head"); Tail = (BigInteger) database.get ("Tail");if(Head = =NULL|| Tail = =NULL) {head = biginteger.valueof (0); Tail = biginteger.valueof (-1); Database.put (headstring, head); Database.put (tailstring, tail); } current = Biginteger.valueof (Head.longvalue ()); }//Set encoding Public void Setcharset(String CharSet) {Database.setcharset (charset); }//set path Public void SetPath(String Path) {Database.setpath (path); }//Set buffer size Public Boolean setchachesize(LongSize) {returnDatabase.setchachesize (size); }//Queue Public void EnQueue(Object value) {if(Value = =NULL)return; Tail = Tail.add (myberkeleydbqueue.bigintegerincrement); Database.put (tailstring, tail); Database.put (tail, value);//queue up, Team tail plus 1}//OUT Team PublicObjectDeQueue() {Object value = Database.del (head);//Get the team head element and delete it if(Value! =NULL) {head = Head.add (bigintegerincrement); Database.put (headstring, head); }returnValue }//Team Header value PublicObjectHead(){returnHead }//Team Tail value PublicObjectTail(){returnTail }//Off Public void Close() { This. Database.close (); }//Get the size of database storage data Public Long size() {returnDatabase.size ()-2; }//Gets the current cursor value PublicObject Current(){returnDatabase.get (current); }//Get the first cursor value PublicObject First() {current = Biginteger.valueof (Head.longvalue ());returnCurrent (); }//Get the first cursor value PublicObject Last() {current = Biginteger.valueof (Tail.longvalue ());returnCurrent (); }//Get the next cursor value PublicObjectNext(){if(Current.compareto (tail) <0) {current = Current.add (bigintegerincrement);returnCurrent (); }return NULL; }//Get previous cursor value PublicObjectprev(){if(Current.compareto (head) >0) {current = Current.divide (bigintegerincrement);returnCurrent (); }return NULL; }}
The team header value and the tail value are stored by String/biginteger, and the URL and key are stored using the Biginteger/string key value pair (in order to reuse, the code is all object, here is explained for a better understanding), So the size function returns the queue length minus a 2, which is the team header value and the tail value.
Below is a functional test file:
Packagecom. Test;Import Java. Math. BigInteger;Importcom. NoSQL. Myberkeleydbqueue;public class Test_myberkeleydbqueue {public static void main (string[] args) {//TODO auto-generated method Stu b myberkeleydbqueue queue = new Myberkeleydbqueue ();Queue. SetPath("Webroot\\data\\db\\queue");Queue. Open("Queue");System. out. println("Head:"+queue. Head());System. out. println("Tail:"+queue. Tail());System. out. println("Size"+queue. Size());System. out. println("===================");for (int i=0; i<10;i++) {Queue. EnQueue(i);System. out. println("Head:"+queue. Head());System. out. println("Tail:"+queue. Tail());System. out. println("Size:"+queue. Size());System. out. println("===================");}//Cursor Test System. out. println("first element:"+queue. First());System. out. println("last element:"+queue. Last());Long size1 = queue. Size()-1;//Lose the first elementSystem. out. println(Queue. First());//resets the cursor to 0while (size1-->0) {System. out. println(Queue. Next());} System. out. println("===================");System. out. println("Size:"+queue. Size());System. out. println("===================");Long size = Queue. Size()+3, the number of outgoing units is greater than the total number of queue elements, and null is outputfor (int i=0; i<size;i++) {System. out. println("Delete:"+queue. DeQueue());System. out. println("Head:"+queue. Head());System. out. println("Tail:"+queue. Tail());System. out. println("Size"+queue. Size());System. out. println("===================");} queue. Close();}}
"Search Engine" BerkeleyDB implementing the Queue database