Key technology-single-host crawler implementation (3)-where is the URL stored? memory is too high for memory, and database performance is poor

Source: Internet
Author: User

This problem is actually a matter of space and time. As you can imagine, if you store all URLs in the memory, the memory will soon be fully occupied. However, if a file exists, you must operate the file each time you read or add it. This performance consumption is relatively large. Therefore, we can quickly think of the reason why cache appears on the computer. My design philosophy is to create three levels of storage: memory, file, and database. In this way, satisfactory results can be achieved to a certain extent. Note that the database I designed here is mainly used for Distributed Web crawlers. When a web crawler finds a url, we store the url of the task of the current node into a file. The url of the node task is not stored in the public database. In this way, URLs are not frequently transmitted between nodes. This is what we will talk about later. Now we need to solve the following problems when considering single-host crawlers.

1. How does a web crawler store a url?

2. How to obtain a url for a web crawler?

3. Access cannot damage the original smooth operation

In this case, we design three queues. Memory queue cache, file queue finding, and waitting queue. The finding queue stores the URL analyzed from the web page, which is stored on the disk. The waitting queue is used to store the URL written back to the disk from the memory each time. The queue is also on the disk and has a higher priority than the finding queue. The cache queue is used to store the URL that will be resolved by DNS every time you read the memory. This queue is in the memory.

First, all URLs are stored in the finding queue on the disk. In the common queue, DNS resolution is required for each new URL on the disk for its hostname, because the DNS resolution process is very time-consuming, we need a cache to store URLs. Therefore, we have created a queue (cache Queue) in the memory, but the cache space is limited, when the memory queue is full, these extra URLs will be written back to the disk. To maintain the original priority, we have created another queue (waitting Queue) on the disk) store the URLs written back to the disk. Otherwise, if it is still stored in the original finding queue, It will be written back to the end of the queue based on the nature of the queue, this will affect the priority of the original URL. Therefore, when the queue in the memory is free for the next time, the URLs will be obtained from the waitting queue each time. When the waitting queue is empty, will continue to get the URL from the finding queue for future work.

We create a UrlQueueManager class. The following code does not consider waitting:

Public class UrlQueueManager
{
Public Queue <string> cachequeue = new Queue <string> ();
Private FileIO finding = new FileIO ();

Public void Init ()
{
Finding. CloseWriteFile ();
Finding. CloseReadFile ();
Finding. OpenWriteFile (Spider. root + "finding.txt ");
Finding. OpenReadFile (Spider. root + "finding.txt ");
}

Public int Count
{
Get
{
Int count = 0;
Count = cachequeue. Count;
Return count;
}
}

Public void Clear ()
{
Cachequeue. Clear ();
}

Public void Enqueue (UrlInfo url)
{
If (Count <Spider. urlqueuemaxcount)
{
Cachequeue. Enqueue (url. ToString ());
}
Else
{
Finding. WriteLine (url. ToString ());
}
}

Public UrlInfo Dequeue ()
{
UrlInfo url = new UrlInfo ();
If (Count> 0)
{

If (Count <Spider. urlqueuemaxcount/2)
{
For (int I = 0; I <Spider. urlqueuemaxcount/2; I ++)
{
If (! Finding. IsEof ())
{
Cachequeue. Enqueue (finding. ReadLine ());
}
Else
{
Break;
}
}
}
String cacheurl = cachequeue. Dequeue ();
String [] spiler = cacheurl. Split ('| ');
Url. Url = spiler [0];
Url. Depth = Convert. ToInt32 (spiler [1]);
Url. Weight = Convert. ToInt32 (spiler [2]);
}
Else
{
If (! Finding. IsEof ())
{
String waitingurl = finding. ReadLine ();
String [] spiler = waitingurl. Split ('| ');
Url. Url = spiler [0];
Url. Depth = Convert. ToInt32 (spiler [1]);
Url. Weight = Convert. ToInt32 (spiler [2]);
}
}
Return url;
}
}

After understanding the concept, there is no problem with implementation. To be continued...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.