Key technology-single-host crawler implementation (3)-where is the URL stored? memory is too high for memory, and database performance is poor

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This problem is actually a matter of space and time. As you can imagine, if you store all URLs in the memory, the memory will soon be fully occupied. However, if a file exists, you must operate the file each time you read or add it. This performance consumption is relatively large. Therefore, we can quickly think of the reason why cache appears on the computer. My design philosophy is to create three levels of storage: memory, file, and database. In this way, satisfactory results can be achieved to a certain extent. Note that the database I designed here is mainly used for Distributed Web crawlers. When a web crawler finds a url, we store the url of the task of the current node into a file. The url of the node task is not stored in the public database. In this way, URLs are not frequently transmitted between nodes. This is what we will talk about later. Now we need to solve the following problems when considering single-host crawlers.

1. How does a web crawler store a url?

2. How to obtain a url for a web crawler?

3. Access cannot damage the original smooth operation

In this case, we design three queues. Memory queue cache, file queue finding, and waitting queue. The finding queue stores the URL analyzed from the web page, which is stored on the disk. The waitting queue is used to store the URL written back to the disk from the memory each time. The queue is also on the disk and has a higher priority than the finding queue. The cache queue is used to store the URL that will be resolved by DNS every time you read the memory. This queue is in the memory.

First, all URLs are stored in the finding queue on the disk. In the common queue, DNS resolution is required for each new URL on the disk for its hostname, because the DNS resolution process is very time-consuming, we need a cache to store URLs. Therefore, we have created a queue (cache Queue) in the memory, but the cache space is limited, when the memory queue is full, these extra URLs will be written back to the disk. To maintain the original priority, we have created another queue (waitting Queue) on the disk) store the URLs written back to the disk. Otherwise, if it is still stored in the original finding queue, It will be written back to the end of the queue based on the nature of the queue, this will affect the priority of the original URL. Therefore, when the queue in the memory is free for the next time, the URLs will be obtained from the waitting queue each time. When the waitting queue is empty, will continue to get the URL from the finding queue for future work.

We create a UrlQueueManager class. The following code does not consider waitting:

Public class UrlQueueManager
{
Public Queue <string> cachequeue = new Queue <string> ();
Private FileIO finding = new FileIO ();

Public void Init ()
{
Finding. CloseWriteFile ();
Finding. CloseReadFile ();
Finding. OpenWriteFile (Spider. root + "finding.txt ");
Finding. OpenReadFile (Spider. root + "finding.txt ");
}

Public int Count
{
Get
{
Int count = 0;
Count = cachequeue. Count;
Return count;
}
}

Public void Clear ()
{
Cachequeue. Clear ();
}

Public void Enqueue (UrlInfo url)
{
If (Count <Spider. urlqueuemaxcount)
{
Cachequeue. Enqueue (url. ToString ());
}
Else
{
Finding. WriteLine (url. ToString ());
}
}

Public UrlInfo Dequeue ()
{
UrlInfo url = new UrlInfo ();
If (Count> 0)
{

If (Count <Spider. urlqueuemaxcount/2)
{
For (int I = 0; I <Spider. urlqueuemaxcount/2; I ++)
{
If (! Finding. IsEof ())
{
Cachequeue. Enqueue (finding. ReadLine ());
}
Else
{
Break;
}
}
}
String cacheurl = cachequeue. Dequeue ();
String [] spiler = cacheurl. Split ('| ');
Url. Url = spiler [0];
Url. Depth = Convert. ToInt32 (spiler [1]);
Url. Weight = Convert. ToInt32 (spiler [2]);
}
Else
{
If (! Finding. IsEof ())
{
String waitingurl = finding. ReadLine ();
String [] spiler = waitingurl. Split ('| ');
Url. Url = spiler [0];
Url. Depth = Convert. ToInt32 (spiler [1]);
Url. Weight = Convert. ToInt32 (spiler [2]);
}
}
Return url;
}
}

After understanding the concept, there is no problem with implementation. To be continued...

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Key technology-single-host crawler implementation (3)-where is the URL stored? memory is too high for memory, and database performance is poor

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Key technology-single-host crawler implementation (3)-where is the URL stored? memory is too high for memory, and database performance is poor

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support