Web Crawler (3): URL Index

Source: Internet
Author: User

The URL index is used to determine whether a URL has been crawled.AlgorithmIt is mainly an MD5 digital signature.

Suppose there are no more than 0.1 billion URLs to be crawled. If a binary bit is used to indicate whether a URL has been crawled, at least 0.1 billion digits are required. Each bit is called a "slot ". Considering that the MD5 algorithm may conflict (that is, the MD5 calculated by different URLs may be the same, but this probability is very small), the smaller the slot, the more obvious the conflict, the better the slot. But on the other hand, we also need to consider the memory usage, because in the capture process, in order to ensure efficiency, all the slots need to load the memory. Currently, I am using the 28th power of 2, that is, 32 m, equivalent to 268435456 (0.26 billion) slots.

To determine whether a URL has been crawled, you only need to determine whether the slot corresponding to the URL's MD5 Signature value is marked as 1. For example, the URL is. Similarly, after a URL is captured, the corresponding slot should be marked as 1.

The 32 m space of the storage slot is not consecutive in the memory, because it is difficult for the operating system to divide the 32 m continuous memory space, so it is divided into 4096 segments Segment segment, each segment contains 2048 32-bit integers, 32*2048*4096 = 268435456. It is equivalent to an integer two-dimensional array.

We use 32-bit MD5 as the signature, which is expressed as an integer. This integer is divided into three parts: segment address, segment offset and value address. The 5-16 bits represent the segment address, and the 17-27 bits represent the segment offset, 28-32 bits (the last 5 bits, the value range is the 5 power of 2, that is, 0-31) position in the integer value, that is, the value address.

When the MD5 value of a URL is given, the segment address is calculated using the following function:

1:UnsignedShortGet_segment_index (unsignedIntMD5 ){
 
2:
 
3:// 5-16 bits indicate the segment address
 
4:
 
5:UnsignedShortResult;
 
6:Bzero (& result,Sizeof(UnsignedShort));
 
7:Memcpy (& result ,((Char*) & MD5) + 2,Sizeof(UnsignedShort));
 
8: 
9:ReturnResult & 0x0fff;
 
10:}

Calculate the segment offset using the following function:

 
1:UnsignedShortGet_segment_offset (unsignedIntMD5 ){
 
2:
 
3:// 17-27 BITs indicate the segment offset
 
4:
 
5:UnsignedShortResult;
 
6:Bzero (& result,Sizeof(UnsignedShort));
7:Memcpy (& result ,((Char*) & MD5 ),Sizeof(UnsignedShort));
 
8: 
 
9:ReturnResult> 5;
 
10:}

Use the following function to calculate the value offset:

 
1:UnsignedIntGet_value (unsignedIntMD5 ){
 
2:
 
3:// 28-32 (last 5 digits) indicates the value
 
4:
 
5:UnsignedIntResult = 1;
6:ReturnResult <(MD5 & 0x0000001f );
 
7:}

After obtaining the segment address, segment offset, and value offset, you can use the following function to determine whether the URL has been crawled:

 
1:BoolIs_url_crawled (Char* URL ){
 
2: 
 
3:// Perform MD5 calculation on the given URL to obtain the corresponding value. The stored value is bitwise AND
 
4:
 
5:UnsignedIntUrl_md5 = MD5 (URL );
 
6:UnsignedShortSegment_index = get_segment_index (url_md5 );
7:UnsignedShortSegment_offset = get_segment_offset (url_md5 );
 
8:UnsignedInt Value= Get_value (url_md5 );
 
9:
 
10:UnsignedIntResult = (unsignedInt)
 
(Url_index [segment_index] [segment_offset] &Value);
 
11: 
 
12:ReturnResult> 0? True: false;
 
13:}

If it is not captured, mark it as crawled by using the following function:

1:IntMark_url_as_crawled (Char* URL ){
 
2: 
 
3:// Obtain the value of the segment address, segment offset, and URL.
 
4:UnsignedIntUrl_md5 = MD5 (URL );
 
5:UnsignedShortSegment_index = get_segment_index (url_md5 );
 
6:UnsignedShortSegment_offset = get_segment_offset (url_md5 );
 
7:UnsignedInt Value= Get_value (url_md5 );
 
8: 
9:// The URL is captured by the bitwise OR mark the corresponding bitwise of the URL
 
10:Url_index [segment_index] [segment_offset] | =Value;
 
11:
 
12:// Write the index file synchronously
 
13:Value= Url_index [segment_index] [segment_offset];
 
14:LongOffset = (((Long) Segment_index) * segment_length + segment_offset)
 
*Sizeof(UnsignedInt);
 
15:If(Fseek (index_file, offset, seek_set )! = 0)
16:Return-1;
 
17:
 
18:If(Fwrite (&Value,Sizeof(UnsignedInt), 1, index_file )! = 1)
 
19:Return-1;
 
20:
 
21:Fflush (index_file );
 
22:Return0;
 
23:}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.