[Wbia 1.5] calculate the capture coverage rate based on the obtained link relationship

Source: Internet
Author: User

[Wbia 1.4] records how to get the link, so you can easily get all URLs from this file. The basic idea of evaluating capture coverage is also very simple. Assume that the number of webpages I crawled is X and the total number of webpages is N, and the coverage rate is X/n. However, since N cannot be obtained, we need to use the sampling method. Random sampling from a webpage collection. Assume that the total number of samples is N, and X of the N web pages are included in our capture set. Then, our coverage rate is X/n. You can find the results of different capture processes for sampling. I requested another URL link from my classmates. After simple processing, I calculated the coverage rate as the random sampling result described above.CodeAs follows:

# Include <iostream> # Include <Fstream> # Include <Time. h> # Include < Set > Using   Namespace  STD;  Set < String >Myurls;  Set < String > Otherurls;  String  Nowtime (){  Char Outtime [ 64  ]; Time_t t = Time ( 0  ); Strftime (outtime,  Sizeof (Outtime ), "  % Y/% m/% d % x  " , Localtime (& T ));  Return  Outtime ;}  Bool Isuseless ( String  Urlstr ){  Return Urlstr. substr (urlstr. Size ()- 4 ) = "  . CSS  " | Urlstr. substr (urlstr. Size ()- 4 ) = "  . Txt " | Urlstr. substr (urlstr. Size ()- 3 ) = "  . Js  "  ;}  Int  Main () {ifstream myin (  "  Mylinkmap.txt  "  ); Ifstream otherIn (  "  Otherlinkmap.txt  "  ); String  Urlstr; cout <Nowtime () < "  : Begin init my URLs.  " < Endl;  While (Myin> Urlstr ){  If (Isuseless (urlstr )) Continue  ; Myurls. insert (urlstr);} cout <Nowtime () < " : Begin init other's URLs.  " < Endl;  While (OtherIn> Urlstr ){  If (Isuseless (urlstr )) Continue  ; Otherurls. insert (urlstr );}  Int All = 0 , Exist = 0  ; Cout <Nowtime () < " : Begin calculate coverage.  " < Endl;  For ( Set < String >:: Iterator it = otherurls. Begin (); it! = Otherurls. End (); It ++ ) {All ++ ;  If (Myurls. Find (* It )! = Myurls. End () exist ++ ;} Cout < "  The coverage is " < Double (Exist)/all * 100 < "  %  " < Endl;  Return   0  ;} 

After calculation, we found that the coverage rate was 40.0702%, which was not as high as we thought. After careful observation, we summarized the following three possible causes.

1. There are a lot of images, DOC files, ppt files and other resources in the results, but I cannot get this part because I have restricted the file type in the module during the capture process.

2. the crawling boundary is unknown. After all, it is not the crawling of the whole network, so which webpages belong to the Shenzhen Research Institute, which do not belong, and the definitions may be different. He may have caught more websites that I don't think belong to the Shenzhen Research Institute.

3. The capture time may be a little short, and many webpages have not been captured yet.

As a result, the coverage rate should be normal. The next section describes how to calculate PageRank.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.