[Wbia 1.4] modifying heritrix code to obtain links between webpages

Source: Internet
Author: User

As mentioned in 1.2, you can modify heritrixCodeTo get the link between webpages, let's talk about how to modify it. To modify heritrix code, you must first configure heritrix code.

1. hertrix code configuration

Which of the following is the heritrix code configuration method, partial content is transferred from the http://www.ibm.com/developerworks/cn/opensource/os-cn-heritrix? S_tact = 105agx52 & s_cmp = reg-CCID

First, create a Java project myheritrix in eclipse. Then use the downloadedSource codeConfigure the project according to the following steps.

1. Import the class library

The tool libraries used by heritrix are all under the heritrix-1.14.4-src \ lib directory and need to be imported into the myheritrix project.

1) copy the Lib folder under the heritrix-1.14.4-src to the root directory of the myheritrix project;

2) Right-click the myheritrix project and choose "build pathconfigure build path ...", Select the library tab and click "add jars ...", 1.

Figure 1. Import a class library-before import

3) in the pop-up "jar selection" dialog box, select all jar files in the Lib folder of the myheritrix project, and click OK. 2.

Figure 2. Select a class library

3 after the settings are complete:

Figure 3. Import the class library-after the Import

2. Copy source code

1) copy the COM, org, and St folders under the heritrix-1.14.4-src \ SRC \ Java into the SRC of the myheritrix project. These three folders contain the core source code required to run heritrix;

2) copy the file heritrix-1.14.4-src under tlds-alpha-by-domain.txt \ SRC \ resources \ org \ archive \ util to myheritrix \ SRC \ org \ archive \ util. This file is a list of top-level domain names and will be read when heritrix is started;

3) copy the conf folder under heritrix-1.14.4-src \ src to the heritrix project root directory. It contains the configuration files required for heritrix to run;

4) copy the webapps folder in heritrix-1.14.4-src \ src to the heritrix project root directory. This folder is used to provide the servlet engine and contains heritrix web UI files. Note that it does not contain help documentation. If you want to use help, you can copy the articles folder in heritrix-1.14.4.zip \ docs to myheritrix \ webapps \ admin \ Docs (you need to create a new docs folder. Or directly replace the webapps folder in heritrix-1.14.4.zip \ SRC with the webapps folder in the heritrix-1.14.4-src, but this is a packaged. War file that cannot be modified.

The copied myheritrix project directory Level 4 is shown in. The source code required to run heritrix is ready. modify the configuration file and add the running parameters.

Figure 4. directory hierarchy of the myheritrix Project

3. modify the configuration file

The conf folder is used to provide the configuration file, which contains a very important file: heritrix. properties. Heritrix. properties is configured with a large number of parameters that are closely related to heritrix running. The configuration of these parameters determines some default tools during heritrix running, startup parameters of Web UI, and the log format of heritrix. When you run heritrix for the first time, you only need to modify the file and add the user name and password of the Web UI to it. 5. Set heritrix. cmdline. Admin = admin: Admin, and "Admin: admin" to the user name and password respectively. Set the version parameter to 1.14.4.

Figure 5. Set the login user name and password

4. Configure the running File

Right-click the myheritrix project and choose "Run asrun deployments" to ensure that the project and main class options on the main tab are correct, as shown in figure 6. The name parameter can be set to any easily recognized name.

Figure 6. Configure the running file-set the project and class

Select the userentries option on the classpath page. The Advanced button on the right is activated. Click it. In the displayed dialog box, select "add Folders" and select the conf folder under the myheritrix project. 7.

Figure 7. Add a configuration file

So far, our myheritrix project can run. Next let's take a look at how to start heritrix and set a specific capture task.

Create a Web page capture task

Find Org. archive. heritrix In the crawler package. java file, which is the entry for heritrix crawler startup. Right-click and choose "Run asjava application". If the configuration is correct, the startup information shown in Figure 8 is output on the console.

Figure 8. Console output when running successfully

Enter http://localhost.sixxs.org: 8080 in your browser to open the Web UI logon interface shown in 9.

Figure 9. heritrix logon page

Ii. Possible problems during configuration

The following content is transferred from

Http://hi.baidu.com/liuqiyuan/blog/item/d0dd42a74005b384d0435825.html

Error 1: access restriction: The type fileurlconnection is not accessible due to restriction on required library C: \ Program Files \ Java \ jdk1.6.0 _ 20 \ JRE \ Lib \ RT. jar, 1.
Solution: this error is caused by JRE access restriction. Right-click the myheritrix project and choose "build pathconfigure build path ...", Select the "library" tab, delete "JRE system library", and then re-import it to fix the problem. Alternatively, select "Windows/preferences/Java/Compiler/errors/warnings" and find "forbidden reference (access rules)" under "deprecated and restricted API )", change "error" to "warning" or "Ignore" by default ".
Figure 1. access restriction Error

Error 2: The nullpointerexception error is reported at this time: the error is due to missing the missing tlds-alpha-by-domain.txt file, which can be found in the heritrix-1.14.4-src \ SRC \ resources \ org \ archive \ util and copied to org. archive. util package (myheritrix \ SRC \ org \ archive \ util.

Figure 2. nullpointerexception Error

In addition, after configuration, many options may not be found, such as the encoding option, which hides some expert settings. Click Show expert setting at the top of setting to display these options.

 

3. Modify heritrix code and record links

To record the complete link relationship, we need to record the page to be traversed before determining the URL weight, so as to record the link relationship. We need to find heritrix's weighting module and modify the code to output the page to be judged and the URL of the current page to a file to get an edge relationship.

Heritirx has multiple implementation methods. The corresponding classes include bdburiuniqfilter, fpuriuniqfilter, and bloomuriuniqfilter. Bdb uses the Berkeley dB to record the URI and then query the database for Weight Determination. fp refers to fingerprints to hash each URI to a 64-bit hash list, this class provides two hash columns: MD5 and sha1.Algorithm. Bloomuriuniqfilter uses the bloomfilter mechanism for filtering. For more information about the bloomfilter mechanism, see http://blog.csdn.net/jiaomeng/article/details/1495500.

Heritrix, these duplicate classes, are derived from a class named setbaseduriuniqfilter. We need to modify the setbaseduriuniqfilter class to achieve the link before determining the weight. The file name of this class is in setbaseduriuniqfilter. java. First, create a new file in the class constructor to record the link between URLs. The Code is as follows:

     Public  Setbaseduriuniqfilter (){  Super  (); String profilelogfile = System. getproperty (setbaseduriuniqfilter. Class  . Getname () + ". Profilelogfile" );  If (Profilelogfile! = Null  ) {Setprofilelog (  New  File (profilelogfile ));}  If (Linkmap! = Null ) Return  ;  Try  {Linkmap = New Filewriter ("linkmap.txt" );}  Catch  (Ioexception e ){  Throw   New  Runtimeexception (e );}} 

Obviously, we can see that the Add function is the function that writes logs and has been re-processed. We can record the link before determining the weight. Considering that heritrix is multi-threaded, synchronized is used for simple mutex processing to prevent file write disorder caused by simultaneous Writing of multiple threads. The Code is as follows:

     Public   Void  Add (string key, candidateuri value ){  Synchronized  (Mutex ){ If (Linkmap! = Null  ) {String Link = New String (value. flattenvia () + "\ t" + value. tostring () + "\ n" );  Try  {Linkmap. Write (link, 0 , Link. Length (); linkmap. Flush ();}  Catch  (Ioexception e ){  Throw   New Runtimeexception (e) ;}} profilelog (key );  If  (Setadd (key )){  This  . Cycler. Receive (value );  If (Setcount () % 50000 = 0 ) {Logger. Log (level. Fine, "Count:" + setcount () + "totaldups:" + duplicatecount + "recentdups:" + (duplicatecount- Duplicatesatlastsample); duplicatesatlastsample = Duplicatecount ;}} Else  {Duplicatecount ++ ;}} 

After such modification, run as described earlierProgramTo start crawling again. In the capture process, a file named linkmap.txt can be found in the heritrixroot directory. The capture lasted for one day and one night. At the end of the capture process, heritrix captured about 90% pages. The link relationships between URLs recorded in linkmap.txt are as follows:

Http://www.pkusz.edu.cn/http://www.pkusz.edu.cn/ http://www.pkusz.edu.cn/statics/css/reset.csshttp://www.pkusz.edu.cn/ http://www.pkusz.edu.cn/statics/css/2011.csshttp://www.pkusz.edu.cn/ http://www.pkusz.edu.cn/statics/js/jquery.min.jshttp://www.pkusz.edu.cn/ http://www.pkusz.edu.cn/statics/js/jquery.sgallery.jshttp://www.pkusz.edu.cn. Pkusz.edu.cn/http://www.pkusz.edu.cn/statics/js/png.jshttp://www.pkusz.edu.cn/ http://www.pkusz.edu.cn/statics/js/2011/banner.jshttp://www.pkusz.edu.cn/ http://www.pkusz.edu.cn/http://www.pkusz.edu.cn/ http://english.pkusz.edu.cn/http://www.pkusz.edu.cn/ http://www.pkusz.edu.cn/index.phphttp://www.pkusz.edu.cn/ http://www.pkusz.edu.cn/special/votehttp://www.pkusz.edu.cn/ Index. php? M = content & C = index & A = show & catid = 143 & id = 1133 http://www.pkusz.edu.cn/http://news.pkusz.edu.cn/index.php? M = content & C = index & A = show & catid = 143 & id = 1277 http://www.pkusz.edu.cn/http://news.pkusz.edu.cn/index.php? M = content & C = index & A = show & catid = 143 & id = 1134 http://www.pkusz.edu.cn/http://www.pkusz.edu.cn/index.php? M = Special & C = index & specialid = 1 & pc_hash = lbcmichttp: // www.pkusz.edu.cn/http://www.pkusz.edu.cn/http://www.pkusz.edu.cn/ http://www.pkusz.edu.cn/

we found that many paths such records have CSS and JS, and many duplicate edges. Therefore, we need to remove CSS and JS before calculating PageRank or other computations, remove duplicate edges. So far, we have a set of webpage URLs and a set of links between webpages. After removing CSS and JS, URL and link deduplication, we get 161153 URLs and 4264030 sides. After getting the URL link, we can calculate the crawling coverage rate based on the file and other people's grabbing results. This will be discussed in the next article.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.