Introduction to using Nutch (2) -- Internet crawling

Source: Internet
Author: User
Introduction to using nutch (2) -- Internet crawling
    Blog type:

  • Java
Internet QQ network protocol search engine Gmail

 

Java code
  1. /**
  2. * I am also a beginner. If you have any errors, please give me more advice. Thank you!
  3. **/

Web crawling

1. Obtain the download list

A large list of entry URLs is required to truly capture data across the Internet. Fortunately, this problem was taken into account by nutch during design. Dmozpraser provides support for the open Internet dmoz directory library. The dashboard directory can be directly downloaded from the Internet. Currently, the latest data compressed file content.rdf.u8.gz has 295 MB, which reaches 1.91 GB after decompression. You can use the dsf-praser tool to randomly extract part of the data from the file and generate a file list. The detailed operation commands are as follows:

 

Java code
  1. Bin/nutch org. Apache. Tools. domzpraser content. RDF. u8-subset 3000> domzurls.txt

 

 

The result of the command execution is a dashboard urls.txt text file generated under the nutchroot directory. This file can be added to the download library as an entry address. To create a search engine on the Internet, you can add all the data in content. RDF. u8 to the download library. Downloading data from the entire network is time-consuming and labor-consuming, and beyond the scope of my experiments. In addition, the dw.urls text file contains many foreign sites, and our access is slow. We will take another approach.

Another method is to find large Internet sites in China. The URLs of these sites are. It should be said that it can be representative. The analysis principles and processes are not described here. Get the text of A chinaurls.txt file. Some results are as follows:

 

 

Java code
  1. Http://www.baidu.com
  2. Http://www.qq.com
  3. Http://www.google.cn
  4. Http://www.sina.com.cn
  5. Http://www.163.com
  6. Http://www.taobao.com
  7. Http://www.soso.com
  8. Http://www.sohu.com
  9. Http://www.youku.com
  10. Http://www.tianya.cn
  11. Http://www.hao123.com
  12. Http://www.kaixin001.com
  13. Http://www.alibaba.com
  14. Http://www.sogou.com
  15. Http://www.ifeng.com
  16. Http://www.cnzz.com
  17. Http://www.chinaz.com
  18. Http://www.xunlei.com
  19. Http://www.soufun.com
  20. Http://www.126.com

 

 

2. Download a large number of websites

After the entry URL list is ready, the next step is to import the nutch system and complete the download. The procedure is as follows:

1) Open the root directory of nutch and create a new internetweb and URLs directory.

2. Copy the chinaurls.txt file to the URLs directory under the root directory of the nutch. Use the file content to add the initial entry URL to the internetweb directory database. The command and execution result are as follows:

3) modify the conf/nutch-site.xml file in the root directory of the nutch and set the value of the HTTP. Agent. Name attribute. This attribute value is carried in the HTTP request header when crawling a webpage. It indicates the identity of a web spider. The modification is as follows.

4) use the injected URL download list in craswdb to call the generate command to create a new data segment and store it in the internetweb directory. Command and execution result:

5) view the recently generated folder named after date in the segments directory. The name of the currently generated folder is 20100325211446. Next, retrieve the page content based on the download list generated under the folder. The command is as follows:

6) obtain the URL link from the downloaded segment data list and update the content in the crawler library. The command and execution process are as follows:

7) execute cyclically (4), (5), and (6) download the page until the list of crawldb is completed or the page depth is reached. The Depth Control here needs to be controlled by the number of cycles.

8) Call the invertlinks command to create all links.

9) index the page content. The command is as follows:

In this way, a simplified Internet data has been downloaded.

 

Then we can search again.

For more information about how to deploy the search page, see
Getting started with nutch1.0 (1 ).

 

Thank you for your attention. I will publish more articles in this regard.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.