Introduction to using nutch (2) -- Internet crawling
Internet QQ network protocol search engine Gmail
Java code
- /**
- * I am also a beginner. If you have any errors, please give me more advice. Thank you!
- **/
Web crawling
1. Obtain the download list
A large list of entry URLs is required to truly capture data across the Internet. Fortunately, this problem was taken into account by nutch during design. Dmozpraser provides support for the open Internet dmoz directory library. The dashboard directory can be directly downloaded from the Internet. Currently, the latest data compressed file content.rdf.u8.gz has 295 MB, which reaches 1.91 GB after decompression. You can use the dsf-praser tool to randomly extract part of the data from the file and generate a file list. The detailed operation commands are as follows:
Java code
- Bin/nutch org. Apache. Tools. domzpraser content. RDF. u8-subset 3000> domzurls.txt
The result of the command execution is a dashboard urls.txt text file generated under the nutchroot directory. This file can be added to the download library as an entry address. To create a search engine on the Internet, you can add all the data in content. RDF. u8 to the download library. Downloading data from the entire network is time-consuming and labor-consuming, and beyond the scope of my experiments. In addition, the dw.urls text file contains many foreign sites, and our access is slow. We will take another approach.
Another method is to find large Internet sites in China. The URLs of these sites are. It should be said that it can be representative. The analysis principles and processes are not described here. Get the text of A chinaurls.txt file. Some results are as follows:
Java code
- Http://www.baidu.com
- Http://www.qq.com
- Http://www.google.cn
- Http://www.sina.com.cn
- Http://www.163.com
- Http://www.taobao.com
- Http://www.soso.com
- Http://www.sohu.com
- Http://www.youku.com
- Http://www.tianya.cn
- Http://www.hao123.com
- Http://www.kaixin001.com
- Http://www.alibaba.com
- Http://www.sogou.com
- Http://www.ifeng.com
- Http://www.cnzz.com
- Http://www.chinaz.com
- Http://www.xunlei.com
- Http://www.soufun.com
- Http://www.126.com
2. Download a large number of websites
After the entry URL list is ready, the next step is to import the nutch system and complete the download. The procedure is as follows:
1) Open the root directory of nutch and create a new internetweb and URLs directory.
2. Copy the chinaurls.txt file to the URLs directory under the root directory of the nutch. Use the file content to add the initial entry URL to the internetweb directory database. The command and execution result are as follows:
3) modify the conf/nutch-site.xml file in the root directory of the nutch and set the value of the HTTP. Agent. Name attribute. This attribute value is carried in the HTTP request header when crawling a webpage. It indicates the identity of a web spider. The modification is as follows.
4) use the injected URL download list in craswdb to call the generate command to create a new data segment and store it in the internetweb directory. Command and execution result:
5) view the recently generated folder named after date in the segments directory. The name of the currently generated folder is 20100325211446. Next, retrieve the page content based on the download list generated under the folder. The command is as follows:
6) obtain the URL link from the downloaded segment data list and update the content in the crawler library. The command and execution process are as follows:
7) execute cyclically (4), (5), and (6) download the page until the list of crawldb is completed or the page depth is reached. The Depth Control here needs to be controlled by the number of cycles.
8) Call the invertlinks command to create all links.
9) index the page content. The command is as follows:
In this way, a simplified Internet data has been downloaded.
Then we can search again.
For more information about how to deploy the search page, see
Getting started with nutch1.0 (1 ).
Thank you for your attention. I will publish more articles in this regard.