① Set the website crawling entry URL [Root @ red-hat-9 nutch-0.9] # cd/zkl/IR/nutch-0.9/ [Root @ red-hat-9 nutch-0.9] # mkdir URLs [Root @ red-hat-9 nutch-0.9] # vi URLs/urls_crawl.txt Create a directory and create a file urls_crawl.txt. We use this method, [Root @ red-hat-9 nutch-0.9] # vi urls_crawl.txt Write the entry URL of the website to be crawled (crawl), that is, capture any URL page under the current domain name starting from this entry, for example: Http://english.gu.cas.cn/ag/ ② Specify crawling filtering rules Edit the URL filtering rule file CONF/crawl-urlfilter.txt for nutch [Root @ red-hat-9 nutch-0.9] # vi CONF/crawl-urlfilter.txt Modify # Accept hosts in my. domain. Name # + ^ Http: // (/[a-z0-9] */.) * My. domain. Name/ Is This is the domain name of the website you want to crawl, indicating to crawl all the URL pages under the current website, crawling the starting website on①. ③ Filter character settings If the URL that crawls a website contains the following filter characters, for example? And =, And you need these access, you can change the filter table # Skip URLs containing certain characters as probable queries, etc. -[? *! @ =] Change -[*! @] ④ Modify CONF/nutch-site.xml Change <Configuration> <Property> <Name>HTTP. Agent. Name</Name> HTTP. Agent. Name attribute <Value> gucas.ac.cn </value> indicates the name of the crawled website, Used in the nutch search </Property> <Property> <Name> HTTP. Agent. version </Name> <Value> 1.0 </value> </Property> <Property> <Name>Searcher. dir</Name> <Value>/zkl/IR/nutch-0.9/gucas </value> <Description> path to root of Crawl </description> </Property> </Configuration> If this agent is not configured, the agent name not configured will appear during crawling! . ⑤ Start crawling Run the crawl command to capture the website content [Root @ red-hat-9 nutch-0.9] # bin/nutch crawl urls_crawl.txt-Dir gucas-Depth 50-threads 5-topn 1000> & logs/logs_crawl.log ·-DirDirNames settings to save the directory of the crawled web page. ·-DepthDepthIndicates the level depth of the captured webpage. ·-DelayDelayLatency of accessing different hosts, measured in seconds" ·-ThreadsThreadsIndicates the number of threads to be started ·-Topn 1000 indicates that only the first n URLs of each layer are captured. In the parameters of the preceding command,Urls_crawl.txtThe created package contains the directory of the file urls_crawl.txt that captures the network. dir specifies the directory where the captured content is stored. Here is gucas; depth indicates the crawling depth starting from the top-level website URL to be crawled; threads specifies the number of concurrent threads; topn indicates that only the first n URLs of each layer are captured. The last logs/logs_crawl.log indicates that the content displayed during the capture process is saved in the logs_crawl.log file under the logs directory, to analyze the running status of the program. After this command is run,The gucas directory will be generated under the nutch-0.9 directory, and there will be captured files and generated IndexesIn addition, there will be the remaining logs directory under the nutch-0.9 Directory, which generates a file logs_crawl.log that stores the capture log. If gucas already exists before running, the following error occurs during running: gucas already exist. We recommend that you delete this directory or specify other directories to store captured web pages. After completing the preceding steps, Data Capturing is completed successfully. Test: Bin/nutch ORG/Apache/nutch/searcher/nutchbean Query the keyword "". The above only crawls a single website, does not reflect the advantages of Web Crawlers crawling data from multiple websites. The following example shows how to crawl data from multiple websites: Create a new multiurls.txt file in the main directory of nutch, and write the list of URLs to be downloaded. Http://www.pcauto.com.cn/ Http://www.xcar.com.cn/ Http://auto.sina.com.cn Modify the filter rule file crawl-urlfilter.txt, allowing you to download any site # Accept hosts in my. domain. Name + ^ // All website links are allowed by default. # Skip everything else -. Run the capture command [Root @ red-hat-9 nutch-0.9] # bin/nutch crawl multiurls.txt-Dir mutilweb-Depth 50-threads 5-topn 1000> & logs/logs_crawl.log Change CONF/nutch-site.xml To: <Configuration> <Property> <Name>HTTP. Agent. Name</Name> HTTP. Agent. Name attribute <Value> * </value> indicates the name of a web crawler, </Property> <Property> <Name> HTTP. Agent. version </Name> <Value> 1.0 </value> </Property> <Property> <Name>Searcher. dir</Name> <Value>/zkl/IR/nutch-0.9/gucas </value> <Description> path to root of Crawl </description> </Property> </Configuration> Test: Bin/nutch ORG/Apache/nutch/searcher/nutchbean SUV Query the keyword "suv ". --------------------------------------------------------------------- 6. Deploy Web Front-end Copy the nutch-0.9.war package under the main directory of the nutch to the webapps directory of Tomcat [Root @ red-hat-9 nutch-0.9] # cp nutch-0.9.war/zkl/Program/Apache-Tomcat-6.0.18/webapps/ Then the browser URL http: // localhost: 8080/nutch-0.9/, the war package is automatically decompressed, A nutch-0.9 folder appears under the Tomcat web home directory webapps. 7. Modify the Web configuration of nutch in Tomcat VI/zkl/Program/Apache-Tomcat-6.0.18/webapps // nutch-0.9/WEB-INF/classes/nutch-site.xml SetSearcher. dirThe attribute value is changed to the directory generated by the index. <Configuration> <Property> <Name>Searcher. dir</Name> <Value>/zkl/IR/nutch-0.9/gucas </value> <Description> Path to root of Crawl. This directory is searched (in Order) for either the file search-servers.txt, containing a list Distributed search servers, or the directory "Index" containing Merged indexes, or the directory "segments" containing segment Indexes. </Description> </Property> </Configuration> |