Simple introduction and simple introduction

Last Update:2018-12-03 Source: Internet

Author: User

Tags xsl

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Simple introduction and simple introduction

Nutch is a search engine. I just learned from a friend yesterday that I had access to Lucene a while ago. I was eager to try searching. I tried it over the weekend and it felt quite fresh, the online examples are mostly based on version 0.7. If you find some 0.8, you won't be able to run it. You have tried it for a long time and written down a bit ~~

System Environment: Tomcat 6.0.13/jdk1.6/nutch0.9/cygwin-cd-release-20060906.iso

Usage process:

1. because the running of nutch requires a Unix environment, for Windows users, you must first download a cygwin, which is a free software that can simulate a Unix environment in windows, you can go to http://www.cygwin.com/download online installer, or go to http://www-inst.eecs.berkeley.edu /~ Instcd/ISO/download the complete installation program (I have 1.27 GB below, huh, make sure the hard disk space is large enough ~~ ). You can install the tool one by one ~~~

2. Download nutch0.9, http://apache.justdn.org/lucene/nutch/, Which is decompressed to D:/nutch-0.9 after I download

3. create a folder URLs in the nutch-0.9, create a text file in URLs, file name arbitrary, add a line of content: http://lucene.apache.org/nutch/, This is the URL to search (the path in URLs/nutch must be added "/")

4. Open conf under the nutch-0.9, find the crawl-urlfilter.txt, find the two lines

# Accept hosts in my. domain. Name

+ ^ Http: // ([a-z0-9] */.) * My. domain. Name/

The red part is a regular, and the URL you want to search for should match it. Here I change to + ^ http: // ([a-z0-9] */.) * apache.org/

Edit the nutch-site.xml file under the conf directory, which tells the crawler information to the crawled website and cannot run if you do not set it.

The default file is as follows:
<? XML version = "1.0"?>
<? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?>

<! -- Put site-specific property overrides in this file. -->

The following is an example of my modification:
<? XML version = "1.0"?>
<? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?>

<! -- Put site-specific property overrides in this file. -->

<Configuration>
<Property>
<Name> HTTP. Agent. Name </Name>
<Value> myfirsttest </value>
<Description> HTTP 'user-agent' request header. Must not be empty-
Please set this to a single word uniquely related to your organization.

Note: You shoshould also check other related properties:

HTTP. Robots. Agents
HTTP. Agent. Description
HTTP. Agent. url
HTTP. Agent. Email
HTTP. Agent. Version

And set their values appropriately.

</Description>
</Property>

<Property>
<Name> HTTP. Agent. Description </Name>
<Value> myfirsttest </value>
<Description> further description of our bot-this text is used in
The User-Agent header. It appears in parenthesis after the agent name.
</Description>
</Property>

<Property>
<Name> HTTP. Agent. url </Name>
<Value> myfirsttest.com </value>
<Description> a URL to advertise in the User-Agent header. This will
Appear in parenthesis after the agent name. Custom dictates that this
Shocould be a URL of a page explaining the purpose and behavior of this
Crawler.
</Description>
</Property>

<Property>
<Name> HTTP. Agent. Email </Name>
<Value> test@test.com </value>
<Description> An email address to advertise in the HTTP 'from' request
Header and User-Agent header. A good practice is to mangle this
Address (e.g. 'info at example dot com ') to avoid spamming.
</Description>
</Property>

</Configuration>
The above file describes the crawler name/description/from which website/contact email and other information.

5. To run in cygwin, you must add a sentence in/ECT/profile pointing to JDK in the Java installation directory.
Java_home =/usr/Java/jdk1.6.0 _ 01
Export java_home
OK, the following starts to index the search URL, run cygwin, will open a command window, enter "CD cygdrive/D/nutch-0.9, go to the nutch-0.9 directory

6. Run "bin/nutch crawl URLs-Dir crawler-depth 3-topn 50-threads 10> & crawl. log"

The parameter meaning is as follows (from the Apache website http://lucene.apache.org/nutch/tutorial8.html ):

-Dir dir names the directory to put the crawl in.

-Threads determines the number of threads that will fetch in parallel.

-Depth depth indicates the link depth from the root page that should be crawled.

-Topn n determines the maximum number of pages that will be retrieved at each level up to the depth.

Crawl. Log: Log File

After the execution, you can see a new crawler folder under the nutch-0.9, which has five folders below:

①/② Crawldb/linkdb: web link directory, store the URL and URL interconnection relationship, as the basis for crawling and re-crawling, the page default 30 days expired (can be configured in the nutch-site.xml, as mentioned later)

③ Segments: a page for storing captured data, which is related to the depth of depth in the above link. If depth is set to 2, two subfolders named after time are generated under segments, for example, "20061014163012 ", open this folder and you can see that there are six subfolders (from the Apache website http://lucene.apache.org/nutch/tutorial8.html) below it ):

Crawl_generate: names a set of URLs to be fetched

Crawl_fetch: contains the status of fetching each URL

Content: contains the content of each URL

Parse_text: contains the parsed text of each URL

Parse_data: Contains outlinks and metadata parsed from each URL

Crawl_parse: contains the outlink URLs, used to update the crawldb

④ Indexes: index directory. I generated a "part-00000" folder at runtime,

⑤ Index: Lucene index directory (nutch is based on Lucene, under the nutch-0.9/lib can see the lucene-core-1.9.1.jar, finally there is a simple use of Luke tool ), it is the complete index after all indexes in indexs are merged. Note that the index file only indexes the page content and is not stored. Therefore, you must access the segments directory to obtain the page content during query.

7. in cygwin, enter "bin/nutch Org. apache. nutch. searcher. the keyword "Apache" is searched by calling the main method of nutchbean. In cygwin, the keyword "Total hits: 29 (hits is equivalent to JDBC results) is searched)

Note: If you find that the search result is always 0, you need to configure the nutch-0.9 for the nutch-site.xml/Conf, the configuration content is the same as the configuration for process 9 below (also, if depth is set to 1 in process 6, the search result may be 0), and then re-Execute Process 6.

8. below we want to test under tomcat, there is a nutch-0.8.1 under the nutch-0.9.war, copy to Tomcat/webapps, can be directly decompressed to this directory with WinRAR, I use tomcat to start decompression, decompress the folder named: nutch

9. Open the WEB-INF file under the nutch/nutch-site.xml/classes, the red below is the content to be added, the other is the original nutch-site.xml content

<? XML version = "1.0"?>

<? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?>

<! -- Put site-specific property overrides in this file. -->

<! -- HTTP properties -->

<Name> HTTP. Agent. Name </Name>

</Property>

<! -- File properties -->

<Name> searcher. dir </Name>

<Value> D:/nutch-0.8.1/crawled </value>

</Property>

</Configuration>

HTTP. Agent. Name: required. If this property is removed, the query result is always 0.

Searcher. dir: Specify the crawler path generated in cygwin.

Here, we can also set the re-crawling time (as mentioned in process 6: The page expires 30 days by default)

<Name> Fetcher. Max. Crawl. Delay </Name>

</Property>

There are also many parameters that can be queried by the nutch-0.8.1 under the nutch-default.xml/Conf, the property configuration in the nutch-default.xml is annotated, if you are interested, you can copy them to Tomcat/webapps/nutch/WEB-INF/classes/nutch-site.xml for debugging.

10. Open http: // localhost: 8081/nutch and enter "Apache". We can see that "a total of 29 query results" are consistent with the results of the simple test in Process 7 above.

Luke introduction:

Luke is a graphical tool used to query Lucene index files. It intuitively shows the index creation. It must be used together with the Lucene package.

Usage process:

1. http://www.getopt.org/luke it provides 3 download types:

Standalone full jar: lukeall. Jar

Standalone minimal jar: lukemin. Jar

Separate jars: Luke. Jar (~ 113kb)

Lucene-1.9-rc1-dev.jar (~ 380kb)

Analyzers-dev.jar (~ 348kb)

Snowball-1.1-dev.jar (~ 88kb)

JS. Jar (~ 492kb)

We only need to download the Luke. jar of "separate jars ".

2. Create a new folder, such as "luke_run", put Luke. Jar under the folder, and copy the nutch-0.9 from lucene-core-1.9.1.jar/lib to this folder

3. go to the "luke_run" directory in the CMD command line and enter "Java-classpath Luke. jar; lucene-core-1.9.1.jar Org. getopt. luke. luke, you can see to open the Luke GUI, from the "file" => "Open Lucene Index", open the "nutch-0.8.1/crawler/Index" folder (in the above process 6 has been created ), then you can see the detailed information about index creation in Luke.

4. attach a piece of gossip :) a problem found in use (which does not exist in the lucene-core-1.9.1.jar, so Luke does not throw this exception), that is, the reconstruct & edit button in "statements" is a little left, an exception is thrown:

Exception in thread "thread-12" Java. Lang. nosuchmethoderror: org. Apache. Lucene. d

Ocument. Field. <init> (ljava/lang/string; zzzz) V

At org. getopt. Luke. Luke $ 2.run( unknown source)

I am using a lucene-core-2.0.0.jar, it seems that it is caused by removing a method in this version, many times the emergence of the new version will always bring

ZZ: http://blog.csdn.net/xiajing12345/archive/2007/05/21/1619311.aspx

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More