Configuration of eclipse tomcat6.0

Source: Internet
Author: User
Tags xsl

 

1. Install cygwin (the software for running Linux environment in Windows) at http://www.cygwin.com/. you can install or download cygwin locally.

I use our school software download resources for download and installation, Image Selection of http://mirrors.163.com/cygwin/ speed surprisingly fast! Awesome

After the installation, remember to configure the environment. The environment is not configured at first, and errors will occur during compilation in eclipse.

The configuration is as follows: add

Cygwin variable; Value: ntsec

Add E: \ myself \ lab \ cygwin \ bin to the PATH variable (that is, the directory where the bin folder under cygwin is installed). If no PATH variable exists, create a new one.

2. Download nutch, address http://labs.renren.com/apache-mirror/nutch/apache-nutch-1.2-bin.tar.gz think this address is very cute, it turned out to be everyone's ......

Here you can download a lot of things, or very good, I downloaded a nutch-1.2, decompress it, I put it in the cygwin \ home \ happy (happy for your username) directory, it is mainly used to facilitate the input of commands in cygwin, because this is its default home directory.

3. capture process

This Part references other people's blogs ~~~ T_t (the red part is marked by myself)

Create a folder URLs in the nutch-1.2, create a text file in URLs, file name arbitrary, add a line of content: http://lucene.apache.org/nutch/, This is the URL to search (the path in URLs/nutch must be added "/")

Open conf under the nutch-1.2, find the crawl-urlfilter.txt, find the two lines

# Accept hosts in my. domain. Name

+ ^ Http: // ([a-z0-9] * \.) * My. domain. Name/

The red part is a regular, and the URL you want to search should match it. Here I change to + ^ http: // ([a-z0-9] * \.) * apache.org/

If you want to search all webpages, you can directly use + ^

Edit the nutch-site.xml file under the conf directory, which tells the crawler information to the crawled website and cannot run if you do not set it.

The default file is as follows:
<? XML version = "1.0"?>
<? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?>

<! -- Put site-specific property overrides in this file. -->

<Configuration>

</Configuration>

The following is an example of my modification:
<? XML version = "1.0"?>
<? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?>

<! -- Put site-specific property overrides in this file. -->

<Configuration>
<Property>
<Name> HTTP. Agent. Name </Name>
<Value> myfirsttest </value> value cannot be blank
<Description> HTTP 'user-agent' request header. Must not be empty-
Please set this to a single word uniquely related to your organization.

Note: You shoshould also check other related properties:

HTTP. Robots. Agents
HTTP. Agent. Description
HTTP. Agent. url
HTTP. Agent. Email
HTTP. Agent. Version

And set their values appropriately.

</Description>
</Property>

<Property>
<Name> HTTP. Agent. Description </Name>
<Value> myfirsttest </value>
<Description> further description of our bot-this text is used in
The User-Agent header. It appears in parenthesis after the agent name.
</Description>
</Property>

<Property>
<Name> HTTP. Agent. url </Name>
<Value> myfirsttest.com </value>
<Description> a URL to advertise in the User-Agent header. This will
Appear in parenthesis after the agent name. Custom dictates that this
Shocould be a URL of a page explaining the purpose and behavior of this
Crawler.
</Description>
</Property>

<Property>
<Name> HTTP. Agent. Email </Name>
<Value> test@test.com </value>
<Description> An email address to advertise in the HTTP 'from' request
Header and User-Agent header. A good practice is to mangle this
Address (e.g. 'info at example dot com ') to avoid spamming.
</Description>
</Property>

</Configuration>
The above file describes the crawler name/description/from which website/contact email and other information.

Back to original again below

Then open cygwin, CD to the folder where the nutch-1.2 is located

Run the "bin/nutch crawl URLs-Dir crawler-depth 3-topn 50-threads 10> & crawl. log" command.

The parameter meaning is as follows (from the Apache website http://lucene.apache.org/nutch/tutorial8.html ):

-Dir: the directory followed by the crawler crawling result. This directory must be a directory that does not exist before.

-Threads: Number of threads running the command

-Depth crawling depth

-The number of URLs crawled at each layer of topn, starting from the top URL

Crawl. Log: log file. You can view the crawling process.

After the execution, you can see a new crawler folder under the nutch-1.2, which has five folders below:

①/② Crawldb/linkdb: web link directory, store the URL and URL interconnection relationship, as the basis for crawling and re-crawling, the page default 30 days expired (can be configured in the nutch-site.xml, as mentioned later)

③ Segments: a page for storing captured data, which is related to the depth of depth in the above link. If depth is set to 2, two subfolders named after time are generated under segments, for example, "20061014163012 ", open this folder and you can see that there are 6 sub folders under it, which are

(From Apache http://lucene.apache.org/nutch/tutorial8.html ):

Crawl_generate: names a set of URLs to be fetched

Crawl_fetch: contains the status of fetching each URL

Content: contains the content of each URL

Parse_text: contains the parsed text of each URL

Parse_data: Contains outlinks and metadata parsed from each URL

Crawl_parse: contains the outlink URLs, used to update the crawldb

④ Indexes: index directory. I generated a "part-00000" folder at runtime,

⑤ Index: Lucene index directory (nutch is based on Lucene, under the nutch-1.2 \ Lib can see the lucene-core-1.9.1.jar, finally there is a simple use of Luke tool ), it is the complete index after all indexes in indexs are merged. Note that the index file only indexes the page content and is not stored. Therefore, you must access the segments directory to obtain the page content during query.

4. in cygwin, enter "bin/nutch Org. apache. nutch. searcher. the keyword "Apache" is searched by calling the main method of nutchbean. In cygwin, the keyword "Total hits: 29 (hits is equivalent to JDBC results) is searched)

Note: If you find that the search result is always 0, you need to configure the nutch-1.2 \ conf to try to add the following section :( note the previous HTTP. agent. name must exist. If this property is not available, the search result is always 0)

<! -- File properties -->

<Property>

<Name> searcher. dir </Name>

<Value> D: \ nutch \ crawler </value> searcher. dir: Specify the crawler path generated in cygwin, that is, the directory that stores the crawling result.

<Description> </description>

</Property>

We can also set the re-crawling time (as mentioned above: the page expires 30 days by default)

<Property>

<Name> Fetcher. Max. Crawl. Delay </Name>

<Value> 30 </value>

<Description> </description>

</Property>

Now, the search result is no longer 0 ~ Happy ~~~

5. tomcat installation. I have been using the installation version for a long time. As a result, the installation always fails to be started, and the installation will pass through in a flash ......

The solution was not solved, so we downloaded the Tomcat green version, which is the installation-free version. No matter what version, we must first configure the environment.

Catalina_base variable: e: \ myself \ lab \ atat6.0 is the installation directory

Catalina_home variable: e: \ myself \ lab \ tomcat6.0

Tomcat_home variable: e: \ myself \ lab \ tomcat6.0

Add in classpath: % catalina_home % \ Lib \ servlet-api.jar old version of Tomcat in the servlet-api.jar may not be in this directory, you can find it yourself

Add % tamcat_home % \ bin in Path

At this point, the environment variables are configured correctly. Please note that there should be no spaces in the name of the installation directory, such as Tomcat 6.0. Otherwise, an error will occur later ...... T_t this error is found in many places, so it is too concealed ......

Double-click STARTUP in the bin in the tomcat6.0 directory. bat script. If it runs successfully, congratulations. If it fails, run startup in the bin directory of tomcat6.0 in cmd. bat script, check the log file under the tomcat6.0/logs folder, which is named after time, so it is easy to find the log information of the same day in a log file

Search for errors on the Internet based on the error log information and modify them ~

When you enter localhost: 8080 in the browser to see the long-time cat, Congratulations ~

6. deploy nutch in Tomcat, copy the nutch-1.2 under the nutch-1.2.war folder to Tomcat, and then run tomcat, it will automatically decompress the nutch-1.2.war file to tomcat6.0 \ webapps, and name it nutch, modify/nutch/WEB-INF/classes/nutch-site.xml
:

Set

<Nutch-conf>

</Nutch-conf>

Change

<Nutch-conf>

<Property>

<Name> HTTP. Agent. Name </Name>

<Value> * </value>

<Description> </description>

</Property>

<Property>

<Name> searcher. dir </Name>

<Value> your_crawl_dir_path </value>

</Property>

</Nutch-conf>

Your_crawl_dir_path refers to the folder saved on the webpage when the webpage is captured.

Finally, enter http: // localhost: 8080/nutch in the browser to view the search interface of the nutch. Note that each nutch-site.xml Modification
After the file, restart tomcat.

At this time, Chinese garbled characters may occur during the search of the nutch, which is actually a problem with Tomcat.

Solution: Modify server. XML in the/tomcat/Apache-Tomcat-6.0.20/conf directory:

Set

<Connector Port = "8080" protocol = "HTTP/1.1"

Connectiontimeout = "20000"

Redirectport = "8443" type = "regxph" text = "yourobjectname"/>

Change

<Connector Port = "8080" protocol = "HTTP/1.1"

Connectiontimeout = "20000"

Redirectport = "8443"

Uriencoding = "UTF-8"

Usebodyencodingforuri = "true"/>

Restart tomcat.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.