Requirements (my choice ):
- In Java 1.5
- Apache's tomcat 5.x
- Win32 with cygwin
- Nutch, huh, huh
In my personal opinion, the version of nutch 0.72 is not very useful (some people say that the so-called feeling is unreliable, but don't worry about it). Now, you can download the source code from the svn repository and compile it with ant package, then you get a 0.8 version of nutch. Compiled directory structure:
Nutch
Bin -- execute the script
Build-compiled classes, plug-in directories, and war packages for Tomcat.
Conf-various configuration files
...
Here is a 0.8 English tutorial http://lucene.apache.org/nutch/tutorial8.html, the document is more detailed, but there are a few small bugs that may be killing people, in general, the bug in this document is more irritating than the bug in the program. The document I wrote is also a constant bug. who depends on who is angry. If you see the Chinese tutorial document, post a connection. There is also a wild web page (unofficial)
Http://wiki.media-style.com/display/nutchdocu/home, where you can find any example of adding nutchin to eclipse.
- Run bin/nutch to view all the commands of bin/nutch. One possible output is:
Usage: nutch command
Where command is one:
Crawl one-step crawler for Intranets
Readdb read/dump crawl DB
Mergedb merge crawldb-s, with optional fi
Readlinkdb read/dump link DB
Inject inject new URLs into the database
Generate generate new segments to fetch
Fetch fetch a segment's pages
Parse parse a segment's pages
Segread read/dump segment data
Mergesegs merge several segments, with opti
Updatedb update crawl DB from segments aft
Invertlinks create a linkdb from parsed segme
Mergelinkdb merge linkdb-s, with optional fil
Index run the indexer On Parsed segment
Merge merge several segment Indexes
Dedup remove duplicates from a set of S
Plugin load a plugin and run one of its
Server Run a search Server
Or
Classname run the class named classname
Most commands print help when invoked w/o parameters.
2. Run the bin/nutch command and add a bare command (No parameter) to check the usage of the command, such
Bin/nutch crawl enter, which may be output:
Usage: Crawl <urldir> [-Dir D] [-threads N] [-depth I] [-topn N]
3. You can use commands such as readdb, readlinkdb, and segread to check your data.
Intranet crawling:
Suitable for scenarios where the expected total number of web pages is 1 million and the number of websites is limited, the one-step command bin/nutch crawl is more comfortable, for many vertical
The search field is sufficient.
Whole-web crawler:
Capture massive WWW data, which can be divided
Inject injection URL,
Generate a capture list,
Fetch crawls web pages,
Updatedb updates crawldb,
Invertlinks,
Index creation index
Dedup deduplication
Merge merge Index
In fact, these two modes are basically the same and can be exchanged. The difference is only the difference in the configuration file (personal opinion !). If it is a crawler command, the configuration file crawl-urlfilter.txt and suffix can be involved. Note that the configuration file should be placed in the class search path. If you use the bin/nutch script to start the program, these configuration files should be located in the conf directory. Note that whether these filter files take effect depends on your plug-in configuration. Oh, you need to configure everything!
Execute the following loop:
Generate
Fetch
Updatedb
Invertlinks
Index
Dedup
Merge
I simply modified org. Apache. nutch. Crawl. Crawl and generated a new class that can be easily updated in one step.
Package org. Apache. nutch. Crawl;
Public class crawlupdate {
Public static final logger log = logformatter
. Getlogger ("org. Apache. nutch. Crawl. crawlupdate ");
Private Static string getdate (){
Return new simpledateformat ("yyyymmddhhmmss"). Format (new date (System
. Currenttimemillis ()));
}
Public static void main (string [] ARGs) throws ioexception {
If (ARGs. Length <1 ){
System. Out
. Println ("Usage: crawlupdate [-Dir D] [-threads N] [-topn N]");
Return;
}
Configuration conf = nutchconfiguration. Create ();
Conf. adddefaultresource ("crawl-tool.xml ");
Jobconf job = new nutchjob (CONF );
Path dir = New Path ("Crawl-" + getdate ());
Int threads = job. getint ("Fetcher. threads. Fetch", 10 );
Int topn = integer. max_value;
For (INT I = 0; I <args. length; I ++ ){
If ("-Dir". Equals (ARGs [I]) {
Dir = New Path (ARGs [I + 1]);
I ++;
} Else if ("-threads". Equals (ARGs [I]) {
Threads = integer. parseint (ARGs [I + 1]);
I ++;
} Else if ("-topn". Equals (ARGs [I]) {
Topn = integer. parseint (ARGs [I + 1]);
I ++;
}
}
Filesystem FS = filesystem. Get (job );
If (! FS. exists (DIR )){
Throw new runtimeexception (DIR + "dosn't exist .");
}
Log.info ("Crawl started in:" + DIR );
Log.info ("Threads =" + threads );
If (topn! = Integer. max_value)
Log.info ("topn =" + topn );
Path crawler LDB = New Path (DIR + "/crawler LDB ");
Path linkdb = New Path (DIR + "/linkdb ");
Path segments = New Path (DIR + "/segments ");
Path indexes = New Path (DIR + "/indexes" + getdate ());
Path index = New Path (DIR + "/Index ");
Path tmpdir = job. getlocalpath ("Crawl" + path. Separator + getdate ());
Path segment = new generator (job). Generate (crawler LDB, segments,-1, topn,
System. currenttimemillis ());
New fetcher (job). Fetch (segment, threads, Fetcher. isparsing (job); // fetch
If (! Fetcher. isparsing (job )){
New parsesegment (job). parse (segment); // parse it, if needed
}
New crawldb (job). Update (crawldb, segment); // update crawldb
New linkdb (job). Invert (linkdb, new path [] {segment}); // invert links
// Index, dedup & merge
New indexer (job)
. Index (indexes, crawldb, linkdb, new path [] {segment });
Path [] indexesdirs = FS. listpaths (Dir, new pathfilter (){
Public Boolean accept (path p ){
Return P. getname (). startswith ("indexes ");
}
});
New deleteduplicates (job). dedup (indexesdirs );
List indexesparts = new arraylist ();
For (INT I = 0; I <indexesdirs. length; I ++ ){
Indexesparts. addall (arrays. aslist (FS. listpaths (indexesdirs [I]);
}
New indexmerger (FS, (path []) (indexesparts
. Toarray (new path [indexesparts. Size ()]), index, tmpdir, job)
. Merge ();
Log.info ("Crawl update finished:" + DIR );
}
}
In this way, I can periodically update my search data in the following mode:
Crawl urlsdir-Dir crawl-topn 1000 -- first download
Crawlupdate-Dir crawl-topn 1000 -- Update
Crawlupdate-Dir crawl-topn 1000 -- continue to update
...
I haven't figured out whether it is the Lucene restriction or based on what considerations. When updating (accurately speaking, when updating the index), I need to stop Tomcat first. It feels a little uncomfortable.