Crawl the nutch-Get started getting started

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Environment Construction

Requirements (my choice ):

In Java 1.5
Apache's tomcat 5.x
Win32 with cygwin
Nutch, huh, huh

In my personal opinion, the version of nutch 0.72 is not very useful (some people say that the so-called feeling is unreliable, but don't worry about it). Now, you can download the source code from the svn repository and compile it with ant package, then you get a 0.8 version of nutch. Compiled directory structure:

Nutch
Bin -- execute the script
Build-compiled classes, plug-in directories, and war packages for Tomcat.
Conf-various configuration files
...

Here is a 0.8 English tutorial http://lucene.apache.org/nutch/tutorial8.html, the document is more detailed, but there are a few small bugs that may be killing people, in general, the bug in this document is more irritating than the bug in the program. The document I wrote is also a constant bug. who depends on who is angry. If you see the Chinese tutorial document, post a connection. There is also a wild web page (unofficial)

Http://wiki.media-style.com/display/nutchdocu/home, where you can find any example of adding nutchin to eclipse.

Tips

Run bin/nutch to view all the commands of bin/nutch. One possible output is:

Usage: nutch command
Where command is one:
Crawl one-step crawler for Intranets
Readdb read/dump crawl DB
Mergedb merge crawldb-s, with optional fi
Readlinkdb read/dump link DB
Inject inject new URLs into the database
Generate generate new segments to fetch
Fetch fetch a segment's pages
Parse parse a segment's pages
Segread read/dump segment data
Mergesegs merge several segments, with opti
Updatedb update crawl DB from segments aft
Invertlinks create a linkdb from parsed segme
Mergelinkdb merge linkdb-s, with optional fil
Index run the indexer On Parsed segment
Merge merge several segment Indexes
Dedup remove duplicates from a set of S
Plugin load a plugin and run one of its
Server Run a search Server
Or
Classname run the class named classname
Most commands print help when invoked w/o parameters.

2. Run the bin/nutch command and add a bare command (No parameter) to check the usage of the command, such

Bin/nutch crawl enter, which may be output:
Usage: Crawl <urldir> [-Dir D] [-threads N] [-depth I] [-topn N]
3. You can use commands such as readdb, readlinkdb, and segread to check your data.

Capture Mode

Intranet crawling:
Suitable for scenarios where the expected total number of web pages is 1 million and the number of websites is limited, the one-step command bin/nutch crawl is more comfortable, for many vertical
The search field is sufficient.
Whole-web crawler:
Capture massive WWW data, which can be divided
Inject injection URL,
Generate a capture list,
Fetch crawls web pages,
Updatedb updates crawldb,
Invertlinks,
Index creation index
Dedup deduplication
Merge merge Index

In fact, these two modes are basically the same and can be exchanged. The difference is only the difference in the configuration file (personal opinion !). If it is a crawler command, the configuration file crawl-urlfilter.txt and suffix can be involved. Note that the configuration file should be placed in the class search path. If you use the bin/nutch script to start the program, these configuration files should be located in the conf directory. Note that whether these filter files take effect depends on your plug-in configuration. Oh, you need to configure everything!

Update

Execute the following loop:
Generate
Fetch
Updatedb
Invertlinks
Index
Dedup
Merge

I simply modified org. Apache. nutch. Crawl. Crawl and generated a new class that can be easily updated in one step.

Package org. Apache. nutch. Crawl;
Public class crawlupdate {

Public static final logger log = logformatter
. Getlogger ("org. Apache. nutch. Crawl. crawlupdate ");

Private Static string getdate (){
Return new simpledateformat ("yyyymmddhhmmss"). Format (new date (System
. Currenttimemillis ()));
}

Public static void main (string [] ARGs) throws ioexception {
If (ARGs. Length <1 ){
System. Out
. Println ("Usage: crawlupdate [-Dir D] [-threads N] [-topn N]");
Return;
}

Configuration conf = nutchconfiguration. Create ();
Conf. adddefaultresource ("crawl-tool.xml ");
Jobconf job = new nutchjob (CONF );

Path dir = New Path ("Crawl-" + getdate ());
Int threads = job. getint ("Fetcher. threads. Fetch", 10 );
Int topn = integer. max_value;

For (INT I = 0; I <args. length; I ++ ){
If ("-Dir". Equals (ARGs [I]) {
Dir = New Path (ARGs [I + 1]);
I ++;
} Else if ("-threads". Equals (ARGs [I]) {
Threads = integer. parseint (ARGs [I + 1]);
I ++;
} Else if ("-topn". Equals (ARGs [I]) {
Topn = integer. parseint (ARGs [I + 1]);
I ++;
}
}

Filesystem FS = filesystem. Get (job );
If (! FS. exists (DIR )){
Throw new runtimeexception (DIR + "dosn't exist .");
}

Log.info ("Crawl started in:" + DIR );
Log.info ("Threads =" + threads );

If (topn! = Integer. max_value)
Log.info ("topn =" + topn );

Path crawler LDB = New Path (DIR + "/crawler LDB ");
Path linkdb = New Path (DIR + "/linkdb ");
Path segments = New Path (DIR + "/segments ");
Path indexes = New Path (DIR + "/indexes" + getdate ());
Path index = New Path (DIR + "/Index ");

Path tmpdir = job. getlocalpath ("Crawl" + path. Separator + getdate ());

Path segment = new generator (job). Generate (crawler LDB, segments,-1, topn,
System. currenttimemillis ());
New fetcher (job). Fetch (segment, threads, Fetcher. isparsing (job); // fetch
If (! Fetcher. isparsing (job )){
New parsesegment (job). parse (segment); // parse it, if needed
}
New crawldb (job). Update (crawldb, segment); // update crawldb

New linkdb (job). Invert (linkdb, new path [] {segment}); // invert links

// Index, dedup & merge
New indexer (job)
. Index (indexes, crawldb, linkdb, new path [] {segment });

Path [] indexesdirs = FS. listpaths (Dir, new pathfilter (){
Public Boolean accept (path p ){
Return P. getname (). startswith ("indexes ");
}
});
New deleteduplicates (job). dedup (indexesdirs );

List indexesparts = new arraylist ();
For (INT I = 0; I <indexesdirs. length; I ++ ){
Indexesparts. addall (arrays. aslist (FS. listpaths (indexesdirs [I]);
}

New indexmerger (FS, (path []) (indexesparts
. Toarray (new path [indexesparts. Size ()]), index, tmpdir, job)
. Merge ();

Log.info ("Crawl update finished:" + DIR );
}
}

In this way, I can periodically update my search data in the following mode:
Crawl urlsdir-Dir crawl-topn 1000 -- first download
Crawlupdate-Dir crawl-topn 1000 -- Update
Crawlupdate-Dir crawl-topn 1000 -- continue to update
...

I haven't figured out whether it is the Lucene restriction or based on what considerations. When updating (accurately speaking, when updating the index), I need to stop Tomcat first. It feels a little uncomfortable.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawl the nutch-Get started getting started

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Crawl the nutch-Get started getting started

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support