After taking the inject and generate steps, I learned about some of the preliminary push work before the crawling of nutch, this includes URL filtering, regularization, score calculation, and the closeness of its connection with mapreduce. I feel that the entire process of nutch is very meticulous, at least from the first two processes. Preliminary review: in the previous issue, we mainly explained the Second Step generate of nutch, which mainly obtained the list of URLs to be crawled and written to the segments directory, the processing of some details includes the input and output before each job is submitted, and the map and CER classes executed. For more information, see the previous article. The next fetch part of the feeling should be the soul of the nutch, because the previous positioning of the nutch is a search engine and has evolved into a crawler tool. In the past few days, the basic data of a project has not been carefully read. In the middle of the project, I tried to read the fetch code again, and found that this is a hard nut to crack, the focus of some materials on the Internet is also different, but in order to finish the nutch, you must cross this hurdle ...... Let's get started ~~~~ 1. the fetch entry is from the Fetcher of the crawl class. start with the fetch (segs [0], threads); statement. It uploads the segments and the number of crawled threads as parameters to the fetch function and enters the fetch function, first, a checkconfiguration function is executed to check HTTP. agent. name and HTTP. robot. whether there is a value for Nam. If it is null, some error messages will be returned through the console. Some variables are assigned and initialized, such as the timeout variable, the maximum depth of the capture, and the maximum number of links. A mapreduce job is initialized later. Set the input to "crawl_generate" under the segments directory generated in the generate stage and output to "segments". The map class to be operated is job. setmaprunnerclass (Fetcher. class);, through the job. runjob (job) Submit a job; 2. run fetch after job submission: Public void run (recordreader <text, crawldatum> input,
Outputcollector <text, nutchwritable> output,
Reporter reporter) throws ioexception {......} From the parameters of the run function, we can see that the input is the recordreader class encapsulated by text and crawldatum, and the output is the outputcollector class encapsulated by text and nutchwritable. Of course, you can set various parameters, thresholds, and so on. It is worth mentioning that this is a classic case of task scheduling in the previous learning operating system-producer and consumer cases for crawling web pages. First, use a line of code: Feeder = new queuefeeder (input, fetchqueues, threadcount * queuedepthmuliplier); Define the producer queue, where input is the input parameter, and fetchqueues is through this. fetchqueues = new fetchitemqueues (getconf (); (the byhost mode is used by default, and there are two types of byip and bydomain). The third parameter is also used to read the default value of the configuration file. After the producer is defined, it is mainly responsible for the information of crawldatum generated from generate and adding them to the shared queue. (Supplement: for the relationship between fetchitemqueues, fetchitemqueue, and fetchitem, you can find the following fields in fetchitemqueues: public static final string default_id = "default ";
Map <string, fetchitemqueue> queues = new hashmap <string, fetchitemqueue> ();
Atomicinteger totalsize = new atomicinteger (0 );
Int maxthreads;
Long crawldelay;
Long mincrawldelay;
Long timelimit =-1;
Int maxexceptionsperqueue =-1;
Configuration conf;
Public static final string queue_mode_host = "byhost ";
Public static final string queue_mode_domain = "bydomain ";
Public static final string queue_mode_ip = "byip ";
String queuemode; it can be seen that the map set queues is obtained by encapsulating a string and fetchitemqueue. The main fields of fetchitemqueue include list <fetchitem> queue = collections. synchronizedlist (New synchronized list <fetchitem> ());
Set <fetchitem> inprogress = collections. synchronizedset (New hashset <fetchitem> ());
Atomiclong nextfetchtime = new atomiclong ();
Atomicinteger predictioncounter = new atomicinteger ();
Long crawldelay;
Long mincrawldelay;
Int maxthreads;
Configuration conf; likewise, we can see from here that the queue is encapsulated by the object fetchitem, where the fetchitem mainly includes the following fields: int outlinkdepth = 0;
String queueid;
Text URL;
Url u;
Crawldatum datum; so far, we have probably understood the encapsulation relationships from fetchitem-> fetchitemqueue-> fetchitemqueues .) Now that the producer has produced the product, there should be consumers for consumption (there is a market when there is demand, there is also a consumer when there is a market) 3. the consumer's generation comes from the code: For (INT I = 0; I <threadcount; I ++) {// spawn threads new fetcherthread (getconf ()). start ();} generates a specified number of threadcount consumers based on user requirements. Before that, there are some parameter settings, such as timeout and blocking. This method is followed by a question about waiting for each thread (consumer) and the number of web pages captured by each thread. Then, determine whether the capture queue of the producer has been captured. If yes, output the information in the capture queue, in addition, there is a judgment mechanism to determine whether the captured thread times out. If the thread times out, it enters the waiting state. 4. this is the model of the entire producer and consumer, which effectively reflects and solves the relationship between captured queues and threads, the following also focuses on how the consumer obtains and crawls the URLs in the capture queue. In this case, the new fetcherthread (getconf () is used ()). start (); the Code enters the run method of fetchthread. First, run: Fit = fetchqueues. getfetchitem (); it is mainly used to retrieve data from the previously stored data in the capture queue, followed by determining whether the retrieved data is empty, if it is null, it further determines whether the producer is alive or whether there is data in the capture queue. If so, it waits. If not, the task fetchitem has been processed and the crawling of this thread (consumer) is ended. Of course, if the obtained fit is not empty, use the code: Text reprurlwritable =
(Text) fit. datum. getmetadata (). Get (nutch. writable_repr_url_key); If (reprurlwritable = NULL ){
Reprurl = fit. url. tostring ();
} Else {
Reprurl = reprurlwritable. tostring ();
} Get the URL, and then analyze the protocal protocol from the data of the URL (Note: The implementation of this function is achieved by using the mandatory plugin mechanism of nutch and uses the protocolfactory class, what is going on? We need to study ^_^). Later we will judge whether the URL complies with robotrules. If not, we will use the code: fetchqueues. finishfetchitem (fit, true); or if the delaytime is greater than the configured maxdelaytime, the page will not be crawled to remove it from the fetchqueues capture queue. Run the core three lines of code: protocoloutput output = protocol. getprotocoloutput (fit. url, fit. datum); // obtain the response content using the protocol.
Protocolstatus status = output. getstatus (); // obtain the status
Content content = output. getcontent (); // get content 5. the following mainly deals with the corresponding status of the response: (1) if the status is wouldblock, execute: Case protocolstatus. wouldblock:
// Retry?
Fetchqueues. addfetchitem (FIT );
Break; that is, retry, add the current URL to the fetchitemqueues queue, and retry (2) If the status is success, the page is crawled, followed by execution: pstatus = output (fit. URL, fit. datum, content, status, crawldatum. status_fetch_success, fit. outlinkdepth); after entering the output method, we can see that the assignment of metadata includes datum. setstatus (Status );
Datum. setfetchtime (system. currenttimemillis (); datum. getmetadata (). put (nutch. writable_proto_status_key, pstatus); and so on. If the fetch_success mark exists, it indicates that the capture is successful. Then, the system will parse the page source code to be crawled. parseresult = This. parseutil. parse (content); followed by output. collect (Key, new nutchwritable (datum); output. collect (Key, new nutchwritable (content); output. collect (URL, new nutchwritable (New parseimpl (New parsetext (PARSE. gettext (), par Sedata, parse. iscanonical (); after executing the output method above, we can use the code pstatus = output (fit. URL, fit. datum, content, status, crawldatum. status_fetch_success, fit. outlinkdepth); The pstatus status is returned, indicating whether the URL is parsed from the page. If it is parsed, mark it as status_db_unfetched and initialize the score. The Code is as follows: FIG = new fig (fig. status_db_unfetched,
Fit. datum. getfetchinterval (), fit. datum. getscore ());
// Transfer existing metadata to the redir
Newdatum. getmetadata (). putall (fit. datum. getmetadata ());
Scfilters. initialscore (redirurl, newdatum); then a series of judgments and operations are performed on the redirurl: If (reprurl! = NULL ){
Newdatum. getmetadata (). Put (nutch. writable_repr_url_key,
New text (reprurl ));
}
Fit = fetchitem. Create (redirurl, newdatum, queuemode );
If (fit! = NULL ){
Fetchitemqueue FIQ =
Fetchqueues. getfetchitemqueue (fit. queueid );
FIQ. addinprogressfetchitem (FIT );
} Else {
// Stop redirecting
Redirecting = false;
Reporter. incrcounter ("fetcherstatus", "fetchitem. notcreated. Redirect", 1 );
} The above is a series of solutions for URLs in the return state of success; (3) if it is moved or temp_moved, it indicates that the webpage has been redirected. Parse the redirected content, generate the corresponding file, and execute output (fit. URL, fit. datum, content, status, Code); and text redirurl = handleredirect (fit. URL, fit. datum,
Urlstring, newurl, temp, Fetcher. protocol_redir); obtain the redirected URL, generate a new fetchitem, place the queueid to the inprogress set of the corresponding queue, and then crawl the redirected webpage. (4) if the status is exception, check the fetchitemqueue of the current URL to see if the number of abnormal webpages exceeds the maximum number. If the number is greater than, clear the queue, think that all the webpages in this queue are faulty; (5) If the status is retry or blocked, output the crawler, set its status to status_fetch_retry, and re-capture in the next round; (6) If the status is gone, notfound, access_denied, and robots_denied, output the crawler and set its status to status_fetch_gone, which may not be captured in the next round. (7) if the status is notmodified, it is deemed that the web page has not changed, then output its crawldatum and set its status to status_fetch_notmodified; (8) If none of the statuses are found, by default, the crawler is output, and its status is set to status_fetch_retry. In the next round of capturing, retry and determine the number of page redirection times. If the maximum number of redirection times is exceeded, the crawler is output, set the status to status_fetch_gone 6. after each consumer completes the "consumption" process, he/she needs to remove it from the consumption queue. After all, after you have been there, you have to sign something, so the finally code is executed at the end of the run method of fetchthread: Finally {
If (fit! = NULL) fetchqueues. finishfetchitem (FIT );
Activethreads. decrementandget (); // count threads
Log.info ("-finishing thread" + getname () + ", activethreads =" + activethreads );
} Indicates that the current thread is finished, and the hospital is reduced in the entire thread queue, among which activethreads. decrementandget (); this type of usage occurs frequently in the fetch process of nutch. activethreads is defined as: Private atomicinteger activethreads = new atomicinteger (0); (Supplement: the main function here indicates that no matter whether the decrementandget () or incrementandget () method is thread-safe, one indicates reducing 1, and the other indicates adding 1) the next step is the process of repeating 3, 4, 5, and 6 in other consumption. We jumped back to crawl. fetcher in Java class. fetch (segs [0], threads); The method shows that it is also in the entire loop: for (I = 0; I <depth; I ++) {// generate new segment
Path [] segs = generator. Generate (crawler LDB, segments,-1, topn, System
. Currenttimemillis ());
If (segs = NULL ){
Log.info ("stopping at depth =" + I + "-no more URLs to fetch .");
Break;
}
Fetcher. Fetch (segs [0], threads); // fetch it segs [0] === [crawl20140727/segments/20140727195735]
If (! Fetcher. isparsing (job )){
Parsesegment. parse (segs [0]); // parse it, if needed
}
Crawldbtool. Update (crawldb, segs, true, true); // update crawldb
}, That is, generate, fetch, parse, and update are executed cyclically. When the collection Depth set by the user or the default system depth is reached, the collection ends. Here, we have a general understanding of the process of collecting crawlers from the nutch. I feel that the most difficult part of a bone should be done, although it is not very clean ...... The context of the entire fetch is roughly as follows. The first step is to enter the fetch function entry of the fetch class, and then submit a job through a series of values initialization and other processes, from the code job. setmaprunnerclass (Fetcher. class); it can be seen that when a job is submitted, the run function of fetch is executed: Public void run (recordreader <text, crawler ldatum> input, outputcollector <text, nutchwritable> output, reporter) after throws ioexception enters the run function, it paves the way for the work to be solved and solves this problem through the producer-consumer model. The actual crawling part is solved by the consumer, and the code is used: new fetcherthread (getconf ()). start (); we can see that we should go to the run function of fetcherthread to execute a series of page captures., Parsing, and other operations. (In addition, from the debugging process, you can see the property configuration file information: {job. end. retry. interval = 30000, FTP. keep. connection = false, Io. bytes. per. checksum= 512, mapred. job. tracker. retiredjobs. cache. size = 1000, DB. fetch. schedule. adaptive. dec_rate = 0.2, mapred. task. profile. reduces = 0-2, mapreduce. jobtracker. staging. root. dir =$ {hadoop. TMP. dir}/mapred/staging, mapred. job. reuse. JVM. num. tasks = 1, mapred.reduce.tasks.speculative.exe cution = true, morein Dexingfilter. indexmimetypeparts = true, DB. ignore. external. links = false, Io. seqfile. sorter. recordlimit = 1000000, generate. min. score = 0, DB. update. additions. allowed = true, mapred. task. tracker. HTTP. address = 0.0.0.0: 50060, Fetcher. queue. depth. multiplier = 50, FS. ramfs. impl = org. apache. hadoop. FS. inmemoryfilesystem, mapred. system. dir =$ {hadoop. TMP. dir}/mapred/system, mapred. task. tracker. report. address = 127.0.0.1: 0, Mapreduce. reduce. shuffle. connect. timeout = 180000, DB. fetch. schedule. adaptive. inc_rate = 0.4, DB. fetch. schedule. adaptive. sync_delta_rate = 0.3, mapred. healthchecker. interval = 60000, mapreduce. job. complete. cancel. delegation. tokens = true, generate. max. per. host =-1, Fetcher. max. exceptions. per. queue =-1, FS. trash. interval = 0, mapred. skip. map. auto. incr. proc. count = true, parser. fix. embeddedparams = true ,...... Urlnormalizer. order = org.apache.nutch.net. urlnormalizer. basic. basicurlnormalizer org.apache.nutch.net. urlnormalizer. regEx. regexurlnormalizer, Io. compression. codecs = org. apache. hadoop. io. compress. defaultcodec, org. apache. hadoop. io. compress. gzipcodec, org. apache. hadoop. io. compress. bzip2codec, Link. score. updater. clear. score = 0.0f, parser.html. impl = neko, Io. file. buffer. size = 4096, parser. character. encoding. d Efault = Windows-1252, FTP. timeout = 60000, mapred.map.tasks.speculative.exe cution = true, Fetcher. timelimit. mins =-1, mapreduce. job. split. metainfo. maxsize = 10000000, HTTP. agent. name = Jack, mapred. map. max. attempts = 4, mapred. job. shuffle. merge. percent = 0.66, FS. har. impl = org. apache. hadoop. FS. harfilesystem, hadoop. security. authentication = simple, FS. s3.buffer. dir =$ {hadoop. TMP. dir}/S3, Lang. analyze. max. length = 204 8, mapred. skip. reduce. auto. incr. proc. count = true, mapred. job. tracker. jobhistory. LRU. cache. size = 5, Fetcher. threads. timeout. divisor = 2, DB. fetch. schedule. class = org. apache. nutch. crawl. defaultfetchschedule, mapred. jobtracker. blacklist. fault-bucket-width = 15, mapreduce. job. ACL-View-job =, mapred. job. queue. name = default, Fetcher. queue. mode = byhost, Link. analyze. initial. score = 1.0f, mapred. job. tracker. persist. Jobstatus. hours = 0, DB. max. outlinks. per. page = 100, FS. file. impl = org. apache. hadoop. FS. localfilesystem, DB. fetch. schedule. adaptive. sync_delta = true, urlnormalizer. loop. count = 1, IPC. client. kill. max = 10, mapred. healthchecker. script. timeout = 600000, mapred. tasktracker. map. tasks. maximum = 2, HTTP. max. delays = 100, Fetcher. follow. outlinks. depth. divisor = 2, mapred. job. tracker. persist. jobstatus. dir =/jobtracker/jobsi Nfo, Lang. identification. only. certain = false, HTTP. usehttp11 = false, Lang. extraction. policy = detect, identify, mapred. reduce. slowstart. completed. maps = 0.05, Io. sort. MB = 100, IPC. server. listen. queue. size = 128, DB. fetch. interval. default = 2592000, [email protected], SOLR. auth = false, Io. mapfile. bloom. size = 1048576, FTP. follow. talk = false, FS. hsftp. impl = org. apache. hadoop. HDFS. hsftpfilesystem, Fetcher. verbose = Fal Se, Fetcher. throughput. threshold. check. after = 5, hadoop. RPC. socket. factory. class. default = org.apache.hadoop.net. standardsocketfactory, FS. hftp. impl = org. apache. hadoop. HDFS. hftpfilesystem, DB. fetch. interval. max = 7776000, FS. KFS. impl = org. apache. hadoop. FS. KFS. kosmosfilesystem, mapred. map. tasks = 2, mapred. local. dir. minspacekill = 0, FS. HDFS. impl = org. apache. hadoop. HDFS. distributedfilesystem, urlfilter. domain. File1_domain-urlfilter.txt, mapred. job. map. memory. MB =-1, mapred. jobtracker. completeuserjobs. maximum = 100, plugin. folders =. /plugins, indexer. max. content. length =-1, Fetcher. throughput. threshold. retries = 5, Link. analyze. damping. factor = 0.85f, urlfilter.regex.file=regex-urlfilter.txt, mapred. min. split. size = 0, HTTP. robots.403.allow = true ...... Such information) Reference blog: http://blog.csdn.net/amuseme_lu/article/details/6725561
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.