Simplifying the second-----of daily work series collection novel

Source: Internet
Author: User

2. Go run the Script task of collecting the novel

To reduce dependency, the previous acquisition of the novel was implemented in two parts:
The first part: Nodejs go to the Directory page crawl chapter URL, write TXT file storage.
The second part: PHP uses the encapsulated Curl class and analytic parsing class to get the title content separately, writes the HTML file.

This will not only make the task of the physical machine or Docker on the PHP environment to have a NODEJS environment. Since I'm good at PHP, I've done it all by PHP instead of two parts. The complete code collected can be found in the above-mentioned collection category and other blogs.

Curl Package beta version of the blog record: http://www.cnblogs.com/freephp/p/4962591.html.
Optimize the blog record for the Curl wrapper class: Http://www.cnblogs.com/freephp/p/5112135.html.

If you are not familiar with a friend, you can read this part of the blog before reading this article.


Key parts of the code:

$menuUrl = ' http://www.zhuaji.org/read/2531/'; $menuContents = mycurl::send ($menuUrl, ' get '); $analyzer New Analyzer (); $urls $analyzer->getlinks ($menuContents);

Back to curl each chapter page, crawl and parse the content and write to the file.
The simplicity and readability of the code is better. Now let's consider efficiency and performance issues. This code is a one-time download of all the files, the only thing to do is to judge each get to the chapter content after the existence of the file name. But some useless, time-consuming network requests have been made. Currently the novel has 578 chapters, plus the directory is crawled once, altogether to initiate 578+1 get request, later the novel will continue to add chapters, then the execution time will be longer.

The biggest bottleneck of this script is the network consumption.

This script is inefficient, each time the pages of all the chapters to crawl once, the network consumption is very high. If it is the first time download OK, after all to download all. If it is executed every day, then actually I would like to incrementally download the new chapter of the previous one.

There are also several ideas to consider:
1. We have to consider the ID of the last saved page after each execution to be recorded. Then the next time you start downloading from this ID.
2. You can also run back and forth when the middle is broken. (Follow the last sentence of the first article)

This will be able to crawl from the new page, reducing the amount of network requests, execution efficiency greatly improved.

In fact, the problem becomes a way to record the last chapter ID of the successful execution.
We can write this ID to the database or write to the file. For simplicity and less reliance, I decided to write the file.

Individually encapsulate a function that gets the maximum ID, and then filter out the files that have already been downloaded. The complete code is as follows:

functionGetMaxID () {$idLogFiles= './biggestid.txt '; $biggestId= 0; if(file_exists($idLogFiles)) {        $fp=fopen($idLogFiles, ' R '); $biggestId=Trim(fread($fp, 1024)); fclose($fp); }    return $biggestId;}/** *? client to run*/Set_time_limit(0);require' Analyzer.php ';$start=Microtime(true);$MENUURL= ' http://www.zhuaji.org/read/2531/';$menuContents= Mycurl::send ($MENUURL, ' Get ');$biggestId= GetMaxID () + 0;$analyzer=NewAnalyzer ();$urls=$analyzer->getlinks ($menuContents);$ids=Array();foreach($urls  as $url) {    $parts=Explode(‘.‘,$url); Array_push($ids,$parts[0]);}Sort($ids,sort_numeric);$newIds=Array();foreach($ids  as&$id) {    if((int)$id>$biggestId)Array_push($newIds,$id);}if(Empty($newIds))Exit(' Nothing to download! ');foreach($newIds  as $id) {    $url=$id. '. html '; $res= Mycurl::send (' http://www.zhuaji.org/read/2531/').$url, ' Get '); $title=$analyzer->gettitle ($res) [1]; $content=$analyzer->getcontent (' div ', ' content ',$res) [0]; $allContents=$title. "<br/>".$content; $filePath= ' d:/www/tempscript/juewangjiaoshi/'.$title. '. html '; if(!file_exists($filePath)) {        $analyzer->storetofile ($filePath,$allContents); $IDFP=fopen(' BiggestId.txt ', ' W '); fwrite($IDFP,$id); fclose($IDFP); } Else {        Continue; }    Echo' Down the URL: ',$url, "\ r \ n";}$end=Microtime(true);$cost=$end-$start;Echo"Total cost time:".round($cost, 3). "Seconds\r\n";

Added to the Windows Timer task or cron under Linux can enjoy the fun of the novel every day, instead of every time manually to browse the web to waste traffic, the parsed HTML file saved text version more comfortable. However, this code in the lower version of PHP will be an error, the array simplification [44,3323,443] after the php5.4 appears.

It takes about 2 minutes to download all the novels before. The final result is improved:

The effect is significant, I set the following in/etc/crontab:

0 3 * * * root/usr/bin/php/data/scripts/tempscript/mycurl.php >>/tmp/downnovel.log

The author of the novel is really good, although the late writing is very harem and lack of text, often to 12 o'clock is still updated, so the daily scheduled tasks at 3 o'clock in the morning collected.

Simplifying the second-----of daily work series collection novel

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.