Simplifying the second-----of daily work series collection novel

Last Update:2016-01-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

2. Go run the Script task of collecting the novel

To reduce dependency, the previous acquisition of the novel was implemented in two parts:
The first part: Nodejs go to the Directory page crawl chapter URL, write TXT file storage.
The second part: PHP uses the encapsulated Curl class and analytic parsing class to get the title content separately, writes the HTML file.

This will not only make the task of the physical machine or Docker on the PHP environment to have a NODEJS environment. Since I'm good at PHP, I've done it all by PHP instead of two parts. The complete code collected can be found in the above-mentioned collection category and other blogs.

Curl Package beta version of the blog record: http://www.cnblogs.com/freephp/p/4962591.html.
Optimize the blog record for the Curl wrapper class: Http://www.cnblogs.com/freephp/p/5112135.html.

If you are not familiar with a friend, you can read this part of the blog before reading this article.

Key parts of the code:

$menuUrl = ' http://www.zhuaji.org/read/2531/'; $menuContents = mycurl::send ($menuUrl, ' get '); $analyzer New Analyzer (); $urls $analyzer->getlinks ($menuContents);

Back to curl each chapter page, crawl and parse the content and write to the file.
The simplicity and readability of the code is better. Now let's consider efficiency and performance issues. This code is a one-time download of all the files, the only thing to do is to judge each get to the chapter content after the existence of the file name. But some useless, time-consuming network requests have been made. Currently the novel has 578 chapters, plus the directory is crawled once, altogether to initiate 578+1 get request, later the novel will continue to add chapters, then the execution time will be longer.

The biggest bottleneck of this script is the network consumption.

This script is inefficient, each time the pages of all the chapters to crawl once, the network consumption is very high. If it is the first time download OK, after all to download all. If it is executed every day, then actually I would like to incrementally download the new chapter of the previous one.

There are also several ideas to consider:
1. We have to consider the ID of the last saved page after each execution to be recorded. Then the next time you start downloading from this ID.
2. You can also run back and forth when the middle is broken. (Follow the last sentence of the first article)

This will be able to crawl from the new page, reducing the amount of network requests, execution efficiency greatly improved.

In fact, the problem becomes a way to record the last chapter ID of the successful execution.
We can write this ID to the database or write to the file. For simplicity and less reliance, I decided to write the file.

Individually encapsulate a function that gets the maximum ID, and then filter out the files that have already been downloaded. The complete code is as follows:

functionGetMaxID () {$idLogFiles= './biggestid.txt '; $biggestId= 0; if(file_exists($idLogFiles)) {        $fp=fopen($idLogFiles, ' R '); $biggestId=Trim(fread($fp, 1024)); fclose($fp); }    return $biggestId;}/** *? client to run*/Set_time_limit(0);require' Analyzer.php ';$start=Microtime(true);$MENUURL= ' http://www.zhuaji.org/read/2531/';$menuContents= Mycurl::send ($MENUURL, ' Get ');$biggestId= GetMaxID () + 0;$analyzer=NewAnalyzer ();$urls=$analyzer->getlinks ($menuContents);$ids=Array();foreach($urls  as $url) {    $parts=Explode(‘.‘,$url); Array_push($ids,$parts[0]);}Sort($ids,sort_numeric);$newIds=Array();foreach($ids  as&$id) {    if((int)$id>$biggestId)Array_push($newIds,$id);}if(Empty($newIds))Exit(' Nothing to download! ');foreach($newIds  as $id) {    $url=$id. '. html '; $res= Mycurl::send (' http://www.zhuaji.org/read/2531/').$url, ' Get '); $title=$analyzer->gettitle ($res) [1]; $content=$analyzer->getcontent (' div ', ' content ',$res) [0]; $allContents=$title. "<br/>".$content; $filePath= ' d:/www/tempscript/juewangjiaoshi/'.$title. '. html '; if(!file_exists($filePath)) {        $analyzer->storetofile ($filePath,$allContents); $IDFP=fopen(' BiggestId.txt ', ' W '); fwrite($IDFP,$id); fclose($IDFP); } Else {        Continue; }    Echo' Down the URL: ',$url, "\ r \ n";}$end=Microtime(true);$cost=$end-$start;Echo"Total cost time:".round($cost, 3). "Seconds\r\n";

Added to the Windows Timer task or cron under Linux can enjoy the fun of the novel every day, instead of every time manually to browse the web to waste traffic, the parsed HTML file saved text version more comfortable. However, this code in the lower version of PHP will be an error, the array simplification [44,3323,443] after the php5.4 appears.

It takes about 2 minutes to download all the novels before. The final result is improved:

The effect is significant, I set the following in/etc/crontab:

0 3 * * * root/usr/bin/php/data/scripts/tempscript/mycurl.php >>/tmp/downnovel.log

The author of the novel is really good, although the late writing is very harem and lack of text, often to 12 o'clock is still updated, so the daily scheduled tasks at 3 o'clock in the morning collected.

Simplifying the second-----of daily work series collection novel

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Simplifying the second-----of daily work series collection novel

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Simplifying the second-----of daily work series collection novel

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support