How do I use crawlers to monitor updates to a range of websites?

Source: Internet
Author: User
I now think of the method only to automatically crawl the site every day and compare the old and new Web site HTML file to decide whether there is no update

Reply content:

1 First request a webpage, crawl to local, assuming the file name is a.html. At this point the file system has a file modification time.

2 second visit to the Web page, if you find that a.html is already in the local, send a if-modified-since request to the server ( http://www. w3.org/protocols/rfc261 6/rfc2616-sec14.html )。 Write the modification time of the a.html in the request.

3 If the Web page is updated, the server returns a 200 response, then crawls the page and updates the local file.

4 If the page is not updated, the server returns a 304 response. There is no need to update the file at this time. This problem has already been made into products, you can look at:
/ http sleepingspider.com
After registering as a user, you can choose which pages you want to follow, and if you have an update, you will receive an email alert. There are some advanced settings that have not been used, you can look at my undergraduate graduation is this.
At that time did a set of monitoring fruit Bank, want to go, flower market, warm island services.

Implementation method:
1. crontab Scheduled Tasks
2. Node reads the configuration and calls Phantomjs (memory browser) to access the links and co-diagrams.
3. All pictures are named by the date sub-folder, using Bootstrap to make a comparison display.

If there is such a set of services, I feel very good.
But the pay rate may be a problem. Maybe git to take down the page to do version control also OK? I'm on a crooked floor.
Chrome has a page Monitor The plugin uses MD5 digital signature
Each time the Web page is downloaded, the data stream returned by the server is responsestream first in the memory buffer and then
Responsestream generate MD5 Digital signature S1, next download also generate signature S2, compare S2 and S1, if same, the page does not
With the new, otherwise the webpage has the new. can use the Website Information monitoring tool, very suits your request
  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.