Example: Using puppeteer Headless method to crawl JS Web page

Source: Internet
Author: User

Puppeteer

The Google Chrome team's puppeteer is an automated test library that relies on nodejs and chromium, and its biggest advantage is that it can handle dynamic content in Web pages, such as JavaScript, to better impersonate users.
Some web site anti-crawler means to hide some of the content in some Javascript/ajax requests, resulting in a direct access to a tag is not effective. Even some sites set hidden element "traps", which are not visible to the user, and script triggers are considered machines. In this case, the advantages of puppeteer are highlighted.
It enables the following functions:

    1. Generate screens and PDFs for the page.
    2. Crawl the spa and generate pre-rendered content ("SSR").
    3. Automatic form submission, UI testing, keyboard input, and more.
    4. Create an up-to-date automated test environment. Run your tests directly in the latest version of Chrome with the latest JavaScript and browser features.
    5. Capture a timeline that tracks your site to help diagnose performance issues.

Open Source Address: [Https://github.com/googlechrome/puppeteer/][1]

Installation
npm i puppeteer

Note Install the Nodejs first and execute it under the Nodejs file root directory (npm file sibling).
The installation process will download chromium, about 120M.

With two days (about 10 hours) groping, bypassing a considerable number of asynchronous pits, the author of puppeteer and Nodejs have a certain grasp.
A long picture, crawl blog article list:

Crawl blog Posts

Take CSDN Blog As an example, the article content needs to click "Read the full text" to obtain, which causes the script that can only read the DOM to fail.

/*** load blog.csdn.net article to local files**/const puppeteer = require (' puppeteer ');//emulate iphoneconst useragent = ' Mozilla/5.0 (iPhone; CPU iPhone os 11_0 like Mac os X applewebkit/604.1.38 (khtml, like Gecko) version/11.0 mobile/15a372 safari/604.1 '; const Workpath = './contents '; const FS = require ("FS"), if (!fs.existssync (Workpath)) {Fs.mkdirsync (Workpath)}//base URLC Onst rooturl = ' https://blog.csdn.net/';//max wait millisecondsconst maxwait = 100;//max loop scroll timesconst makloop =    (Async () = {let URL;    Let Counturl=0;  Const Browser = await Puppeteer.launch ({headless:false});//set headless:true would hide chromium UI Const page = await    Browser.newpage ();    Await Page.setuseragent (useragent);    Await Page.setviewport ({width:414, height:736});    Await page.setrequestinterception (true); Filter to block images Page.On (' request ', request = = {if (request.resourcetype () = = = ' image ') Request.abo    RT (); else Request.continUE ();    });    Await Page.goto (Rooturl); For (let i= 0; i<makloop;i++) {try{await page.evaluate (() =>window.scrollto (0, Document.body.scroll            Height));        Await Page.waitfornavigation ({timeout:maxwait,waituntil: [' Networkidle0 ']});        }catch (Err) {console.log (' scroll to bottom and then wait ' +maxwait+ ' Ms. ');    }} await Page.screenshot ({path:workpath+ '/screenshot.png ', fullpage:true, quality:100, type: ' JPEG '});    #feedlist_id li[data-type= "blog"] a const SEL = ' #feedlist_id li[data-type= "blog"] H2 a ';        Const HREFS = await page.evaluate (SEL) + {Let elements = Array.from (SEL);    Let links = elements.map (element = = {return element.href}) return links;    }, SEL);    Console.log (' Total Links: ' +hrefs.length);  Process ();        Async function Process () {if (counturl
Execution process

The recording screen can be viewed in my public number, below:

Execution results

Article content list:

Article content:

Conclusion

Previously thought that since Nodejs is using the JavaScript scripting language, it would certainly be able to handle the JavaScript content of the Web page, but did not find a suitable/efficient library. It was not until I found puppeteer that I decided to try water.
In other words, Nodejs's async is really a headache, this hundred lines of code I unexpectedly toss for 10 hours.
You can expand the code in the process() method, using async.eachSeries , I use the recursive method is not the optimal solution.
In fact, the one-off process is not efficient, originally I wrote an asynchronous close browser method:

let tryCloseBrowser = setInterval(function(){        console.log("check if any process running...")        if(countDown<=0){          clearInterval(tryCloseBrowser);          console.log("none process running, close.")          browser.close();        }    },3000);

According to this idea, the original version of the code is to open multiple tabs at the same time, the efficiency is very high, but the fault rate is very low, you can try to write it yourself.

Off Topic

People who have read my article know that I write more about the way/method of dealing with problems and give you some ideas .
I am completely unfamiliar with Nodejs and puppeteer (of course, I know what they are suited to do, that's all). If you remember the concept of on-demand memory mentioned in the 10 times-speed programmer, then you will understand that I deliberately do not go to the system of learning new technology.
I say I touch puppeteer to complete all the thinking logic I need to function:

    1. Understand the Puppeteer function/features and determine whether the requirements are met with the purpose.
    2. Fast implementation of all demos in Getstart
    3. Two times to judge the characteristics of puppeteer, from the designer's point of view, speculation puppeteer architecture.
    4. Validates the schema.
    5. Read through the API to learn puppeteer details.
    6. Search puppeteer Pre-learning content (and pre-learning content that is dependent on pre-learning content). Organize the learning content tree back to 1.
    7. Design/analysis/commissioning/...

May 9, 2018 02:13

Example: Using puppeteer Headless method to crawl JS Web page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.