Nodejs Crawler Notes (iii)

Source: Internet
Author: User

idea : through notes (ii) in the proxy settings, YouTube has been able to crawl the information, these days want to crawl the site under the video information. By analyzing YouTube, you can start with a subscription number, select several subscription numbers, then crawl through the video categories in the subscription number, then go to the list of videos under each category, and finally get the information you need in each video. Take the subscription number YouTube Movie for example.

Crawl the video category list in youtube movies

Open the subscription number, we can find that there are many video classification under the subscription number as shown, the next can resolve the subscription number information, the video classification of the URL and name crawl down.

Next we click through the browser to check the page, analysis of how to get the classification, you can find all the video classification in the Li under the UL, through the UL ID we can find the relevant information of the classification, so we can use the Cheerio module to parse the page, to obtain classification information.

First, the relevant code to obtain the classification information is divided into a function functions categorylist (URL, callback) {}, the code is as follows:

var request =require (' superagent '); require (' Superagent-proxy ') (request), var cheerio = require (' Cheerio '); var debug = Require (' Debug ') (' youtube:test:category-list '); var proxy = ' http://127.0.0.1:61481 ';//set proxy IP address//Request header information var header = {' Accept ': ' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 ', ' accept-encoding ': ' gzip, Deflate, SDCH, Br ', ' accept-language ': ' zh-cn,zh;q=0.8,en;0.6 ', ' Cookie ': ' Wkcs6.resume=; _ga=ga1.2.1653214693.1476773935; _gid=ga1.2.943573022.1500212436; YSC=_X6AKOK1JMC; s_gl=1d69aac621b2f9c0a25dade722d6e24bcwiaaabvuw==; Visitor_info1_live=t3bczupuiqo; pref=hl=zh-cn&cvdm=grid&gl=us&f1=50000000&al=zh-cn&f6=1&f5=30; Sid=7gr6xoimfw5pbjlorscscd4dxf8chckwckxsufy9cbhnafaplbcvcelv97n_mjwgkpc_ow.; HSID=A0_BKGPKAZLJUFNTJ; Ssid=asjqton7p_q4ungit; Apisid=zivpx9a3vukra28e/a0dykxlivj4xdius_; sapisid=t6dcqhc9pjgse7bi/atm5wgrc27rquqr5b; consent=yes+cn.zh-cn+20160904-14-0; Login_info=apunbegwrqihapnmz-qyhosakq0s9lteqiuvnwnj9chq8j5S2jtzk15taiblzfs4heuh-mwgo2qo6xoruitgrdppz2v3cxlqyy7xta: Quzknv9bajdrr2vzq2qyrlvddxh3vddkz1azmlfqrmg3atbfr2pxwxfhwxlxym1bavvnqwk4uzzfwmzgsgcxrknutdbfytk2a2tkluetnmtnawzkm3htmwntz Kgyovlvtf9dnenwtg5xtljudevhqzviegxmbtftdkl6ys02qlbmmmm0nvgtewi3qvnia3c5c2zkv1nsa3azbwhwohbtbzvrvtvsbtbqawpiz0dwntd4ujjrsl LR ', ' upgrade-insecure-requests ': ' 1 ', ' user-agent ': ' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/54.0.2840.59 safari/537.36 ', ' x-chrome-uma-enabled ': ' 1 ', ' X-Clien T-data ': ' cja2yqeiorbjaqjbtskbckmdyge= ', ' Connection ': ' keep-alive '};//get subscription number under Video Category List function categorylist (URL, Callback) {Debug (' read the subscription number under Video classification list ', URL); request.get (URL)//the URL to get. Set (' header ', header). Proxy. End (Onresponse function Onresponse (err,res) {//res.setencoding (' utf-8 ');//Prevent Chinese garbled if (ERR) {return callback (ERR);} Else{console.log (' Status: ' +res.status);//Print Status Code//console.log (Res.text); var $ = cheerio.load (res.text);//Get Subscription number Idvar $channelName = $ (' #c4-primary-header-contents. BranDed-page-header-title a '). attr (' href '), var $channelId = $channelName. Match (/channel\/([a-za-z0-9_-]+)/); var CategoryList = [];$ (' #browse-items-primary. Branded-page-module-title '). each (function () {var $category = $ (this). Find (' a '). First (); var item = {name: $category. Text (). Trim (), url: ' https://www.youtube.com ' + $category. attr (' href ')};// According to the URL to determine the subscription number or video classification if (item.url.indexOf (' list ')!==-1) {item.channelid = $channelId [1];} Else{var s = Item.url.match (/channel\/([a-za-z0-9_-]+)/); item.id = s[1];} Get video classifications under a YouTube subscription number if (item.name!== ') {Categorylist.push (item);}}); Callback (Null,categorylist);}}}

Then call the CategoryList function:

CategoryList (' https://www.youtube.com/channel/UClgRkhTL3_hImCAmdLfDE4g ', function (err,categorylist) {if (err) { return Console.log (ERR);} Return Console.log (categorylist);});

Run in the background to get the video classification information, this will be found in the video classification contains some of the subscription number, and we only need to extract video classification.

Remove the subscription number information (you can also add to the subscription number list, and then read the video classification under subscription number)

Get the video category if (item.name!== ") {Categorylist.push (item) under a YouTube subscription number;} Replace with//Get a YouTube subscription number under the video classification if (item.categoryname!== ' &&item.hasownproperty (' Channelid ')) { Categorylist.push (item);}

This is the re-run will find the subscription number of the information has been removed, then go to the next level, video classification to get a video list

Second, get the video list

us to The best-selling movie For example, get a list of the videos below it. Click on the open page, we will find the category under the video list is all inside, we also only get its URL and name.

Click Check (PS: I use the Google browser), first look at the structure of the page, and then use Cheerio to parse. Check that we can get each video through the TR in tbody.

Nodejs Crawler Notes (iii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.