Python learns the first bullet: Crawler (Crawl Blog Park News)

Last Update:2017-06-07 Source: Internet

Author: User

Tags gettext italic font python script

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective

When it comes to Python, the first reaction to it may be a reptile.

These two days to see a bit of Python's fur knowledge, can not help but want to write a simple crawler practice practiced hand, JUST do IT

Preparatory work

To make data crawling crawler, the request of the source page structure needs to have a specific analysis, only the correct analysis, to better and faster crawl to the content we want.

Open the Blog Park any news page, such as https://news.cnblogs.com/n/570973/, the idea is through this source page, and according to the page "previous", "next article" and other links, continuously crawl other news content.

Browser access https://news.cnblogs.com/n/570973/, right-click "View Source Code", initially only want to take some simple data (article title, author, publish time, etc.), in the HTML source to find the relevant parts of the data:

1) title (URL):<div id= "News_title" ><a href= "//news.cnblogs.com/n/570973/" >spacex re-used " Dragon "The spacecraft successfully docked with the International Space Station </a></div>

2)<span class= "News_poster" > delivery person <a href= "//home.cnblogs.com/u/34358/" >itwriter</a ></span>

3) Published: <span class= "Time" > Posted in 2017-06-06 14:53 </span>

<input type= "hidden" value= "570981" id= "Lbcontentid";

　　Of course, it is important to think about the structure of the "last" and "next" links, but to find a problem, the two <a> tags on the page, the links and the text content, are rendered by JS, how can this be good? Try to find information (Python does JS and so on), for the Python novice, may be a little ahead of the plan to find another solution.

Although the two links are rendered by JS, in theory, JS can render the content, it should also be done by initiating a request to get the response after the rendering bar, then whether you can monitor the page loading process to see what useful information? In order to chrome/firefox these browsers, developer tools/network, you can clearly see the request and response of all resources.

Their request addresses are:

1) Previous News id:https://news.cnblogs.com/newsajax/getprenewsbyid?contentid=570992

2) Next News id:https://news.cnblogs.com/newsajax/getnextnewsbyid?contentid=570992

The content of the response is JSON

Here ContentID is what we need, according to this value, you can know the current news of the previous or next news URL, because the page address of the press release is fixed format: https://news.cnblogs.com/n/{{ContentID}} /(red content is a replaceable ID)

Tools

1) Python 3.6 (install pip at the same time and add environment variable)

2) Pycharm 2017.1.3

3) third-party Python library (install: cmd, pip install name)

A) Pyperclip: for reading and writing the Clipboard

b) Requests: Based on Urllib, the HTTP library with Apache2 Licensed Open source protocol. It is more convenient than urllib and can save us a lot of work

c) Beautifulsoup4:beautiful soup provides some simple, Python-style functions for navigating, searching, and modifying analysis trees. It is a toolbox that provides users with the data they need to crawl by parsing the document

Source

Code personal feel is very basic easy to understand (after all, the novice can not write advanced code), there are questions or suggestions, please advise

#! python3# coding = utf-8# get_cnblogs_news.py# Get all News (title, release time, publisher) according to any news in the blog Park # https://news.cnblogs.com/n/123456/# This is the title format: <div id= "News_title" ><a href= "//news.cnblogs.com/n/570973/" >spacex re-used "dragon" spacecraft successfully docking with the International Space Station </a ></div># This is the release personality: <span class= "News_poster" > Delivery person <a href= "//home.cnblogs.com/u/34358/" >itwriter </a></span># This is the release date format: <span class= "Time" > Posted in 2017-06-06 14:53</span># Current news ID: <input type= "Hidden" value= "570981" id= "Lbcontentid" ># not get a direct link to the previous and next article, because it is using AJAX request post-rendering # need to request an address, get results, json# the previous HTTPS ://news.cnblogs.com/newsajax/getprenewsbyid?contentid=570971# Next https://news.cnblogs.com/NewsAjax/ getnextnewsbyid?contentid=570971# Response Content # contentid:570971# Title: "Mac supports external GPU VR development Kit for $599" # Submitdate: "/date (142544 5514) "# Submitdateformat:" 2017-06-06 14:47 "import sys, pyperclipimport requests, Bs4import json# parse and print (title, author, publish Time, Current ID) # soup: The HTML content of the response is BS4 converted Object def get_info (soup): Dict_info = {' curr_id ': ', ' author ': ', ' time ': ', ' title ': ', ' url ': '} titles = Soup.select (' div#news_title > A ') If Len (titles) ; 0:dict_info[' title ' = Titles[0].gettext () dict_info[' url '] = titles[0].get (' href ') authors = Soup.selec T (' Span.news_poster > A ') If Len (authors) > 0:dict_info[' author '] = Authors[0].gettext () times = soup. Select (' Span.time ') If Len (times) > 0:dict_info[' time '] = Times[0].gettext () Content_ids = Soup.select (' I Nput#lbcontentid ') If Len (content_ids) > 0:dict_info[' curr_id '] = Content_ids[0].get (' value ') # Write file wi Th open (' d:/cnblognews.csv ', ' a ') as F:text = '%s,%s,%s,%s\n '% (dict_info[' curr_id '), (dict_info[' author ') + Dict _info[' time '), dict_info[' url '], dict_info[' title ') print (text) f.write (text) return dict_info[' curr_id ']# get previous article Info # curr_id: News id# Loop_count: Up How many bars, if 0, then infinity upward until end def get_prev_info (curr_id, loop_count = 0): Private_lo  Op_count = 0 Try:      While loop_count = = 0 or Private_loop_count < Loop_count:res_prev = Requests.get (' Https://news.cnblog S.com/newsajax/getprenewsbyid?contentid= ' + curr_id) res_prev.raise_for_status () Res_prev_dict = JSO N.loads (res_prev.text) prev_id = res_prev_dict[' ContentID '] Res_prev = Requests.get (' https://news.cn blogs.com/n/%s/'% prev_id) res_prev.raise_for_status () Soup_prev = BS4.    BeautifulSoup (Res_prev.text, ' html.parser ') curr_id = Get_info (soup_prev) Private_loop_count + = 1  except:pass# Get next post info # curr_id: News id# Loop_count: How many bars down, if 0, then infinity Down, until end def get_next_info (curr_id, Loop_count = 0): Private_loop_count = 0 try:while Loop_count = = 0 or Private_loop_count < Loop_count:res_ Next = Requests.get (' https://news.cnblogs.com/NewsAjax/GetNextNewsById?contentId= ' + curr_id) res_next.raise_fo R_status () res_next_dict = Json.loads (res_next.text) next_id = res_next_dict[' ContentID '] res_next = Requests.get (' https://news.cnblogs.com/n /%s/'% next_id) res_next.raise_for_status () Soup_next = BS4.    BeautifulSoup (Res_next.text, ' html.parser ') curr_id = Get_info (soup_next) Private_loop_count + = 1 The except:pass# parameter is obtained from the command line precedence, and if none, gets the # URL from the Clipboard is the blog Park News section, any news if Len (sys.argv) > 1:url = Sys.argv[1]else:ur L = pyperclip.paste () # does not get an address, then throws an exception if not url:raise valueerror# start getting news content from the source address res = requests.get (URL) res.raise_for_st ATUs () If not res.text:raise valueerror# parse Htmlsoup = BS4. BeautifulSoup (Res.text, ' html.parser ') curr_id = get_info (soup) print (' Backward ... ') get_prev_info (curr_id) print (' Forward. ') Get_next_info (curr_id) print (' Done ')

Run

Save the above source code to d:/get_cnblogs_news.py, open command line tool cmd under Windows platform:

Input command: Py.exe d:/get_cnblogs_news.py https://news.cnblogs.com/n/570992/Enter

Parsing: Py.exe does not have to explain, the second parameter is a Python script file, the third parameter is the source page to crawl (there is another consideration in the code, if you will https://news.cnblogs.com/n/570992/ This URL is copied to the system Clipboard and can be run directly: Py.exe d:/get_cnblogs_news.py

Command line output interface (print)

Content saved to a CSV file

Recommended Novice Python Learning bookcase or materials:

1) Liaoche python tutorial, very basic easy to understand: http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000

2) Python programming quick start to automate tedious tasks. pdf

The article is only for you to learn the diary of Python, if misleading please criticize (do not like to spray), such as help you, honored.

Python learns the first bullet: Crawler (Crawl Blog Park News)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More