Python web crawler for beginners (2) and python Crawler
Disclaimer: the content and Code involved in this article are limited to personal learning and cannot be used for commercial purposes by anyone. Reprinted Please attach this article address
This article Python beginners web crawler continues, the latest Code has been submitted to the https://github.com/octans/PythonPractice
1. Review in the previous article
In the previous article, I started with the popular recommendation page of pepper for Python beginners, and then obtained the host's personal information and the corresponding live broadcast history video.
First, let's take a look at the crawling results of the broadcaster and video of huajiao.com in the previous article:
# getUserCount10179# getLiveCount111574
Up to now, 10179 broadcaster information and 111574 video information of these broadcasters have been collected. The reason for the small data volume here is that I only collected the anchor under the popular recommendation of pepper. This page shows 60 anchor recommended by the system each time.
So far, I have done the following new things:
- Encapsulate the read and write operations of MySql
- The encoding style follows PEP8.
- Crawling the host information of the womi preferred network (http://video.51wom.com /)
- Crawl the anchor information and video information of the Web (http://www.yixia.com /)
The encapsulation code for MySql is separately stored in the mysql file. as a module in py, this module is simple, but has implemented select, insert, delete, and other operations. If you are interested in MySql encapsulation, refer, but do not use it in the production environment. We recommend that you use and read database peewee.
Next, I will continue to describe my development experience on data capture.
2. Crawled data source and logic
Ultimate Goal: Collect the broadcaster information and history playback records of major live broadcast platforms, and then aggregate and analyze the data.
Currently completed: data collection on the pepper network is complete.
Womi preferred network (http://video.51wom.com/) is a network of Red Data aggregation site, it collects the various live broadcast platform (pepper, pandatv, SEC shot, douyu, Ke, a live broadcast, meipai). Therefore, I hope to obtain information about popular hosts on various platforms from it, and then use the host id to crawl more detailed information on the corresponding live broadcast platform.
3. Crawl the host list page of The womi preferred network
The list page is http://video.51wom.com/as shown below:
At first glance, this is a list page with a paging link at the bottom. When a page is clicked, the form is submitted.
3.1 analysis conclusion and conception of procedural Logic
When you click the page at the bottom, use the chrom developer tool to view the XHR request as follows:
From some tests, we can analyze the following:
- A) To request data after the second page, you need to submit the corresponding cookie and csrf data to the website;
- B) The submission method is POST's "multipart/form-data ";
- C) The submitted parameters include _ csrf, stage-name, platform, and industry;
- D) The returned result of the request is the html code of a table list;
It is easy to obtain cookies, but how does _ csrf obtain them?
View the page source code and find that the website has written the csrf value to the form when generating the list page. The same csrf value can be used multiple times in subsequent requests.
<input type="hidden" name="_csrf" value="aWF6ZGMzclc9EAwRK3Y4LhobNQo6eEAdWwA0IFd1ByUDNTgwClUEZw==">
From the above analysis, the program logic should be like this,
- A) First, request the homepage of the host list to obtain the csrf value and cookie.
- B) Save the csrf and cookie values for the next request.
- C) request the second and third pages of the host list
- D) parse the html code of the obtained table list using BeautifulSoup to traverse each row and each column in the row.
- E) write the obtained data to mysql
3.2 obtain the host information of the womi preferred network through python Encoding
A) construct the base class Website, and create a class for each Website to inherit the Website.
- Some requests return html code, and the html Parser is set in the class;
- Some requests return json strings, and the json parser is set in the base class;
- When requesting each website, you need to set different headers to put the headers in the base class;
- Function encapsulation for post Content-Type: multipart/form-data;
- Function encapsulation is performed for the Content-Type: application/x-www-form-urlencoded method of post;
- In this case, try to write different request methods as functions without using the type parameter, so that the subclass can be called clearly;
Note that the following code is not complete or PEP8 code specification to save space
Class Website: ### use requests. session () can automatically process cookies session = requests. session () ### set html Parser htmlParser = BeautifulSoup ### set json parser jsonParser = json ### set headers = {'user-agent ': 'mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/''54. 0.2840.98 Safari/537.36 '} ### directly initiate the get request def get (self, url, params = None): if params is None: params = {} return self. session. get (url, params = params, headers = self. headers) ### initiate a get request and return the parsed html object def get_html (self, url, params = None): r = self. get (url, params) return self.html Parser (r. text, 'html. parser ') ### initiate a get request and return the parsed json object def get_json (self, url, params = None): r = self. get (url, params) return self. jsonParser. loads (r. text) ### initiate a post request and use the Content-Type: multipart/form-data method def post_multi_part (self, url, params): kwargs = dict () for (k, v) in params. items (): kwargs. setdefault (k, (None, v) r = self. session. post (url, files = kwargs, headers = self. headers) return self.html Parser (r. text, "html. parser ")
B) construct the class WoMiYouXuan to encapsulate the requests that are preferred for the website womi.
- The first_kiss () method is used to request the website for the first time. The obtained csrf value is saved by the property self. csrf;
- First_kiss () another function is to obtain the cookie. Although it is not displayed for processing, because requests. session () has helped us to process the cookie, and automatic acquisition and submission are automatically obtained;
- Note that in an instance, you only need to call first_kiss () once, and then you can call other page request functions multiple times;
- The csrf and cookie are associated, and the website will verify and submit them all;
- Parse_actor_list_page () is the html code used to analyze the host list;
- The spider_actors method is a skeleton function that cyclically accesses data on each page and writes the results to mysql;
Class WoMiYouXuan (Website): ### send csrf to the Website csrf = ''def _ init _ (self): self. first_kiss () ### obtain the csrf value from the website for the first time and save it to self. csrf for other post requests to directly use def first_kiss (self): url = 'HTTP: // video.51wom.com/'html = self. get_html (url) self. csrf = html. find ('meta', {'name': 'csrf-token '}). attrs ['content'] ### obtain the host information def parse_actor_list_page (self, page = 1) from the host list page ): ### construct a parameter-> initiate a post request url = 'HTTP: // video.51wom.com/media/' + str (page) + '.html 'keys = ('_ csrf', 'stage-name ', 'platform', 'industry ', 'price', 'follower _ num', 'follower _ region', 'page', 'is _ video_platform', 'sort _ by_price ', 'Type _ by_price ') params = dict () for key in keys: params. setdefault (key, '') params ['_ csrf'] = self. csrf params ['page'] = str (page) html = self. post_multi_part (url, params) ### parse the host list trs = html. find ('div ', {'id': 'table-list '}). table. findAll ('tr') trs. pop (0) # Remove the title line actor_list = list () for tr in trs: ### there are too many things to follow. If you are interested, go to the source code ### skeleton function, cyclically access data on each page and write the results to mysql def spider_actors (self): page = 1 tbl_actor = WMYXActor () while True: ret = self. parse_actor_list_page (page) for actor in ret ['items ']: actor ['price _ dict'] = json. dumps (actor ['price _ dict ']) tbl_actor.insert (actor, replace = True) if ret ['items _ count'] * ret ['page'] <ret ['Total']: page + = 1 else: break
Method parse_actor_list_page (): analyze the html code of the host list in detail.
3.3 Knowledge Point Summary
A) Form submission POST method
Generally, application/x-www-form-urlencoded is used when only some kv data is submitted;
Generally, the multipart/form-data method is used to upload files. However, this method can also submit kv data. For example, this method is used to obtain the host list data.
B) Python network request library Requests
This library is too easy to use! In addition, it can automatically process cookies, such as how I use them in the base-class Website. It is also very convenient to use it to construct post requests in the multipart/form-data mode, such as Website :: post_multi_part ()
C) use regular expressions in Python to match integers in strings,The following code:
Avg_watched = tds [6]. get_text (strip = True) # average number of viewers mode = re. compile (r '\ d +') tmp = mode. findall (avg_watched)
D) Use the try, retry t mechanism to implement isset () similar to that in php (), The following code:
# Determine whether there are commas (,), such as 8, 189try: index = string. index (',') string = string. replace (',', ') Before t ValueError: string = string
E) Be sure to note that '1' and '1' in python are different.You need to convert the string and number types by yourself.
4. Crawl the anchor and video information of the Second-shot network.
In the womi preferred network to get the anchor id of each live platform, first to achieve the network (http://www.yixia.com/) capture, get the corresponding anchor and video information.
The personal home address of the net is http://www.yixia.com/u/uid, this uid is the anchor id, as follows:
4.1 analysis conclusion and conception of procedural Logic
- A) on the host's personal homepage, you can get the host's personal information, such as the Avatar, nickname, and number of fans. You can also get the host's video list;
- B) the loading method of the video list is waterfall stream, which means ajax interface is adopted;
- C) The data returned by the video list interface is html code, which still needs to be parsed using BeautifulSoup;
- D) when requesting the video list interface, you must submit the suid parameter. The parameter value must be obtained on the host's personal page using uid;
4.2 python code the anchor information and video list of the Network
- Construct class YiXia (Website ),
- Method parse_user_page () obtains the personal information of the host with the uid;
- Method get_video_list () to retrieve video list data by PAGE
Class YiXia (Website): ### access the anchor page, which is also a video list page. Obtain suid and anchor's personal information def parse_user_page (self, uid): print (self. _ class __. _ name _ + ': parse_user_page, uid =' + uid) user = dict () user ['uid'] = uid url = 'HTTP: // www.yixia.com/u/' + uid bs = self. get_html (url) div = bs. find ('div ', {'class': 'box1'}) user ['nickname'] = div. h1.a. get_text (strip = True) # nickname stat = div. ol. get_text (strip = True) stat = re. split ('follow \ | fans ', stat) user ['follow'] = stat [0]. strip () # number of followers user ['followed'] = stat [1]. strip () # Number of fans ####------ a lot of code is omitted here ---- return user ### AJAX requests video list def get_video_list (self, suid, page = 1 ): url = 'HTTP: // www.yixia.com/gu/u' payload = {'page': page, 'suid': suid, 'fen _ type': 'channel'} json_obj = self. get_json (url, params = payload) msg = json_obj ['msg '] msg = BeautifulSoup (msg, 'html. parser ') ### parse the video title titles = list () ps = msg. findAll ('P') for p in ps: titles. append (p. get_text (strip = True) # Video title ### number of video likes and comments parsed stats = list () divs = msg. findAll ('div ', {'class': 'list clearfix'}) for div in divs: tmp = div. ol. get_text (strip = True) tmp = re. split ('like | \ | comment ', tmp) stats. append (tmp) ### parse other video data videos = list () divs = msg. findAll ('div ', {'class': 'd _ video'}) for (k, div) in enumerate (divs): video = dict () video ['scid'] = div. attrs ['data-scid'] ####------ a lot of code is omitted here ------ return videos ### skeleton function to obtain each page data of each video def spider_videos (self, suid, video_count): page = 1 current = 0 tbl_video = YiXiaVideo () while current <int (video_count): print ('ider _ videos: suid = '+ suid + ', page = '+ str (page) videos = self. get_video_list (suid, page) for video in videos: tbl_video.insert (video, replace = True) current ++ = len (videos) page + = 1 return True
4.3 knowledge point summary
Most of the knowledge points are still in 3.3. Here we will focus on the conversion of character strings, integers, and floating-point numbers. For example, the number of fans '2. 100' is a string that needs to be converted to a floating point number of 30 thousand or an integer of 2.3. For example, the number of fans must be converted to 23000.
5. Program Results
The following figure shows the collected video data:
6. Knowledge Point reference
The Reference Links recorded by me are listed here:
- Parse json: JSON encoder and decoder
- Parse html: BeautifulSoup
- Construct an http request: Requests
- File Operation: Reading and Writing Files
- Database ORM Library: peewee
- Operations on strings: Python string operations
- List Operation: More on Lists | Python List Operation Method
- Dict operations: Dictionaries | details about dict in Python
- Use of True and Flase: Boolean operations of Python
- POST submission method: four common POST data submission methods
- Python encoding style: PEP8