For multi-line Matching of html, use of the regular re. S (crawling Douban movie short ratings) and re. s short ratings
First, this article referred to the above two articles, crawled "invisible guest short comments" on the Douban film column, and imported them into cvs.
For the multi-line html of Regular Expression matching, re. S must be added based on the original one.
In this way, the end of each row is displayed in the form of \ n + space.
In fact, matching can be performed through .*? Directly filter out.
For more information, see the 13th rows.
In addition, the pandas module of python needs to be encoded and converted to use to_cvs of DataFrame to avoid garbled characters.
1 # coding = UTF-8 2 import requests 3 import re 4 import pandas as pd 5 headers = {6 'user-agent': 'mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) appleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.20.3.100 Safari/100', 7 'host': 'movie .douban.com '8} 9 cookies = {'cookies ': 'Your cookies '} 10 url = 'https: // movie.douban.com/subject/26580232/comments? Status = P '11 html = requests. get (url, headers = headers, cookies = cookies) 12 reg = re. compile (R' <a href = "(.*?) & Amp; status = P ".*? Class = "next"> ') 13 ren = re. compile (R' <span class = "comment-info"> .*? Class = ""> (.*?) </A> .*? <Span> .*? Title = "(.*?) "> </Span> .*? <Span .*? Title = "(.*?) "> .*? <P class = ""> (.*?) \ N', re. s) 14 while html. status_code = 200:15 url_next = 'https: // movie.douban.com/subject/26580232/comments'{re.findall (reg, html. text) [0] 16 keren = re. findall (ren, html. text) 17 data = pd. dataFrame (keren) 18 print (data) 19 print (url_next) 20 data. to_csv ('/Users/b1ancheng/Desktop/kerenduanping.csv', header = False, index = False, mode = 'a + ', encoding = "utf_8_sig ") 21 data = [] 22 keren = [] 23 html = requests. get (url_next, headers = headers, cookies = cookies)
Reference: http://www.python (tab). com/html/2017/pythonhexinbiancheng_0904/1170.html (remove parentheses)
Http://blog.csdn.net/eastmount/article/details/51082253
Hope you can give more comments and make progress together.