Reference link: http://www.python (tab). com/html/2017/pythonhexinbiancheng_0904/1170.html (remove brackets)
http://blog.csdn.net/eastmount/article/details/51082253
First of all, this article refers to the above two articles, crawling the "Invisible Guest essays" On the Watercress film column and importing it into CVS.
As for the regular matching of multiple lines of HTML, you actually need to add the re to the original base. S
In this way, the end of each line will be rendered in the form of "\n+ spaces".
And actually the match can be filtered out directly through. *?
See line 13th for details.
Another said Python pandas module, using Dataframe To_cvs Import also need to do encoding conversion, to avoid garbled.
1 #Coding=utf-82 ImportRequests3 ImportRe4 ImportPandas as PD5headers={6 'user-agent':'mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) applewebkit/537.36 (khtml, like Gecko) chrome/61.0.3163.100 safari/537.36',7 'Host':'movie.douban.com'8 }9cookies={'Cookies':'your own cookie'}TenUrl='Https://movie.douban.com/subject/26580232/comments?status=P' OneHtml=requests.get (url,headers=headers,cookies=cookies) AReg=re.compile (R'<a href= "(. *?) &status=p ". *?class=" Next ">') -Ren=re.compile (R'<span class= "Comment-info" >.*? class= "" > (. *?) </a>.*?<span>.*?title= "(. *?)" ></span>.*?<span.*? Title= "(. *?)" >.*?<p class= "" > (. *?) \ n', Re. S) - whilehtml.status_code==200: theurl_next='https://movie.douban.com/subject/26580232/comments'+Re.findall (Reg,html.text) [0] -keren=Re.findall (Ren,html.text) -Data=PD. DataFrame (Keren) - Print(data) + Print(Url_next) -Data.to_csv ('/users/b1ancheng/desktop/kerenduanping.csv', Header=false,index=false, mode='A +', encoding="Utf_8_sig") +Data=[] Akeren=[] atHtml=requests.get (url_next,headers=headers,cookies=cookies)
Brother Hope more advice and common progress.
For multi-line matching of HTML, regular re. Use of S (Crawl the Watercress Movie essays)