標籤:findall color ack jpg sof 爬取 set dal 電影
爬蟲項目爬取貓眼電影TOP100電影資訊
項目內容來自:https://github.com/Germey/MaoYan/blob/master/spider.py
由於其中需要爬取的包含電影名字、電影海報圖片、演員、發行日期等眾多資訊,Regex代碼較為複雜
在parse_one_page(html)擷取HTML文本print(html)後得到以下資訊:
#劃線為匹配內容
<dd><i class="board-index board-index-1">1</i> #電影排名<a href="fim/1203"title="霸王別姬" class="image-link" data-act"boarditem-click" data-val="{movieId:1203}"><img src="//ms0.meituan.net/mywww/image/Loading_2.e3d934bf.png" alt="" class="poster-default"/><img data-src="http://p1.meeituan.net/movie/20803f59291c47e1e116c11963cee19e68711.ing160w_22h_1e_1c" alt="霸王別姬” class="board-img" /> #image</a><div class="board-item-main"><div class="board-item-content"><diy classamovie-item-info><p Class="name"><a href"/ films/1293 title-"露王別姬”data-act=" boorditem-cltck"data-val="{ moved:1283]">霸王別姬</a></p> #title、actor和name<p class-star>主演:張國榮,張豐毅,鞏俐</p><p class"releasetime">發行日期:1993-01-01〔中國香港)</p> </div> #time<div class="movie-item-numher score-num"><p class=score><i class="integer">9.</i><i class="fraction">6</i></p></div> #integer和fraction分數
詳解Regex
pattern = re.compile(‘<dd>.*?board-index.*?>(\d+)</i> .*?data-src="(.*?)".*?name"><a‘ #匹配電影排名index和電影海報image+‘.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>‘ #匹配電影名name、明星演員actor和發行日期time+‘.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>‘ #匹配integer和電影評分fraction, re.S)
Regex為:
def parse_one_page(html): pattern = re.compile(‘<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a‘ +‘.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>‘ +‘.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>‘, re.S) items = re.findall(pattern, html) for item in items: yield { ‘index‘: item[0], ‘image‘: item[1], ‘title‘: item[2], ‘actor‘: item[3].strip()[3:], ‘time‘: item[4].strip()[5:], ‘score‘: item[5]+item[6] }
匹配成功之後輸出的result.txt結果:
{"title": "霸王別姬", "image": "http://p1.meituan.net/movie/[email protected]_220h_1e_1c", "actor": "張國榮,張豐毅,鞏俐", "time": "1993-01-01(中國香港)", "score": "9.6", "index": "1"}{"title": "肖申克的救贖", "image": "http://p0.meituan.net/movie/[email protected]_220h_1e_1c", "actor": "蒂姆·羅賓斯,摩根·弗裡曼,鮑勃·岡頓", "time": "1994-10-14(美國)", "score": "9.5", "index": "2"}{"title": "本傑明·巴頓奇事", "image": "http://p0.meituan.net/movie/48/[email protected]_220h_1e_1c", "actor": "布拉德·皮特,凱特·布蘭切特,塔拉吉·P·漢森", "time": "2008-12-25(美國)", "score": "8.8", "index": "71"}{"title": "哈利·傳輸速率與死亡聖器(下)", "image": "http://p0.meituan.net/movie/76/[email protected]_220h_1e_1c", "actor": "丹尼爾·雷德克裡夫,魯伯特·格林特,艾瑪·沃森", "time": "2011-08-04", "score": "9.0", "index": "72"}{"title": "這個殺手不太冷", "image": "http://p0.meituan.net/movie/[email protected]_220h_1e_1c", "actor": "讓·雷諾,加裡·奧德曼,娜塔莉·傳輸速率曼", "time": "1994-09-14(法國)", "score": "9.5", "index": "3"}{"title": "大話西遊之大聖娶親", "image": "http://p0.meituan.net/movie/[email protected]_220h_1e_1c", "actor": "周星馳,朱茵,羅家英", "time": "2014-10-24", "score": "9.4", "index": "73"}{"title": "致命魔術", "image": "http://p0.meituan.net/movie/12/[email protected]_220h_1e_1c", "actor": "休·傑克曼,克裡斯蒂安·貝爾,邁克爾·凱恩", "time": "2006-10-20(美國)", "score": "8.8", "index": "61"}{"title": "羅馬假日", "image": "http://p0.meituan.net/movie/23/[email protected]_220h_1e_1c", "actor": "格利高利·派克,奧黛麗·赫本,埃迪·艾伯特", "time": "1953-09-02(美國)", "score": "9.1", "index": "4"}{"title": "阿甘正傳", "image": "http://p0.meituan.net/movie/53/[email protected]_220h_1e_1c", "actor": "湯姆·漢克斯,羅賓·懷特,加裡·西尼斯", "time": "1994-07-06(美國)", "score": "9.4", "index": "5"}{"title": "十二怒漢", "image": "http://p0.meituan.net/movie/86/[email protected]_220h_1e_1c", "actor": "亨利·方達,李·科布,馬丁·鮑爾薩姆", "time": "1957-04-13(美國)", "score": "9.1", "index": "62"}{"title": "倩女幽魂", "image": "http://p0.meituan.net/movie/85/[email protected]_220h_1e_1c", "actor": "張國榮,王祖賢,午馬", "time": "2011-04-30", "score": "9.1", "index": "74"}
#省略
Python:Regex匹配貓眼電影HTML資訊