Python:Regex匹配貓眼電影HTML資訊

來源:互聯網
上載者:User

標籤:findall   color   ack   jpg   sof   爬取   set   dal   電影   

爬蟲項目爬取貓眼電影TOP100電影資訊

項目內容來自:https://github.com/Germey/MaoYan/blob/master/spider.py

由於其中需要爬取的包含電影名字、電影海報圖片、演員、發行日期等眾多資訊,Regex代碼較為複雜

在parse_one_page(html)擷取HTML文本print(html)後得到以下資訊:

#劃線為匹配內容
<dd><i class="board-index board-index-1">1</i> #電影排名<a href="fim/1203"title="霸王別姬" class="image-link" data-act"boarditem-click" data-val="{movieId:1203}"><img src="//ms0.meituan.net/mywww/image/Loading_2.e3d934bf.png" alt="" class="poster-default"/><img data-src="http://p1.meeituan.net/movie/20803f59291c47e1e116c11963cee19e68711.ing160w_22h_1e_1c" alt="霸王別姬” class="board-img" /> #image</a><div class="board-item-main"><div class="board-item-content"><diy classamovie-item-info><p Class="name"><a href"/ films/1293 title-"露王別姬”data-act=" boorditem-cltck"data-val="{ moved:1283]">霸王別姬</a></p> #title、actor和name<p class-star>主演:張國榮,張豐毅,鞏俐</p><p class"releasetime">發行日期:1993-01-01〔中國香港)</p> </div> #time<div class="movie-item-numher score-num"><p class=score><i class="integer">9.</i><i class="fraction">6</i></p></div> #integer和fraction分數

 

詳解Regex

pattern = re.compile(‘<dd>.*?board-index.*?>(\d+)</i>      .*?data-src="(.*?)".*?name"><a‘     #匹配電影排名index和電影海報image+‘.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>‘          #匹配電影名name、明星演員actor和發行日期time+‘.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>‘               #匹配integer和電影評分fraction, re.S)

 

 Regex為:

def parse_one_page(html):    pattern = re.compile(‘<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a‘                         +‘.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>‘                         +‘.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>‘, re.S)    items = re.findall(pattern, html)    for item in items:        yield {            ‘index‘: item[0],            ‘image‘: item[1],            ‘title‘: item[2],            ‘actor‘: item[3].strip()[3:],            ‘time‘: item[4].strip()[5:],            ‘score‘: item[5]+item[6]        }

匹配成功之後輸出的result.txt結果:

{"title": "霸王別姬", "image": "http://p1.meituan.net/movie/[email protected]_220h_1e_1c", "actor": "張國榮,張豐毅,鞏俐", "time": "1993-01-01(中國香港)", "score": "9.6", "index": "1"}{"title": "肖申克的救贖", "image": "http://p0.meituan.net/movie/[email protected]_220h_1e_1c", "actor": "蒂姆·羅賓斯,摩根·弗裡曼,鮑勃·岡頓", "time": "1994-10-14(美國)", "score": "9.5", "index": "2"}{"title": "本傑明·巴頓奇事", "image": "http://p0.meituan.net/movie/48/[email protected]_220h_1e_1c", "actor": "布拉德·皮特,凱特·布蘭切特,塔拉吉·P·漢森", "time": "2008-12-25(美國)", "score": "8.8", "index": "71"}{"title": "哈利·傳輸速率與死亡聖器(下)", "image": "http://p0.meituan.net/movie/76/[email protected]_220h_1e_1c", "actor": "丹尼爾·雷德克裡夫,魯伯特·格林特,艾瑪·沃森", "time": "2011-08-04", "score": "9.0", "index": "72"}{"title": "這個殺手不太冷", "image": "http://p0.meituan.net/movie/[email protected]_220h_1e_1c", "actor": "讓·雷諾,加裡·奧德曼,娜塔莉·傳輸速率曼", "time": "1994-09-14(法國)", "score": "9.5", "index": "3"}{"title": "大話西遊之大聖娶親", "image": "http://p0.meituan.net/movie/[email protected]_220h_1e_1c", "actor": "周星馳,朱茵,羅家英", "time": "2014-10-24", "score": "9.4", "index": "73"}{"title": "致命魔術", "image": "http://p0.meituan.net/movie/12/[email protected]_220h_1e_1c", "actor": "休·傑克曼,克裡斯蒂安·貝爾,邁克爾·凱恩", "time": "2006-10-20(美國)", "score": "8.8", "index": "61"}{"title": "羅馬假日", "image": "http://p0.meituan.net/movie/23/[email protected]_220h_1e_1c", "actor": "格利高利·派克,奧黛麗·赫本,埃迪·艾伯特", "time": "1953-09-02(美國)", "score": "9.1", "index": "4"}{"title": "阿甘正傳", "image": "http://p0.meituan.net/movie/53/[email protected]_220h_1e_1c", "actor": "湯姆·漢克斯,羅賓·懷特,加裡·西尼斯", "time": "1994-07-06(美國)", "score": "9.4", "index": "5"}{"title": "十二怒漢", "image": "http://p0.meituan.net/movie/86/[email protected]_220h_1e_1c", "actor": "亨利·方達,李·科布,馬丁·鮑爾薩姆", "time": "1957-04-13(美國)", "score": "9.1", "index": "62"}{"title": "倩女幽魂", "image": "http://p0.meituan.net/movie/85/[email protected]_220h_1e_1c", "actor": "張國榮,王祖賢,午馬", "time": "2011-04-30", "score": "9.1", "index": "74"}
#省略

 

Python:Regex匹配貓眼電影HTML資訊

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.