Python: Regular expression matching cat-eye movie HTML information

Source: Internet
Author: User

Reptile Project Crawl Cat's Eye movie TOP100 movie info

Project content from: https://github.com/Germey/MaoYan/blob/master/spider.py

Regular expression codes are more complex because of the many messages that need to be crawled, including movie names, movie poster pictures, actors, release times, and more

Get the following information after Parse_one_page (HTML) Gets the HTML text print (HTML):

#划线为匹配内容
<dd><i Class="board-index board-index-1">1</i> #Movie Rankings<a href="fim/1203"title="Farewell My Concubine" class="Image-link"Data-act"Boarditem-click"Data-val="{movieid:1203}">"//ms0.meituan.net/mywww/image/loading_2.e3d934bf.png"alt="" class="Poster-default"/> data-src= "Http://p1.meeituan.net/movie/20803f59291c47e1e116c11963cee19e68711.ing160w_22h_ 1e_1c "alt=" Farewell My Concubine " class="board-img" /> #Image</a><divclass="Board-item-main"><divclass="board-item-content"><diy classamovie-item-info><p class= "name" ><a href "/films/1293 title-" Lulu ji "data-act=" Boorditem-cltck "data-val=" { moved:1283] "> farewell my concubine #title, actor, and name<p class-star> Starring: Leslie Cheung, 张丰毅secret, Gong li </p>

class" Releasetime"> release time: 1993-01-01(, Hong Kong, China </p>

</div> #time <divclass="Movie-item-numher Score-num"><p class=score><i class="integer">9.</i><i class="fraction">6</i></p></div>#integer和fraction分数

Detailed Jiu Zheng expression

Pattern =Re.compile ('<dd>.*? board-index.*?> (\d+) </i>. *? data-src= "(. *?)". *?name "><a' #匹配电影排名index和电影海报image+'.*?> (. *?) </a>.*? star"> (. *?) </p>.*? releasetime"> (. *?) </p>' #匹配电影名name, star actor actors and release time+'.*? integer"> (. *?) </i>.*? fraction"> (. *?) </i>.*?</dd> ' #匹配integer和电影评分fraction, Re. S

The regular expression is:

defparse_one_page (HTML): pattern= Re.compile ('<dd>.*?board-index.*?> (\d+) </i>.*?data-src= "(. *?)". *?name "><a'+'.*?> (. *?) </a>.*?star "> (. *?) </p>.*?releasetime "> (. *?) </p>'+'. *?integer "> (. *?) </i>.*?fraction "> (. *?) </i>.*?</dd>', Re. S) Items=Re.findall (pattern, HTML) forIteminchItems:yield {            ' Index': item[0],' Image': item[1],            ' title': item[2],            ' actor': Item[3].strip () [3:],            ' time': Item[4].strip () [5:],            ' score': item[5]+item[6]        }

Result.txt result of output after successful match:

{"title": "Farewell My Concubine", "image": "Http://p1.meituan.net/movie/[email protected]_220h_1e_1c", "actor": "Leslie Cheung, 张丰毅secret, Gong Li", " Time ":" 1993-01-01 (Hong Kong, China) "," Score ":" 9.6 "," Index ":" 1 "} {" title ":" Shawshank Redemption "," image ":" http://p0.meituan.net/movie/[ Email protected]_220h_1e_1c "," actor ":" Tim Robbins, Morgan Freeman, Bob Gunton "," Time ":" 1994-10-14 (US) "," Score ":" 9.5 "," index ": "2"} {"title": "Benjamin Button Wonders", "image": "Http://p0.meituan.net/movie/48/[email protected]_220h_1e_1c", "Actor": " Brad Pitt, Cate Blanchett, Taraji · P Hansen "," Time ":" 2008-12-25 (United States) "," Score ":" 8.8 "," index ":" the "" {"} {" title ":" Harry Potter and the Deathly Hallows "," image "," http:// P0.meituan.net/movie/76/[email protected]_220h_1e_1c "," actor ":" Daniel Radcliffe, Rupert Grint, Emma Watson "," Time ":" 2011-08-04 "," Score ":" 9.0 "," Index ":" A.} {"title": "This killer is not too cold", "image": "http://p0.meituan.net/movie/[email  Protected]_220h_1e_1c "," actor ":" Jean Reno, Gary Oldman, Natalie Portman "," Time ":" 1994-09-14 (France) "," Score ":" 9.5 "," Index ":" 3 "} {" title ":" The Big West tour of the Holy Sage Wedding "," image ":" Http://p0.meituan.net/movie/[email protected]_220h_1e_1c "," actor ":" Stephen Chow, Athena Chu, actor "," Time ":" 2014-10-24 "," Score ":" 9.4 "," index ":" "} {" title ":" Deadly Magic "," image ":" Htt P://p0.meituan.net/movie/12/[email protected]_220h_1e_1c "," actor ":" Hugh Jackman, Christian Bale, Michael Caine "," Time ":" 2006-10-20 (United States) "," Score ":" 8.8 "," index ":" All "} {" title ":" Roman Holiday "," image ":" Http://p0.meituan.net/movie/23/[email  protected]_220h_1e_1c "," actor ":" Gregory Peck, Audrey Hepburn Ben, Eddy Albert "," Time ":" 1953-09-02 (US) "," Score ":" 9.1 "," Index ":" 4 " } {"title": "Forrest Gump", "image": "Http://p0.meituan.net/movie/53/[email protected]_220h_1e_1c", "actor": "Tom Hanks, Robin • Wright, Gari Sinis "," Time ":" 1994-07-06 (United States) "," Score ":" 9.4 "," Index ":" 5 "} {" title ":" 12 Nu han "," image ":" http://p0.meituan.net/ Movie/86/[email protected]_220h_1e_1c "," actor ":" Henry Fonda, Li Hu Cobb, Martin Balsam "," Time ":" 1957-04-13 (US) "," Score ":" 9.1 " , "index": "$"} {"title": "Qian female Ghost", "image": "Http://p0.meituan.net/movie/85/[email protected]_220h_1e_1c", " Actor ":" Leslie Cheung, Condom, noon horse "," Time ":" 2011-04-30 "," Score ":" 9.1 "," Index ":" 74 "}
#省略

Python: Regular expression matching cat-eye movie HTML information

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.