Reptile Project Crawl Cat's Eye movie TOP100 movie info
Project content from: https://github.com/Germey/MaoYan/blob/master/spider.py
Regular expression codes are more complex because of the many messages that need to be crawled, including movie names, movie poster pictures, actors, release times, and more
Get the following information after Parse_one_page (HTML) Gets the HTML text print (HTML):
#划线为匹配内容
<dd><i Class="board-index board-index-1">1</i> #Movie Rankings<a href="fim/1203"title="Farewell My Concubine" class="Image-link"Data-act"Boarditem-click"Data-val="{movieid:1203}">"//ms0.meituan.net/mywww/image/loading_2.e3d934bf.png"alt="" class="Poster-default"/> data-src= "Http://p1.meeituan.net/movie/20803f59291c47e1e116c11963cee19e68711.ing160w_22h_ 1e_1c "alt=" Farewell My Concubine " class="board-img" /> #Image</a><divclass="Board-item-main"><divclass="board-item-content"><diy classamovie-item-info><p class= "name" ><a href "/films/1293 title-" Lulu ji "data-act=" Boorditem-cltck "data-val=" { moved:1283] "> farewell my concubine #title, actor, and name<p class-star> Starring: Leslie Cheung, 张丰毅secret, Gong li </p> class" Releasetime"> release time: 1993-01-01(, Hong Kong, China </p>
</div> #time <divclass="Movie-item-numher Score-num"><p class=score><i class="integer">9.</i><i class="fraction">6</i></p></div>#integer和fraction分数
Detailed Jiu Zheng expression
Pattern =Re.compile ('<dd>.*? board-index.*?> (\d+) </i>. *? data-src= "(. *?)". *?name "><a' #匹配电影排名index和电影海报image+'.*?> (. *?) </a>.*? star"> (. *?) </p>.*? releasetime"> (. *?) </p>' #匹配电影名name, star actor actors and release time+'.*? integer"> (. *?) </i>.*? fraction"> (. *?) </i>.*?</dd> ' #匹配integer和电影评分fraction, Re. S
The regular expression is:
defparse_one_page (HTML): pattern= Re.compile ('<dd>.*?board-index.*?> (\d+) </i>.*?data-src= "(. *?)". *?name "><a'+'.*?> (. *?) </a>.*?star "> (. *?) </p>.*?releasetime "> (. *?) </p>'+'. *?integer "> (. *?) </i>.*?fraction "> (. *?) </i>.*?</dd>', Re. S) Items=Re.findall (pattern, HTML) forIteminchItems:yield { ' Index': item[0],' Image': item[1], ' title': item[2], ' actor': Item[3].strip () [3:], ' time': Item[4].strip () [5:], ' score': item[5]+item[6] }
Result.txt result of output after successful match:
{"title": "Farewell My Concubine", "image": "Http://p1.meituan.net/movie/[email protected]_220h_1e_1c", "actor": "Leslie Cheung, 张丰毅secret, Gong Li", " Time ":" 1993-01-01 (Hong Kong, China) "," Score ":" 9.6 "," Index ":" 1 "} {" title ":" Shawshank Redemption "," image ":" http://p0.meituan.net/movie/[ Email protected]_220h_1e_1c "," actor ":" Tim Robbins, Morgan Freeman, Bob Gunton "," Time ":" 1994-10-14 (US) "," Score ":" 9.5 "," index ": "2"} {"title": "Benjamin Button Wonders", "image": "Http://p0.meituan.net/movie/48/[email protected]_220h_1e_1c", "Actor": " Brad Pitt, Cate Blanchett, Taraji · P Hansen "," Time ":" 2008-12-25 (United States) "," Score ":" 8.8 "," index ":" the "" {"} {" title ":" Harry Potter and the Deathly Hallows "," image "," http:// P0.meituan.net/movie/76/[email protected]_220h_1e_1c "," actor ":" Daniel Radcliffe, Rupert Grint, Emma Watson "," Time ":" 2011-08-04 "," Score ":" 9.0 "," Index ":" A.} {"title": "This killer is not too cold", "image": "http://p0.meituan.net/movie/[email Protected]_220h_1e_1c "," actor ":" Jean Reno, Gary Oldman, Natalie Portman "," Time ":" 1994-09-14 (France) "," Score ":" 9.5 "," Index ":" 3 "} {" title ":" The Big West tour of the Holy Sage Wedding "," image ":" Http://p0.meituan.net/movie/[email protected]_220h_1e_1c "," actor ":" Stephen Chow, Athena Chu, actor "," Time ":" 2014-10-24 "," Score ":" 9.4 "," index ":" "} {" title ":" Deadly Magic "," image ":" Htt P://p0.meituan.net/movie/12/[email protected]_220h_1e_1c "," actor ":" Hugh Jackman, Christian Bale, Michael Caine "," Time ":" 2006-10-20 (United States) "," Score ":" 8.8 "," index ":" All "} {" title ":" Roman Holiday "," image ":" Http://p0.meituan.net/movie/23/[email protected]_220h_1e_1c "," actor ":" Gregory Peck, Audrey Hepburn Ben, Eddy Albert "," Time ":" 1953-09-02 (US) "," Score ":" 9.1 "," Index ":" 4 " } {"title": "Forrest Gump", "image": "Http://p0.meituan.net/movie/53/[email protected]_220h_1e_1c", "actor": "Tom Hanks, Robin • Wright, Gari Sinis "," Time ":" 1994-07-06 (United States) "," Score ":" 9.4 "," Index ":" 5 "} {" title ":" 12 Nu han "," image ":" http://p0.meituan.net/ Movie/86/[email protected]_220h_1e_1c "," actor ":" Henry Fonda, Li Hu Cobb, Martin Balsam "," Time ":" 1957-04-13 (US) "," Score ":" 9.1 " , "index": "$"} {"title": "Qian female Ghost", "image": "Http://p0.meituan.net/movie/85/[email protected]_220h_1e_1c", " Actor ":" Leslie Cheung, Condom, noon horse "," Time ":" 2011-04-30 "," Score ":" 9.1 "," Index ":" 74 "}
#省略
Python: Regular expression matching cat-eye movie HTML information