I've seen a page element that was parsed with an XPath that was used to create a project crawl page parsing using the Scrapy framework this attempt to match elements with the Select method
1, The entrance crawl page http://www.ygdy8.com/index.html
2, use the module requests (web source Download) BeautifulSoup4 (webpage Parsing)
3, the Idea: first by the entrance crawl page to get the page above the column and corresponding URL such as
4. Create a menu URL list for loop to parse again to crawl the specific movie title and URL under each level menu
5. problem: the URL under each menu is parsed again because the site content is different select element will appear Non-movie title connection and title
6, next Processing: consider building classes and functions using recursive loop to get the URL and parse
① movie URL to parse again get movie download connection and write to local file
② Remove the Non-movie title that appears in step 5
7, python code
#coding: utf-8import requestsfrom BS4 import beautifulsoup as bs# crawl entry rooturl= "http://www.ygdy8.com/index.html" # Get Web source res=requests.get (rooturl) #网站编码gb2312res. encoding= ' gb2312 ' #网页源码html =res.textsoup=bs (html, ' Html.parser ') Cate_urls = []for Cateurl in Soup.select ('. contain ul li a '): #网站分类标题 cate_name=cateurl.text #分类url to crawl the cat again E_url= "http://www.ygdy8.com/" + cateurl[' href '] cate_urls.append (cate_url) print "site level menu:", cate_name, "menu url:", cate_ url# per menu URL Resolution for I in range (len (cate_urls)): cate_listurl=cate_urls[i] res = requests.get (cate_listurl) Res.enco Ding = ' gb2312 ' html = res.text soup = bs (html, ' html.parser ') print "parsing section" +str (i+1) + "links", cate_urls[i] Conte nturls=[] contents=soup.select ('. co_content8 ul ') [0].select (' a ') #print contents for title in Contents:mo Ivetitle=title.text moiveurl=title[' href '] contenturls.append (moiveurl) print Moivetitle,moiveurlprin T Contenturls
8. Operation Result
Python 2.7_first_try_ Crawl Sunshine Movie Net _20161206