Python 2.7_first_try_ Crawl Sunshine Movie Net _20161206

Source: Internet
Author: User

I've seen a page element that was parsed with an XPath that was used to create a project crawl page parsing using the Scrapy framework this attempt to match elements with the Select method

1, The entrance crawl page http://www.ygdy8.com/index.html

2, use the module requests (web source Download) BeautifulSoup4 (webpage Parsing)

3, the Idea: first by the entrance crawl page to get the page above the column and corresponding URL such as

4. Create a menu URL list for loop to parse again to crawl the specific movie title and URL under each level menu

5. problem: the URL under each menu is parsed again because the site content is different select element will appear Non-movie title connection and title

6, next Processing: consider building classes and functions using recursive loop to get the URL and parse

① movie URL to parse again get movie download connection and write to local file

② Remove the Non-movie title that appears in step 5

7, python code

#coding: utf-8import requestsfrom BS4 import beautifulsoup as bs# crawl entry rooturl= "http://www.ygdy8.com/index.html" # Get Web source res=requests.get (rooturl) #网站编码gb2312res. encoding= ' gb2312 ' #网页源码html =res.textsoup=bs (html, ' Html.parser ') Cate_urls = []for Cateurl in Soup.select ('. contain ul li a '): #网站分类标题 cate_name=cateurl.text #分类url to crawl the cat again E_url= "http://www.ygdy8.com/" + cateurl[' href '] cate_urls.append (cate_url) print "site level menu:", cate_name, "menu url:", cate_ url# per menu URL Resolution for I in range (len (cate_urls)): cate_listurl=cate_urls[i] res = requests.get (cate_listurl) Res.enco Ding = ' gb2312 ' html = res.text soup = bs (html, ' html.parser ') print "parsing section" +str (i+1) + "links", cate_urls[i] Conte nturls=[] contents=soup.select ('. co_content8 ul ') [0].select (' a ') #print contents for title in Contents:mo Ivetitle=title.text moiveurl=title[' href '] contenturls.append (moiveurl) print Moivetitle,moiveurlprin T Contenturls

8. Operation Result

Python 2.7_first_try_ Crawl Sunshine Movie Net _20161206

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.