0 Basic Python Crawler implementation (crawl the latest movie rankings)

Source: Internet
Author: User

Hint: This study comes from Ehco predecessor's article, after realizes the note.

Target site
http://dianying.2345.com/top/
Website structure

The part to crawl, under the UL tag (including the Li Tag), roughly iterate over the content output of the Li tag.

Problems you are experiencing?

The code is simple, but there are a lot of problems.

One: Coding

The GBK is used uniformly here.

Two: library

The process of missing requests,bs4,idna,certifi,chardet,urllib3 and other libraries, need to manually add the library, I say my method

How to add a library:

Example: URLLIB3

Baidu urllib3, download via link to local

I downloaded the first one

Unzip the Urllib3 folder into the Python installation directory in the Lib directory

Three: Download Image link

That's interesting, that's what I wrote before.

F.write (Requests.get (img_url). Content)

Error

File "C:\Users\Shinelon\AppData\Local\Programs\Python\Python36\lib\requests\models.py", line 379, in Prepare_url    raise Missingschema (Error) requests.exceptions.MissingSchema:Invalid URL '//imgwx5.2345.com/dypcimg/img/c/65/ Sup196183_223x310.jpg ': No schema supplied. Perhaps you meant http:////imgwx5.2345.com/dypcimg/img/c/65/sup196183_223x310.jpg? Process finished with exit code 1

The picture is like this and cannot be iterated output download

No way, then you automatically add HTTP to the link:

Img_url2 = ' http: ' + img_url            f.write (Requests.get (IMG_URL2). Content)            print (IMG_URL2)            f.close ()

And then it's normal.

Attached code
Import requestsimport bs4def get_html (URL): try:r = Requests.get (URL, timeout=30) r.raise_for_status r.encoding = ' GBK ' return r.text except:return "someting wrong" def get_content (URL): HTML = get_h tml (URL) soup = bs4. BeautifulSoup (HTML, ' lxml ') movieslist = Soup.find (' ul ', class_= ' piclist clearfix ') Movies = Movieslist.find_all (' Li ') for top in movies: #爬取图片src img_url = Top.find (' img ') [' src '] #爬取影片name name = Top.find ('        Span ', class_= ' stit '). A.text try: #爬取影片上映时间 time = top.find (' span ', class_= ' Sintro '). Text        Except:time = "No release time" #爬取电影角色主演 actors = top.find (' P ', class_= ' pactor ') actor = ' For act in actors.contents:actor = actor + act.string + ' #爬取电影简介 intro = top.find (' P ',  class_= ' PTxt pintroshow '). Text print ("Slice name: {}\t{}\n{}\n{} \ n \ n". Format (name, time, Actor,intro)) #下载图片到指定目录      With open ('/users/shinelon/desktop/1212/' +name+ '. png ', ' wb+ ') as F:img_url2 = ' http: ' + img_url F.write (Requests.get (IMG_URL2). Content) Print (IMG_URL2) f.close () def main (): url = ' Http://dian ying.2345.com/top/' get_content (URL) If __name__ = = "__main__": Main ()
Results

0 Basic Python Crawler implementation (crawl the latest movie rankings)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.