Hint: This study comes from Ehco predecessor's article, after realizes the note.
Target site
http://dianying.2345.com/top/
Website structure
The part to crawl, under the UL tag (including the Li Tag), roughly iterate over the content output of the Li tag.
Problems you are experiencing?
The code is simple, but there are a lot of problems.
One: Coding
The GBK is used uniformly here.
Two: library
The process of missing requests,bs4,idna,certifi,chardet,urllib3 and other libraries, need to manually add the library, I say my method
How to add a library:
Example: URLLIB3
Baidu urllib3, download via link to local
I downloaded the first one
Unzip the Urllib3 folder into the Python installation directory in the Lib directory
Three: Download Image link
That's interesting, that's what I wrote before.
F.write (Requests.get (img_url). Content)
Error
File "C:\Users\Shinelon\AppData\Local\Programs\Python\Python36\lib\requests\models.py", line 379, in Prepare_url raise Missingschema (Error) requests.exceptions.MissingSchema:Invalid URL '//imgwx5.2345.com/dypcimg/img/c/65/ Sup196183_223x310.jpg ': No schema supplied. Perhaps you meant http:////imgwx5.2345.com/dypcimg/img/c/65/sup196183_223x310.jpg? Process finished with exit code 1
The picture is like this and cannot be iterated output download
No way, then you automatically add HTTP to the link:
Img_url2 = ' http: ' + img_url f.write (Requests.get (IMG_URL2). Content) print (IMG_URL2) f.close ()
And then it's normal.
Attached code
Import requestsimport bs4def get_html (URL): try:r = Requests.get (URL, timeout=30) r.raise_for_status r.encoding = ' GBK ' return r.text except:return "someting wrong" def get_content (URL): HTML = get_h tml (URL) soup = bs4. BeautifulSoup (HTML, ' lxml ') movieslist = Soup.find (' ul ', class_= ' piclist clearfix ') Movies = Movieslist.find_all (' Li ') for top in movies: #爬取图片src img_url = Top.find (' img ') [' src '] #爬取影片name name = Top.find (' Span ', class_= ' stit '). A.text try: #爬取影片上映时间 time = top.find (' span ', class_= ' Sintro '). Text Except:time = "No release time" #爬取电影角色主演 actors = top.find (' P ', class_= ' pactor ') actor = ' For act in actors.contents:actor = actor + act.string + ' #爬取电影简介 intro = top.find (' P ', class_= ' PTxt pintroshow '). Text print ("Slice name: {}\t{}\n{}\n{} \ n \ n". Format (name, time, Actor,intro)) #下载图片到指定目录 With open ('/users/shinelon/desktop/1212/' +name+ '. png ', ' wb+ ') as F:img_url2 = ' http: ' + img_url F.write (Requests.get (IMG_URL2). Content) Print (IMG_URL2) f.close () def main (): url = ' Http://dian ying.2345.com/top/' get_content (URL) If __name__ = = "__main__": Main ()
Results
0 Basic Python Crawler implementation (crawl the latest movie rankings)