[Python learning] simply crawl pictures in the image gallery

Last Update:2015-03-20 Source: Internet

Author: User

Tags errno socket error socket error

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

recently, the teacher to learn python and Wikipedia-related knowledge, bored with Python simply do a Crawl " network library " in the picture, because each click on the next impression is very waste of time and cumbersome. The main share is how to crawl the knowledge of HTML and python how to download pictures, I hope to help you, at the same time found that the site is very beautiful pictures, recommended to read the original Web download pictures, support the network do not destroy it.
Browse the network to discover its library URL, where all images are 0_0_1 to 0_0_75:
Http://pic.yxdown.com/list/0_0_1.html
Http://pic.yxdown.com/list/0_0_75.html
at the same time through the discovery of the network of 1-75 pages of the list, each page has a number of topics, each theme has a corresponding number of pictures.
The source code is as follows:
(You need to create the E:\\picture3 folder locally and the Python run directory to create the Yxdown folder)

# coding=utf-8# declaration Encoding method default encoding ASCII reference Https://www.python.org/dev/peps/pep-0263/import Urllibimport timeimport reimport Os "python download Zhong Web image by:eastmount" "************************************************** #第一步 traversal get url#http for each page theme ://pic.yxdown.com/list/0_0_1.html#http://pic.yxdown.com/list/0_0_75.html************************************** "' Fileurl=open (' Yxdown_url.txt ', ' W ') fileurl.write (' **************** get the network Picture url*************\n\n ') # It is recommended that num=3 while num<=3 traverse one page at a time for all topics, next to num=4 while num<=4 instead of 1-75 num=3while num<=3:temp = ' http://pic.yxdown.co m/list/0_0_ ' +str (num) + '. html ' content = Urllib.urlopen (temp). Read () Open (' Yxdown_ ' +str (num) + '. html ', ' w+ '). Write ( Content) Print temp fileurl.write (' **************** ' +str (num) + ' page *************\n\n ') #爬取对应主题的URL #<div clas s= "Cbmiddle" ></div> <a target= "_blank" href= "/html/5533.html" > Count=1 #计算每页1-the number of specific pages in 75 Res_div = R ' <div class= "Cbmiddle" > (. *?) </div> ' M_div = Re.findAll (Res_div,content,re. S|re. M) in M_div: #fileurl. Write (line+ ' \ n ') #获取每页所有主题对应的URL并输出 if "_blank" in line: #防止获取列表lis t/1_0_1.html list/2_0_1.html #获取主题 fileurl.write (' \n\n********************************************\n ') Title_pat = R ' <b class= "Imgname" > (. *?) </b> ' title_ex = Re.compile (title_pat,re. M|re. S) Title_obj = Re.search (title_ex, line) title = Title_obj.group () print Unicode (title, ' U             Tf-8 ') fileurl.write (title+ ' \ n ') #获取URL res_href = R ' <a target= "_blank" href= "(. *)" " M_linklist = Re.findall (res_href,line) #print Unicode (str (m_linklist), ' utf-8 ') for link In M_linklist:fileurl.write (link) + ' \ n ') #形如 "/html/5533.html" "******* #第二步 go to the specific image page to download the HTML page #http://pic.yxdown.com/Html/5533.html#p=1 #注意先本地创建yxdown Otherwise error no such file or directory **************************** "#下载HTML网页无原图 add ' #p = 1 ' Errors #HTTP error 400.                The request URL is invalid. Html_url = ' http://pic.yxdown.com ' +str (link) print Html_url html_content = Urllib.urlopen (HT Ml_url). Read () #具体网站内容 #可注释它 temporarily do not download static HTML open (' yxdown/yxdown_html ' +str (count) + '. html ', ' w+ '). W Rite (html_content) "#第三步 go to the picture screen to download pictures #图片的链接地址为http://pic.yxdown.com/html/55 30.html#p=1 #p =2 #点击 "View original" HTML code as follows #<a href= "javascript:;" style= "" onclick= "return false; > View original </a> #通过JavaScript实现 and this interface stores all picture links <script></script> #获取 "Origina L ":" HTTP://I-2.YXDOWN.COM/2015/3/18/6381CCC. 3158d6ad23e.jpg "' Html_script = R '<script> (. *?) </script> ' M_script = Re.findall (html_script,html_content,re. S|re. m) for script in m_script:res_original = R ' "Original": "(. *?)" ' #原图 m _original = Re.findall (res_original,script) for Pic_url in M_original:print pic                        _url fileurl.write (str (pic_url) + ' \ n ') ' #第四步 download pictures #如果浏览器存在验证信息如维基百科 need to add the following code class Appurlopener (Urllib. Fancyurlopener): Version = "mozilla/5.0" Urllib._urlopener = App Urlopener () #参考 http://bbs.csdn.net/topics/380203601 #http://www.lylinux.org                         /python use multithreading to download pictures. html "filename = os.path.basename (pic_url) #去掉目录路径, return file name #No such file or directory needs to first create a document picTure3 Urllib.urlretrieve (Pic_url, ' e:\\picture3\\ ' +filename) #http://pic.yxd                                 own.com/html/5519.html #IOError: [Errno socket ERROR] [Errno 10060] #只输出一个URL otherwise output two the same URL break #当前页具体内容个数加1 count=count+1 time . Sleep (0.1) else:print ' No URL about content ' Time.sleep (1) num=num+1else:print '     Download over!!! '

the picture e:\\picture folder where download http://pic.yxdown.com/list/0_0_1.html is as follows:

Download the http://pic.yxdown.com/list/0_0_3.html Picture E:\\picture3 folder as follows:

because there are detailed steps in the code comments, the following is a brief introduction to the procedure.
1. Simply traverse the site to get the URL of the topic for each page. There are countless topics in each page, with the following format for the theme:

<!--first step  The crawled HTML code is as follows--><div class= "Conbox" > <div class= "cbtop" > </div> <div class= "Cbmiddle" > <a target= "_blank" href= "/html/5533.html" class= "proimg" >  <strong></strong> <p> <span>b></b>1836 people have seen </span> <em><b></b>10 Zhang </em> </p> <b class= "Imgname" >miss big miss! The esports goddess of the League of Legends </b> </a> <a target= "_blank" href= "/html/5533.html" class= "Pllink" ><em>1</ Em> comments </a> </div> <div class= "Cbbottom" > </div> <a target= "_blank" class= "plbtn" href= "/ Html/5533.html "></A></DIV>

It is composed of countless <div class= "Conbox" ></div>, which we only need to extract <a target= "_blank" href= "/html/5533.html" class= "PR Oimg "> in the href, then through the URL stitching implementation to the specific theme page. Which corresponds to the above layout as shown: 2. Go to the specific image page to download the HTML page, such as:
Http://pic.yxdown.com/html/5533.html#p=1
You can also annotate the code by downloading the local HTML page. You need to click "View Pictures" to download the original image, right-click can only be saved as Web site HTML. 3. I originally intended to analyze the "view original" URL to achieve the download, the other site is the same as the analysis of "next page" to achieve. But I found it to be a browse through JavaScript, namely:
<a href= "javascript:;" onclick= "return false; id=" Photooriginal "> View original </a>
At the same time it writes all the pictures in the following code <script></script>:

<script>var images = [{"Big": "http://i-2.yxdown.com/2015/3/18/KDkwMHgp/ 6381ccc0-ed65-4422-8671-b3158d6ad23e.jpg ",  " thumb ":" http://i-2.yxdown.com/2015/3/18/KHgxMjAp/ 6381ccc0-ed65-4422-8671-b3158d6ad23e.jpg ",  " original ":" http://i-2.yxdown.com/2015/3/18/ 6381ccc0-ed65-4422-8671-b3158d6ad23e.jpg ",  " title ":" "," descript ":" "," id ": 75109},{" Big ":"/HTTP/ I-2.yxdown.com/2015/3/18/kdkwmhgp/fec26de9-8727-424a-b272-f2827669a320.jpg ",  " thumb ":"/http I-2.yxdown.com/2015/3/18/khgxmjap/fec26de9-8727-424a-b272-f2827669a320.jpg ",  " original ":"/http I-2.yxdown.com/2015/3/18/fec26de9-8727-424a-b272-f2827669a320.jpg ",  " title ":" "," descript ":" "," id ": 75110}, ...</script>

It can get the original-original, thumbnail-thumb, large map-big, download the URL via regular expression:
res_original = R ' "Original": "(. *?)" ' #原图
M_original = Re.findall (res_original,script)
4. The final step is to download the picture, where I'm not very good at using threads, but simply add the Time.sleep (0.1) function. Download images may be subject to Wikipedia's limited access, the corresponding settings, the core code is as follows:

Import Osimport urllibclass Appurlopener (urllib. Fancyurlopener):    Version = "mozilla/5.0" Urllib._urlopener = appurlopener () url = "Http://i-2.yxdown.com/2015/2/25 /c205972d-d858-4dcd-9c8b-8c0f876407f8.jpg "filename = os.path.basename (URL) urllib.urlretrieve (URL, filename)

at the same time I also created the folder locally Picture3, and txt record gets the URL as shown in:
Finally hope that the article is helpful to everyone, in brief, the article on two words: How to analyze the source code to extract the specified URL through a regular expression, how to download pictures through Python. If the article has shortcomings, please Haihan!
(By:eastmount 2015-3-20 5 o'clock in the afternoon http://blog.csdn.net/eastmount/)

[Python learning] simply crawl pictures in the image gallery

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More