[Python learning] simply crawls pictures in the pictures site Gallery

Source: Internet
Author: User
Tags socket error

recent teachers to learn python and Wikipedia-related knowledge, bored with Python simply do a Crawl " network library " in the picture, because each click on the next impression is a waste of time and cumbersome. The main share is how to crawl HTML knowledge and how Python downloads images. Hope to have some help, the same time found that the site's pictures are very beautiful, suggest reading the original Web download pictures, support the network do not destroy it.
Discover its library URL by browsing the network. All images are 0_0_1 to 0_0_75:
Http://pic.yxdown.com/list/0_0_1.html
Http://pic.yxdown.com/list/0_0_75.html
At the same time, there are a number of topics on each page through the 1-75-page list that can be found on the network. Each theme has more than one picture.


        Source code such as the following:
        (need to create a local e:\\ Picture3 directory and Python execution directory creation Yxdown directory)

# coding=utf-8# declaration Encoding default encoding method ASCII reference Https://www.python.org/dev/peps/pep-0263/import Urllibimport timeimport reimport Os "python download Zhong Web image by:eastmount" "************************************************** #第一步 traverse to get the url#http of the corresponding topic per page ://pic.yxdown.com/list/0_0_1.html#http://pic.yxdown.com/list/0_0_75.html************************************** "' Fileurl=open (' Yxdown_url.txt ', ' W ') fileurl.write (' **************** get the network Picture url*************\n\n ') # It is recommended that num=3 while num<=3 traverse one page at a time for all topics, next to num=4 while num<=4 instead of 1-75 num=3while num<=3:temp = ' http://pic.yxdown.co m/list/0_0_ ' +str (num) + '. html ' content = Urllib.urlopen (temp). Read () Open (' Yxdown_ ' +str (num) + '. html ', ' w+ '). Write ( Content) Print temp fileurl.write (' **************** ' +str (num) + ' page *************\n\n ') #爬取相应主题的URL #<div clas s= "Cbmiddle" ></div> <a target= "_blank" href= "/html/5533.html" > Count=1 #计算每页1-75 Detailed page number Res_div = R ' <div class= "Cbmiddle" > (. *?

) </div> ' M_div = Re.findall (res_div,content,re. S|re. M) in M_div: #fileurl. Write (line+ ' \ n ') #获取每页全部主题相应的URL并输出 if "_blank" in line: #防止获取列表lis t/1_0_1.html list/2_0_1.html #获取主题 fileurl.write (' \n\n********************************************\n ') Title_pat = R ' <b class= "Imgname" > (. *?

) </b> ' title_ex = Re.compile (title_pat,re. M|re. S) Title_obj = Re.search (title_ex, line) title = Title_obj.group () print Unicode (title, ' U Tf-8 ') fileurl.write (title+ ' \ n ') #获取URL res_href = R ' <a target= "_blank" href= "(. *?)

"' ' M_linklist = Re.findall (res_href,line) #print Unicode (str (m_linklist), ' Utf-8 ') for Li NK in M_linklist:fileurl.write (link) + ' \ n ') #形如 "/html/5533.html" * * * * #第二步 go to the detailed image page to download the HTML page #http://pic.yxdown.com /html/5533.html#p=1 #注意先本地创建yxdown Otherwise error no such file or directory *************************** "#下载HTML网页无原图 add ' #p = 1 ' Errors #HTTP error 400. The request URL is invalid. Html_url = ' http://pic.yxdown.com ' +str (link) print Html_url html_content = Urllib.urlopen (HT Ml_url). Read () #详细站点内容 #可凝视它 temporarily do not download static HTML open (' yxdown/yxdown_html ' +str (count) + '. html ', ' w+ '). W Rite (html_content) "#第三步 go to the picture screen to download pictures #图片的链接地址为http://pic.yxdown.Com/html/5530.html#p=1 #p =2 #点击 "View original" HTML code such as the following #<a href= "javascript:;" style= "" onclick= " return false; " > View original </a> #通过JavaScript实现 and this interface stores all picture links <script></script> #获取 "original ":" HTTP://I-2.YXDOWN.COM/2015/3/18/6381CCC. 3158d6ad23e.jpg "' Html_script = R ' <script> (. *?)

) </script> ' m_script = Re.findall (html_script,html_content,re. S|re. M) for script in m_script:res_original = R ' "Original": "(. *?

)

) "' #原图 m_original = Re.findall (res_original,script) for Pic_url in m_original: Print Pic_url fileurl.write (str (pic_url) + ' \ n ') ' #第四步 Download the image #假设浏览器存在验证信息如维基百科 need to add the following code class Appurlopene R (Urllib. Fancyurlopener): Version = "mozilla/5.0" Urllib._urlopener = App Urlopener () #參考 http://bbs.csdn.net/topics/380203601 #http://www.lylinux.org /python use multithreading to download pictures. html "filename = os.path.basename (pic_url) #去掉文件夹路径, return file Name #No such file or directory requires that you first create a Picture3 urllib.urlretrieve (Pic_url, ' E : \\Picture3\\ ' +filename) #http://pic.yxdown.com/html/5519.html #IOError: [Er Rno Socket ErroR] [Errno 10060] #仅仅输出一个URL Otherwise output two of the same URL break #当前页 Detailed content plus 1 count=count+1 time.sleep (0.1) else:print ' No URL about content ' Time.sleep (1) num=num+1else:print ' Download over!!! '

download Http://pic.yxdown.com/list/0_0_1.html's picture e:\\picture directory for example the following:

Download the http://pic.yxdown.com/list/0_0_3.html picture E:\\picture3 directory For example the following:

Because there are specific steps in the code gaze. The following is just the introduction process.


1. Simply traverse the site. Gets the URL of the corresponding topic per page. There are countless themes on each page. The format of the theme is as follows:

<!--first step Crawled HTML code such as the following--><div class= "Conbox" > <div class= "cbtop" > </div> <div class= "Cbmiddle" > < ; a target= "_blank" href= "/html/5533.html" class= "proimg" >  

Count the heroes of the League of Legends "/> <strong></strong> <p> <span>b></b>1836 people have seen </ Span> <em><b></b>10 Zhang </em> </p> <b class= "Imgname" >miss big miss!

Count the heroes of the League of Legends </b> </a> <a target= "_blank" href= "/html/5533.html" class= "Pllink" ><em >1</em> comments </a> </div> <div class= "Cbbottom" > </div> <a target= "_blank" class= "PlB TN "href="/html/5533.html "></A></DIV>

It is composed of countless <div class= "Conbox" ></div>, in which we only need to extract <a target= "_blank" href= "/html/5533.html" class= "p Roimg "> in the href can, and then through the URL stitching implementation to the detailed topic page. The corresponding above layout, for example , is as seen in the following:  2. Go to the detailed image page to download the HTML page. Such as:
Http://pic.yxdown.com/html/5533.html#p=1
at the same time, download the local HTML page to gaze at the code. At this point you need to click "View Pictures" talent Download original. Right-click can only be saved as site HTML. 3. I originally intended to analyze the "view original" URL to achieve the download, the other site is the same as the analysis of "next page" to achieve.

But I found it to be a browse through JavaScript, namely:
<a href= "javascript:;" onclick= "return false; id=" Photooriginal "> View original </a>
At the same time it writes all the pictures in the following code <script></script>:

<script>var images = [{"Big": "http://i-2.yxdown.com/2015/3/18/KDkwMHgp/ 6381ccc0-ed65-4422-8671-b3158d6ad23e.jpg ",  " thumb ":" http://i-2.yxdown.com/2015/3/18/KHgxMjAp/ 6381ccc0-ed65-4422-8671-b3158d6ad23e.jpg ",  " original ":" http://i-2.yxdown.com/2015/3/18/ 6381ccc0-ed65-4422-8671-b3158d6ad23e.jpg ",  " title ":" "," descript ":" "," id ": 75109},{" Big ":"/HTTP/ I-2.yxdown.com/2015/3/18/kdkwmhgp/fec26de9-8727-424a-b272-f2827669a320.jpg ",  " thumb ":"/http I-2.yxdown.com/2015/3/18/khgxmjap/fec26de9-8727-424a-b272-f2827669a320.jpg ",  " original ":"/http I-2.yxdown.com/2015/3/18/fec26de9-8727-424a-b272-f2827669a320.jpg ",  " title ":" "," descript ":" "," id ": 75110}, ...</script>
The -original can be obtained from the original image, thumbnail-thumb, large map-big, through the regular table download URL:
res_original = R ' "Original": "(. *?)" ' #原图
M_original = Re.findall (res_original,script)
4. The final step is to download the image. I'm not very good at using threads, simply adding the Time.sleep (0.1) function. Downloading images may be subject to Wikipedia's limited access. The corresponding setting is required. The core code is as follows:
Import Osimport urllibclass Appurlopener (urllib. Fancyurlopener):    Version = "mozilla/5.0" Urllib._urlopener = appurlopener () url = "Http://i-2.yxdown.com/2015/2/25 /c205972d-d858-4dcd-9c8b-8c0f876407f8.jpg "filename = os.path.basename (URL) urllib.urlretrieve (URL, filename)
at the same time I also create a local directory Picture3, and txt record gets the URL, for example to see:
        Finally hope the article is helpful to everyone, In brief, the article has two sentences: How to analyze the source code through the regular table to extract the specified URL. How to download images via Python. If the article has shortcomings, please Haihan!


      (By:eastmount 2015-3-20 5 o'clock in the afternoon  http://blog.csdn.net/eastmount/)

[Python learning] Simply crawl pictures in the site gallery

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.