Python Crawler (6): Fried egg nets whole station sister figure Reptile __python

Source: Internet
Author: User
Tags mkdir nets

In the previous article we grabbed the data of the Watercress book, if you run successfully, and see the folder under the TXT file. Is there a kind of just contact programming, the first time to output the joy of Hello world!. Unlike the previous practice, this time we climbed the egg net all over the sister figure, and saved to the specified folder. crawl process from egg net The first page of the figure begins to crawl; The Crawl tab gets the last page number, the page URL is obtained from the last page, all pages are crawled, the URL for all sister pictures of the page is fetched; access the picture URL and save the picture to the folder. Start

Through the last article crawl process, we basically understand the crawl a site's approximate process. Because a Web site has many pages, the HTML tag content is the same for most sites on every page. As soon as we get to the content of a page, we can get the content of all the pages. So before we start, let's analyze the URL of the egg-laying net's sister chart page.

The url:http://jandan.net/ooxx/page-1 of the first page

Second page: http://jandan.net/ooxx/page-2

Last page: http://jandan.net/ooxx/page-93

It is not difficult to find that the egg net of the rules of the URL is relatively simple, each page behind the page number is a few. Then we can get all the page URLs through a loop. But you should think that this site will be updated every day, today is 93 pages, tomorrow will be increased to 94 pages. If you want to crawl every time, then each need to change the code page information. This can be done, but it's a little silly.

So we need to use the page tag information to get the program to get the number of pages, we visit the http://jandan.net/ooxx/this page, it is equivalent to our direct access to the last page. We can try it by ourselves.

We can clearly see that the last page of the figure is 94. As long as the page through the URL to crawl can be obtained. We first obtain the source code:

Import requests from
BS4 import beautifulsoup

url = ' http://jandan.net/ooxx/'
headers = {' User-agent ': ' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36 '}
resp = requests.get (URL, headers= Headers)

soup = BeautifulSoup (resp.text, ' lxml ')

We press F12, from the page source to find the last page 94 of the tag:

It turns out that 94 is in this span tag. The next is not very simple:

# get the highest number of pages
allpage = soup.find (' span ', class_= ' current-comment-page '). Get_text () [1:-1]

Since the 94 sides of the label contain a [], we don't think it's just a list, as long as [0] is available. We can use type (), look at his attributes, know it is a string, we use the slice to remove the first and last characters, we get the number of pages. After we get the number of pages, we use the loop to get the URL of all the pages:

Urllist = []
# The For loop iterates through all the pages, gets the URL for
page in range (1, int (allpage) +1):
    allurl = base_url + ' page-' + str (PAG e)
    urllist.append (Allurl)

We'll save it in a list. So now that we have the URL of all the pages, we can get the contents of each page. We take the last page as an example to crawl.

We still use the review element to find the label where the picture URL is located. Still the old way to get to the page all the IMG tags that contain pictures:

# CSS Selector
allimgs = Soup.select (' div.text > P > img ')

As long as one line of code, we successfully get all the tags. The CSS selector is used here, do you remember this method? You can view the previous articles or the official documents of BeautifulSoup. If you are not familiar with CSS, or do not know. It doesn't matter, anyway Find_all () and find () methods are also achievable. But I'm here to teach you a simple CSS selector method.

We just press F12 to open the browser developer tool, find the label location, right-click the tag. You can see the situation:

Yes, we directly copy selector content, pasted out is such a string: #comment -3468457 > div > div > Div.text > P > img

Let's just get rid of some of the previous tags, and in most cases, the content behind the parent tag. That's it: Div.text > P > img

We put it in the code and we'll see if it works.

The result is only a list:

[ ,  ,  ,  ,  

Obviously, the picture URL for this page is all here. The next step is to extract the SRC attribute of the img tag.

For IMG in list:
    urls = img[' src ']
    # To determine if the URL is complete
    if urls[0:5] = = ' http: ':
        img_url = URLs
    else:
        img_ url = ' http: ' + URLs

Because some of the tags within the URL is not complete, so here we make a judgment. If it is not complete, give him a full complement.

OK, the URL of the picture is obtained, Next is to save the picture. We remember when we introduced the requests module, there have been a demo to save pictures. Because every picture in the World Wide Web, each video has a unique URL pointing to them. So we just have to access the URL and get the binary data of the picture, save it locally.

IMGs = Requests.get (img_url,headers=headers)
filename = img_url.split ('/') [-1]
# Save picture with
open (filename , ' WB ') as F:
    # Direct filter out save failed picture, do not terminate program
    try:
        f.write (imgs.content)
        print (' sucessful image: ', filename )
    except:
        print (' Failed: ', filename)

Attention, get picture binary data is. Content method, not. Text. Here we have a bug filter, because there will be some file saving error in the process, we directly filtered out, do not terminate the operation of the program.

Well, the crawler program has basically come true here. But if we put all the pictures in a folder and it's the folder of the code, it's a little ugly. We can specify where they are stored. There's a python-built OS library to use here, and it's not clear that partners can view the data themselves.

# Create a function of the folder, save it to D disk
def mkdir (path):
    # os.path.exists (name) to determine if
    the path # os.path.join (path, name) connection directory and filename
    isexists = os.path.exists (Os.path.join ("D:\jiandan", Path))
    # If If not
    isexists:
        print (' MakeDir ', path '
        # Create folder
        os.makedirs (Os.path.join ("D:\jiandan", Path))
        # Switch to the folder you created
        Os.chdir ( Os.path.join ("D:\jiandan", Path))
        return True
    # If present, returns false
    else:
        print (path, ' already Exists ') return
        False

We can create a folder simply by passing the function a path parameter. All of the features have been implemented, and if not, you can see the folder in D disk.

If the program error, may be our program access too often, the site banned our IP. At this point, we are going to use an agent. Free agent on the internet a lot of people can find their own, here to do a simple use of proxy demo. Because it is free IP, will not survive too long will not be able to use, we do not directly use the IP in the code. Follow-up can take you together to build a proxy pool of their own.

Proxies = {' http ': ' 111.23.10.27:8080 '}
try:
    # Requests Library's GET request
    resp = requests.get (URL, headers= Headers)
except:
    # If request is blocked, use proxy
    resp = requests.get (URL, headers=headers, proxies=proxies)
Code

Okay, finally, the complete code:

#-*-Coding:utf-8-*-# Author:yukun Import requests import OS import time from BS4 import BeautifulSoup # make requests for HTML SOURCE def get_html (URL): # Specifies a browser header headers = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.110 safari/537.36 '} # Agent, the free agent can only be maintained for a while may be useless, replace the Proxi
    es = {' http ': ' 111.23.10.27:8080 '} Try: # Requests Library's GET Request RESP = requests.get (URL, headers=headers) Except: # If request is blocked, use proxy resp = requests.get (URL, headers=headers, proxies=proxies) return RESP # Create A function of the folder, saved to D disk def mkdir (path): # os.path.exists (name) to determine whether there is a path # os.path.join (path, name) connection directory with the filename isexists = os. Path.exists (Os.path.join ("D:\jiandan", path) # without if not isexists:print (' MakeDir ', path) # Create Folder Os.makedirs (Os.path.join ("D:\jiandan", Path)) # Switch to the created folder Os.chdir os.path.join ("D:\jiandan", Path)) Return True # If it exists, go back to FALSe else:print (path, ' already exists ') return False # get picture address call download function download Def get_imgs (): # Call function to get all pages for URLs in All_page (): Path = Url.split ('-') [-1] # Create a function of a folder mkdir (path) # call please The function gets HTML source HTML = get_html (URL). Text # using the lxml parser, you can also use html.parser soup = beautifulsoup (html, ' LXM
        L ') # CSS Selector allimgs = soup.select (' div.text > P > img ') # call download function Download save Download (ALLIMGS) # Execute OK print (' OK ') # get all Page def all_page (): Base_url = ' http://jandan.net/ooxx/' # BeautifulSoup Resolution page Gets the highest number of page numbers soup = BeautifulSoup (get_html (base_url). Text, ' lxml ') # get the highest number of pages Allpage = Soup.find (' Span ', class_= ' current-comment-page '). Get_text () [1:-1] urllist = [] # The For loop iterates through all the pages, gets the URL for page in range (1, Int (allpage) +1): Allurl = base_url + ' page-' + str (page) urllist.append (Allurl) return Urllist # Save picture function, the passed parameter is a page of all picture URLs set def Download (list): for img in list:urls = img[' src '] # to determine whether the URL is complete if urls[0:5] = = ' http: ':
        Img_url = urls Else:img_url = ' http: ' + urls filename = img_url.split ('/') [-1] # Save picture with open (filename, ' WB ') as F: # Direct filter out save failed picture, do not terminate program Try:f.write ( Get_html (Img_url). Content) Print (' sucessful image: ', filename) except:print (' F ailed: ', filename ' if __name__ = = ' __main__ ': # timed T1 = time.time () # Call function Get_imgs () print (Time.time (
 )-T1)

After the time, we only need 146 seconds to climb the whole station of the sister picture, everyone attention to the body.

Reproduced in: https://www.yukunweb.com/2017/6/python-spider-jiandan-girls/infringement legislation deleted

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.