Using URLLIB2 to implement simple web crawler 1

Source: Internet
Author: User

To play Python's classmates can not avoid to write a reptile to play, and generally to grab xx pictures mainly, of course, landlord is no exception ~ ~

Here first the original way: Urllib2 + regular expression, and then try to requests

Background: Suppose there is a website about some artists and their works introduced, after landing in, each page is the artist's avatar and list of names (because many artists, so there are many pages);

Click on the artist's avatar or name to go to the artist's homepage, which has a detailed description of the artist and a list of works (since each artist has a lot of work, all may have many pages);

Click on one of the works to enter the detailed introduction page of the work, including the title (I) (D), cover image, content.

Goal: To crawl each artist's information and work, first establish a folder named after the artist, and the artist's information into the Info.txt file, and then each work to create a folder (in the title (I) named (D),

Finally, the cover and contents of the artwork are saved in the folder.

Idea: 1 go to the Artist List page, extract the name of all the artists on the page and the URL of their homepage, deposit Dict,key as the artist name, value is the homepage URL;

2 traverse the previous step of the dict, first create a folder named after the artist, then go to the artist's homepage, get the artist's details, write to the Info.txt file;

3 Get the URL of all the works on the artist's homepage, and save it in the list;

4 go through the list in the previous step, enter the detailed page of a work, obtain the name (i) of the work (d), and create a folder named (d) under that name;

5 at the same time get the cover of the work and the URL of the content, saved in the list;

6 Walk through the list in the previous step, save all the pictures to the local folder;

7 If the artist has a lot of pages, then go to the list of works, repeat step 3-6;

8 after all the artists on the artist List page crawl, go to the next artist list and repeat step 1-7.

According to the above ideas, first define the required classes and methods:

  

1 classSpider (object):2     def __init__(Self, Base_url, start_page, End_page, File_path, min_age=20, max_age=35, movie_cnt=50):3         """4 :p Aram Base_url: The base address of the Artist List homepage5 :p Aram start_page: Start Page number6 :p Aram End_page: End page number, not including this page7 :p Aram File_path: File Save local Path8 :p Aram min_age: Artist's minimum age, less than this age does not crawl9 :p Aram max_age: Artist's maximum age, greater than this age does not crawlTen :p Aram movie_cnt: The maximum number of works per artist to be acquired One : Return:none A         """ -         Pass -  the     defGet_current_page_actresses_url (Self, page): -         """ - gets all actor names and URLs for the currently specified page - :p Aram Page: The specified number of pages + : Return: A dict, key is the artist's name, value is the corresponding URL -         """ +         Pass A  at     defmkdir_for_actress (self, name): -         """ - Create a directory named after the artist's name - :p Aram Name: The name of the artist - : Return: Returns the directory path created -         """ in         Pass -  to     defget_actress_info (self, Path, data): +         """ - get basic information about an artist and save it in a Info.txt file the :p Aram Path: The directory where artist information is stored, returned by Mkdir_for_actress * :p Aram Data: page HTML on the artist's list of works $ : Return: The age of the artistPanax Notoginseng         """ -         Pass the  +     defhas_next_page (self, data): A         """ the determine if the current artist still has the next page of work + :p Aram Data: page HTML on the artist's list of works - : Return: False if the next page exists returns true $         """ $         Pass -  -     defget_actress_movies (self, data): the         """ - in the artist's work page, get the URL of all works list, save in listWuyi :p Aram Data: page HTML on the artist's list of works the : Return: List of artwork URLs for the artist's current page -         """ Wu         Pass -  About     defGet_id_and_mkdir (self, Path, data): $         """ - get the ID of a work and create a folder with that ID as its name - :p Aram Path: The directory where the newly created folder is located, returned by Mkdir_for_actress - :p Aram Data: page HTML for the details of the work A : Return: Returns the folder path that was created +         """ the         Pass -  $     defget_pictures (self, data): the         """ the get the URL of a work cover and sample picture the :p Aram Data: page HTML for the details of the work the : Return: List of saved cover and sample image URLs -         """ in         Pass the  the     defsave_pictures (self, Path, url_list): About         """ the save picture to local specified folder the :p Aram Path: The folder where the picture is saved, returned by Get_id_and_mkdir the :p Aram url_list: The URL list of the image to be saved, returned by Get_pictures + : Return:none -         """ the         PassBayi  the     defStart_spider (self): the         """ - start the crawler, external only call the interface - : Return:none the         """ the         Pass

Using URLLIB2 to implement simple web crawler 1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.