Python crawling framework Scrapy crawler entry: Page extraction, pythonscrapy

Source: Internet
Author: User
Tags webp virtual environment

Python crawling framework Scrapy crawler entry: Page extraction, pythonscrapy

Preface

Scrapy is a very good crawling framework. It not only provides some basic components available in the out-of-the-box environment, but also provides powerful Customization Based on your own needs. This article describes how to extract the Scrapy page of the Python capture framework and share it with you for your reference.

Before getting started, you can refer to this article on scrapy framework: http://www.bkjia.com/article/87820.htm

Next, we will create a crawler project to capture images through the figure above.

I. Content Analysis

Open the graph worm network, and click a tag in the top menu "Discover" and "tag" to classify various images. Click a tag, such as "beauty". The link to the webpage is https://tuchong.com/tags/beauty /. here, we crawl the page:

After opening the page, you can click the gallery to view the images in full screen. More Atlas will appear when you scroll down the page, without page paging settings. Right-click Chrome and choose "Check element" to open the developer tool and check the page source code. The content section is as follows:

<div class="content"> <div class="widget-gallery"> <ul class="pagelist-wrapper">  <li class="gallery-item...

You can judge every li. gallery-item is the entrance to an gallery, which is stored in ul. under pagelist-wrapper, div. widget-gallery is a container. If you use xpath to select: // div [@ class = "widget-gallery"]/ul/li, follow the general page logic, in li. under gallery-item, find the corresponding link address, and then go down to a page to capture the image.

However, if you use an HTTP debugging tool similar to Postman to request this page, the obtained content is:

<div class="content"> <div class="widget-gallery"></div></div>

That is to say, there is no actual gallery content, so we can conclude that the page uses an Ajax request. The Gallery content is requested and added to the div only when the browser loads the page. in widget-gallery, you can use the developer tool to view the XHR request address as follows:

Https://tuchong.com/rest/tags/ //posts? Page = 1 & count = 20 & order = weekly & before_timestamp =

The parameter is simple. page is the page number, count is the number of gallery pages per page, order is the sorting, and before_timestamp is empty. Because graph worm is a content-based website, before_timestamp should be a time value, different contents are displayed at different times. Here we discard the content and directly capture the content from the latest page without considering the time.

The request results are in JSON format, reducing the crawling difficulty. The results are as follows:

{"PostList": [{"post_id": "15624611", "type": "multi-photo", "url": "https://weishexi.tuchong.com/15624611/", "site_id ": "443122", "author_id": "443122", "published_at": "October 18 18:01:03", "excerpt": "", "favorites": 4052, "comments ": 353, "rewardable": true, "parent_comments": "165", "rewards": "2", "views": 52709, "title ": "breeze does not dry Autumn", "image_count": 15, "images": [{"img_id": 11585752, "user_id": 443122, "title ":"", "excerpt": "", "width": 5016, "height": 3840 },{ "img_id": 11585737, "user_id": 443122, "title ":"", "excerpt": "", "width": 3840, "height": 5760},...], "title_image": null, "tags": [{"tag_id": 131, "type": "subject", "tag_name": "portrait", "event_type ": "", "vote": "" },{ "tag_id": 564, "type": "subject", "tag_name": "beauty", "event_type ": "", "vote": ""}], "favorite_list_prefix": [], "reward_list_prefix": [], "comment_list_prefix": [], "cover_image_src ": "https://photo.tuchong.com/443122/g/11585752.webp", "is_favorite": false}], "siteList ":{...}, "following": false, "coverUrl": "https://photo.tuchong.com/443122/ft640/11585752.webp", "tag_name": "beauty", "tag_id": "564", "url": "https://tuchong.com/tags/%E7%BE%8E%E5%A5%B3 ", "more": true, "result": "SUCCESS "}

According to the attribute name, it is easy to know the meaning of the corresponding content. Here we only need to care about the postlist attribute. An array element corresponding to it is an gallery. We need to use several attributes in the gallery element:

  • Url: the address of the page browsed by a single Gallery
  • Post_id: gallery number, which must be unique on the website and can be used to determine whether the content has been crawled.
  • Site_id: the author's site number, which is used to build the image source link
  • Title: title
  • Excerpt: Abstract text
  • Type: Gallery type. Currently, two types are found. One is multi-photo, and the other is text-and image-based document pages. The two types have different content structures, different capturing methods are required. In this example, only the photo type is captured, and the text type is directly discarded.
  • Tags: gallery tags with multiple tags
  • Image_count: number of images
  • Images: image list, which is an array of objects. Each object contains an img_id attribute.

According to the analysis on the Image Browsing page, the address of the basic piece is in this format: https://photo.tuchong.com/?site_id=/f/?img_id=.jpg, which can be easily merged through the above information.

2. Create a project

  • Go to the cmder command line tool and enter workon scrapy to enter the virtual environment created earlier. At this time, the (Scrapy) identifier appears before the command line prompt, and the identifier is in the virtual environment, related paths are added to the PATH environment variable for development and use.
  • Enter scrapy startproject tuchong to create the project tuchong
  • Enter the main directory of the project, enter scrapy genspider photo tuchong.com to create a crawler named photo (the same name cannot be the same as the project name), and crawl the tuchong.com domain name (this needs to be modified. Enter the approximate address here ), A project can contain multiple crawlers.

After the above steps, the project automatically creates some files and settings. The directory structure is as follows:

(PROJECT)│ scrapy.cfg│└─tuchong │ items.py │ middlewares.py │ pipelines.py │ settings.py │ __init__.py │ ├─spiders │ │ photo.py │ │ __init__.py │ │ │ └─__pycache__ │  __init__.cpython-36.pyc │ └─__pycache__  settings.cpython-36.pyc  __init__.cpython-36.pyc
  • Scrapy. cfg: Basic settings
  • Items. py: Structure Definition of captured entries
  • Middlewares. py: middleware definition, which does not need to be changed in this example
  • Pipelines. py: pipe definition, used for processing after data capturing
  • Settings. py: global settings
  • Spiders \ photo. py: crawler entity, which defines how to capture the required data

Iii. Main Code

In items. py, create a TuchongItem class and define the required attributes. The attributes inherit from scrapy. Field values, such as characters, numbers, lists, and dictionaries:

import scrapyclass TuchongItem(scrapy.Item): post_id = scrapy.Field() site_id = scrapy.Field() title = scrapy.Field() type = scrapy.Field() url = scrapy.Field() image_count = scrapy.Field() images = scrapy.Field() tags = scrapy.Field() excerpt = scrapy.Field() ...

The values of these attributes are assigned to the crawler.

The spiders \ photo. py file is automatically created by running the command scrapy genspider photo tuchong.com. The initial content is as follows:

import scrapyclass PhotoSpider(scrapy.Spider): name = 'photo' allowed_domains = ['tuchong.com'] start_urls = ['http://tuchong.com/'] def parse(self, response): pass

Crawler name, the allowed domain name allowed_domains (if the link does not belong to this domain name, it will be discarded, multiple allowed), the starting address start_urls will be crawled from the address defined here (multiple allowed)

The parse function is the default callback function for processing the request content. The response parameter is the request content, and the page content text is saved in response. in the body, we need to slightly modify the Default Code so that it can meet the requirements of Multi-page cyclic sending requests. This requires the start_requests function to be reloaded, and the multi-page Link request is constructed using a repeating statement. The modified code is as follows:

Import scrapy, jsonfrom .. items import TuchongItemclass PhotoSpider (scrapy. spider): name = 'photo' # allowed_domains = ['tuchong. com '] # start_urls = ['HTTP: // tuchong.com/'] def start_requests (self): url = 'https: // tuchong.com/rest/tags/s/posts? Page = % d & count = 20 & order = weekly '; # capture 10 pages, 20 Atlas entries per page # specify parse as the callback function and return the Requests request object for page in range (1, 11): yield scrapy. request (url = url % ('beauty ', page), callback = self. parse) # callback function, processing the captured content to fill in the TuchongItem attribute def parse (self, response): body = json. loads (response. body_as_unicode () items = [] for post in body ['postlist']: item = TuchongItem () item ['type'] = post ['type'] item ['Post _ id'] = post ['Post _ id'] item ['site _ id'] = post ['site _ id'] item ['title'] = post ['title'] item ['url'] = post ['url'] item ['excerpt'] = post ['excerpt'] item ['image _ count'] = int (post ['image _ count']) item ['images'] ={}# process images into an array of {img_id: img_url} objects for img in post. get ('images', ''): img_id = img ['img _ id'] url = 'https: // photo.tuchong.com/s/f/developers.jpg' % (item ['site _ id'], img_id) item ['images'] [img_id] = url item ['tags'] = [] # process tags into a tag_name array for tag in post. get ('tags', ''): item ['tags']. append (tag ['tag _ name']) items. append (item) return items

After these steps, the captured data will be stored in the TuchongItem class and can be easily processed and saved as structured data.

As mentioned above, not all captured entries are required. For example, in this example, we only need an image of the type = "multi_photo" type, and there are too few images, the filtering operations for these captured entries and how to save them must be performed in pipelines. by default, the TuchongPipeline class has been created in this file and the process_item function has been reloaded. By modifying this function, only the qualified items are returned. The Code is as follows:

... Def process_item (self, item, spider): # scrapy is triggered if the condition is not met. exceptions. dropItem exception. if int (item ['image _ count']) <3: raise DropItem ("too few girls:" + item ['url']) elif item ['type']! = 'Multi-photo ': raise DropItem ("format incorrect:" ++ item ['url']) else: print (item ['url']) return item...

Of course, the same is true if you do not use pipelines to directly process data in parse, except that the structure is clearer and FilePipelines and ImagePipelines with more functions are available, process_item will be triggered after each entry is captured, and open_spider and close_spider functions can be reloaded to process the actions when a crawler is opened or closed.

Note:The MPs queue must be registered in the project before it can be used. Add the following in settings. py:

ITEM_PIPELINES = {'tuchong. pipelines. TuchongPipeline ': 300, # MPs queue name: Running priority (small number priority )}

In addition, most websites have anti-crawler Robots.txt exclusion protocols. Setting ROBOTSTXT_OBEY = True can ignore these Protocols. Yes, it seems to be just a gentleman agreement. If the website is configured with a browser User Agent or IP address detection for anti-crawler, a more advanced Scrapy function is required, which is not described in this article.

Iv. Run

Return to the cmder command line to enter the project directory and enter the command:

scrapy crawl photo

The crawler outputs all crawling results and debugging information, and lists the statistics of crawler running at the end, for example:

[scrapy.statscollectors] INFO: Dumping Scrapy stats:{'downloader/request_bytes': 491, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 10224, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 11, 27, 7, 20, 24, 414201), 'item_dropped_count': 5, 'item_dropped_reasons_count/DropItem': 5, 'item_scraped_count': 15, 'log_count/DEBUG': 18, 'log_count/INFO': 8, 'log_count/WARNING': 5, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 11, 27, 7, 20, 23, 867300)}

We mainly focus on ERROR and WARNING. Here, Warning is actually a DropItem exception triggered because it does not meet the conditions.

5. Save the result

In most cases, you need to save the captured results. By default, item. the properties defined in py can be saved to the file. You only need to add the-o {filename} parameter to the command line:

Scrapy crawl photo-o output. json # The output is a JSON file scrapy crawl photo-o output.csv # The output is a CSV file.

Note:The items output to the file are those that have not been filtered by TuchongPipeline. As long as the items returned by the parse function are all output, you can also filter the items in the parse function and only return the expected items.
To save the data to the database, you need to add additional code. For example, you can add the following code after process_item in pipelines. py:

... Def process_item (self, item, spider ):... else: print (item ['url']) self. myblog. add_post (item) # myblog is a database class used to process database operations return item...

To exclude duplicate content in the insert database operation, you can use item ['Post _ id'] for determination. If so, skip this step.
The captured content in this project only involves text and image links, and does not download image files. To download images, you can use either of the following methods:

Install the Requests module, download the image content in the process_item function, and replace it with the local image path when saving the database.
Use the ImagePipelines pipeline to download images.

Summary

The above is all the content of this article. I hope the content of this article has some reference and learning value for everyone's learning or work. If you have any questions, please leave a message to us, thank you for your support.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.