Quick guide to Python crawler from scratch

Source: Internet
Author: User
Tags website ip
Main content of this article: write the simplest crawler in the shortest time to capture the post title and content of the Forum. Audience: no new crawlers have been written. Collation

Main content of this article: write the simplest crawler in the shortest time to capture the post title and content of the Forum.

Audience: no new crawlers have been written.

Getting started

0. preparations

Something to prepare: Python, scrapy, an IDE, or any text editing tool.

1. the technical department has decided to write crawlers.

Create a working directory and create a project using the command line. The project name is miao, which can be replaced with your favorite name.

scrapy startproject miao

Then you will get the following directory structure created by scrapy:

Create a python file in the spiders folder, such as miao. py, as a crawler script. The content is as follows:

Import scrapyclass NgaSpider (scrapy. Spider): name = "NgaSpider" host = "http://bbs.ngacn.cc/" # start_urls is the initial page we are preparing to crawl start_urls = ["http://bbs.ngacn.cc/thread.php? Fid = 406 ",] # This is a parsing function. if it is not specified, the page captured by scrapy will be parsed by this function. # Process and analyze the page here. In this example, we simply print the page content. Def parse (self, response): print response. body

2. how to run one?

If you use the command line:

cd miaoscrapy crawl NgaSpider

You can see that the crawler has printed the first page of your altar space zone. of course, because there is no processing, html tags and js scripts are printed together.

Analysis

Next, we will analyze the captured page and extract the title of this page from the html and js stacks. In fact, parsing pages is a physical activity. There are many ways to do this. here we only introduce xpath.

0. why not try the magic xpath?

Let's take a look at the piece of stuff we caught just now, or use chrome to manually open the page and press F12 to see the page structure. Each title is actually wrapped by such an html tag. For example:

[Cooperation mode] ideas for changing cooperation mode

We can see that href is the address of this post (of course, we need to splice the address of the Forum), and the content of this label package is the title of the post.
So we use the absolute positioning method of xpath to extract the class = 'topic.

1. check the xpath effect.

Add reference at the top:

from scrapy import Selector

Change the parse function:

Def parse (self, response): selector = Selector (response) # Here, xpath extracts all the labels of class = topic, of course, this is a list # every element in this list is the html tag content_list = selector. xpath ("// * [@ class = 'topic ']") # traverse this list and process each tag for content in content_list: # resolve the tag here, extract the title of the post we need. Topic = content. xpath ('string (.) '). extract_first () print topic # extract the url of the post here. Url = self. host + content. xpath ('@ href'). extract_first () print url

Run it again and you will see the title and url of all the posts on the first page of your space zone.

Recursion

Next we will capture the content of each post. Yield of python is used here.

yield Request(url=url, callback=self.parse_topic)

Here, scrapy will be told to capture the url, and then parse the captured page with the specified parse_topic function.

So far, we need to define a new function to analyze the content in a post.

The complete code is as follows:

Import scrapyfrom scrapy import Selectorfrom scrapy import Requestclass NgaSpider (scrapy. Spider): name = "NgaSpider" host =" http://bbs.ngacn.cc/ "# In this example, only one page is specified as the starting url for crawling. # of course, reading the starting url from a database or file or somewhere else is also possible. start_urls = [" http://bbs.ngacn.cc/ Thread. php? Fid = 406 ",] # crawler entry, which can be initialized here, for example, reading the starting url def start_requests (self): for url in self. start_urls: # add the starting url to the crawler queue of scrapy, and specify the parsing function # scrapy will schedule it on its own, access the url and get the content back to yield Request (url = url, callback = self. parse_page) # The layout parsing function that parses the title and address of a post on a layout. def parse_page (self, response): selector = Selector (response) content_list = selector. xpath ("// * [@ class = 'topic ']") for content in content_list: topic = content. xpath ('string (.) '). extract_first () print topic url = self. host + content. xpath ('@ href '). extract_first () print url # here, add the resolved Post address to the queue to be crawled, and specify the parsing function yield Request (url = url, callback = self. parse_topic) # The page turning information can be parsed here to implement multiple pages in the crawling zone # The resolution function of the post to parse the content def parse_topic (self, response): selector = Selector (response) content_list = selector. xpath ("// * [@ class = 'postcontent ubbcode']") for content in content_list: content = content. xpath ('string (.) '). extract_first () print content # you can parse the paging information here to crawl multiple pages of a post.

So far, this crawler can crawl the titles of all the posts on the first page of your forum and the content of each floor on the first page of each post. The principle of crawling multiple pages is the same. parse the url address of the pages, set the termination condition, and specify the corresponding page resolution function.

Pipelines-MPs queue

This section describes the processing of captured and parsed content. you can use pipelines to write local files and databases.

0. define an Item

Create an items. py file in the miao folder.

from scrapy import Item, Fieldclass TopicItem(Item):    url = Field()    title = Field()     author = Field()      class ContentItem(Item):    url = Field()     content = Field()    author = Field()

Here we define two simple classes to describe the crawler results.

1. write a processing method

Find the pipelines. py file under the miao folder. it should have been generated automatically before scrapy.

We can create a processing method here.

Class FilePipeline (object): # The crawler analysis result is handed over by scrapy to this function to process def process_item (self, item, spider): if isinstance (item, TopicItem ): # file writing and database writing can be performed here. pass if isinstance (item, ContentItem): # file writing and database writing can be performed here. pass ##... return item

2. call this processing method in the crawler.

To call this method, we only need to call it in the crawler. for example, the original content processing function can be changed:

Def parse_topic (self, response): selector = Selector (response) content_list = selector. xpath ("// * [@ class = 'postcontent ubbcode']") for content in content_list: content = content. xpath ('string (.) '). extract_first () # The above is the original content # Create a ContentItem object and put the crawler into item = ContentItem () item ["url"] = response. url item ["content"] = content item ["author"] = "" ##### this is the call. ## scrapy will give this item to the one we just wrote. filePipeline to process yield item

3. specify this pipeline in the configuration file.

Find the settings. py file and add it to it.

ITEM_PIPELINES = {            'miao.pipelines.FilePipeline': 400,        }

In this way, the crawler calls

yield item

This FilePipeline will be used for processing. The following number 400 indicates the priority.
You can configure multiple pipelines here. scrapy will hand the items to each item in order based on the priority, and each processed result will be passed to the next Pipeline for processing.
You can configure multiple pipelines as follows:

ITEM_PIPELINES = {            'miao.pipelines.Pipeline00': 400,            'miao.pipelines.Pipeline01': 401,            'miao.pipelines.Pipeline02': 402,            'miao.pipelines.Pipeline03': 403,            ## ...        }

Middleware-Middleware

Through Middleware, we can make some modifications to the request information. for example, common settings such as UA, proxy, and login information can all be configured through Middleware.

0. Middleware configuration

Similar to the pipeline configuration, add the Middleware name to setting. py, for example

DOWNLOADER_MIDDLEWARES = {            "miao.middleware.UserAgentMiddleware": 401,            "miao.middleware.ProxyMiddleware": 402,        }

1. check UA on the website. I want to change UA.

Some websites do not allow access without UA. Create a middleware. py file under the miao folder.

import randomagents = [    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",    "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",]class UserAgentMiddleware(object):     def process_request(self, request, spider):        agent = random.choice(agents)        request.headers["User-Agent"] = agent

Here is a simple middleware for random UA replacement. the content of agents can be expanded on its own.

2. break the website IP address. I want to use a proxy.

For example, if a 8123 Port proxy is enabled for the local 127.0.0.1, you can also use the middleware configuration to allow crawlers to crawl the target website through this proxy. Add the following to middleware. py:

Class ProxyMiddleware (object): def process_request (self, request, spider ): # enter your own proxy here # If you are buying a proxy, you can use the API to obtain the proxy list and then randomly select a proxy = "http: // 127.0.0.1: 8123" request. meta ["proxy"] = proxy

Many websites limit the number of visits. if the access frequency is too high, IP addresses are temporarily banned. If necessary, you can purchase an IP address from the internet. Generally, the service provider provides an API to obtain the currently available IP address pool. just select one and fill it here.

Some common configurations

Some common configurations in settings. py

# Interval, in seconds. Specifies the interval between each two scrapy requests. DOWNLOAD_DELAY = 5 # retry RETRY_ENABLED = True when an access exception occurs # retry RETRY_HTTP_CODES = [500,502,503,504,400,403,404,408] # retry times RETRY_TIMES = 5 # Pipeline concurrency. At the same time, a maximum of Pipeline can be used to process itemCONCURRENT_ITEMS = 200 # maximum number of concurrent requests CONCURRENT_REQUESTS = 100 # maximum number of concurrent requests to a website CONCURRENT_REQUESTS_PER_DOMAIN = 50 # maximum number of concurrent requests to an IP address CONCURRENT_REQUESTS_PER_IP = 50
I just want to use Pycharm

If you need to use Pycharm as a development and debugging tool, you can perform the following configuration in the running configuration:
Configuration page:
Script to enter your scrapy's cmdline. py path. for example

/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py

Enter the crawler name in Scrpit parameters. In this example, it is:

crawl NgaSpider

Finally, it is Working diretory. find your settings. py file and fill in the directory where this file is located.
Example:

You can debug it with a small green arrow.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.