Python Scrapy captures data
We use the dmoz.org website to show our skills.
Project: Create a New crawler Project.
Clear goals: define the goals you want to capture
Crawler creation: crawlers start crawling webpages.
Storage content (Pipeline): Design pipelines to store crawled content
1. Create a Project)
scrapy startproject tutorial
Use the tree Command to display:
The following describes the functions of each file:
Scrapy. cfg: project configuration file
Tutorial/: Python module of the project. The Code will be referenced here.
Tutorial/items. py: the project's items File
Tutorial/pipelines. py: pipelines file of the project
Tutorial/settings. py: the setting file of the project.
Tutorial/spiders/: directory for storing Crawlers
2. Define the target (Item)
In Scrapy, items is a container used to load and capture content. It is a bit like Dic in Python, that is, dictionary, but it provides some additional protection to reduce errors.
Generally, items can be created using the scrapy. item. item class, and attributes are defined using the scrapy. Item. Field object (which can be understood as An ORM-like ing relationship ).
Next, we start to build the item model ).
First, we want:
Name)
Link (url)
Description)
Modify the items. py file under the tutorial directory and add our own class after the original class.
Because we want to capture the content of the dmoz.org website, we can name it DmozItem:
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlfrom scrapy.item import Item, Field class TutorialItem(Item): # define the fields for your item here like: # name = Field() pass class DmozItem(Item): title = Field() link = Field() desc = Field()
At the beginning, it may seem a little incomprehensible, but defining these items allows you to know what your items is when using other components.
You can simply understand items as encapsulated class objects.
3. Make a crawler)
Make a crawler in two steps: first crawl and then fetch it.
That is to say, first you need to get all the content of the entire web page, and then retrieve the useful parts.
3.1 crawling
Spider is a self-compiled class used to capture information from a domain (or domain group.
They define a list of URLs for download, a scheme for tracking links, and a method for parsing webpage content to extract items.
To create a Spider, you must use scrapy. spider. BaseSpider to create a subclass and determine three mandatory attributes:
Name: identifies a crawler. It must be unique. You must define different names for different crawlers.
Start_urls: List of crawled URLs. Crawlers start to capture data from here, so the data downloaded for the first time will start from these urls. Other sub-URLs are generated from these starting URLs.
Parse (): the parsing method. When calling, the Response object returned from each URL is passed as the unique parameter, which is used to parse and match the captured data (resolved to item ), trace more URLs.
That is to say, store the Url and gradually spread it from here. capture all the qualified webpage URLs for storage and continue crawling.
Next we will write the first crawler named dmoz_spider.py and save it in the tutorial \ spiders directory.
The d1__spider.py code is as follows:
from scrapy.spider import Spider class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body)
Allow_domains is the search domain name range, that is, the crawler's restricted area. It requires crawlers to only crawl webpages under this domain name.
From the parse function, we can see that the last two addresses of the link are extracted and stored as file names.
Run the following command in the top-level directory:
scrapy crawl dmoz
Running result
Do you still remember our start_urls?
Http://www.dmoz.org/Computers/Programming/Languages/Python/Books
Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources
Because these URLs are the starting pages, they are not referenced (referrers), so you will see at the end of each line (referer: ).
Under the role of the parse method, two files are created: Books and Resources, which contain URL pages.
First, Scrapy creates a scrapy. http. Request object for each URL in the start_urls attribute of the crawler, and specifies the parse method of the crawler as the callback function.
Then, these requests are scheduled and executed, and then the scrapy. http. Response object is returned through the parse () method and fed back to the crawler.
3.2 fetch
After crawling the entire webpage, the next step is the process. The whole web page of optical storage is not enough. In basic crawlers, this step can be captured using regular expressions. In Scrapy, a mechanism called XPath selectors is used, which is based on XPath expressions.
Here are some examples of XPath expressions and their meanings.
/Html/head/title: select
3.3xpath Experiment
Next we will try Selector usage in Shell.
Lab URL: http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
After getting familiar with the experiment, the mouse crawls the webpage using Shell.
Go to the top-level directory of the project, that is, under the tutorial folder of the first layer.
scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
After you press enter, you can see the following content:
Now it's like holding a lot of sand in your hand, which hides the gold we want, so the next step is to shake two times with a sieve, remove the impurities, and select the key content.
Selector is such a sieve.
In the old version, Shell instantiates two selectors. One is to parse the hxs variable of HTML, and the other is to parse the xxs variable of XML.
The current Shell is the selector object we have prepared. sel can automatically select the best resolution scheme (XML or HTML) based on the returned data type ).
Then let's try it out !~
To thoroughly understand this problem, first of all, we need to know what the captured page looks like.
For example, we need to capture the webpage title, that is
Note: a simple list of available xpath path expressions:
Expression |
Description |
Nodename |
Select all child nodes of the node. |
/ |
Select from the root node. |
// |
Select the nodes in the document from the current node that matches the selected node, regardless of their location. |
. |
Select the current node. |
.. |
Select the parent node of the current node. |
@ |
Select attributes. |
All the experiment results are as follows. In [I] indicates the input of the I-th experiment, and Out [I] indicates the output of the I-th result:
Of course, the title tag does not have much value for us. Next we will capture something meaningful.
Using Firefox's review elements, we can clearly see that what we need is as follows:
We can use the following code to capture this
Tags:sel.xpath('//ul/li')
Slave
Tag to obtain the website description:
sel.xpath('//ul/li/text()').extract()
You can obtain the website title as follows:
You can obtain the URL of a website as follows:
Of course, the previous examples show how to directly obtain attributes.
We noticed that xpath returns an object list,
We can also directly call the attributes of objects in this list to mine deeper nodes.
(Refer:Nesting selectorsAndWorking with relative XPathsInSelectors):
sites = sel.xpath('//ul/li')for site in sites: title = site.xpath('a/text()').extract() link = site.xpath('a/@href').extract() desc = site.xpath('text()').extract() print title, link, desc
3.4xpath practices
We have been using shell for so long. Finally, we can apply the content we learned to the dmoz_spider crawler.
Make the following changes in the original crawler's parse function:
qixuan@ubuntu:~/qixuan02/tutorial/tutorial/spiders$ cat dmoz_spider.pyfrom scrapy.spider import Spider from scrapy.selector import Selector class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): sel = Selector(response) sites = sel.xpath('//ul/li') for site in sites: title = site.xpath('a/text()').extract() link = site.xpath('a/@href').extract() desc = site.xpath('text()').extract() print title
Note: We have imported the selector class from scrapy. Selector and instantiated a new Selector object. In this way, we can operate on xpath like in Shell.
Let's try to enter the command to run the crawler (in the tutorial root directory ):
scrapy crawl dmoz
The running result is as follows:
Sure enough, all titles are successfully captured. But it doesn't seem quite right. Why Top? Is the Python navigation bar captured?
We only need the content in the Red Circle:
It seems that our xpath statements have some problems. We didn't just capture the project names we need, but also caught some innocent elements with the same xpath syntax.
Review elements we find what we need
- Has the class = 'Directory-url' attribute,
Change the xpath statement to sel. xpath ('// ul [@ class = "directory-url"]/li ')
3.5 use Item
Next, let's take a look at how to use Item.
As we mentioned earlier, the Item object is a custom python dictionary. You can use the Standard Dictionary syntax to obtain the value of an attribute.
As a crawler, Spiders wants to store the captured data in the Item object. To return the captured data, the final code of the spider should be as follows:
from scrapy.spider import Spider from scrapy.selector import Selector from tutorial.items import DmozItem class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): sel = Selector(response) sites = sel.xpath('//ul[@class="directory-url"]/li') items = [] for site in sites: item = DmozItem() item['title'] = site.xpath('a/text()').extract() item['link'] = site.xpath('a/@href').extract() item['desc'] = site.xpath('text()').extract() items.append(item) return items
4. Pipeline)
The simplest way to save information is throughFeed exportsThere are four main types: JSON, JSON lines, CSV, and XML.
Export the results in JSON format. The command is as follows:
scrapy crawl dmoz -o items.json -t json
Next, let's take a look at the export result and use a text editor to open the json file (to facilitate display, the attributes except title are deleted in item ):
This is just a small example, so you can simply process it.
The following files are generated in the final directory: