What is Scrapy?
Scrapy is an open-source, twisted-based Python crawler framework that enables crawling of network data with just a few simple modules to customize.
Scrapy's overall architecture
Simply explain the diagram above:
The raw material of the reptile processing is one or more URLs, crawl when sheduler will assign a URL to downloader to make a request for a network request, after the completion of downloader the response to spiders, If the data returned is what we need, the data will be processed into the corresponding item to the Itempipeline for storage and so on. If the return is still the URL to be processed, then it will be left to the scheduler on the line once again processing process.
Installation of 3.Scrapy
sudo pip_install scrapy sudo easy_intall scrapy
Enter the password to complete the installation. Loss Scrapy If you do not have command not found, you have successfully installed.
4. Create a project
scrapy startproject project_name
If you see the above hint, then our project is created successfully. Switch to the project record and we'll see the following directory structure
appscrapy/
SCRAPY.CFG preparation information for the entire project
appscrapy/folder to store all Python custom modules
init. py
ITEMS.PY data structures that are stored for crawling
pipelines.py data stream processing files to process crawled data streams
settings.py setup file, set up database here, etc.
spiders/Storage of our custom crawlers
init. py
...
We're looking for the whole goal, AppStore's entertainment leaderboard https://itunes.apple.com/cn/genre/ios-yu-le/id6016?mt=8
The data we crawl is the URL to the list of app names and their corresponding specific information.
First we will customize the items.py we use to save the data type, open the items.py, and add the following code as follows:
Import scrapyclass Appscrapyitem (scrapy . item ) : # define the fields for your item here Like:name = Scrapy.field () URL = Scrapy. field ()
Simply stated, all item is inherited from Scrapy. Item, the fields inside are scrapy. Field () type, scrapy. Field () can receive any data type.
Now it's time to customize our crawlers.
Create a appscrapy.py under the Spiders folder, and then open add the following code
fromScrapy.spiderImportBasespider fromAppscrapy.itemsImportAppscrapyitem class appscrapy(basespider):Name =' App_scrapy 'Start_urls = ["Https://itunes.apple.com/cn/genre/ios-yu-le/id6016?mt=8"] def parse(self, Response):result = [] lis = Response.xpath ("//div[@class = ' grid3-column ']/div") forLiinchLis:array = Li.xpath ("./ul/li") forNodeinchArray:item = Appscrapyitem () item["Name"] = Node.xpath ("./a/text ()"). Extract () item["url"] = Node.xpath ("./a/@href"). Extract () Result.append (item)returnResult
All reptiles need to inherit from Basespider, and must define a name, because we have to start the crawler by this name. An array of URLs that must be used by the crawler to know where to go and finally the parse method must be implemented, where the data crawling back is really filtered to get what we want.
When we start the crawler ( scrapy crawl app_scrapy ), Scrapy will take the first URL from the start_urls and use this URL to initiate a request, and parse as the callback function for that request. The response in the callback function is the response from the request.
For the selection of content we use the XPath method, which requires the input of a path in the XPath method to return an selector array.
For the path we can use Chrome's developer tool to get it, as shown in. When we want to get content, just select the content under the Element tab and right-click the copy XPath
lis = response.xpath("//div[@class=‘grid3-column‘]/div")
First we use XPath to get all the div in class = ' grid3-column ' div, returning the value of an array. From the picture above we can see that the array should be 3 selector that represent the Div.
The contents of each div are as shown. We take out each div to parse its contents again.
forLiinchLis:array = Li.xpath ("./ul/li") forNodeinchArrayItem= Appscrapyitem ()Item["Name"] = Node.xpath ("./a/text ()"). Extract ()Item["url"] = Node.xpath ("./a/@href"). Extract ()result. Append (Item)
First use the For loop to remove each div, and then under the current DIV get all of the UL under all Li, as shown, we will get a selector array representing li. Let's look at the structure of Li again.
The text in the middle is obtained through text (), so the current Li text path is "./a/text ()" "." Represents the beginning of the current selector. If this returns, the XPath returns a selector. We want to get its true value and call Extract (), which returns an array of its true literals.
To get the property values of a field, you need to use @, such as the above @href, and then assign these values to the item we wrote is OK.
The data of course to save, not to be continued, the next time to write how to save to the database.
Scrapy Getting Started tutorial--crawl to the itunes app list