A web crawler, Spider, is a robot that crawls on a network Crawler. Of course, it is not usually an entity of the robot, because the network itself is a virtual thing, so this "robot" is actually a program, and it is not crawling, but there is a certain purpose, and when crawling will collect some information. For example, Google has a large number of crawlers on the Internet to collect Web content and links between them and other information, such as some of the malicious crawler will be on the internet to collect such as foo@bar.com or Foo [at] bar [dot] com and other things 。 In addition, there are some custom crawlers, specifically for a particular site, such as a few years ago Javaeye Robbin wrote a few special to deal with the Malicious Crawler blog (the original link seems to have been invalidated, not to give), as well as the small audience software or Linuxtoy Such a site is often crawl down the entire site, the name of another hanging out. In fact, crawler from the basic principle is very simple, as long as you can access the network and analysis of Web pages, most languages now have a convenient Http client library can crawl Web pages, and HTML analysis of the simplest can be directly used in regular expressions to do, So it's actually a simple thing to be a rudimentary web crawler. But it is very difficult to achieve a high-quality spider.
Two parts of the crawler, one is to download Web pages, there are many issues to consider, how to maximize the use of local bandwidth, how to schedule Web requests for different sites to alleviate the burden of the other server. In a high-performance Web Crawler system, DNS queries are also a bottleneck for optimization, and there are some "Guild regulations" that need to be followed (such as robots.txt). and get the Web page analysis process is also very complex, the Internet is a strange thing, all kinds of errors are full of HTML pages have, want to fully analyze is almost impossible, in addition, with the popularity of AJAX, how to get the Javascript Dynamically generated content has become a big problem; There are also a variety of intentional or unintended Spider traps on the Internet, and if you blindly follow hyperlinks, you'll be stuck in a Trap, such as this site, which is said to have been previously claimed by Google The Unique URL number has reached 1 trillion, so this person is a proud to announce the second trillion. :D
However, in fact, there are not many people need to do like Google's general Crawler, usually we do a Crawler is to climb a certain or a certain type of site, the so-called understanding, Baizhanbudai, we can in advance to crawl the site structure to do some analysis, things become much easier. Through analysis, the selection of valuable links to track, you can avoid a lot of unnecessary links or Spider traps, if the structure of the site allows to choose a suitable path, we can follow the order of the things we are interested in crawling, so that even the URL repeated judgment can be omitted.
For example, if we want to Pongba blog mindhacks.cn inside the blog text crawl down, through observation, it is easy to find that we are interested in two of the pages:
The article List page, such as the homepage, or the URL is a page such as/page/\d+/, through Firebug can see that each article link is in a H1 under a tag (note that in the Firebug HTML panel to see the HTML code and Vi EW Source may see some discrepancy, if the Web page has Javascript dynamically modify the DOM tree, the former is modified version, and after the Firebug rules, such as attribute are quoted, and so on, and the latter is usually your spider crawl to the original content. If you are using regular expressions to analyze the page, or the HTML Parser and Firefox in the use of some discrepancies, you need to pay special attention to, in addition, in a class of Wp-pagenavi P has a different list of links to the page.
The article content page, each blog has such a page, for example/2008/09/11/machine-learning-and-ai-resources/, contains the complete article content, this is the content which we are interested.
Therefore, we start from the homepage, through the link in the Wp-pagenavi to get the other Article List page, in particular, we define a path: only follow Next page link, so you can go through the sequence in order, eliminating the need to judge the trouble of repeated fetching. In addition, the pages of the article List page to the specific article link to the corresponding page is what we really want to save the data page.
In this case, in fact, the scripting language to write an ad hoc Crawler to complete this task is not difficult, but today's protagonist is Scrapy, which is a Python written Crawler Framework, simple and lightweight, and very convenient, and the official web said that the actual production In use, so it's not a toy level thing. However, there is no release version, you can directly use their Mercurial warehouse crawl source to install. However, this thing can also not be installed directly to use, so it is convenient to update at any time, the document is very detailed, I will not repeat.
Scrapy uses Twisted this asynchronous network library to handle network communication, the structure is clear, and contains a variety of middleware interfaces, you can flexibly complete a variety of requirements. The overall architecture looks like this:
The Green Line is the data flow, first starting from the initial URL, Scheduler will give it to Downloader to download, download will be given to the spider for analysis, spider analysis of the results are two: one is to further crawl the link, such as the previous analysis of the "next page" Links, these things will be sent back to Scheduler, and the other is the data that needs to be saved, and they are delivered to Item Pipeline, which is a place for post-processing (detailed analysis, filtering, storage, etc.) of the data. In addition, in the data flow channel can also install a variety of middleware, to do the necessary processing.
The specific content will also be introduced in the final subsidiary.
It seems complicated, but it's simple to use, just like Rails, to create a new project first:
scrapy-admin.py Startproject Blog_crawl
A blog_crawl directory is created, and a scrapy-ctl.py is the control script for the entire project, and the code is placed in the subdirectory blog_crawl. To be able to crawl mindhacks.cn, we create a new mindhacks_spider.py in the Spiders directory, defining our spider as follows:
From Scrapy.spider import Basespider class Mindhacksspider (basespider): domain_name = "mindhacks.cn" Start_ URLs = ["http://mindhacks.cn/"] def parse (self, Response): return [] SPIDER = Mindhacksspider ()
Our mindhacksspider inherit from the Basespider (usually directly inherited from the richer scrapy.contrib.spiders.CrawlSpider to be more convenient, but in order to show how the data is parse, here is the use Basespider), variables domain_name and Start_urls are easy to understand what it means, and the parse method is the callback function we need to define, and the default request gets response call this callback function, we need to Page to parse, return two results (need further crawl link and need to save the data), let me feel somewhat strange is that in its interface definition, these two results are mixed in a list to return, it is not clear why this design, and finally do not have to struggle to separate them? Anyway, here we'll write an empty function, returning only an empty list. In addition, define a "global" variable SPIDER, which is instantiated when Scrapy imports the module and is automatically found by the Scrapy engine. So you can run the crawler first, try it:
./scrapy-ctl.py Crawl Mindhacks.cn
There will be a bunch of output, and you can see that the http://mindhacks.cn is crawled, because this is the initial URL, but since we don't return the URL that needs to be crawled in the parse function, the entire crawl process just grabs the home page and ends. The next step is to analyze the page, and Scrapy provides a handy shell (requiring IPython) that allows us to experiment and start the shell with the following command:
./scrapy-ctl.py Shell http://mindhacks.cn
It will start crawler, the command line specified this page crawl down, and then into the shell, according to the hint, we have a lot of ready-made variables can be used, one is HXS, it is a htmlxpathselector, mindhacks HTML page comparison specification, Can be easily analyzed directly with XPath. As you can see through Firebug, the link to each blog post is under H1, so use this XPath expression test in the Shell:
In [1]: hxs.x ('//h1/a/@href '). Extract () out[1]: [u ' http://mindhacks.cn/2009/07/06/why-you-should-do-it-yourself/', U ' http://mindhacks.cn/2009/05/17/seven-years-in-nju/', U ' http://mindhacks.cn/2009/03/28/ Effective-learning-and-memorization/', U ' http://mindhacks.cn/2009/03/15/preconception-explained/', U '/http mindhacks.cn/2009/03/09/first-principles-of-programming/', U ' http://mindhacks.cn/2009/02/15/ Why-you-should-start-blogging-now/', U ' http://mindhacks.cn/2009/02/09/writing-is-better-thinking/', U '/http mindhacks.cn/2009/02/07/better-explained-conflicts-in-intimate-relationship/', U ' http://mindhacks.cn/2009/02/07 /independence-day/', U ' http://mindhacks.cn/2009/01/18/escape-from-your-shawshank-part1/']
This is the URL we need, in addition, you can find the "next page" link, along with several other pages of the link in a p, but the "next page" link does not have the title property, so the XPath writing
p[@class = "Wp-pagenavi"]/a[not (@title)]
However, if you turn back a page, you will find that the "previous page" is also the case, so you also need to determine that the text on the link is the next page of the arrow u ' \XBB ', can also be written into the XPath, but as if this itself is the Unicode escape character, because the encoding reason is not clear , put it directly outside to judge, the final parse function is as follows:
Def parse (self, Response): items = [] hxs = htmlxpathselector (response) posts = hxs.x ('//h1/a/@href '). Extract () items.extend ([Self.make_requests_from_url (URL). Replace (callback=self.parse_post) for URL in Posts]) page_links = hxs.x ('//p[@class = "Wp-pagenavi"]/a[not (@title)] ') for link in page_links: if link.x (' Text () '). Extract () [0] = = U ' \xbb ': url = link.x (' @href '). Extract () [0] items.append (self.make_requests_ From_url (URL)) return items
The first half is the link that resolves the text of the blog that needs to be crawled, and the second part is the link to the next page. It is important to note that the list returned here is not a single string format URL is finished, scrapy want to get the Request object, which is more than a string format URL can carry more things, such as cookies or callback functions. You can see that we replaced the callback function when we created the Request for the body of the blog, because the default callback function, parse, is specifically used to parse a page such as a list of articles, and parse_post is defined as follows:
def parse_post (self, Response): item = Blogcrawlitem () item.url = Unicode (response.url) Item.raw = Response.body_as_unicode () return [item]
Quite simply, returning a blogcrawlitem, putting the captured data inside, could have done a bit of parsing here, for example, parsing the text and title through XPath, but I tend to do it later, like Item Pipeline or later. Offline stage. Blogcrawlitem is scrapy automatically help us define an empty class that inherits from Scrapeditem, in items.py, here I add a little something:
From Scrapy.item import Scrapeditem class Blogcrawlitem (Scrapeditem): def __init__ (self): scrapeditem.__init __ (self) self.url = " def __str__ (self): return ' Blogcrawlitem (URL:%s) '% Self.url
Define the __STR__ function, just give the URL, because the default __str__ function will show all the data, so you will see crawl when the console log mad output, it is crawling to the Web content output. -.-BB
In this way, the data is taken, finally only the function of storing data, we add a Pipeline to achieve, because Python in the standard library with Sqlite3 support, so I use the Sqlite database to store data. Replace the contents of pipelines.py with the following code:
Import sqlite3from OS import path from Scrapy.core import signalsfrom scrapy.xlib.pydispatch Import Dispatcher class Sqlit Estorepipeline (object): filename = ' data.sqlite ' def __init__ (self): Self.conn = None Dispatcher.connect (self.ini Tialize, signals.engine_started) dispatcher.connect (Self.finalize, signals.engine_stopped) def process_item (self, do Main, item): Self.conn.execute (' INSERT INTO blog values (?,?,?) ', (Item.url, Item.raw, Unicode (domain))) Return Item Def initialize (self): if Path.exists (self.filename): Self.conn = Sqlite3.connect (self.filename) Else:self.conn = Self.create_table (self.filename) def finalize (self): If self.conn are not None:self.conn. Commit () self.conn.close () Self.conn = None def create_table (self, filename): conn = Sqlite3.connect (Filena Me) Conn.execute ("" "CREATE Table Blog (URL text primary key, raw text, domain text)" "" Conn.commit () R Eturn Conn
In the __INIT__ function, use dispatcher to connect two signals to the specified function, respectively, to initialize and close the database connection (remember commit before close, it does not seem to be automatic commit, just close the words as if all the data has been lost D D-.-). When there is data passing through the pipeline, the Process_item function is called, where we directly talk about the raw data stored in the database without any processing. If necessary, you can add additional pipeline, extract data, filter, etc., here is not elaborate.
Finally, list our pipeline in settings.py:
Item_pipelines = [' Blog_crawl.pipelines.SQLiteStorePipeline ']
Run again crawler, OK!
Components of the Ps1:scrapy
1.Scrapy engine (Scrapy engines)
The Scrapy engine is used to control the data processing flow of the entire system and to trigger transactions. More detailed information can be found in the following data processing process.
2.Scheduler (Scheduler)
The scheduler accepts requests from the Scrapy engine and sorts them into queues and returns them to them after the Scrapy engine makes a request.
3.Downloader (Downloader)
The main function of the downloader is to crawl the Web page and return the content to the spider (Spiders).
4.Spiders (spider)
Spiders are scrapy. The user defines the class that is used to parse the Web page and crawl the content returned by the URL, each of which can process a domain name or a group of domain names. In other words, it is used to define the crawl and parse rules for a particular site.
5.Item Pipeline (Project pipeline)
The main responsibility of the project pipeline is to handle projects that have spiders extracted from web pages, whose main task is to clear, validate, and store data. When the page is parsed by the spider, it is sent to the project pipeline, and the data is processed in several specific order. Each project pipeline component is a Python class that consists of a simple method. They get the project and execute their methods, and they need to decide whether or not to continue the next step in the project pipeline or discard it directly.
The process typically performed by a project pipeline is:
Clean the data that the HTML data validation resolves to (check if the item contains the necessary fields) check if duplicate data (if deleted repeatedly) stores the parsed data into the database
6.Middlewares (middleware)
Middleware is a hook framework between the Scrapy engine and other components, primarily to provide a custom code to extend the functionality of the scrapy.
Data processing flow of ps2:scrapy
Scrapy's entire data processing process is controlled by the Scrapy engine, which operates mainly in the following ways:
The engine opens a domain name, when the spider handles the domain name and lets the spider get the first crawl URL.
The engine gets the first URL to crawl from the spider, and then dispatches it as a request in the schedule.
The engine gets the page that crawls next from the dispatch.
The schedule returns the next crawled URL to the engine, which the engine sends to the downloader via the download middleware.
When the Web page is downloaded by the downloader, the response content is sent to the engine via the download middleware.
The engine receives a response from the downloader and sends it through the spider middleware to the spider for processing.
The spider processes the response and returns the crawled item, and then sends a new request to the engine.
The engine crawls to the project project pipeline and sends a request to the dispatch.
The system repeats the operation after the second part until there is no request in the schedule, and then disconnects the engine from the domain.
The above is an in-depth analysis of Python framework scrapy structure and operation of the content of the process, more relevant content please pay attention to topic.alibabacloud.com (www.php.cn)!