A novel crawler made with Scrapy

Last Update:2015-01-16 Source: Internet

Author: User

Tags django website

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A novel crawler made with Scrapy

Crawler-matching Django website https://www.zybuluo.com/xuemy268/note/63660

First is the installation of Scrapy, under Windows installation is troublesome, everyone good under Baidu, here is not elaborate on, under Ubuntu installation

Apt-get Install python-devapt-get install python-lxml    apt-get install Libffi-devpip install Scrapy

Crawling the story is nothing more than crawling two pages, a novel introduction page and a chapter page of the novel, and then divided into 2 different situations

The introduction page of the novel contains the chapters list directory
The chapter list directory is not included in the introduction page of the novel, but contains a URL to a list of chapters

In relation to a case:

Def parse (self,response): # Use XPath to get the name of the novel, author, category, Introduction, chapter list URL #使用下面的方法获取章节列表URL可以直接使用Request (), and get the chapter name directly #http:// www.ydzww.com    Sgmllinkextractor (restrict_xpaths= (Config.get ("Novelchapterlist_xpath"),). Extract_links ( Response

For the B case:

#可以使用xpath get a URL that points to a list of chapters, you can use Get_base_url (response) to get the domain name information, and then use Moves.urllib.parse.urljoin () for stitching # Then you can use the request (), and the back step is basically the same as a case #http://www.ydzww.com

Insert database This aspect, Google, use Twisted database interface, as if this is asynchronous, with scrapy estimate will be better, if use other also have no relationship, I use the Django Model did not find the problem

Provide an online search code

# cannot use this to create the table, must has table already created from twisted.enterprise import Adbapiimport Datetim Eimport Mysqldb.cursors class Sqlstorepipeline (object): Def __init__ (self): Self.dbpool = Adbapi. ConnectionPool (' MySQLdb ', db= ' mydb ', user= ' myuser ', passwd= ' mypass ', cursorclass=mysqldb.cursors.dictcursor , charset= ' UTF8 ', use_unicode=true) def process_item (self, item, spider): # Run DB query in thre Ad Pool query = self.dbpool.runInteraction (Self._conditional_insert, item) Query.adderrback (Self.handle_erro         R) return item Def _conditional_insert (self, TX, item): # Create record if doesn ' t exist.        # All this block run on it's own thread Tx.execute ("select * from websites where link =%s" (item[' link '][0],)) result = Tx.fetchone () If Result:log.msg ("Item already stored in DB:%s"% Item, Level=log.       DEBUG) Else:tx.execute (         INSERT into websites (link, created) "Values (%s,%s)", (item[' link '][0], Datetime.datetime.now ())) log.msg ("Item stored in DB:%s"% Item, Level=log. DEBUG) def handle_error (self, E): Log.err (E) #该代码片段来自于: Http://www.sharejs.com/codes/python/8392#http://www.ydz Ww.com

The other is the crawler control this piece, using the default control, crawler crawling too fast, there is the danger of sealing station, and then there is afraid so fast, the collection station climbed off, later collected who?

# simultaneous Download Number concurrent_requests = 5concurrent_requests_per_spider = 5closespider_pagecount = 100000closespider_timeout = 36000download_delay = 1.5  retry_enabled = falsecookies_enabled = false# http://www.ydzww.com

This is my configuration, from the collection of my so many days, a minute to collect about 40 pages, also almost

Filtering of content

Basically the content is obtained by XPath, and then the chapter content also uses some regular, remove the content inside the URL, there are some information about the collection station

(HTTP (s)?:/ /.)? (www\.)? [Email protected]:!$^&\*%. () _\+~#=\uff10-\uff40{}\[\]]{2,256}[\[\]{}!$^\*&@:%._\+~#= ()][\[\]{}a-z!$^\*&@:%._\uff10-\uff40\s]{2,6 }\b ([\[\]-a-za-z0-9 () @:%_\+.~#?&//=]*) # www.ydzww.com

This is my use to deal with the content page URL of the regular, so far in the collection of novels did not encounter the URL can not be processed, if we could find that there is no way to deal with, comments, I am good to make a change, convenient for everyone to use it!

Crawler than the current novel crawler, the advantages are as follows:

Can run flawlessly under Linux, Windows can run, but sometimes the log file may be garbled
Through and database configuration, a novel corresponding to a collection station, 3-minute cycle monitoring single novel, to ensure that the novel can be collected in the fastest time
Fast and stable operation, the stability of the scrapy is still worth affirming

Already used this crawler to create a novel station, easy to read Chinese web

A novel crawler made with Scrapy

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More