A novel crawler made with Scrapy

Source: Internet
Author: User
Tags django website

A novel crawler made with Scrapy

Crawler-matching Django website https://www.zybuluo.com/xuemy268/note/63660

First is the installation of Scrapy, under Windows installation is troublesome, everyone good under Baidu, here is not elaborate on, under Ubuntu installation

Apt-get Install python-devapt-get install python-lxml    apt-get install Libffi-devpip install Scrapy

Crawling the story is nothing more than crawling two pages, a novel introduction page and a chapter page of the novel, and then divided into 2 different situations

      1. The introduction page of the novel contains the chapters list directory

      2. The chapter list directory is not included in the introduction page of the novel, but contains a URL to a list of chapters

In relation to a case:

Def parse (self,response): # Use XPath to get the name of the novel, author, category, Introduction, chapter list URL #使用下面的方法获取章节列表URL可以直接使用Request (), and get the chapter name directly #http:// www.ydzww.com    Sgmllinkextractor (restrict_xpaths= (Config.get ("Novelchapterlist_xpath"),). Extract_links ( Response

For the B case:

#可以使用xpath get a URL that points to a list of chapters, you can use Get_base_url (response) to get the domain name information, and then use Moves.urllib.parse.urljoin () for stitching # Then you can use the request (), and the back step is basically the same as a case #http://www.ydzww.com

Insert database This aspect, Google, use Twisted database interface, as if this is asynchronous, with scrapy estimate will be better, if use other also have no relationship, I use the Django Model did not find the problem

Provide an online search code

# cannot use this to create the table, must has table already created from twisted.enterprise import Adbapiimport Datetim Eimport Mysqldb.cursors class Sqlstorepipeline (object): Def __init__ (self): Self.dbpool = Adbapi. ConnectionPool (' MySQLdb ', db= ' mydb ', user= ' myuser ', passwd= ' mypass ', cursorclass=mysqldb.cursors.dictcursor , charset= ' UTF8 ', use_unicode=true) def process_item (self, item, spider): # Run DB query in thre Ad Pool query = self.dbpool.runInteraction (Self._conditional_insert, item) Query.adderrback (Self.handle_erro         R) return item Def _conditional_insert (self, TX, item): # Create record if doesn ' t exist.        # All this block run on it's own thread Tx.execute ("select * from websites where link =%s" (item[' link '][0],)) result = Tx.fetchone () If Result:log.msg ("Item already stored in DB:%s"% Item, Level=log.       DEBUG) Else:tx.execute (         INSERT into websites (link, created) "Values (%s,%s)", (item[' link '][0], Datetime.datetime.now ())) log.msg ("Item stored in DB:%s"% Item, Level=log. DEBUG) def handle_error (self, E): Log.err (E) #该代码片段来自于: Http://www.sharejs.com/codes/python/8392#http://www.ydz Ww.com

The other is the crawler control this piece, using the default control, crawler crawling too fast, there is the danger of sealing station, and then there is afraid so fast, the collection station climbed off, later collected who?

# simultaneous Download Number concurrent_requests = 5concurrent_requests_per_spider = 5closespider_pagecount = 100000closespider_timeout = 36000download_delay = 1.5  retry_enabled = falsecookies_enabled = false# http://www.ydzww.com

This is my configuration, from the collection of my so many days, a minute to collect about 40 pages, also almost

Filtering of content

Basically the content is obtained by XPath, and then the chapter content also uses some regular, remove the content inside the URL, there are some information about the collection station

(HTTP (s)?:/ /.)? (www\.)? [Email protected]:!$^&\*%. () _\+~#=\uff10-\uff40{}\[\]]{2,256}[\[\]{}!$^\*&@:%._\+~#= ()][\[\]{}a-z!$^\*&@:%._\uff10-\uff40\s]{2,6 }\b ([\[\]-a-za-z0-9 () @:%_\+.~#?&//=]*) # www.ydzww.com

This is my use to deal with the content page URL of the regular, so far in the collection of novels did not encounter the URL can not be processed, if we could find that there is no way to deal with, comments, I am good to make a change, convenient for everyone to use it!

Crawler than the current novel crawler, the advantages are as follows:
    1. Can run flawlessly under Linux, Windows can run, but sometimes the log file may be garbled

    2. Through and database configuration, a novel corresponding to a collection station, 3-minute cycle monitoring single novel, to ensure that the novel can be collected in the fastest time

    3. Fast and stable operation, the stability of the scrapy is still worth affirming

Already used this crawler to create a novel station, easy to read Chinese web

A novel crawler made with Scrapy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.