Scrapy Reptiles vs. self-writing reptiles--crawling jobbole articles

Source: Internet
Author: User

A few days ago wrote a reptile, used to climb bole online python section of the article. In fact, this crawler is just to save the page, because Bole online article has both pictures and code, climbing the main part of the words of the layout is difficult to see, rather than directly to save the page.
Then these two days are looking at Python's Lightweight crawler framework –scrapy and trying to write a crawler with Scrapy. At first I felt like I had no idea, and then slowly I felt very good. But to make it work, you just don't know what the performance is. So he used Scrapy to also write a reptile to climb bole online article, and then make a contrast. I wrote the reptile-jobbole.py

Because I used to write a crawler contact with multiple threads, see the post bar of a person to write the crawler used is Multiprocessing.dummy.Pool, and then went to use, and found this Pool really good, so basically I wrote every reptile will use Pool. Its usage is:

From multiprocessing.dummy import Pool
pool=pool #10代码线程数
#定义一个函数
def run (num):
    print num**2
num_list=[1,2,3,4,5,6,7,8,9,10]
pool.map (run,num_list) #运行
#关闭线程
pool.close ()
pool.join ()

The above is the basic usage, simple and efficient.
And then here's the reptile that's written in this way.

The reptile is divided into two steps: The first step: parse each page such as http://python.jobbole.com/all-posts/page/2/page, the article link to parse out.

Step Two: Parse the article page (there's no resolution here, because it's a direct save page) and save the content in. html format.

The detailed code is as follows:

Import re import requests as Req from Multiprocessing.dummy import Pool Import time Class Downloadarticle (): Def __ini T__ (self): self.url= ' http://python.jobbole.com/all-posts/page/' self.article_list=[] self.savepath= ' d:/python/ Jobbole/articles/' self.errorurl={} def parsearticleurl (self,page): Global article_list Url=self.url+str (PA GE) s=req.get (url) #<a target= "_blank" href= "http://python.jobbole.com/81896/" title= "Art_re=re.compile" (' & Lt;a target=.*?href= "(http://python.*?\d{3,5}/)" title= ') Au=art_re.findall (s.content) s.close () Self.article_ List.extend (AU) def savearticle (Self,url): Global errorurl s=req.get (URL) title= (' <title> ( .*?)
        -.*?</title> ', s.content) [0]). Decode (' Utf-8 ') encode (type_) try:for item in ['/', ': ', '? ', ' < ', ' > ']: Title=title.replace (item, ') F=open (R '%s%s.html '% (self.savepath,title), ' W ') F.write (s.content) F . Close () except: Self.errorurl[title]=url print ' ERROR ' def run (self): Pool=pool st=time.time () #step 1 Pages=range (1,21) pool.map (self.parsearticleurl,pages) print ' Step 1 finished!  Cost Time:%fs '% (Time.time ()-st) st1=time.time () #step 2 pool.map (self.savearticle,self.article_list) print ' Step 2 finished! Cost Time:%fs '% (Time.time ()-st1) #finish pool.close () pool.join () print ' all articles downloaded!
  Cost Time:%fs '% (Time.time ()-st) if __name__== ' __main__ ': print ' articles begin download! '
 D=downloadarticle () D.run ()

Run Result:

It took a total of 8.65 seconds to sweep through 20 pages of the article, which was pretty fast. a reptile written with Scrapy

Scrapy's basic knowledge is not introduced, you can see the official documents.
First open cmd, then the CD command switches to the working directory, enter:

Scrapy Startproject Jobbole
One of the jobbole is you get the name for this project.
A directory is then created in the working directory, and the structure is this:

Jobbole
|–jobbole
| |–spyders
| | |– init. py
| |
| |– init. py
| |–items.py
| |–pipelines.py
| |–settings.py
|
|–scrapy.cfg items.py is used to set the content to crawl. Pipelines.py is used to process crawled data, such as weight checking, cleaning, and storage. settings.py for setting ... You also create a spider.py in spiders.

Just Yi Yilai to see how to set up separately. items.py

Import Scrapy

class Jobboleitem (scrapy. Item):
  title=scrapy. Field ()
  content=scrapy. Field ()

Very simple, so a few words, the first two sentences are still fixed, and title, content is what I want to crawl, one is the title of the article, there is an article theme-here is the entire Web page. spider:JobSpider.py Under the Spiders folder

#-*-Coding:utf-8-*-"" "Created on% (date) s @author:% (C.yingxian) S" "" import sys type_ = sys.getfilesystemencoding ( ) Reload (SYS) sys.setdefaultencoding (' Utf-8 ') from scrapy.spiders import Spider import scrapy.linkextractors Tractor from scrapy.http import Request to Jobbole.items import jobboleitem from Scrapy.selector import selector class
  Jobspider (Spider): Page_link=set () # Content_link=set () #这两个set集合适用于下面去重复链接的.
  Name= ' Jobbole ' #名字很重要, need to remember. allowed_domain=[' jobbole.com '] #允许的域名, only allowed to crawl #开始域名 in this domain-that is, where to start start_urls=[' http://python.jobbole.com/all-posts/ '] #制定网页提取规则.
         The first one is used to "flip", and the second is the rules={' page ': Linkextractor (allow= (' http://python.jobbole.com/all-posts/page/\d+/'), which is used to find the link to the article. ' Content ': Linkextractor (allow= (' http://python.jobbole.com/\d+/'))} #start_urls的网址会第一个传到parse () def parse (self,  Response): For link in self.rules[' page '].extract_links (response): #对于找到的每一个页链接, if the If Link.url is not joined and parsed in the set Not in Self.page_link:sElf.page_link.add (Link.url) yield Request (link.url,callback=self.parse_page) #对于每一个文章链接 If you are not in the Content_link collection and resolves for link in self.rules[' content '].extract_links (response): If Link.url not in Self.content_link:sel F.content_link.add (Link.url) yield Request (link.url,callback=self.parse_content) #用于解析每一页网页并找到文章链接 def parse_p
        Age (self,response): For link in self.rules[' page '].extract_links (response): If Link.url not in Self.page_link: Self.page_link.add (Link.url) yield Request (link.url,callback=self.parse_page) for link in self.rules['
        Content '].extract_links (response): If Link.url not in Self.content_link:self.content_link.add (Link.url) Yield Request (link.url,callback=self.parse_content) #解析文章内容, find the title and Content def parse_content (self,response): item=j Obboleitem () sel=selector (response) Title=sel.xpath ('//title/text () '). Extract () [0] item[' title ']=title ite m[' content ']=response.boDY return item# must return 
pipeline.py
From Jobbole.items import Jobboleitem

class Jobbolepipeline (object):
  path= ' d:/python/scrapy/jobbole/art/'
  def process_item (self, item, spider):
    title=item[' title ' for
    it in ['/', ': ', '? ', ' < ', ' > ']:
        Title=title.replace (it, ')
    f=open (self.path+title+ '. html ', ' W ')
    f.write (item[' content ')
    f.close () Return
    Item

pipelines.py here will not say, the main is to write out the data received.
After setting up pipelines.py, there is a very important thing to do, is to set up in settings.py. settings.py

Bot_name = ' Jobbole '

spider_modules = [' jobbole.spiders ']
newspider_module = ' jobbole.spiders '

ITEM_ Pipelines = {' Jobbole.pipelines.JobbolePipeline ': 1000,} #前三句都是创建项目的时候就有的, the main to add this sentence.

Here, a scrapy reptile is set up.
We go to the project root directory, which is here
Jobbole
|–jobbole <- This is the root directory .
| |–spyders
| | |– init. py
| |
| |– init. py
| |–items.py
| |–pipelines.py
| |–settings.py
|
|–scrapy.cfg
Open cmd Run project, enter in CMD

Scrapy Crawl Jobbole

Like this:
Jobbole is the name set in the Spider, remember.
Run Result:

See the Running time?
Run time equals "13:04:47-12:54:23" = 624 seconds.
That Start_time is the start time, but the time zone it gives is GMT-0, Greenwich Mean Time, and we're GMT-8 time zone, so it's 4+8=12 ... Did you see it? One 8.65 seconds, one 624 seconds.

Well, I think the reason for this is probably because I wrote the scrapy is not perfect, because I was contacted for two days.
Well, if that's the case, it's not the time to compare.
but the benefits of scrapy are obvious, and that's easy to write (when you understand the rationale, of course, I'm not really familiar with it yet ...). )

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.