Python Crawler crawls all articles of a specified blog

Source: Internet
Author: User
Since the previous article Z story:using Django with GAE Python background captures the full text of multiple Web pages, the overall progress is as follows:
1. Added cron: Used to tell the program to wake a task every 30 minutes and run to the designated blogs to crawl the latest updates
2. Use Google's Datastore to store the contents of each crawler crawl. Only new content is stored.

As I said last time, there has been a significant improvement in performance since the crawler was awakened after each request, and it took about 17 seconds to output from the background to the foreground and now it only takes 2 seconds.

3. Optimization of crawlers

1. Cron.yaml to schedule the waking hours for each program

After flipping through the documents, asking questions finally figured out how Google's cron works--it's actually just a virtual visit to one of our own designated URLs at every specified time ...
So under Django, there's no need to write a pure Python program:
If __name__== "__main__":
Just configure a URL yourself to put in views.py:

def updatepostsdb (Request): #deleteAll () siteinfos=[] siteinfo={} siteinfo[' Postsite ']= "l2zstory" siteinfo[' Feedurl '  ]= "feed://l2zstory.wordpress.com/feed/" siteinfo[' Blog_type ']= "WordPress" Siteinfos.append (siteinfo) SiteInfo={} siteinfo[' Postsite ']= "yukilife" siteinfo[' Feedurl ']= "feed://blog.sina.com.cn/rss/1583902832.xml" SiteInfo[' Blog_ Type ']= ' Sina "Siteinfos.append (siteinfo) siteinfo={} siteinfo[' Postsite ']=" zlife "siteinfo[' feedurl ']= ' feed:// ireallife.wordpress.com/feed/"siteinfo[' Blog_type ']=" WordPress "Siteinfos.append (siteinfo) siteinfo={} SiteInfo[' Postsite ']= "Zlife_sina" siteinfo[' Feedurl ']= "feed://blog.sina.com.cn/rss/1650910587.xml" SiteInfo[' blog_type ']= ' Sina "Siteinfos.append (siteinfo) try:for site in siteinfos:feedurl=site[' Feedurl '] blog_type=site[' blog _type '] postsite=site[' postsite '] postinfos=getpostinfosfromweb (feedurl,blog_type) Recordtodb (PostSite,Post    infos) msg= "Cron Job done ..." except exception,e:Msg=str (e) return HttpResponse (MSG) 

Cron.yaml to be placed on the same level as App.yaml:
Cron:
-Description:retrieve Newest Posts
URL:/task_updateposts/
Schedule:every minutes

In url.py, just point to the Task_updatepostsdb, point to the URL.

Debugging this cron process can be described in a tragic way ... There are a lot of people on the StackOverflow asking why their cron can't work ... I was sweating at first and couldn't find my head. Finally fortunately, the general steps are vague and very. But very simple:
First of all, make sure that your program has no syntax error .... Then you can try to manually access the URL if cron is normal, this time the task should have been executed and finally really not enough to look at the log ...

2. Configuration and utilization of the Datastore--using Datastore with Django

My needs are simple here-no joins ... So I used the most humble django-helper directly.
This models.py is a key point:

Copy the Code code as follows:


From Appengine_django.models import Basemodel
From Google.appengine.ext Import db

Classpostsdb (Basemodel):
Link=db. Linkproperty ()
Title=db. Stringproperty ()
Author=db. Stringproperty ()
Date=db. Datetimeproperty ()
Description=db. Textproperty ()
Postsite=db. Stringproperty ()

The first two lines are the focus of the emphasis .... I didn't write the second line at first. It took me 2 hours to understand what was going on. Not worth the candle ...
When reading and writing, don't forget ... Postdb.put ()

In the beginning, I just want to save time, directly every cron wake up, delete all the data, and then re-write the new crawled down the data ...
Results... After a day ... There are 40,000 reading and writing records .... And there are only 50,000 free articles per day ....
So instead of before inserting the first to see if there is an update, some words to write, no words will not write. Finally the database this part of the good ...

3. Crawler improvements:
In the beginning, the crawler just crawled into the feed for the article. This way, if a blog has 24*30 article ... You can only get up to 10 articles ....
This time, the improved version can crawl all the articles. I took the lonely Chuan Ling, Han Cold, Yuki and Z's blog to do the experiment. The success is very ... One of the lonely Chuan ling there are 720+ articles ... The missing was climbed down.

Import urllib#from beautifulsoup import beautifulsoupfrom pyquery import pyquery as pqdef getarticlelist (URL): lstarticle S=[] url_prefix=url[:-6] cnt=1 response=urllib.urlopen (URL) html=response.read () D=PQ (HTML) try:pagecnt=d ("ul. Sg_pages "). Find (' span ') Pagecnt=int (d (pagecnt). Text () [1:-1]) except:pagecnt=1 for I in Range (1,pagecnt+1): URL =url_prefix+str (i) + ". html" #print url response=urllib.urlopen (URL) html=response.read () D=PQ (HTML) title_sp Ans=d (". Atc_title"). Find (' a ') Date_spans=d ('. Atc_tm ') for J in Range (0,len (Title_spans)): titleobj=title_sp ANS[J] dateobj=date_spans[j] article={} article[' link ']= d (titleobj). attr (' href ') article[' title ']= D ( titleobj). Text () article[' Date ']=d (dateobj). Text () article[' desc ']=getpagecontent (article[' link ') lstartic Les.append (article) return lstarticles def getpagecontent (URL): #get Page Content response=urllib.urlopen (URL) html=r Esponse.read () d=pq (HTML) Pagecontent=d ("Div.articalcontent"). Text () #print pagecontent return pagecontentdef main (): Url= ' Http://blog.sina. Com.cn/s/articlelist_1191258123_0_1.html ' #Han Han url= "http://blog.sina.com.cn/s/articlelist_1225833283_0_1.html "#Gu Du Chuan Ling url=" http://blog.sina.com.cn/s/articlelist_1650910587_0_1.html "#Feng url="/http/ Blog.sina.com.cn/s/articlelist_1583902832_0_1.html "#Yuki lstarticles=getarticlelist (URL) for article in LstArticles : F=open ("blogs/" +article[' Date ']+ "_" +article[' title ']+ ". txt", ' W ') f.write (article[' desc '].encode (' utf-8 ')) # Special attention to the processing of Chinese f.close () #print article[' desc '] If __name__== ' __main__ ': Main ()

Recommended for Pyquery.
I'm sorry to say, Bueautifulsoup let me deeply disappointed ... When I wrote the article, there was a small bug. have been unable to find the reason. After I got home, I spent a lot of time trying to figure out why Bueautifulsoup couldn't catch what I wanted ... Later generally looked at it selector part of the source code think it should be it for a lot of

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.