Because GoogleAppEngine is walled, I cannot continue to improve my Movenproject, and I still have 20 + days to go back. I am afraid that I will forget the progress and details of the project, so I don't want to do anything cold, the general progress is as follows:
1. Added Cron: Used to tell the program to wake up a task every 30 minutes and go to the designated blogs to crawl the latest updates.
2. Use google's Datastore to store the content crawled by each crawler .. Only store new content ..
As mentioned last time, the performance has been greatly improved: after each original request, crawlers are awakened, so it takes about 17 seconds to output data from the background to the foreground. Now it takes less than 2 seconds.
3. crawler optimized
1. Cron. yaml to schedule the waking time of each program
After reading the document, I finally figured out how google's cron works. In fact, google only attempts to access a specified url at a specified time...
Therefore, in Django, you do not need to write a pure python program:
If _ name __= = "_ main __":
You only need to configure a url and put it in views. py:
def updatePostsDB(request): #deleteAll() SiteInfos=[] SiteInfo={} SiteInfo['PostSite']="L2ZStory" SiteInfo['feedurl']="feed://l2zstory.wordpress.com/feed/" SiteInfo['blog_type']="wordpress" SiteInfos.append(SiteInfo) SiteInfo={} SiteInfo['PostSite']="YukiLife" SiteInfo['feedurl']="feed://blog.sina.com.cn/rss/1583902832.xml" SiteInfo['blog_type']="sina" SiteInfos.append(SiteInfo) SiteInfo={} SiteInfo['PostSite']="ZLife" SiteInfo['feedurl']="feed://ireallife.wordpress.com/feed/" SiteInfo['blog_type']="wordpress" SiteInfos.append(SiteInfo) SiteInfo={} SiteInfo['PostSite']="ZLife_Sina" SiteInfo['feedurl']="feed://blog.sina.com.cn/rss/1650910587.xml" SiteInfo['blog_type']="sina" SiteInfos.append(SiteInfo) try: for site in SiteInfos: feedurl=site['feedurl'] blog_type=site['blog_type'] PostSite=site['PostSite'] PostInfos=getPostInfosFromWeb(feedurl,blog_type) recordToDB(PostSite,PostInfos) Msg="Cron Job Done..." except Exception,e: Msg=str(e) return HttpResponse(Msg)
Cron. yaml should be placed at the same level as app. yaml:
Cron:
-Description: retrieve newest posts
Url:/task_updatePosts/
Schedule: every 30 minutes
In url. py, just point to this and point task_updatePostsDB to the url.
The process of debugging this cron can be described as tragic... On stackoverflow, many people are asking why their cron cannot work... At first, I was full of sweat and couldn't find my mind... Finally, I was lucky enough to complete the general steps .. But it is very simple:
First, make sure that your program has no syntax error .... Then, you can manually access the url. If cron is normal, the task should have been executed. If it is not, check the log...
2. configuration and utilization of Datastore -- Using Datastore with Django
My requirements are simple here -- no join... So I directly used the most primitive django-helper ..
This models. py is important:
The Code is as follows:
From appengine_django.models import BaseModel
From google. appengine. ext import db
ClassPostsDB (BaseModel ):
Link = db. LinkProperty ()
Title = db. StringProperty ()
Author = db. StringProperty ()
Date = db. DateTimeProperty ()
Description = db. TextProperty ()
PostSite = db. StringProperty ()
The first two rows are the focus .... I did not write the second line at first... It took me more than two hours to figure out what was going on .. Not worth the candle...
Never forget to read and write data... PostDB. put ()
At the beginning, I simply woke up the cron, deleted all the data, and re-written the new crawled data to save trouble...
Result... One day later... 40 thousand read/write records .... And there are only 50 thousand free articles every day ....
So we should first check whether there is any update before inserting the data. If there is any update, We will write it. If there is no update, we will not write it .. Finally we have done a good job in this part of the database...
3. crawler improvement:
At the beginning, crawlers only crawl the articles in the feed .. In this way, if a blog has 24*30 articles... You can get up to 10 articles ....
This time, the ghost version can crawl all the articles .. I did some experiments with the blogs of Sichuan mausoleum, Han, Yuki, and Z respectively .. Success... Among them, there are more than 720 articles in the lonely Sichuan mausoleum... The missing ones were crawled down ..
Import urllib # from BeautifulSoup import BeautifulSoupfrom pyquery import PyQuery as pqdef getArticleList (url): lstArticles = [] url_prefix = url [:-6] Cnt = 1 response = urllib. urlopen (url) html = response. read () d = pq (html) try: pageCnt = d ("ul. SG_pages "). find ('span ') pageCnt = int (d (pageCnt ). text () [1:-1]) Partition T: pageCnt = 1 for I in range (1, pageCnt + 1): url = url_prefix + str (I) + ". html "# print url response = urllib. urlopen (url) html = response. read () d = pq (html) title_spans = d (". atc_title "). find ('A') date_spans = d ('. atc_tm ') for j in range (0, len (title_spans )): titleObj = title_spans [j] dateObj = date_spans [j] article = {} article ['link'] = d (titleObj ). attr ('href ') article ['title'] = d (titleObj ). text () article ['date'] = d (dateObj ). text () article ['desc'] = getPageContent (article ['link']) lstArticles. append (article) return lstArticles def getPageContent (url): # get Page Content response = urllib. urlopen (url) html = response. read () d = pq (html) pageContent = d ("p. articalContent "). text () # print pageContent return pageContentdef main (): url =' http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html '# Han url =" http://blog.sina.com.cn/s/articlelist_1225833283_0_1.html "# Gu Du Chuan Ling url =" http://blog.sina.com.cn/s/articlelist_1650910587_0_1.html "# Feng url =" http://blog.sina.com.cn/s/articlelist_1583902832_0_1.html "# Yuki lstArticles = getArticleList (url) for article in lstArticles: f = open ("blogs/" + article ['date'] + "_" + article ['title'] + ". txt ", 'w') f. write (article ['desc']. encode ('utf-8') # Pay special attention to the processing of Chinese characters. close () # print article ['desc'] if _ name __= = '_ main _': main ()
Recommendation for PyQuery ..
Sorry, I am deeply disappointed with BueautifulSoup... When I wrote my previous article, there was a small bug that I could not find .. After I went home, I spent a lot of time trying to figure out why BueautifulSoup never caught what I wanted... Later, I looked at the source code of its selector and thought that it was inaccurate for parsing many nonstandard html pages with the script tag...
I gave up this library and tried lxml... based on xpath .. But I always need to check documents about xpath... So I found another database, PyQuery... You can use the jQuery selector tool... Very, very, very useful .... For specific usage, see the above... This database is promising...
Worry
Because pyquery is based on lxml... The underlying layer of lxml is c... So it is estimated that it cannot be used on gae... My crawler can only crawl good things on my computer now... Then push to the server...
Summary
In a word, I love Python.
I love Python, And I love Django.
I love Python, Django, jQuery, and so on...
End 4: I love Python, I love Django, I love jQuery, And I love pyQuery...