Python crawlers crawl all the articles of a specified blog,

Source: Internet
Author: User

Python crawlers crawl all the articles of a specified blog,

Since the previous Article Z Story: Using Django with GAE Python, after capturing the full text of pages of multiple websites in the background, the general progress is as follows:
1. Added Cron: Used to tell the program to wake up a task every 30 minutes and go to the designated blogs to crawl the latest updates.
2. Use google's Datastore to store the content crawled by each crawler .. Only store new content ..

As mentioned last time, the performance has been greatly improved: after each original request, crawlers are awakened, so it takes about 17 seconds to output data from the background to the foreground. Now it takes less than 2 seconds.

3. crawler optimized

1. Cron. yaml to schedule the waking time of each program

After reading the document, I finally figured out how google's cron works. In fact, google only attempts to access a specified url at a specified time...
Therefore, in Django, you do not need to write a pure python program:
If _ name __= = "_ main __":
You only need to configure a url and put it in views. py:

def updatePostsDB(request):  #deleteAll()  SiteInfos=[]  SiteInfo={}  SiteInfo['PostSite']="L2ZStory"  SiteInfo['feedurl']="feed://l2zstory.wordpress.com/feed/"  SiteInfo['blog_type']="wordpress"  SiteInfos.append(SiteInfo)  SiteInfo={}  SiteInfo['PostSite']="YukiLife"  SiteInfo['feedurl']="feed://blog.sina.com.cn/rss/1583902832.xml"  SiteInfo['blog_type']="sina"  SiteInfos.append(SiteInfo)  SiteInfo={}  SiteInfo['PostSite']="ZLife"  SiteInfo['feedurl']="feed://ireallife.wordpress.com/feed/"  SiteInfo['blog_type']="wordpress"  SiteInfos.append(SiteInfo)  SiteInfo={}  SiteInfo['PostSite']="ZLife_Sina"  SiteInfo['feedurl']="feed://blog.sina.com.cn/rss/1650910587.xml"  SiteInfo['blog_type']="sina"  SiteInfos.append(SiteInfo)    try:    for site in SiteInfos:      feedurl=site['feedurl']      blog_type=site['blog_type']      PostSite=site['PostSite']      PostInfos=getPostInfosFromWeb(feedurl,blog_type)      recordToDB(PostSite,PostInfos)    Msg="Cron Job Done..."   except Exception,e:    Msg=str(e)    return HttpResponse(Msg)

Cron. yaml should be placed at the same level as app. yaml:
Cron:
-Description: retrieve newest posts
Url:/task_updatePosts/
Schedule: every 30 minutes

In url. py, just point to this and point task_updatePostsDB to the url.

The process of debugging this cron can be described as tragic... On stackoverflow, many people are asking why their cron cannot work... At first, I was full of sweat and couldn't find my mind... Finally, I was lucky enough to complete the general steps .. But it is very simple:
First, make sure that your program has no syntax error .... Then, you can manually access the url. If cron is normal, the task should have been executed. If it is not, check the log...

2. configuration and utilization of Datastore -- Using Datastore with Django

My requirements are simple here -- no join... So I directly used the most primitive django-helper ..
This models. py is important:

Copy codeThe Code is as follows:
From appengine_django.models import BaseModel
From google. appengine. ext import db

ClassPostsDB (BaseModel ):
Link = db. LinkProperty ()
Title = db. StringProperty ()
Author = db. StringProperty ()
Date = db. DateTimeProperty ()
Description = db. TextProperty ()
PostSite = db. StringProperty ()

The first two rows are the focus .... I did not write the second line at first... It took me more than two hours to figure out what was going on .. Not worth the candle...
Never forget to read and write data... PostDB. put ()

At the beginning, I simply woke up the cron, deleted all the data, and re-written the new crawled data to save trouble...
Result... One day later... 40 thousand read/write records .... And there are only 50 thousand free articles every day ....
So we should first check whether there is any update before inserting the data. If there is any update, We will write it. If there is no update, we will not write it .. Finally we have done a good job in this part of the database...

3. crawler improvement:
At the beginning, crawlers only crawl the articles in the feed .. In this way, if a blog has 24*30 articles... You can get up to 10 articles ....
This time, the ghost version can crawl all the articles .. I did some experiments with the blogs of Sichuan mausoleum, Han, Yuki, and Z respectively .. Success... Among them, there are more than 720 articles in the lonely Sichuan mausoleum... The missing ones were crawled down ..

Import urllib # from BeautifulSoup import BeautifulSoupfrom pyquery import PyQuery as pqdef getArticleList (url): lstArticles = [] url_prefix = url [:-6] Cnt = 1 response = urllib. urlopen (url) html = response. read () d = pq (html) try: pageCnt = d ("ul. SG_pages "). find ('span ') pageCnt = int (d (pageCnt ). text () [1:-1]) Partition T: pageCnt = 1 for I in range (1, pageCnt + 1): url = url_prefix + str (I) + ". html "# print url response = urllib. urlopen (url) html = response. read () d = pq (html) title_spans = d (". atc_title "). find ('A') date_spans = d ('. atc_tm ') for j in range (0, len (title_spans )): titleObj = title_spans [j] dateObj = date_spans [j] article = {} article ['link'] = d (titleObj ). attr ('href ') article ['title'] = d (titleObj ). text () article ['date'] = d (dateObj ). text () article ['desc'] = getPageContent (article ['link']) lstArticles. append (article) return lstArticles def getPageContent (url): # get Page Content response = urllib. urlopen (url) html = response. read () d = pq (html) pageContent = d ("div. articalContent "). text () # print pageContent return pageContentdef main (): url = 'HTTP: // adjust Han url = "http://blog.sina.com.cn/s/articlelist_1225833283_0_1.html" # Gu Du Chuan Ling url = "http://blog.sina.com.cn/s/articlelist_1650910587_0_1.html" # Feng url = "http://blog.sina.com.cn/s/articlelist_1583902832_0_1.html" # Yuki lstArticles = getArticleList (url) for article in lstArticles: f = open ("blogs/" + article ['date'] + "_" + article ['title'] + ". txt ", 'w') f. write (article ['desc']. encode ('utf-8') # Pay special attention to the processing of Chinese characters. close () # print article ['desc'] if _ name __= = '_ main _': main ()

Recommendation for PyQuery ..
Sorry, I am deeply disappointed with BueautifulSoup... When I wrote my previous article, there was a small bug that I could not find .. After I went home, I spent a lot of time trying to figure out why BueautifulSoup never caught what I wanted... Later, I looked at the source code of its selector and thought that it was inaccurate for parsing many nonstandard html pages with <script> tags...

I gave up this library and tried lxml... based on xpath .. But I always need to check documents about xpath... So I found another database, PyQuery... You can use the jQuery selector tool... Very, very, very useful .... For specific usage, see the above... This database is promising...

Worry
Because pyquery is based on lxml... The underlying layer of lxml is c... So it is estimated that it cannot be used on gae... My crawler can only crawl good things on my computer now... Then push to the server...

Summary

In a word, I love Python.
I love Python, And I love Django.
I love Python, Django, jQuery, and so on...
End 4: I love Python, I love Django, I love jQuery, And I love pyQuery...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.