Python Crawler crawls all articles of a specified blog

Last Update:2016-06-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Since the previous article Z story:using Django with GAE Python background captures the full text of multiple Web pages, the overall progress is as follows:
1. Added cron: Used to tell the program to wake a task every 30 minutes and run to the designated blogs to crawl the latest updates
2. Use Google's Datastore to store the contents of each crawler crawl. Only new content is stored.

As I said last time, there has been a significant improvement in performance since the crawler was awakened after each request, and it took about 17 seconds to output from the background to the foreground and now it only takes 2 seconds.

3. Optimization of crawlers

1. Cron.yaml to schedule the waking hours for each program

After flipping through the documents, asking questions finally figured out how Google's cron works--it's actually just a virtual visit to one of our own designated URLs at every specified time ...
So under Django, there's no need to write a pure Python program:
If __name__== "__main__":
Just configure a URL yourself to put in views.py:

def updatepostsdb (Request): #deleteAll () siteinfos=[] siteinfo={} siteinfo[' Postsite ']= "l2zstory" siteinfo[' Feedurl '  ]= "feed://l2zstory.wordpress.com/feed/" siteinfo[' Blog_type ']= "WordPress" Siteinfos.append (siteinfo) SiteInfo={} siteinfo[' Postsite ']= "yukilife" siteinfo[' Feedurl ']= "feed://blog.sina.com.cn/rss/1583902832.xml" SiteInfo[' Blog_ Type ']= ' Sina "Siteinfos.append (siteinfo) siteinfo={} siteinfo[' Postsite ']=" zlife "siteinfo[' feedurl ']= ' feed:// ireallife.wordpress.com/feed/"siteinfo[' Blog_type ']=" WordPress "Siteinfos.append (siteinfo) siteinfo={} SiteInfo[' Postsite ']= "Zlife_sina" siteinfo[' Feedurl ']= "feed://blog.sina.com.cn/rss/1650910587.xml" SiteInfo[' blog_type ']= ' Sina "Siteinfos.append (siteinfo) try:for site in siteinfos:feedurl=site[' Feedurl '] blog_type=site[' blog _type '] postsite=site[' postsite '] postinfos=getpostinfosfromweb (feedurl,blog_type) Recordtodb (PostSite,Post    infos) msg= "Cron Job done ..." except exception,e:Msg=str (e) return HttpResponse (MSG)

Cron.yaml to be placed on the same level as App.yaml:
Cron:
-Description:retrieve Newest Posts
URL:/task_updateposts/
Schedule:every minutes

In url.py, just point to the Task_updatepostsdb, point to the URL.

Debugging this cron process can be described in a tragic way ... There are a lot of people on the StackOverflow asking why their cron can't work ... I was sweating at first and couldn't find my head. Finally fortunately, the general steps are vague and very. But very simple:
First of all, make sure that your program has no syntax error .... Then you can try to manually access the URL if cron is normal, this time the task should have been executed and finally really not enough to look at the log ...

2. Configuration and utilization of the Datastore--using Datastore with Django

My needs are simple here-no joins ... So I used the most humble django-helper directly.
This models.py is a key point:

Copy the Code code as follows:

From Appengine_django.models import Basemodel
From Google.appengine.ext Import db

Classpostsdb (Basemodel):
Link=db. Linkproperty ()
Title=db. Stringproperty ()
Author=db. Stringproperty ()
Date=db. Datetimeproperty ()
Description=db. Textproperty ()
Postsite=db. Stringproperty ()

The first two lines are the focus of the emphasis .... I didn't write the second line at first. It took me 2 hours to understand what was going on. Not worth the candle ...
When reading and writing, don't forget ... Postdb.put ()

In the beginning, I just want to save time, directly every cron wake up, delete all the data, and then re-write the new crawled down the data ...
Results... After a day ... There are 40,000 reading and writing records .... And there are only 50,000 free articles per day ....
So instead of before inserting the first to see if there is an update, some words to write, no words will not write. Finally the database this part of the good ...

3. Crawler improvements:
In the beginning, the crawler just crawled into the feed for the article. This way, if a blog has 24*30 article ... You can only get up to 10 articles ....
This time, the improved version can crawl all the articles. I took the lonely Chuan Ling, Han Cold, Yuki and Z's blog to do the experiment. The success is very ... One of the lonely Chuan ling there are 720+ articles ... The missing was climbed down.

Import urllib#from beautifulsoup import beautifulsoupfrom pyquery import pyquery as pqdef getarticlelist (URL): lstarticle S=[] url_prefix=url[:-6] cnt=1 response=urllib.urlopen (URL) html=response.read () D=PQ (HTML) try:pagecnt=d ("ul. Sg_pages "). Find (' span ') Pagecnt=int (d (pagecnt). Text () [1:-1]) except:pagecnt=1 for I in Range (1,pagecnt+1): URL =url_prefix+str (i) + ". html" #print url response=urllib.urlopen (URL) html=response.read () D=PQ (HTML) title_sp Ans=d (". Atc_title"). Find (' a ') Date_spans=d ('. Atc_tm ') for J in Range (0,len (Title_spans)): titleobj=title_sp ANS[J] dateobj=date_spans[j] article={} article[' link ']= d (titleobj). attr (' href ') article[' title ']= D ( titleobj). Text () article[' Date ']=d (dateobj). Text () article[' desc ']=getpagecontent (article[' link ') lstartic Les.append (article) return lstarticles def getpagecontent (URL): #get Page Content response=urllib.urlopen (URL) html=r Esponse.read () d=pq (HTML) Pagecontent=d ("Div.articalcontent"). Text () #print pagecontent return pagecontentdef main (): Url= ' Http://blog.sina. Com.cn/s/articlelist_1191258123_0_1.html ' #Han Han url= "http://blog.sina.com.cn/s/articlelist_1225833283_0_1.html "#Gu Du Chuan Ling url=" http://blog.sina.com.cn/s/articlelist_1650910587_0_1.html "#Feng url="/http/ Blog.sina.com.cn/s/articlelist_1583902832_0_1.html "#Yuki lstarticles=getarticlelist (URL) for article in LstArticles : F=open ("blogs/" +article[' Date ']+ "_" +article[' title ']+ ". txt", ' W ') f.write (article[' desc '].encode (' utf-8 ')) # Special attention to the processing of Chinese f.close () #print article[' desc '] If __name__== ' __main__ ': Main ()

Recommended for Pyquery.
I'm sorry to say, Bueautifulsoup let me deeply disappointed ... When I wrote the article, there was a small bug. have been unable to find the reason. After I got home, I spent a lot of time trying to figure out why Bueautifulsoup couldn't catch what I wanted ... Later generally looked at it selector part of the source code think it should be it for a lot of

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawler crawls all articles of a specified blog

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Crawler crawls all articles of a specified blog

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support