This article mainly introduces UsingDjangowithGAEPython to capture the full text of pages of multiple websites in the background, for more information about how to filter out high-quality articles and blogs, you can find a platform named Moven .. The process of implementing it is divided into three stages:
1. Downloader: Download the specified url and pass the obtained content to Analyser. this is the simplest start.
2. Analyser: use Regular Expression, XPath, BeautifulSoup, and lxml to filter and simplify the received content. This part is not too difficult.
3. Smart Crawler: to capture links to high-quality articles-this part is the most difficult:
Crawler can be quickly built on the Scrapy Framework.
However, a complicated algorithm is required to judge whether an article under a link is of high quality.
Start Downloader and Analyser recently: I recently set up an l2z story and a Z Life and Z Life @ Sina. I wrote this post to listen to the above four sites as a Downloader and Analyser exercise. and synchronize all their content to this site:
Http://l2zstory.appspot.com
App features
Except for the black navigation bar on the top and the About This Site on the rightmost side, other content is automatically obtained from other sites.
In principle, you can add any blog or website address to this topic... Of course, this is L2Z Story .. so only four sites are included in it.
Feature: as long as the site owner does not stop updating, this thing will always exist-this is the power of lazy people
It is worth mentioning that the Content menu is automatically generated on the client using JavaScript-this saves the resource consumption on the server.
Here, we use html to capture the whole page. for websites with no full-text feed output, this app can capture the text to be hidden.
Loading takes a lot of time because the program will automatically crawl all the article lists, author information, update time, and full text on a page without full text output .. So please be patient when opening... The next step will be to add the data storage part, which will be faster ..
Technical preparation
Front end:
1. CSS adheres to the principle of simplicity. in principle, TWITTER'S bootstrap.css satisfies most of my requirements and I like its Grid System.
2. on Javascript, of course, jQuery has been selected. since I used jQuery in my first small project, I fell in love with it. the dynamic directory system is quickly generated using jQuery.
For bootstrap.css, the bootstrap-dropdown.js is also used
Server:
This app has two versions:
One runs on my Apache, but because my network is ADSL, the ip address is basically used for self-testing in my so-called Lan .. This version is pure Django
Another address that runs on the Google App Engine is the http://l2zstory.appspot.com. it took me a lot of effort to build the framework when configuring Django to GAE.
See Using Django with Google App Engine GAE: l2Z Story Setup-Step 1 http://blog.sina.com.cn/s/blog_6266e57b01011mjk.html for details
Background:
The main language is Python-I don't explain it. I haven't left it since I got to know Python.
The main module used is
1. BeautifulSoup. py is used for html parsing-not explained
2. feedparser. py is used to parse the feed xml. many people on the Internet say GAE does not support feedparser. you have the answer here .. Yes .. It took me a long time to figure out what was going on .. In short, it can be used! However, the feedparser. py file must be placed in the same directory as app. yaml. Otherwise, the file may not be able to import feedparser.
Database:
Google Datastore: in the next step, this program will wake up every 30 minutes to check whether each site has updated and capture the updated articles and store them in Google Datastore.
App configuration
Following the rules of Google, the configuration file app. yaml is as follows:
The location of static directory-css and javascript is defined here.
The code is as follows:
Application: l2zstory
Version: 1
Runtime: python
Api_version: 1
Handlers:
-Url:/images
Static_dir: l2zstory/templates/template2/images
-Url:/css
Static_dir: l2zstory/templates/template2/css
-Url:/js
Static_dir: l2zstory/templates/template2/js
-Url:/js
Static_dir: l2zstory/templates/template2/js
-Url :/.*
Script: main. py
URL configuration
Here we use the regular expression in Django.
The code is as follows:
From django. conf. urls. defaults import *
# Uncomment the next two lines to enable the admin:
# From django. contrib import admin
# Admin. autodiscover ()
Urlpatterns = patterns ('',
# Example:
# (R' ^ l2zstory/', include ('l2zstory. foo. URLs ')),
# Uncomment the admin/doc line below and add 'Django. contrib. admindocs'
# To INSTALLED_APPS to enable admin documentation:
# (R '^ admin/doc/', include ('Django. contrib. admindocs. URLs ')),
# Uncomment the next line to enable the admin:
# (R '^ admin/(. *)', admin. site. root ),
(R '^ $', 'l2zstory. stories. views. l2zstory '),
(R '^ YukiLife/', 'l2zstory. stories. views. YukiLife '),
(R '^ ZLife_Sina/', 'l2zstory. stories. views. ZLife_Sina '),
(R '^ ZLife/', 'l2zstory. stories. views. zlife ')
)
View details
Those familiar with Django should see the view name from the url configuration. I only paste the L2ZStory view, because other architectures in the view are at least similar.
The code is as follows:
# From BeautifulSoup import BeautifulSoup
From PyUtils import getAboutPage
From PyUtils import getPostInfos
Def L2ZStory (request ):
Url = "feed: // l2zstory.wordpress.com/feed /"
About_url = "http://l2zstory.wordpress.com/about"
Blog_type = "wordpress"
Htmlpages = {}
AboutContent = getAboutPage (about_url, blog_type)
If aboutContent = "Not Found ":
AboutContent = "We use this to tell those past stories ..."
Htmlpages ['about'] = {}
Htmlpages ['about'] ['content'] = aboutContent
Htmlpages ['about'] ['title'] = "about This Story"
Htmlpages ['about'] ['URL'] = about_url
PostInfos = getPostInfos (url, blog_type, order_desc = True)
Return render_to_response('l2zstory.html ',
{'Postinfos ': PostInfos,
'Htmlpage': htmlpages
})
Here we mainly construct a dictionary of dictionary htmlpages and a list of dictionary PostInfos
Htmlpages are used to store pages such as About and Contact US of a site.
PostInfos stores the content of all articles, such as the author and release time.
The most important thing here is PyUtils .. This is the core of this app.
PyUtils details
I have deepened some details that I think are important and added comments.
The code is as follows:
Import feedparser
Import urllib2
Import re
From BeautifulSoup import BeautifulSoup
Header = {
'User-Agent': 'mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv: 8.0.1) Gecko/20100101 Firefox/8.0.1 ',
}
# It is used to fool the background of a website .. Websites like Sina are very unfriendly to apps like ours... I hope they will be able to learn from wordpress, which is broken by walls ..
The code is as follows:
TimeoutMsg = """
The Robot cannot connect to the desired page due to either of these reasons:
1. Great Fire Wall
2. The Blog Site has block connections made by Robots.
"""
Def getPageContent (url, blog_type ):
Try:
Req = urllib2.Request (url, None, header)
Response = urllib2.urlopen (req)
Html = response. read ()
Html = BeautifulSoup (html). pretup ()
Soup = BeautifulSoup (html)
Content = ""
If blog_type = "wordpress ":
Try:
For each section in soup. findAll ('P', {'class': 'sharedaddy sd-like-enabled sd-sharing-enabled '}):
Partition section. extract ()
For item in soup. findAll ('P', {'class': 'Post-content '}):
Content + = unicode (item)
Except t:
Content = "No Post Content Found"
Elif blog_type = "sina ":
Try:
For item in soup. findAll ('P', {'class': 'articalcontent '}):
Content + = unicode (item)
Except t:
Content = "No Post Content Found"
# Apply different filters to different website types
Except t:
Content = timeoutMsg
Return removeStyle (Content)
Def removeStyle (Content ):
# Add this to remove all the img tag: () | (src = \ ". * \") |
Patn = re. compile (r "(align = \". * \ ") | (id = \". * \ ") | (class = \" * \ ") | (style = \". * \ ") | () | ()")
Replacepatn = ""
Content = re. sub (patn, replacepatn, Content)
# Use a regular expression to remove all the formats in the captured content.
Return Content
Def getPostInfos (url, blog_type, order_desc = False ):
Feeds = feedparser. parse (url)
PostInfos = []
If order_desc:
Items = feeds. entries [:-1]
Else:
Items = feeds. entries
Cnt = 0
For item in items:
PostInfo = {}
PostInfo ['title'] = item. title
PostInfo ['author'] = item. author
PostInfo ['Date'] = item. date
PostInfo ['link'] = item. link
If blog_type = "wordpress ":
Cnt + = 1
If Cnt <= 8:
PostInfo ['description'] = getPageContent (item. link, blog_type)
Else:
PostInfo ['description'] = removeStyle (item. description)
Elif blog_type = "sina ":
PostInfo ['description'] = removeStyle (item. description)
PostInfos. append (PostInfo)
Return PostInfos
Template overview
Inspired by the principles above, all sites use a template to accept only two variables-htmlpages and PostInfos mentioned earlier
Important parts are:
The code is as follows:
{Htmlpages. about. title }}
{Htmlpages. about. content }}
{% For item in PostInfos %}
{Item. title }}
Author: {item. author} date: {item. date }}
{Item. description }}
{% Endfor %}
Summary
In a word, I love Python.
I love Python, and I love Django.
I love Python, Django, jQuery, and so on...