Using Django with GAE Python crawls the full text of pages on multiple websites in the background,

Source: Internet
Author: User

Using Django with GAE Python crawls the full text of pages on multiple websites in the background,

I always wanted to create a platform that could help me filter out high-quality articles and blogs and name it Moven .. The process of implementing it is divided into three stages:
1. Downloader: Download the specified url and pass the obtained content to Analyser. This is the simplest start.
2. Analyser: use Regular Expression, XPath, BeautifulSoup, and lxml to filter and simplify the received content. This part is not too difficult.
3. Smart Crawler: To capture links to high-quality articles-this part is the most difficult:

Crawler can be quickly built on the Scrapy Framework.
However, a complicated algorithm is required to judge whether an article under a link is of high quality.

Start Downloader and Analyser recently: I recently set up an l2z story and a Z Life and Z Life @ Sina. I wrote this post to listen to the above four sites as a Downloader and Analyser exercise. and synchronize all their content to this site:

Http://l2zstory.appspot.com

App features
Except for the black navigation bar on the top and the About This Site on the rightmost side, other content is automatically obtained from other sites.
In principle, you can add any blog or website address to this topic... Of course, this is L2Z Story .. so only four sites are included in it.
Feature: as long as the site owner does not stop updating, this thing will always exist-this is the power of lazy people


It is worth mentioning that the Content menu is automatically generated on the client using JavaScript-This saves the resource consumption on the server.

Here, we use html to capture the whole page. For websites with no full-text feed output, this app can capture the text to be hidden.
Loading takes a lot of time because the program will automatically crawl all the article lists, author information, Update Time, and full text on a page without full text output .. So please be patient when opening... The next step will be to add the data storage part, which will be faster ..

Technical preparation

Front end:

1. CSS adheres to the principle of Simplicity. In principle, twitter's bootstrap.css satisfies most of my requirements and I like its Grid System.
2. on Javascript, of course, jQuery has been selected. Since I used jQuery in my first small project, I fell in love with it. The dynamic directory system is quickly generated using jQuery.
For bootstrap.css, The bootstrap-dropdown.js is also used

Server:

This app has two versions:
One runs on my Apache, but because my network is ADSL, the ip address is basically used for self-testing in my so-called LAN .. This version is pure Django
Another address that runs on the Google App Engine is the http://l2zstory.appspot.com. It took me a lot of effort to build the framework when configuring Django to GAE.

See Using Django with Google App Engine GAE: l2Z Story Setup-Step 1 http://blog.sina.com.cn/s/blog_6266e57b01011mjk.html for details

Background:

The main language is Python-I don't explain it. I haven't left it since I got to know Python.

The main module used is

1. BeautifulSoup. py is used for html parsing-not explained
2. feedparser. py is used to parse the feed xml. Many people on the Internet say GAE does not support feedparser. You have the answer here .. Yes .. It took me a long time to figure out what was going on .. In short, it can be used! However, the feedparser. py file must be placed in the same directory as app. yaml. Otherwise, the file may not be able to import feedparser.

Database:
Google Datastore: In the next step, this program will wake up every 30 minutes to check whether each site has updated and capture the updated articles and store them in Google Datastore.

App Configuration

Following the rules of Google, the configuration file app. yaml is as follows:
The location of static directory-css and javascript is defined here.

Copy codeThe Code is as follows:
Application: l2zstory
Version: 1
Runtime: python
Api_version: 1


Handlers:

-Url:/images
Static_dir: l2zstory/templates/template2/images
-Url:/css
Static_dir: l2zstory/templates/template2/css
-Url:/js
Static_dir: l2zstory/templates/template2/js
-Url:/js
Static_dir: l2zstory/templates/template2/js
-Url :/.*
Script: main. py

URL Configuration


Here we use the regular expression in Django.

Copy codeThe Code is as follows:
From django. conf. urls. defaults import *

# Uncomment the next two lines to enable the admin:
# From django. contrib import admin
# Admin. autodiscover ()


Urlpatterns = patterns ('',
# Example:
# (R' ^ l2zstory/', include ('l2zstory. foo. urls ')),


# Uncomment the admin/doc line below and add 'django. contrib. admindocs'
# To INSTALLED_APPS to enable admin documentation:
# (R '^ admin/doc/', include ('django. contrib. admindocs. urls ')),


# Uncomment the next line to enable the admin:
# (R '^ admin/(. *)', admin. site. root ),
(R '^ $', 'l2zstory. stories. views. l2zstory '),
(R '^ YukiLife/', 'l2zstory. stories. views. YukiLife '),
(R '^ ZLife_Sina/', 'l2zstory. stories. views. ZLife_Sina '),
(R '^ ZLife/', 'l2zstory. stories. views. zlife ')
)

View Details


Those familiar with Django should see the view name from the url configuration. I only paste the L2ZStory view, because other architectures in the view are at least similar.
Copy codeThe Code is as follows:
# From BeautifulSoup import BeautifulSoup
From PyUtils import getAboutPage
From PyUtils import getPostInfos

Def L2ZStory (request ):
Url = "feed: // l2zstory.wordpress.com/feed /"
About_url = "http://l2zstory.wordpress.com/about"
Blog_type = "wordpress"
Htmlpages = {}
AboutContent = getAboutPage (about_url, blog_type)
If aboutContent = "Not Found ":
AboutContent = "We use this to tell those past stories ..."
Htmlpages ['about'] = {}
Htmlpages ['about'] ['content'] = aboutContent
Htmlpages ['about'] ['title'] = "about This Story"
Htmlpages ['about'] ['url'] = about_url
PostInfos = getPostInfos (url, blog_type, order_desc = True)
Return render_to_response('l2zstory.html ',
{'Postinfos ': PostInfos,
'Htmlpage': htmlpages
})

Here we mainly construct a dictionary of dictionary htmlpages and a list of dictionary PostInfos
Htmlpages are used to store pages such as About and Contact US of a site.
PostInfos stores the content of all articles, such as the author and release time.

The most important thing here is PyUtils .. This is the core of this app.

PyUtils details

I have deepened some details that I think are important and added comments.

Copy codeThe Code is as follows:
Import feedparser
Import urllib2
Import re
From BeautifulSoup import BeautifulSoup
Header = {
'User-agent': 'mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv: 8.0.1) Gecko/20100101 Firefox/8.0.1 ',
}

# It is used to fool the background of a website .. Websites like Sina are very unfriendly to apps like ours... I hope they will be able to learn from wordpress, which is broken by walls ..

Copy codeThe Code is as follows:
TimeoutMsg = """
The Robot cannot connect to the desired page due to either of these reasons:
1. Great Fire Wall
2. The Blog Site has block connections made by Robots.
"""

Def getPageContent (url, blog_type ):
Try:
Req = urllib2.Request (url, None, header)
Response = urllib2.urlopen (req)
Html = response. read ()
Html = BeautifulSoup (html). pretup ()
Soup = BeautifulSoup (html)
Content = ""
If blog_type = "wordpress ":
Try:
For each section in soup. findAll ('div ', {'class': 'sharedaddy sd-like-enabled sd-sharing-enabled '}):
Partition section. extract ()
For item in soup. findAll ('div ', {'class': 'post-content '}):
Content + = unicode (item)
Except t:
Content = "No Post Content Found"
Elif blog_type = "sina ":
Try:
For item in soup. findAll ('div ', {'class': 'articalcontent '}):
Content + = unicode (item)
Except t:
Content = "No Post Content Found"


# Apply different filters to different website types


Except t:
Content = timeoutMsg
Return removeStyle (Content)


Def removeStyle (Content ):
# Add this to remove all the img tag: () | (</img>) | (src = \". * \ ") |
Patn = re. compile (r "(align = \". * \ ") | (id = \". * \ ") | (class = \" * \ ") | (style = \". * \ ") | (</font>) | (<font. * \ ">) | (<embed + (\ w * = \". * \ ")>) | (</embed> )")
Replacepatn = ""


Content = re. sub (patn, replacepatn, Content)
# Use a regular expression to remove all the formats in the captured content.
Return Content

Def getPostInfos (url, blog_type, order_desc = False ):
Feeds = feedparser. parse (url)
PostInfos = []
If order_desc:
Items = feeds. entries [:-1]
Else:
Items = feeds. entries
Cnt = 0
For item in items:
PostInfo = {}
PostInfo ['title'] = item. title
PostInfo ['author'] = item. author
PostInfo ['date'] = item. date
PostInfo ['link'] = item. link

If blog_type = "wordpress ":
Cnt + = 1
If Cnt <= 8:
PostInfo ['description'] = getPageContent (item. link, blog_type)
Else:
PostInfo ['description'] = removeStyle (item. description)
Elif blog_type = "sina ":
PostInfo ['description'] = removeStyle (item. description)


PostInfos. append (PostInfo)

Return PostInfos

Template Overview

Inspired by the principles above, all sites use a template to accept only two variables-htmlpages and PostInfos mentioned earlier
Important parts are:
Copy codeThe Code is as follows:
<Div class = "page-header">
<A href = "{htmlpages. about. url }}" name = "{htmlpages. about. title }}">


</Div>
<P>
{Htmlpages. about. content }}
</P>
{% For item in PostInfos %}
<Div class = "page-header">
<A href = "{item. link} "name =" {item. title }}">


</Div>
<P> <I> author :{{ item. author }}date :{{ item. date }}</I> </p>
<P >{{ item. description }}</p>
{% Endfor %}
</Div>

Summary

In a word, I love Python.
I love Python, And I love Django.
I love Python, Django, jQuery, and so on...

Articles you may be interested in:
  • Python + django File Download
  • Python + django implement File Upload
  • How to Use the python Django Template
  • Nginx + Python web. py and Django framework Environment on Linux
  • Deploy the Apache + Python + Django + MySQL environment on Linux
  • Create a language file in the Python Django framework
  • Translate strings in the Django framework in Python
  • In Python, the Django framework uses URLs to control logon.
  • How to integrate the Python Django framework with the Authentication System
  • Talking about Cache Control in Python Django framework
  • Introduction to the template fragment cache in the Django framework of Python
  • Use an example of a voting program to explain how to use the Python Django framework

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.