Using Django with GAE Python crawls the full text of pages on multiple websites in the background,

Last Update:2016-02-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I always wanted to create a platform that could help me filter out high-quality articles and blogs and name it Moven .. The process of implementing it is divided into three stages:
1. Downloader: Download the specified url and pass the obtained content to Analyser. This is the simplest start.
2. Analyser: use Regular Expression, XPath, BeautifulSoup, and lxml to filter and simplify the received content. This part is not too difficult.
3. Smart Crawler: To capture links to high-quality articles-this part is the most difficult:

Crawler can be quickly built on the Scrapy Framework.
However, a complicated algorithm is required to judge whether an article under a link is of high quality.

Start Downloader and Analyser recently: I recently set up an l2z story and a Z Life and Z Life @ Sina. I wrote this post to listen to the above four sites as a Downloader and Analyser exercise. and synchronize all their content to this site:

Http://l2zstory.appspot.com

App features
Except for the black navigation bar on the top and the About This Site on the rightmost side, other content is automatically obtained from other sites.
In principle, you can add any blog or website address to this topic... Of course, this is L2Z Story .. so only four sites are included in it.
Feature: as long as the site owner does not stop updating, this thing will always exist-this is the power of lazy people

It is worth mentioning that the Content menu is automatically generated on the client using JavaScript-This saves the resource consumption on the server.

Here, we use html to capture the whole page. For websites with no full-text feed output, this app can capture the text to be hidden.
Loading takes a lot of time because the program will automatically crawl all the article lists, author information, Update Time, and full text on a page without full text output .. So please be patient when opening... The next step will be to add the data storage part, which will be faster ..

Technical preparation

Front end:

1. CSS adheres to the principle of Simplicity. In principle, twitter's bootstrap.css satisfies most of my requirements and I like its Grid System.
2. on Javascript, of course, jQuery has been selected. Since I used jQuery in my first small project, I fell in love with it. The dynamic directory system is quickly generated using jQuery.
For bootstrap.css, The bootstrap-dropdown.js is also used

Server:

This app has two versions:
One runs on my Apache, but because my network is ADSL, the ip address is basically used for self-testing in my so-called LAN .. This version is pure Django
Another address that runs on the Google App Engine is the http://l2zstory.appspot.com. It took me a lot of effort to build the framework when configuring Django to GAE.

See Using Django with Google App Engine GAE: l2Z Story Setup-Step 1 http://blog.sina.com.cn/s/blog_6266e57b01011mjk.html for details

Background:

The main language is Python-I don't explain it. I haven't left it since I got to know Python.

The main module used is

1. BeautifulSoup. py is used for html parsing-not explained
2. feedparser. py is used to parse the feed xml. Many people on the Internet say GAE does not support feedparser. You have the answer here .. Yes .. It took me a long time to figure out what was going on .. In short, it can be used! However, the feedparser. py file must be placed in the same directory as app. yaml. Otherwise, the file may not be able to import feedparser.

Database:
Google Datastore: In the next step, this program will wake up every 30 minutes to check whether each site has updated and capture the updated articles and store them in Google Datastore.

App Configuration

Following the rules of Google, the configuration file app. yaml is as follows:
The location of static directory-css and javascript is defined here.

Copy codeThe Code is as follows:
Application: l2zstory
Version: 1
Runtime: python
Api_version: 1

Handlers:

-Url:/images
Static_dir: l2zstory/templates/template2/images
-Url:/css
Static_dir: l2zstory/templates/template2/css
-Url:/js
Static_dir: l2zstory/templates/template2/js
-Url:/js
Static_dir: l2zstory/templates/template2/js
-Url :/.*
Script: main. py

URL Configuration

Here we use the regular expression in Django.

Copy codeThe Code is as follows:
From django. conf. urls. defaults import *

# Uncomment the next two lines to enable the admin:
# From django. contrib import admin
# Admin. autodiscover ()

Urlpatterns = patterns ('',
# Example:
# (R' ^ l2zstory/', include ('l2zstory. foo. urls ')),

# Uncomment the admin/doc line below and add 'django. contrib. admindocs'
# To INSTALLED_APPS to enable admin documentation:
# (R '^ admin/doc/', include ('django. contrib. admindocs. urls ')),

# Uncomment the next line to enable the admin:
# (R '^ admin/(. *)', admin. site. root ),
(R '^ $', 'l2zstory. stories. views. l2zstory '),
(R '^ YukiLife/', 'l2zstory. stories. views. YukiLife '),
(R '^ ZLife_Sina/', 'l2zstory. stories. views. ZLife_Sina '),
(R '^ ZLife/', 'l2zstory. stories. views. zlife ')
)

View Details

Those familiar with Django should see the view name from the url configuration. I only paste the L2ZStory view, because other architectures in the view are at least similar.
Copy codeThe Code is as follows:
# From BeautifulSoup import BeautifulSoup
From PyUtils import getAboutPage
From PyUtils import getPostInfos

Def L2ZStory (request ):
Url = "feed: // l2zstory.wordpress.com/feed /"
About_url = "http://l2zstory.wordpress.com/about"
Blog_type = "wordpress"
Htmlpages = {}
AboutContent = getAboutPage (about_url, blog_type)
If aboutContent = "Not Found ":
AboutContent = "We use this to tell those past stories ..."
Htmlpages ['about'] = {}
Htmlpages ['about'] ['content'] = aboutContent
Htmlpages ['about'] ['title'] = "about This Story"
Htmlpages ['about'] ['url'] = about_url
PostInfos = getPostInfos (url, blog_type, order_desc = True)
Return render_to_response('l2zstory.html ',
{'Postinfos ': PostInfos,
'Htmlpage': htmlpages
})

Here we mainly construct a dictionary of dictionary htmlpages and a list of dictionary PostInfos
Htmlpages are used to store pages such as About and Contact US of a site.
PostInfos stores the content of all articles, such as the author and release time.

The most important thing here is PyUtils .. This is the core of this app.

PyUtils details

I have deepened some details that I think are important and added comments.

Copy codeThe Code is as follows:
Import feedparser
Import urllib2
Import re
From BeautifulSoup import BeautifulSoup
Header = {
'User-agent': 'mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv: 8.0.1) Gecko/20100101 Firefox/8.0.1 ',
}

# It is used to fool the background of a website .. Websites like Sina are very unfriendly to apps like ours... I hope they will be able to learn from wordpress, which is broken by walls ..

Copy codeThe Code is as follows:
TimeoutMsg = """
The Robot cannot connect to the desired page due to either of these reasons:
1. Great Fire Wall
2. The Blog Site has block connections made by Robots.
"""

Def getPageContent (url, blog_type ):
Try:
Req = urllib2.Request (url, None, header)
Response = urllib2.urlopen (req)
Html = response. read ()
Html = BeautifulSoup (html). pretup ()
Soup = BeautifulSoup (html)
Content = ""
If blog_type = "wordpress ":
Try:
For each section in soup. findAll ('div ', {'class': 'sharedaddy sd-like-enabled sd-sharing-enabled '}):
Partition section. extract ()
For item in soup. findAll ('div ', {'class': 'post-content '}):
Content + = unicode (item)
Except t:
Content = "No Post Content Found"
Elif blog_type = "sina ":
Try:
For item in soup. findAll ('div ', {'class': 'articalcontent '}):
Content + = unicode (item)
Except t:
Content = "No Post Content Found"

# Apply different filters to different website types

Except t:
Content = timeoutMsg
Return removeStyle (Content)

Content = re. sub (patn, replacepatn, Content)
# Use a regular expression to remove all the formats in the captured content.
Return Content

Def getPostInfos (url, blog_type, order_desc = False ):
Feeds = feedparser. parse (url)
PostInfos = []
If order_desc:
Items = feeds. entries [:-1]
Else:
Items = feeds. entries
Cnt = 0
For item in items:
PostInfo = {}
PostInfo ['title'] = item. title
PostInfo ['author'] = item. author
PostInfo ['date'] = item. date
PostInfo ['link'] = item. link

If blog_type = "wordpress ":
Cnt + = 1
If Cnt <= 8:
PostInfo ['description'] = getPageContent (item. link, blog_type)
Else:
PostInfo ['description'] = removeStyle (item. description)
Elif blog_type = "sina ":
PostInfo ['description'] = removeStyle (item. description)

PostInfos. append (PostInfo)

Return PostInfos

Template Overview

Inspired by the principles above, all sites use a template to accept only two variables-htmlpages and PostInfos mentioned earlier
Important parts are:
Copy codeThe Code is as follows:
<Div class = "page-header">
<A href = "{htmlpages. about. url }}" name = "{htmlpages. about. title }}">

</Div>
<P>
{Htmlpages. about. content }}
</P>
{% For item in PostInfos %}
<Div class = "page-header">
<A href = "{item. link} "name =" {item. title }}">

</Div>
<P> <I> author :{{ item. author }}date :{{ item. date }}</I> </p>
<P >{{ item. description }}</p>
{% Endfor %}
</Div>

Summary

In a word, I love Python.
I love Python, And I love Django.
I love Python, Django, jQuery, and so on...

Articles you may be interested in:

Python + django File Download
Python + django implement File Upload
How to Use the python Django Template
Nginx + Python web. py and Django framework Environment on Linux
Deploy the Apache + Python + Django + MySQL environment on Linux
Create a language file in the Python Django framework
Translate strings in the Django framework in Python
In Python, the Django framework uses URLs to control logon.
How to integrate the Python Django framework with the Authentication System
Talking about Cache Control in Python Django framework
Introduction to the template fragment cache in the Django framework of Python
Use an example of a voting program to explain how to use the Python Django framework

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Using Django with GAE Python crawls the full text of pages on multiple websites in the background,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Using Django with GAE Python crawls the full text of pages on multiple websites in the background,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support