[Python] Flask

Source: Internet
Author: User
Tags python sqlite virtualenv

[Python] Flask

 

 

Analyze request/response processes

The analysis service we will build is a bit similar to Google Analytics (more like a simplified version)

Workflow:

Each page to be tracked will use <script>Tags introduce a JavaScript file, which is used by our applications (for example, in the basic template of the analyzed website). When someone accesses your website, their browser will execute the JavaScript code in this JavaScript file to read the title, URL, and other elements of interest to the current webpage. The coolest thing is that this script will dynamically create Tag (generally a blank image with 1 * 1px ), the URL of the src attribute of this label officially points to the src attribute of the image after the information on the current page of our analysis application is collected and encoded, our analysis server will receive and parse this information. The analysis service will complete parsing, store this information to the database, and then return a 1-pixel GIF image.

The interaction diagram is as follows:

Design Considerations

Because I want to run on a VPS with limited resources (which can be considered a very low-configuration server), my blog is not much wandering, so it should be lightweight and flexible. I like Flask for any type of project, especially for this project. We will use peewee ORM to store PV (page access data) and query and analyze data. Let us know, our application will not exceed 100 lines of code (including comments ).

Page view

To be able to query data, we will use relational databases to store PV data. I chose the BerkeleyDB's SQLite interface because it is a lightweight embedded database and does not use much memory. I have considered SQLite, but BerkeleyDB has a much higher performance in concurrent access than SQLite. When the analysis application is damaged, it can still run stably.

If you have already installed Postgresql or MySQL, select it as needed.

WSGI Server

Although there are many options, I prefer gevent. Gevent is a coroutine-based network library that uses the libev event mechanism to implement greenlets ). By using monkey-patching, gevent will change normal blocked python programs to non-blocking without special APIs or syntax. Gevent WSGI server, despite being very basic, has very high performance and very low resource consumption. Similar to databases, if you use other databases, you can choose from them as needed.

Create virtualenv

We will first create an isolated environment for this analysis application and install it onflask,peewee(Select to install gevent ).

$ virtualenv analyticsNew python executable in analytics/bin/python2Also creating executable in analytics/bin/pythonInstalling setuptools, pip...done.$ cd analytics/$ source bin/activate$ pip install flask peewee......Successfully installed flask peewee Werkzeug Jinja2 itsdangerous markupsafeCleaning up...$ pip install gevent  #  Optional.

If you want to compilePython SQLiteDriver to supportBerkeleyDB, Check out the playhouse module (lib/python2.7/site-packages/playhouse/berkeley_build.sh. This script will download and compile BerkeleyDB, and then compile pysqlite. For detailed procedures, see this article. You can skip this step and directly use the SqliteDatabase class of peewee.

Implement the Flask Application

Let's start with the overall code framework. As discussed earlier, we will create two views, one for returning JavaScript files, and the other for creating 1 pixel GIF images. InanalyticsDirectory (this is the directory developed and used by the author. Please create it if not). Createanalytics.pyFile, the code is listed below. The Code includes the application code structure and basic configuration.

#coding:utf-8from base64 import b64decodeimport datetimeimport jsonimport osfrom urlparse import parse_qsl, urlparsefrom flask import Flask, Response, abort, requestfrom peewee import *from playhouse.berkeleydb import BerkeleyDatabase  # Optional.# 1 pixel GIF, base64-encoded.BEACON = b64decode('R0lGODlhAQABAIAAANvf7wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==')# Store the database file in the app directory.APP_DIR = os.path.dirname(__file__)DATABASE_NAME = os.path.join(APP_DIR, 'analytics.db')DOMAIN = 'http://127.0.0.1:5000'  # TODO: change me.# Simple JavaScript which will be included and executed on the client-side.JAVASCRIPT = '' # TODO: add javascript implementation.# Flask application settings.DEBUG = bool(os.environ.get('DEBUG'))SECRET_KEY = 'secret - change me'  # TODO: change me.app = Flask(__name__)app.config.from_object(__name__)database = BerkeleyDatabase(DATABASE_NAME)  # or SqliteDatabase(DATABASE_NAME)class PageView(Model):    # TODO: add model definition.    class Meta:        database = database@app.route('/a.gif')def analyze():    # TODO: implement 1pixel gif view.@app.route('/a.js')def script():    # TODO: implement javascript view.@app.errorhandler(404)def not_found(e):    return Response('Not found.')if __name__ == '__main__':    database.create_tables([PageView], safe=True)    app.run()
Obtain information from a browser

We started to write JavaScript files that collect client information, mainly to extract some basic page information.

URL information, including the feferring information on the title (document. title) page of the query information (document. location. href) page, if any (document. referrer)

Some of the following attributes can also be extracted, if you are interested (I: It may be more than listed, especially H5), for example:
* Cookie information (document. cookie)
* Last modification time of the document (document. lastModified)
* More

After the information is extracted, we pass it through the url query stringanalyzeThis view. For simplicity, this js will be triggered immediately when the page is loaded. We encapsulate all the code into an anonymous function. Finally, use the encodeURIComponent method to escape all parameters to ensure the security of all parameters:

(function() {  var img = new Image,      url = encodeURIComponent(document.location.href),      title = encodeURIComponent(document.title),      ref = encodeURIComponent(document.referrer);  img.src = '%s/a.gif?url=' + url + '&t=' + title + '&ref=' + ref;})();

We reserve the placeholder % s in Python for readingDOMAINConfigure and insert it to form the complete JS Code

In The py file, we define a JAVASCRIPT variable to store the above js Code:

# Simple JavaScript which will be included and executed on the client-side.JAVASCRIPT = (function(){    var d=document,i=new Image,e=encodeURIComponent;    i.src='%s/a.gif?url='+e(d.location.href)+'&ref='+e(d.referrer)+'&t='+e(d.title);    })().replace('', '')

In view.

@app.route('/a.js')def script():    return Response(        app.config['JAVASCRIPT'] % (app.config['DOMAIN']),        mimetype='text/javascript')
Save PV Information

The above script will pass three values to analyze, including the URL, title, and referring page of the page. Now we define a PageView model to store the data.

On the server side, we can also read the visitor's IP address and request header information, so we also create fields for the information and add the request timestamp field.

Because each browser has different request headers and the query parameters of each page request are different, we will store them in the TextField field in JSON format. If you use Postgresql, you can use HStore or native JSON data-type.

The following is the definition of the PageView model. A JSONField is also defined to store query parameters and request header information:

class JSONField(TextField):    Store JSON data in a TextField.    def python_value(self, value):        if value is not None:            return json.loads(value)    def db_value(self, value):        if value is not None:            return json.dumps(value)class PageView(Model):    domain = CharField()    url = TextField()    timestamp = DateTimeField(default=datetime.datetime.now, index=True)    title = TextField(default='')    ip = CharField(default='')    referrer = TextField(default='')    headers = JSONField()    params = JSONField()    class Meta:        database = database

First, we add a method to PageView so that it can extract all the required values from the request and coexist with the database.urlparseThe module contains many methods to extract request information. We use these methods to obtain access URLs and request parameters:

class PageView(Model):    # ... field definitions ...    @classmethod    def create_from_request(cls):        parsed = urlparse(request.args['url'])        params = dict(parse_qsl(parsed.query))        return PageView.create(            domain=parsed.netloc,            url=parsed.path,            title=request.args.get('t') or '',            ip=request.headers.get('X-Forwarded-For', request.remote_addr),            referrer=request.args.get('ref') or '',            headers=dict(request.headers),            params=params)

analyzeThe last step of view is to return a 1-pixel GIF image. To ensure security, we will check whether the URL exists and ensure that no blank record is inserted in the database.

@app.route('/a.gif')def analyze():    if not request.args.get('url'):        abort(404)    with database.transaction():        PageView.create_from_request()    response = Response(app.config['BEACON'], mimetype='image/gif')    response.headers['Cache-Control'] = 'private, no-cache'    return response
Start the application

In this case, if you want to test the application, you can first set it in the command line.DEBUG=1To start the debug mode.

(analytics) $ DEBUG=1 python analytics.py * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit) * Restarting with reloader

Access http: // 127.0.0.1: 5000/a. js to view the js file. If there are other web applications locally, You can embed this js file into its page and test the analysis application.

<script src=http://127.0.0.1:5000/a.js type=text/javascript></script>

We recommend that you use a dedicated WSGI server to deploy applications in the production environment. I like gevent, which is very lightweight and high-performance. You can modify the analytics. py file and use gevent to replace the original Flask server. The following code uses gevent to run an application on port 5000:

if __name__ == '__main__':   from gevent.wsgi import WSGIServer   WSGIServer(('', 5000), app).serve_forever()

Because gevent uses monkey-patching for high concurrency, you need to add the following line in analytics. py:

from gevent import monkey; monkey.patch_all()
Query data

What really makes you feel happy is that you have collected data for several days and queried it. This part shows how to query some interesting information from the collected data.

The following uses the data of my blog. We will query the data in the last seven days.

>>> from analytics import *>>> import datetime>>> week_ago = datetime.date.today() - datetime.timedelta(days=7)>>> base = PageView.select().where(PageView.timestamp >= week_ago)

First, let's take a look at the PV of the past week.

>>> base.count()1133

How many different IP addresses have accessed my website?

>>> base.select(PageView.ip).group_by(PageView.ip).count()850

What are the 10 most visited pages?

print (base       .select(PageView.title, fn.Count(PageView.id))       .group_by(PageView.title)       .order_by(fn.Count(PageView.id).desc())       .tuples())[:10]# Prints...[('Postgresql HStore, JSON data-type and Arrays with Peewee ORM',  88), (Describing Relationships: Django's ManyToMany Through,  73), ('Using python and k-means to find the dominant colors in images',  66), ('SQLite: Small. Fast. Reliable. Choose any three.', 58), ('Using python to generate awesome linux desktop themes',  54), (Don't sweat the small stuff - use flask blueprints, 51), ('Using SQLite Full-Text Search with Python', 48), ('Home', 47), ('Blog Entries', 46), ('Django Patterns: Model Inheritance', 44)]

The Unit is 4 hours. At which time is the most visited person in a day?

hour = fn.date_part('hour', PageView.timestamp) / 4id_count = fn.Count(PageView.id)print (base       .select(hour, id_count)       .group_by(hour)       .order_by(id_count.desc())       .tuples())[:][(3, 208), (2, 201), (0, 194), (1, 183), (4, 178), (5, 169)]

Based on the data, it seems that the most people visit each day at lunch time, the minimum number of people visit the internet before midnight, and the total traffic is relatively average.

Which user-agents are most popular?

from collections import Counterc = Counter(pv.headers.get('User-Agent') for pv in base)print c.most_common(5)[(u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36',  81), (u'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36',  70), (u'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:32.0) Gecko/20100101 Firefox/32.0',  50), (u'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2',  37), (u'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0',  37)]

All the data you want depends on you. It is interesting to query the order list of all pages accessed by an IP address. This shows how people access your website on the previous page.

inner = base.select(PageView.ip, PageView.url).order_by(PageView.timestamp)query = (PageView         .select(PageView.ip, fn.GROUP_CONCAT(PageView.url).alias('urls'))         .from_(inner.alias('t1'))         .group_by(PageView.ip)         .order_by(fn.Count(PageView.url).desc())print {pv.ip: pv.urls.split(',') for pv in query[:10]}# Prints something like the following:{  u'xxx.xxx.xxx.xxx': [    u'/blog/peewee-was-baroque-so-i-rewrote-it/',    u'/blog/peewee-was-baroque-so-i-rewrote-it/',    u'/blog/',    u'/blog/postgresql-hstore-json-data-type-and-arrays-with-peewee-orm/',    u'/blog/search/',    u'/blog/the-search-for-the-missing-link-what-lies-between-sql-and-django-s-orm-/',    u'/blog/how-do-you-use-peewee-/'],  u'xxx.xxx.xxx.xxx': [    u'/blog/dont-sweat-small-stuff-use-flask-blueprints/',    u'/',    u'/blog/',    u'/blog/migrating-to-sqlite/',    u'/blog/',    u'/blog/saturday-morning-hacks-revisiting-the-notes-app/'],  u'xxx.xxx.xxx.xxx': [    u'/blog/using-python-to-generate-awesome-linux-desktop-themes/',    u'/',    u'/blog/',    u'/blog/customizing-google-chrome-s-new-tab-page/',    u'/blog/-wallfix-using-python-to-set-my-wallpaper/',    u'/blog/simple-botnet-written-python/'],  # etc...}
The idea of improving the application is to create a web interface or API to query pv data using tables or similar to Postgresql HStore to standardize request header data collection user cookies tracking user access paths using GeoIP to determine users' geographic locations use canvas fingerprints to better determine the uniqueness of users. More and more cool queries are used to study data.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.