[Python] Flask
Analyze request/response processes
The analysis service we will build is a bit similar to Google Analytics (more like a simplified version)
Workflow:
Each page to be tracked will use
<script>
Tags introduce a JavaScript file, which is used by our applications (for example, in the basic template of the analyzed website). When someone accesses your website, their browser will execute the JavaScript code in this JavaScript file to read the title, URL, and other elements of interest to the current webpage. The coolest thing is that this script will dynamically create
Tag (generally a blank image with 1 * 1px ), the URL of the src attribute of this label officially points to the src attribute of the image after the information on the current page of our analysis application is collected and encoded, our analysis server will receive and parse this information. The analysis service will complete parsing, store this information to the database, and then return a 1-pixel GIF image.
The interaction diagram is as follows:
Design Considerations
Because I want to run on a VPS with limited resources (which can be considered a very low-configuration server), my blog is not much wandering, so it should be lightweight and flexible. I like Flask for any type of project, especially for this project. We will use peewee ORM to store PV (page access data) and query and analyze data. Let us know, our application will not exceed 100 lines of code (including comments ).
Page view
To be able to query data, we will use relational databases to store PV data. I chose the BerkeleyDB's SQLite interface because it is a lightweight embedded database and does not use much memory. I have considered SQLite, but BerkeleyDB has a much higher performance in concurrent access than SQLite. When the analysis application is damaged, it can still run stably.
If you have already installed Postgresql or MySQL, select it as needed.
WSGI Server
Although there are many options, I prefer gevent. Gevent is a coroutine-based network library that uses the libev event mechanism to implement greenlets ). By using monkey-patching, gevent will change normal blocked python programs to non-blocking without special APIs or syntax. Gevent WSGI server, despite being very basic, has very high performance and very low resource consumption. Similar to databases, if you use other databases, you can choose from them as needed.
Create virtualenv
We will first create an isolated environment for this analysis application and install it onflask
,peewee
(Select to install gevent ).
$ virtualenv analyticsNew python executable in analytics/bin/python2Also creating executable in analytics/bin/pythonInstalling setuptools, pip...done.$ cd analytics/$ source bin/activate$ pip install flask peewee......Successfully installed flask peewee Werkzeug Jinja2 itsdangerous markupsafeCleaning up...$ pip install gevent # Optional.
If you want to compilePython SQLite
Driver to supportBerkeleyDB
, Check out the playhouse module (lib/python2.7/site-packages/playhouse/berkeley_build.sh
. This script will download and compile BerkeleyDB, and then compile pysqlite. For detailed procedures, see this article. You can skip this step and directly use the SqliteDatabase class of peewee.
Implement the Flask Application
Let's start with the overall code framework. As discussed earlier, we will create two views, one for returning JavaScript files, and the other for creating 1 pixel GIF images. Inanalytics
Directory (this is the directory developed and used by the author. Please create it if not). Createanalytics.py
File, the code is listed below. The Code includes the application code structure and basic configuration.
#coding:utf-8from base64 import b64decodeimport datetimeimport jsonimport osfrom urlparse import parse_qsl, urlparsefrom flask import Flask, Response, abort, requestfrom peewee import *from playhouse.berkeleydb import BerkeleyDatabase # Optional.# 1 pixel GIF, base64-encoded.BEACON = b64decode('R0lGODlhAQABAIAAANvf7wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==')# Store the database file in the app directory.APP_DIR = os.path.dirname(__file__)DATABASE_NAME = os.path.join(APP_DIR, 'analytics.db')DOMAIN = 'http://127.0.0.1:5000' # TODO: change me.# Simple JavaScript which will be included and executed on the client-side.JAVASCRIPT = '' # TODO: add javascript implementation.# Flask application settings.DEBUG = bool(os.environ.get('DEBUG'))SECRET_KEY = 'secret - change me' # TODO: change me.app = Flask(__name__)app.config.from_object(__name__)database = BerkeleyDatabase(DATABASE_NAME) # or SqliteDatabase(DATABASE_NAME)class PageView(Model): # TODO: add model definition. class Meta: database = database@app.route('/a.gif')def analyze(): # TODO: implement 1pixel gif view.@app.route('/a.js')def script(): # TODO: implement javascript view.@app.errorhandler(404)def not_found(e): return Response('Not found.')if __name__ == '__main__': database.create_tables([PageView], safe=True) app.run()
Obtain information from a browser
We started to write JavaScript files that collect client information, mainly to extract some basic page information.
URL information, including the feferring information on the title (document. title) page of the query information (document. location. href) page, if any (document. referrer)
Some of the following attributes can also be extracted, if you are interested (I: It may be more than listed, especially H5), for example:
* Cookie information (document. cookie)
* Last modification time of the document (document. lastModified)
* More
After the information is extracted, we pass it through the url query stringanalyze
This view. For simplicity, this js will be triggered immediately when the page is loaded. We encapsulate all the code into an anonymous function. Finally, use the encodeURIComponent method to escape all parameters to ensure the security of all parameters:
(function() { var img = new Image, url = encodeURIComponent(document.location.href), title = encodeURIComponent(document.title), ref = encodeURIComponent(document.referrer); img.src = '%s/a.gif?url=' + url + '&t=' + title + '&ref=' + ref;})();
We reserve the placeholder % s in Python for readingDOMAINConfigure and insert it to form the complete JS Code
In The py file, we define a JAVASCRIPT variable to store the above js Code:
# Simple JavaScript which will be included and executed on the client-side.JAVASCRIPT = (function(){ var d=document,i=new Image,e=encodeURIComponent; i.src='%s/a.gif?url='+e(d.location.href)+'&ref='+e(d.referrer)+'&t='+e(d.title); })().replace('', '')
In view.
@app.route('/a.js')def script(): return Response( app.config['JAVASCRIPT'] % (app.config['DOMAIN']), mimetype='text/javascript')
Save PV Information
The above script will pass three values to analyze, including the URL, title, and referring page of the page. Now we define a PageView model to store the data.
On the server side, we can also read the visitor's IP address and request header information, so we also create fields for the information and add the request timestamp field.
Because each browser has different request headers and the query parameters of each page request are different, we will store them in the TextField field in JSON format. If you use Postgresql, you can use HStore or native JSON data-type.
The following is the definition of the PageView model. A JSONField is also defined to store query parameters and request header information:
class JSONField(TextField): Store JSON data in a TextField. def python_value(self, value): if value is not None: return json.loads(value) def db_value(self, value): if value is not None: return json.dumps(value)class PageView(Model): domain = CharField() url = TextField() timestamp = DateTimeField(default=datetime.datetime.now, index=True) title = TextField(default='') ip = CharField(default='') referrer = TextField(default='') headers = JSONField() params = JSONField() class Meta: database = database
First, we add a method to PageView so that it can extract all the required values from the request and coexist with the database.urlparse
The module contains many methods to extract request information. We use these methods to obtain access URLs and request parameters:
class PageView(Model): # ... field definitions ... @classmethod def create_from_request(cls): parsed = urlparse(request.args['url']) params = dict(parse_qsl(parsed.query)) return PageView.create( domain=parsed.netloc, url=parsed.path, title=request.args.get('t') or '', ip=request.headers.get('X-Forwarded-For', request.remote_addr), referrer=request.args.get('ref') or '', headers=dict(request.headers), params=params)
analyze
The last step of view is to return a 1-pixel GIF image. To ensure security, we will check whether the URL exists and ensure that no blank record is inserted in the database.
@app.route('/a.gif')def analyze(): if not request.args.get('url'): abort(404) with database.transaction(): PageView.create_from_request() response = Response(app.config['BEACON'], mimetype='image/gif') response.headers['Cache-Control'] = 'private, no-cache' return response
Start the application
In this case, if you want to test the application, you can first set it in the command line.DEBUG=1
To start the debug mode.
(analytics) $ DEBUG=1 python analytics.py * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit) * Restarting with reloader
Access http: // 127.0.0.1: 5000/a. js to view the js file. If there are other web applications locally, You can embed this js file into its page and test the analysis application.
<script src=http://127.0.0.1:5000/a.js type=text/javascript></script>
We recommend that you use a dedicated WSGI server to deploy applications in the production environment. I like gevent, which is very lightweight and high-performance. You can modify the analytics. py file and use gevent to replace the original Flask server. The following code uses gevent to run an application on port 5000:
if __name__ == '__main__': from gevent.wsgi import WSGIServer WSGIServer(('', 5000), app).serve_forever()
Because gevent uses monkey-patching for high concurrency, you need to add the following line in analytics. py:
from gevent import monkey; monkey.patch_all()
Query dataWhat really makes you feel happy is that you have collected data for several days and queried it. This part shows how to query some interesting information from the collected data.
The following uses the data of my blog. We will query the data in the last seven days.
>>> from analytics import *>>> import datetime>>> week_ago = datetime.date.today() - datetime.timedelta(days=7)>>> base = PageView.select().where(PageView.timestamp >= week_ago)
First, let's take a look at the PV of the past week.
>>> base.count()1133
How many different IP addresses have accessed my website?
>>> base.select(PageView.ip).group_by(PageView.ip).count()850
What are the 10 most visited pages?
print (base .select(PageView.title, fn.Count(PageView.id)) .group_by(PageView.title) .order_by(fn.Count(PageView.id).desc()) .tuples())[:10]# Prints...[('Postgresql HStore, JSON data-type and Arrays with Peewee ORM', 88), (Describing Relationships: Django's ManyToMany Through, 73), ('Using python and k-means to find the dominant colors in images', 66), ('SQLite: Small. Fast. Reliable. Choose any three.', 58), ('Using python to generate awesome linux desktop themes', 54), (Don't sweat the small stuff - use flask blueprints, 51), ('Using SQLite Full-Text Search with Python', 48), ('Home', 47), ('Blog Entries', 46), ('Django Patterns: Model Inheritance', 44)]
The Unit is 4 hours. At which time is the most visited person in a day?
hour = fn.date_part('hour', PageView.timestamp) / 4id_count = fn.Count(PageView.id)print (base .select(hour, id_count) .group_by(hour) .order_by(id_count.desc()) .tuples())[:][(3, 208), (2, 201), (0, 194), (1, 183), (4, 178), (5, 169)]
Based on the data, it seems that the most people visit each day at lunch time, the minimum number of people visit the internet before midnight, and the total traffic is relatively average.
Which user-agents are most popular?
from collections import Counterc = Counter(pv.headers.get('User-Agent') for pv in base)print c.most_common(5)[(u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36', 81), (u'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36', 70), (u'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:32.0) Gecko/20100101 Firefox/32.0', 50), (u'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2', 37), (u'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0', 37)]
All the data you want depends on you. It is interesting to query the order list of all pages accessed by an IP address. This shows how people access your website on the previous page.
inner = base.select(PageView.ip, PageView.url).order_by(PageView.timestamp)query = (PageView .select(PageView.ip, fn.GROUP_CONCAT(PageView.url).alias('urls')) .from_(inner.alias('t1')) .group_by(PageView.ip) .order_by(fn.Count(PageView.url).desc())print {pv.ip: pv.urls.split(',') for pv in query[:10]}# Prints something like the following:{ u'xxx.xxx.xxx.xxx': [ u'/blog/peewee-was-baroque-so-i-rewrote-it/', u'/blog/peewee-was-baroque-so-i-rewrote-it/', u'/blog/', u'/blog/postgresql-hstore-json-data-type-and-arrays-with-peewee-orm/', u'/blog/search/', u'/blog/the-search-for-the-missing-link-what-lies-between-sql-and-django-s-orm-/', u'/blog/how-do-you-use-peewee-/'], u'xxx.xxx.xxx.xxx': [ u'/blog/dont-sweat-small-stuff-use-flask-blueprints/', u'/', u'/blog/', u'/blog/migrating-to-sqlite/', u'/blog/', u'/blog/saturday-morning-hacks-revisiting-the-notes-app/'], u'xxx.xxx.xxx.xxx': [ u'/blog/using-python-to-generate-awesome-linux-desktop-themes/', u'/', u'/blog/', u'/blog/customizing-google-chrome-s-new-tab-page/', u'/blog/-wallfix-using-python-to-set-my-wallpaper/', u'/blog/simple-botnet-written-python/'], # etc...}
The idea of improving the application is to create a web interface or API to query pv data using tables or similar to Postgresql HStore to standardize request header data collection user cookies tracking user access paths using GeoIP to determine users' geographic locations use canvas fingerprints to better determine the uniqueness of users. More and more cool queries are used to study data.