Data engineers, common database and network service sharing, python code, and Network Service python
As a data engineer or data analyst, he often deals with various types of data. The access to data is unavoidable. below, I will share the data connection configuration model that I often use in my work for your communication.
MySQL database
Mysql database is currently one of the most widely used databases. Here I am using read and write interfaces, while deletion and update operations are generally not developed by analysts, so I didn't do this.
1 import MySQLdb 2 import pandas as pd 3 from sqlalchemy import create_engine 4 5 class con_analyze: 6 "Data Analysis Platform Connection" 7 8 def _ init _ (self, database = 'myanalyze'): 9 self. database = database10 self. conn = None11 12 def connect (self): 13 self. conn = MySQLdb. connect (host = '***', user = 'root', passwd = '***', db = self. database, 14 charset = 'utf8') 15 16 def query (self, SQL): 17 try: 18 self. connect () 19 data = pd. read_ SQL (SQL, self. conn) 20 Tb (AttributeError, MySQLdb. operationalError): 21 self. connect () 22 data = pd. read_ SQL (SQL, self. conn) # An error occurred while reading data. Connect 23 return data24 25 def store (self, mydataframe, table_name, if_exists = 'replace '): 26 conn2 = "mysql + mysqldb: // root: *** @ ***: 3306/% s "% self. database27 local_engine = create_engine (conn2) 28 mydataframe. to_ SQL (table_name, local_engine, if_exists = if_exists, index = False, chunksize = 10000) 29 30''' You can also add a function to execute a single SQL statement, not just reading data, you can also update, create, and other '''
Used as a link class, the conn given during initialization is None, and the link is created only when the query function is executed. (In the link, I ConCealed my host information, you need to enter your host)
The try statement is used for query. If the link is unsuccessful or the query is unsuccessful, an error occurs. If the link is unsuccessful, the connection is connected again in the exception. There is a library for repeated code execution: tenacity allows you to implement more elegant (pythonic) code duplication.
The read_ SQL function in the pandas database is used to read data. This function can directly convert the query result into a dataframe, facilitating subsequent analysis.
The storage function also uses the dataframe function to_ SQL, which directly converts a df into SQL data and saves it to the database. If table_name exists, you can choose replace or append, if df is large and long, you need to set the chunksize parameter.
When chunksize is set, the program will automatically store your 100,000 rows of df iterative storage, and only 10000 rows at a time (this number is set by me, you can also change it ).
Here, you may wonder why the Read and Write Conns are different. One is created using MySQLdb. connect, and the other is created using create_engine. What I want to say is that the following conn2 can actually be used as the read connection parameter, but MySQLdb is used. the connection created by connect may not be used for writing, because I have encountered errors many times in practice, so I changed it.
In fact, other databases can use this method to configure a connection class for their own projects, which should be used like this:
First, you need to put the code in a separate configuration file, such as config. py.
Then, import the configuration file where you need it.
1 from config import con_analyze 2 3 4 class AnalyzeData: 5 def _ init _ (self): 6 # initialization here, You can include a parameter: database, the default is myanalyze 7 self. conn = con_analyze () 8 # self. conn2 = con_analyze ("myanalyze_2") 9 10 def get_data (self, SQL): 11 # Save the SQL query result to df 12 df = self. conn. query (SQL = SQL) 13 14 def store_data (self, df): 15 # Save the data df OF THE dataframe type to the data table named dd_name 16 self. conn. store (df, 'db _ name ')
MongoDB
Mongodb is an unstructured database. The data stored in it is similar to json and is in the form of key-value pairs. If you need to query data in mongodb, I will briefly introduce it below.
Similarly, we need to create a class for standardization.
1 import pymongo 2 import pandas as pd 3 4 class Conn_Mongo: 5 "mongo database connection" 6 7 def _ init _ (self): 8 self. pai_utoken = pymongo. mongoClient ('mongodb: // ***: 100 '). utoken # User table 9 10 def get_user_data_mongo (self, list_id): 11 "" 12 search 13 "14 user_data = pd by connecting to mongo. dataFrame (list (self. pai_fotor.userinfo.find ({'ftoken': {'$ in': list (list_id)}) 15 return user_data
After all, this is simple, it is a query operation. I first input a string of IDs and find the corresponding information based on the id. Generally, the mongodb database capacity is relatively large, so I have specific query information.
Here, the pymongo library is used to create a connection to the corresponding address (I used *** to hide it. utoken is the name of the database. You can also use it as a parameter and upload it during initialization.
The find function is used for subsequent queries. The userinfo in front of the function is the table name, And the find parameter is also in the form of a key-value pair. Here I have specified the key name "FToken ", the value {'$ in': list (list_id)} indicates the content of the object.
The id is made into a list (named list_id for your understanding). You can refer to the related syntax.
Flurry
If your work involves app data, you will often use Flurry to obtain data.
Flurry is a mobile statistics platform. Although it is abroad, it can still be used in China (unlike Google analysis). The operation data of ios and Android applications can be queried in statistics.
If you do not have one, you can stamp it here: Flurry
Yes, the web page is like this.
Common functions are user data.
And click events
However, this is not the point I want to talk about. The above just shows you what Flurry looks like. Now I want to write the python interface to retrieve the data.
Flurry api address. Please stamp it here: Flurry API
This is an api for creating analysis reports.
First, we need to apply for an app token to obtain the connection permission. For how to apply, see: app access token.
It is a string of letters.
After obtaining the token, we can create a url to obtain the data in Flurry. For details, see the following code:
import pandas as pdimport json, requests
1 class Conn_Flurry: 2 "flurry api data" "3 api_token = "******. ****. * *** "4 headers = {'authorization': 'bearer {}'. format (api_token)} 5 url = "https://api-metrics.flurry.com/public/v1/data/appEvent/day/app? Metrics = activeDevices, newDevices, averageTimePerDevice & dateTime = 2017-05-23/2017-05-24 "6 7 def get_results (self, url = url ): 8 ''' 9 the url used here is an example. You can also use the get_url function to create the required url and input this function as the 10''' 11 data = requests parameter. get (url, headers = self. headers) 12 cleaned = json. loads (data. text, 'utf-8') 13 cleaned = pd. dataFrame (cleaned ['rows ']) 14 return cleaned15 16 def get_url (self, table = 'appevent', timegrain = 'day', dimensions = 'App/event', metrics = 'occurrences ', 17 dateTime = '2017-09-2017-05-24', filters = ""): 18 ''' 19 if filters is empty, the standard url of result 20 is not affected: endpoint + '/table/timeGrain/dimension1/dimension2; show = all/dimension3 {...}? Metrics = [comma-separated-metrics] & dateTime = [..] & filters = [...] & topN = [..] & sort = [...] & having = [...] & format = [..] & timeZone = [...] '21 App Usage url: endpoint + "/appUsage/day? Metrics = sessions, activeDevices, newDevices & dateTime = 2016-06-01/2016-08-01 & filters = app | name-in [appname] "22 app event url: endpoint + "/appEvent/day/app/appVersion/event? Metrics = occurrences & dateTime = 2016-07-01/2016-07-03 & filters = app | name-in [foo], event | name-in [login, register] "23 app event url2: endpoint +"/appEvent/day/app/country? Metrics = activeDevices, newDevices & dateTime = 2016-07-01/2016-07-03 & filters = app | name-in [foo], event | name-in [login] & topN = 5 & sort = activeDevices | desc "24 event parameter: endpoint +"/eventParams/day/app; show = all/event/paramName/paramValue? Metrics = count & dateTime = 2016-11-07/2016-11-08 & filters = app | name-in [foo], event | name-in [level_complete] "25 note, dimensions changes. When you want to view the specific information of an event: app; show = all/event/paramName/paramValue, add a show = all26 file to note the filters format in filters, you can select the app name and event name 27. Pay attention to the relationship between timegrain and datetime. Common examples are day and month. The datetime format must also be changed to 28''' 29 endpoint = 'https: // api-metrics.flurry.com/public/v1/data'30 url = "{}/{}/{}/{}? Metrics ={} & dateTime ={} & filters = {} ". format (endpoint, table, timegrain, dimensions, metrics, 31 dateTime, filters) 32 return url
The Code is a little long, with many comments in the middle, but in general there are two steps:
1. Construct a url
2. Get the result of the url
But in detail, there are many things involved, such as why the url format is like this, why headers is constructed like that, and the form of results.
What I want to say is that these details are already provided on the api on the official website, so I will not make any changes. However, if you have any questions, please leave a message in the comment area, all I know is answered with all my heart.
1 url = self.conn_flurry.get_url('appUsage', 'month', 'app','averageTimePerSession,activeDevices,newDevices,sessions', self.time_range)2 user_mobile = self.conn_flurry.get_results(url)
The above is a simple application, where time_range should be in this format
self.time_range = '2017-09/2017-10'
For this time range, Flurry is left closed and right open by default, that is, it does not include January 1, October.
Similarly, if so
'2017-09-23/2017-10-24'
This indicates that the results of the first quarter of September 23 are excluded from the first quarter of October 24. If you take data for a certain period of time, it is easy to ignore this, resulting in less data
If you get data by day, the date dimension will remind you of the days in which data is obtained.