Close minutes 18. Upload data containing Chinese characters to Google App Engine

Source: Internet
Author: User

From previousArticleAt first, I started to fully use Google App Engine. In fact, I am a newbie in this regard. Therefore, I will not write many articles based on gae, but once I encounter problems that plague me, I still think we should share our experiences with you.

Data upload is a very difficult problem. As mentioned above, because Python used by Google App Engine does not support remote database access (this should be based on performance considerations ), you can only upload your data to the Google datastore api library of Gae. The Google datastore API is described in the Google documentation as follows:

The Google datastore API provides an interface for Data Query and interactive storage. It is all implemented based on Google's scalable architecture and provides a series of data model interfaces in Python, and a query language gql similar to SQL statements, allowing development of Scalable DataProgramIt is as easy as other simple web applications.

After reading this introduction of "Alibaba Cloud", let me post a few complaints about this Google datastore API. Maybe for a long time, I have been spoiled by relational databases and related tools, in my opinion, Google datastore API has brought me the following troubles:

1. Multi-table join queries are not supported. This requirement may be a bit high, because gdata is said to be not based on relational databases, but it does cause a lot of inconvenience;

2. the number of results returned each time is limited, and a maximum of 1000 results can be returned for each query. I do not need that much, but when I need to know how many data entries are in the entire database, this may be the case, but it cannot be solved. You can sort the data by key in each query, then, each time the key of the last query result is used to query the new 1000 results greater than this key, all results can be queried theoretically, but I have not actually used this function;

3. the query conditions are only supported, although many data types are provided, suchGeopt. However, if you cannot query the longitude and latitude by distance, this parameter can only be used as a 2-length floating-point array;

4. it is quite difficult to upload and manage data. The data management view provided by Google is very simple and can only be used for simple queries. For example, if I want to delete all records that meet the conditions, I will not be able, because gql does not have a delete statement;

In this article, I will talk about data uploading. You know, although Google is unable to perform many functions due to database architecture considerations, it is the same as the relational database I have been using, however, there are still many solutions to solve these problems.Bulk data uploader.

Bulk data UploaderThe principle is to use a local Python script to send 10 records of CSV data to the server page post (not too much, because the server processing is easy to time out ), then, save the data to the database based on the simple data type configuration on the server.

This process is very slow. I am sending million records of data to the database. I wrote this blog when the machine was running this process, although each piece of data is only four Integers of the int type, I initially estimated that only about 0.7 requests can be sent per second, that is, 7 data records per second on average, it takes me 7 hours to send all the data. Of course, this is also related to the network speed. The network speed on my side, the download speed of files is usually 60 K. in fact, the CSV size of my data file is less than 4 MB. If there is a more efficient mechanism to process the data, I think the problem will be solved within one minute.

let's talk about how to solve the problem of Chinese uploading, next, we will introduce the process of using bulk to upload data. If we use the default method introduced by Google to upload data, we will always encounter one problem or another when encountering Chinese characters, either it means " 'ascii 'codec can't encode characters U' \ ufeff 'in position 0 ", either that is, "three columns should exist, and only two columns are received, after searching for a period of time, I found that Google has a solution called "issue 157: non-ascii csv data not handled by Google. appengine. ext. bulkload (UNICODE errors) ". Generally, the process of uploading data is as follows:

1. Replace one of issue157_ Init _. pySave it to the root folder of the project and rename it "bulkload. py". This file actually contains modifications to solve the problem of uploading Chinese characters;

2. Create a server access page to receive data and save it to the database. Therefore, you need to specify the data type in this file:

1 Import Bulkload # Here, we do not use "from Google. appengine. Ext import bulkload" in the Google document, but use bulkload under the root directory of our project to solve the problem of uploading Chinese characters.
2 From Google. appengine. API Import Datastore_types
3 From Google. appengine. ext Import Search
4
5 Class Initdata (bulkload. loader ):
6 Def   _ Init __ (Self ):
7 # The following defines the data format. ableable1 is the name of the uploaded data type.
8 # The name and type of each data column
9 # The geopt type must be enclosed in quotation marks (LAT) and Lon format in CSV files, such as "2.99629, 73.00207"
10 Bulkload. loader. _ Init __ (Self, ' Datatable1 ' ,
11 [( ' Field1 ' , INT ),
12 ( ' Field2 ' , Datastore_types.geopt ),
13 ( ' Field3 ' , Unicode ),
14 ( ' Field4 ' , STR ),
15 ])
16
17 If   _ Name __   =   ' _ Main __ ' :
18 Bulkload. Main (initdata ())

3. After such a page is defined above, register the page in the Handlers section of APP. yaml:

-URL:/datainit
Script: datainit. py

Based on security considerations, you should also configure this URL to be accessible by the Administrator (login: Admin, however, I do not quite understand how to set the user name and password in the bulk client after setting admin, so I have to leave;

4. Run tools \ bulkload_client.py in the installation directory of google_appengine on the client, for example:

Bulkload_client.py -- filename data/arealist.csv -- kind arealist -- URL http://aaa.appspot.com/Datainit

The filename parameter is the path of the local CSV file, the kind parameter is the name of the Data class, And the URL parameter is the address of the page for the server to process uploads.

After running, you can see that the client is constantly uploading data. This process is very slow, and if your data is even more complex than mine, it is recommended that batch upload be appropriate because the Google server has a daily limit on the number of requests and the CPU usage. Although it is difficult to reach this limit, however, when constantly uploading data, you need to be careful, because writing data itself consumes a lot of performance.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.