preface: Recently, to help my brother process json files, he needs to read them into the database in case he might read data from the database. The data is about the yelp site: https://github.com/Yelp/dataset-examples, http://www.yelp.com/dataset_challenge/. Some issues involving json and sql are recorded.
First, python sql installation
Python comes with a lightweight database sqlite, but it just doesn't work. Requires mysql. pip failed to install mysql. easy_install installation also failed. This unscientific.
After the help of my colleagues. Successful installation with conda, what a hell. Ok. Checked is the package manager conda that comes with python.
pip install MySQLdb
easy_install MySQLdb
pip install MySQL
easy_install MySQL
ipython
which python
sudo conda search mysql
conda search mysql
conda install mysql-python
Processing json data
Python has its own package for parsing json, like beautifulsoup for parsing html, xml package for parsing xml, etc. Can be broken with the json.loads () function. The following lines of code can be broken.
import json
import codecs
f = codecs.open (file_name, encoding = "utf-8")
for line in f:
line = line.strip ("\ n")
line_dict = json.loads (line)
It should be noted that:
1. Halogen mainly uses codecs to read files. Used to think
with codecs.open (file_name, encoding = "utf-8") as f:
text = f.readlines ()
Readlines are read line by line, but this time when I encountered a 1.4G json file. Out of memory. Instead, use the above instead of using readlines ().
2.json.loads () The parameters passed in need to be json strings. Read in line by line, pass in the json string, and parse it. Is a dictionary. Then the advantages are justified. It depends on the individual to analyze.
# ==============================
Method 2: Pass the entire json file as a parameter to
f = file (file_name)
s = json.load (f)
But this will encounter a ValueError: Extra data error, check the information, said that it is a problem with multiple json objects. Is this nonsense? There must be multiple json objects in a directory. Stackoverflow explained very specific http://stackoverflow.com/questions/21058935/python-json-loads-shows-valueerror-extra-data.
>>> json.loads ('(}')
{}
>>> json.loads ('() ()') # == json.loads (json.dumps (()) + json.dumps (()))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C: \ Python27 \ lib \ json \ __ init__.py", line 338, in loads
return _default_decoder.decode (s)
File "C: \ Python27 \ lib \ json \ decoder.py", line 368, in decode
raise ValueError (errmsg ("Extra data", s, end, len (s)))
ValueError: Extra data: line 1 column 3-line 1 column 5 (char 2-4)
>>> dict1 = ()
>>> dict2 = ()
>>> json.dumps ([dict1, dict2])
'[{}, {}]'
>>> json.loads (json.dumps ([dict1, dict2]))
[{}, {}]
Halogen main useless method 2. So I did not delve into it.
Third, save sql
Never tried it at first. Waiting for a real find blog to write your own code is easier than you think. There is staring code directly.
import MySQLdb as mdb
#Need to create a database yelp_dataset_challenge_academic_daaset
conn = mdb.connect (host = 'XXX.XX.XX.XX', user = 'XXX', passwd = 'XX', db = 'yelp_dataset_challenge_academic_daaset')
cur = conn.cursor () # Initial cursor
# conn.set_character_set ("utf-8")
cur.execute ('SET NAMES utf8;')
cur.execute ('SET CHARACTER SET utf8;')
cur.execute ('SET character_set_connection = utf8;')
# ================= Create a table, delete existing records first.
The table itself is not deleted. No drop, delete
table_name = "yelp_academic_dataset_checkin"
delete_table = "delete from" + table_name
cur.execute (delete_table)
#Need to create the table yelp_academic_dataset_checkin, and the fields and field attribute types in the database.
#Write the sql statement to create it. Do not
insert_sql = "insert into yelp_academic_dataset_checkin (type, business_id, checkin_info) values (% s,% s,% s)"
# ===== Some steps to get value from json, omitted. =============
values_tuple = (str (temp_values [0]), str (temp_values [1]), str (temp_values [2]))
cur.execute (insert_sql, values_tuple)
#Run is over, need to turn off
conn.commit ()
conn.close ()
Fourth, write date data
Demand changes. The code has changed. Ugh. To crash a program ape, change it three times.
There is a field in the database that is of type date, not varchar (). Python strings must be date types in the mysql database. There are two ways to break it. First, the python string is converted to the date type, and then written In the database. Second. The python string is written to it, and then converted to the date type by calling the function and written to the database. Halogen mainly used another method, and almost vomited blood when used too much. MySQL has a function str_to_date (), which can help convert string data to date data.
But note the parameter changes.
date_string = "2015-07"
name = "shifeng"
values_tuple = (name, date_string)
insert_sql = "insert into table_name (name_field, date_field) values (% s, str_to_date (% s," %% Y-%% m "))"
cur.execute (insert_sql, values_tuple)
See step 3 for storing sql. The main explanation here is the str_to_date function. The first parameter is a string, the format of the two parameters and the date, pay attention to the format. Pay attention to format, pay attention to format (important thing to say three times). The string is a four-digit age plus a bar symbol plus a two-digit month. Then the second parameter of the str_to_date function must be a string of percent sign plus Y plus bar symbol plus percent sign plus m, assuming that the bars in the string are commas. Then the second parameter also corresponds to a comma. In addition, there are two percent signs. It is because python needs to be escaped. If it is directly operated in mysql:
insert into table_name (name_field, date_field) values ("shifeng", str_to_date ("2015-07", "% Y-% m"))
can.
Five, terminal manipulation mysql
Download mysql first
sudo apt-get install mysql-client
Secondly. Connect to the database
mysql -h XX.XX.XX.XX -u user_name password database_name
Next, you can do some operations like commands in the data:
show tables;
select * from table_name;
insert into table_name (name_field, date_field) values ("shifeng", str_to_date ("2015-07", "% Y-% m"))
delete from table_name;
In addition, it seems to be able to json ---> dataframe ---> sql, using pandas.io.json related. The Lord hasn't tried it, and I will have a chance to try it later.
reference:
1.https: //github.com/Yelp/dataset-examples
2.http: //www.yelp.com/dataset_challenge/
3.http: //stackoverflow.com/questions/21058935/python-json-loads-shows-valueerror-extra-data
Python json and mysql-read json file to save sql, database date type conversion, terminal manipulation mysql and python codecs to read large files