Analyzes the Problems and Solutions encountered during the import of a large number of Mysql Data, mysql Data Import
In projects, a large amount of data is often imported into the database for data analysis using SQL. Some problems need to be solved during data import. Here, we will introduce a 4G txt data import practice to present the problems and solutions, on the one hand, make a summary record on your own, and on the other hand, hope to have a reference for those who encounter the same problem.
The imported data is a txt file of encyclopedia. The file size is 4 GB and there are more than 65 million pieces of data. Each piece of data is separated by line breaks. Each data entry contains three fields separated by tabs. I use a TripleData class to store the three fields. The fields are all strings and multiple data entries are saved in List <TripleData>, then, store List <TripleData> into the mysql database and store all data in batches to the mysql database.
The above is a rough idea. The following are the problems encountered during the import process.
1. Database Connection garbled characters and compatibility issues.
If the data contains Chinese characters, you must set the encoding parameters for the url of the database to the following format.
URL="jdbc:mysql://"+IP+":"+PORT+"/"+DB_NAME+"?useSSL=false&useUnicode=true&characterEncoding=utf-8";
Setting the code to UTF-8 is to solve the garbled problem, and setting useSSL is to solve the problem of compatibility between JDBC and mysql. If useSSL is not set, an error is returned. Similar
Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
This error message. The major reason is that the mysql version is relatively high, the JDBC version is relatively low, and compatibility is required.
2 utf8mb4 Encoding Problems
During data import
SQLException: Incorrect string value: '\ xF0 \ xA1 \ x8B \ xBE \ xE5 \ xa2...' for column 'name'
This error message is because UTF-8 set in mysql is 3 bytes by default, which is no problem for general data. If it is a large data volume, it will inevitably contain some emotices or special characters, which occupy 4 bytes. UTF-8 cannot be processed, so an error is returned. The solution is that mysql introduces 4 bytes of UTF-8 encoding in Versions later than 5.5.3, that is, utf8mb4. You need to reset the mysql encoding.
You can follow the steps below to back up the database to be modified. Although utf8mb4 is backward compatible with utf8, to prevent improper operations, you still need to prevent problems, back up data. The second is to modify the character set encoding of the database to utf8mb4-UTF-8 Unicode, sorting rules utf8mb4_general_ci. I used navicat to modify the above modification. You can find out how to use the command line to modify the modification. The third is to modify the configuration file my. ini in the root directory of mysql installation. Add the following settings.
[client]default-character-set = utf8mb4[mysqld]character-set-server=utf8mb4collation-server=utf8mb4_general_ci[mysql]default-character-set = utf8mb4
After the modification, You need to restart mysql to make the modification take effect.
Then, you can import the data normally.
3. Time Efficiency of mass import
Because of the large data volume, we split the data. I divided 65 million data records into 500 files, each of which contains about 0.11 million data records, put the 0.11 million pieces of data in ArrayList <TripleObject> and import them in batches. The general idea is to use "insert into tb (...) values (...), (...)...; insert is used for one-time insertion, which saves a lot of time. The following is an example.
Public static void insertSQL (String SQL, List <TripleObject> tripleObjectList) throws SQLException {Connection conn = null; PreparedStatement psts = null; try {conn = DriverManager. getConnection (Common. URL, Common. DB_USERNAME, Common. DB_PASSWORD); conn. setAutoCommit (false); // set manual submission // Save the SQL suffix StringBuffer suffix = new StringBuffer (); int count = 0; psts = conn. prepareStatement (""); String s = ""; String p =" "; String o =" "; while (count <tripleObjectList. size () {s = tripleObjectList. get (count ). getSubject (). replaceAll (",",". "). replaceAll ("\\(",""). replaceAll ("\\)",""). replaceAll ("\'",""). replaceAll ("\\\\", ""); p = tripleObjectList. get (count ). getPredicate (). replaceAll (",",". "). replaceAll ("\\(",""). replaceAll ("\\)",""). replaceAll ("\'",""). replaceAll ("\\\\", ""); o = tripleObjectList. get (count ). getObj Ect (). replaceAll (",",". "). replaceAll ("\\(",""). replaceAll ("\\)",""). replaceAll ("\'",""). replaceAll ("\\\\", ""); suffix. append ("('" + s + "', '" + p + "', '" + o + "'),"); count ++ ;} // construct the complete SQL String allsql = SQL + suffix. substring (0, suffix. length ()-1); // Add and execute SQL psts. addBatch (allsql); psts.exe cuteBatch (); // execute batch processing conn. commit (); // submit} catch (Exception e) {e. printStackTrace ();} finally {if (psts! = Null) {psts. close ();} if (conn! = Null) {conn. close ();}}}
The advantage of this method is that it takes very little time to import data. It takes exactly one hour to import 65 million pieces of data. The disadvantage is that if there is a large sentence in the data, you need to process the comma, Parentheses, and backslash in it. here you need to measure whether to use this method.
If the data is inserted normally, insert into tb (...) is used (...) values (...); insert into tb (...) values (...); ......" But it takes a long time. I tested that it takes about 12 minutes to process 0.11 million pieces of data, it takes about 65 million hours to import 100 data records.
We use the first method, which allows you to view the data. The data requirements are not so strict, saving time.
The above are the problems I encountered when importing large volumes of data to mysql and the solutions I have come up with. If you have better solutions or encounter other problems, I hope to discuss them together.