How to Write millions of data records to the database

Source: Internet
Author: User
Tags php cli what array
There are 1 million pieces of data in a text file, one row at a time. I need to write each row that meets the conditions into the database. The previous practice was to read the data in the file, then, store the data in the array, and use the foreach array to process the data one by one (the data is written to the database according to the conditions... 1 million pieces of data are stored in a text file, one row at a time,

I need to write each row that meets the conditions into the database,

The previous practice was to read the data in the file, store the data in the array, and then process the foreach array one by one (write the data to the database as required ),

However, in the face of millions of data records, if I continue to do so, it seems to be self-seeking, but it is really a big girl to deal with big data. I took a sedan chair, but I did not have any experience at all,

I used php process/thread to solve the problem by looking for information on the Internet. I have a sewage problem with the process and thread. Please come in and share with me how to deal with big data, how is it implemented through processes/threads?

Reply content:

1 million pieces of data are stored in a text file, one row at a time,

I need to write each row that meets the conditions into the database,

The previous practice was to read the data in the file, store the data in the array, and then process the foreach array one by one (write the data to the database as required ),

However, in the face of millions of data records, if I continue to do so, it seems to be self-seeking, but it is really a big girl to deal with big data. I took a sedan chair, but I did not have any experience at all,

I used php process/thread to solve the problem by looking for information on the Internet. I have a sewage problem with the process and thread. Please come in and share with me how to deal with big data, how is it implemented through processes/threads?

Millions... Not big...

The bottleneck of data processing is basically I/O. You can directly read and write data in a single thread (especially if you want to insert a database, adding an index at Will will become the bottleneck ).

But why do you want to store the array! Are you going to insert all the file content into the memory? Read one record and judge it once. If OK is saved to the database, it will be discarded if it is not OK. That's simple. What array does it store...

Are you sure you want to use PHP for data processing...

Based on 10 million rows of data, suppose you are most familiar with PHP and the development speed is the fastest. Suppose you want to write data to MySQL.

  1. Use shell to split 10 million lines of files into 100 files, so that each file has 0.1 million lines. The specific method can be man split.

  2. Write a php script that reads a file and outputs valid data. Note that the data format is strictly in the order of fields in the table. fields are separated by semicolons and rows are separated by \ n. For specific parameters, see MySQL load data command parameters. Note that it is run in php cli mode. Do not run Apache or other web servers. If you do not know how to implement row-based reading, you can directly use the file () function of php to generate an SQL statement through error_log ($ SQL, 3, "/path/to/dataa ") the function is written to the file. At the same time, you can echo some debugging information for subsequent checks.

  3. Write a shell script to call php to process logs. The script can be similar

    /Path/to/php/bin/php-f genMySQLdata. php source = loga out = dataa>/errora. log &/path/to/php/bin/php-f genMySQLdata. php source = logb out = datab>/errorb. log &/path/to/php/bin/php-f genMySQLdata. php source = logc out = datac>/errorc. log &.... repeat one hundred rows. If the machine configuration is low, you can write data in batches. You can write 10 rows at a time. The content of this script is quite regular. It can also be generated using php. Time is saved again. Execute this shell script on the machine. In fact, multiple PHP processes are started to generate data. If the configuration is good enough, it means you have started 100 PHP processes to process data. The speed is faster.

  4. Continue to write the shell script. The content is to open MySQL and load data to load data.

    Mysql-h127.0.0.1-uUser-ppwd-P3306-D dbname-e 'Load data infile "/path/to/dataa" into table TableName (Field1, Field1, Field1 ); 'field1... to correspond to the order in which data is generated, this command can be executed directly, or you can repeat N rows in the shell, and then execute the shell script.

PS: Pay attention to Encoding

If this is a one-time import of 1 million data into mysql, you can use mysql load data, I used load data to import the data about 6.5 million of the account password leaked by csdn in the past. It took more than two minutes...
If the program is reused multiple times and you want to create a script, you can read 0.1 million transactions at a time, and enable transactions explicitly in the foreach field (ps: loop insertion must enable transactions explicitly, which has better performance, after one-time writing, unify the commit, increase the speed of 0.1 million entries by at least 1 million times or even thousands of times, and reduce the disk I/O). Remember to unset when variables are used up, and insert data, which is a small case.
You can also use insert into values (), (2, 3)... to splice the data, with the fastest performance. Note that the length of an SQL statement is limited. you can insert 1000 SQL statements at a time. If you do not enable transactions explicitly, insert in foreach is the most spam, the slowest, and the most io pressure, because every insert operation has an expensive system call fsync (). loop 1 million is equivalent to calling 1 million fsync. it indicates that the transaction is enabled. fsync is called once every 0.1 million commit calls, And fsync is only called 10 times for 1 million times.

In this scenario, the bottleneck is no longer on php. It takes a short time to insert 1 million of data,
You do not need to use full-Read Memory at a time, read one entry at a time, and combine SQL statements that meet the conditions. Do not insert mysql in a hurry,
Hundreds of entries can be inserted at a time, which is much faster. Note that the size of the final SQL statement cannot exceed max _ allowed _ packet,

Use python. Use python to read text to generate an SQL statement, and then use source to import it. OK!

f = open("/data/data.txt")t = open("/data/sql.txt", 'w+');s = '';i = 0;line = f.readline();while line:    a  = line.split("\t");    t.write("INSERT INTO `table`(`id`, `add_time`) VALUES");    b = "('%s', '%s');" %(a[0], a[2]);    line = f.readline();     if line == '':        b = "('%s', '%s');" %(a[0], a[2]);    t.write(b);f.close();t.close();

In addition, remember to delete all the indexes, and load data in file. Then add the Index, which is much faster.

In fact, why don't you try to write a thread to process data import? Before the import, I suggest opening things, And I generally prefer opening things in programs.

Generally, databases provide the buck insert function ~

Load data infile http://dev.mysql.com/doc/refman/5.7/en/load-data.html

It is much easier not to filter data from files and insert data into the database, but to insert the whole file into the database and filter data from the database.

Isn't the database the same?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.