NET fast write/read large data PostgreSQL

Source: Internet
Author: User
Tags postgresql stdin cpu usage


PostgreSQL quickly writes/reads large amounts of data http://www.cnblogs.com/podolski/p/7152144.html


Environment and testing
Use. NET driver Npgsql to connect to the post database. Configuration: Win10 x64, i5-4590, 16G DDR3, SSD 850EVO.



PostgreSQL 9.6.3, database and data are installed on SSD, default configuration, no extension.



CREATE TABLE Public.mesh
(
x integer not NULL,
Y integer not NULL,
Z Integer,
CONSTRAINT Prim PRIMARY KEY (x, y)
)


    1. Import
      Use data Backup, CSV format import, file located on mechanical hard disk, 480MB, data volume 2500w+.


Using copy
Copy mesh from ' d:/user.csv ' CSV
Run Time 107s



Using the Insert
Single connection, C # release Any CPU non-debug mode.



Class Program
{
static void Main (string[] args)
{
var list = GetData ("D:\user.csv");
Timecalc.logstarttime ();
using (var sm = new Sqlmanipulation (@ "Strings", Sqltype.postgresql))
{
Sm. Init ();
foreach (var n in list)
{
Sm. Excutenonquery ($ "INSERT into mesh (x, y, Z) VALUES ({n.x},{n.y},{n.z})");
}
}
Timecalc.showtotalduration ();


Console.ReadKey();
}

static List<(int x, int y, int z)> GetData(string filepath)
{
    List<ValueTuple<int, int, int>> list = new List<(int, int, int)>();
    foreach (var n in File.ReadLines(filepath))
    {
        string[] x = n.Split(',');
        list.Add((Convert.ToInt32(x[0]), Convert.ToInt32(x[1]), Convert.ToInt32(x[2])));
    }
    return list;
}


}
Postgresql CPU usage is very low, but ran for a year, the program is still not finished, no patience ..., so insert not.



Multiline insert
Using multiline INSERT, a statement inserts about 100 data.



var bag = GetData ("D:\user.csv");
When used, directly executes the ToString method of the StringBuilder.
ListListbuilder = new list();
StringBuilder sb = new StringBuilder ();
for (int i = 0; I < bag. Count; i++)
{
if (i% 100 = = 0)
{
SB = new StringBuilder ();
Listbuilder. ADD (SB);
Sb. Append ("INSERT into mesh (x, y, z) values");
Sb. Append ($ "({bag[i].x}, {bag[i].y}, {bag[i].z})");
}
Else
Sb. Append ($ ", ({bag[i].x}, {bag[i].y}, {bag[i].z})";
}
Postgresql CPU usage is almost 27%, disk write about 45mb/s, feel is working, last time 217.36s.
If you change to 10,001 lines, the CPU usage is increased, but the disk writes are decreased on average, and the last time is 160.58s.



Prepare syntax
Prepare syntax allows PostgreSQL to plan SQL ahead of time to optimize performance.



Using single-line insertion CPU usage is less than 25%, Disk writes 63mb/s around, however, using single-line insertion method, the efficiency is not improved, too long or wait for the result.



Using multiple rows to insert CPU utilization 30%, disk write 50mb/s, the final result 163.02, the last time an exception, that is, the last set of data length is not satisfied with the condition, harmless.


static void Main (string[] args)
{
var bag = GetData ("D:\user.csv");
Listlistbuilder = new List();
StringBuilder sb = new StringBuilder ();
for (int i = 0; I < bag. Count; i++)
{
if (i% = = 0)
{
SB = new StringBuilder ();
Listbuilder. ADD (SB);
//SB. Append ("INSERT into mesh (x, y, z) values");
SB. Append ($ "{bag[i].x}, {bag[i].y}, {bag[i].z}");
}
Else
sb. Append ($ ", {bag[i].x}, {bag[i].y}, {bag[i].z}");
}
StringBuilder SBP = new StringBuilder ();
SBP. Append ("PREPARE Insertplan (");
for (int i = 0; i <; i++)
{
SBP. Append ("Int,int,int,");
}
SBP. Remove (SBP. Length-1, 1);
SBP. Append (") as INSERT into mesh (x, y, z) values");
for (int i = 0; i <; i++)
{
SBP. Append ($ "(${i 3 + 1},${i 3 + 2},${i*3+ 3}),");
}
SBP. Remove (SBP. Length-1, 1);
Timecalc.logstarttime ();

using (var sm = new SqlManipulation(@"string", SqlType.PostgresQL))
{
    sm.Init();
    sm.ExcuteNonQuery(sbp.ToString());
    foreach (var n in listbuilder)
    {
        sm.ExcuteNonQuery($"EXECUTE insertplan({n.ToString()})");
    }
}
TimeCalc.ShowTotalDuration();

Console.ReadKey();


}
Using transaction
On the basis of the above, use the transaction transformation. Each statement inserts 1000 data, each 1000 as a transaction, CPU 30%, disk 34mb/s, time consuming 170.16s.
Change to 100 one transaction, time consuming 167.78s.



Using multithreading
Also on the previous basis, using multiple threads, each thread establishes a connection, a connection processes 100 SQL statements, and each SQL statement inserts 1000 data, which is imported in this way. Note that the connection string can set the Maxpoolsize to a larger size, measured on my machine, not set to report a connection timeout error.



CPU usage up to 80%, disk here to note, due to the generation of very many PostgreSQL server processes, bad statistics, the cumulative count should have a small 100mb/s, the final time, 98.18s.



With TPL, because the results returned by Parallel.ForEach are not checked, the time may not be accurate (small).



var lists = new List var listt = new List();
for (int i = 0; i < Listbuilder. Count; i++)
{
if (i% = = 0)
{
Listt = new List();
Lists. ADD (LISTT);
}
Listt. ADD (Listbuilder[i]. ToString ());
}
Timecalc.logstarttime ();
Parallel.ForEach (lists, (x) = =
{
using (var sm = new Sqlmanipulation (@); string; maxpoolsize=1000; ", Sqltype.postgresql))
{
Sm. Init ();
foreach (var n in x)
{
Sm. Excutenonquery (n);
}
}
});
Timecalc.showtotalduration ();
Write Time (1000/Line)
COPY 107s
Insert N/a
Multiline insert 160.58s
prepare multiline insert 163.02s
transaction Multiline Insert 170.16s
Multi-join multiple lines insert 98.18s


    1. Write Updates
      The data is updated in real time, the number may continue to grow, the use of simple insert or update is not possible, operation using the new syntax supported after PostgreSQL 9.5.


INSERT into mesh on conflict (x, y) do update set z = excluded.z
Spit Groove PostgreSQL Support on Conflict,mysql so late ...



This data is repeated in the database 2500w+ the data in the table. This is only a long line insert update test, the other results should be similar.



Normal multi-line insertion, time consuming 272.15s.
In the case of multi-threaded insertion, the time-consuming 362.26s,cpu occupancy rate was 100%. In the case of multi-connection guessing, the update interlock causes performance degradation.


    1. Read
      Select method
      Standard read or use the Select method, ADO. NET direct read.


Using adapter mode, time consuming 135.39s, using dbreader method, time consuming 71.62s.



Copy method
PostgreSQL's Copy method provides stdout binary, which allows you to specify a query for output, which takes 53.20s of time.



Public list< (int x, int y, int z) > bulkiquerynpg ()
{
list< (int, int, int) > dict = new list< (int, I NT, int) > ();
using (var reader = ((npgsqlconnection) _conn). Beginbinaryexport ("COPY (select X, Y, z from Mesh) to STDOUT (FORMAT BINARY)")) is
{
while (reader. StartRow ()! =-1)
{
var x = reader. Read(Npgsqldbtype.integer);
var y = reader. Read(Npgsqldbtype.integer);
var z = reader. Read(Npgsqldbtype.integer);
Dict. ADD ((x, Y, z));
}
}
return dict;
}
Conclusion
summarizes the results of the test, for more data, the following conclusions can be drawn:



Import to an empty data table, or no duplicate data table, takes precedence over the copy statement (why this premise is described in P.s.);
The method of inserting multiple data with a single statement can greatly improve the insertion performance, and the optimal number of bars can be determined experimentally.
Using transaction or prepare insertion, the optimization effect is not obvious in this scene;
Use multi-connection/multi-threaded operation, the speed advantage, but not easy to grasp the resource occupancy rate is too high, the number of connections is too large and easy to affect other applications;
Write update is a new feature of PostgreSQL, which will cause some performance consumption (relative direct insertion);
When reading data, the use of Copy statement can achieve better performance;
The ADO Dbreader object is also preferred because it does not require fill, and the read speed is faster (although it cannot catch up with copy).
P.S.
Why not MySQL?
There is no best, only the most suitable, reason I also like to use MySQL. The main reasons for using PostgreSQL are:



PostgreSQL Import and Export SQL command "copy" directly support binary mode to stdin and stdout, if the program wants to integrate directly, then this is more convenient; compared to MySQL's SQL syntax (load data infile) does not support to stdin or stdout, export can be implemented through Mysqldump.exe, import temporarily nothing particularly good way (mysqlimport perhaps).



Compared to MySQL cons
When PostgreSQL uses copy import, if the target table already has data, copy automatically terminates when the table with the primary KEY constraint encounters an error, and may result in incomplete insertions, in other words, the update operation is not supported by the import process The load syntax for MySQL can explicitly specify an action after an error (Ignore/replace), without interrupting the import process.



Other
If you need to import data from a program using MySQL, consider exporting to a file by program and then importing it with a file, which is said to be much more efficient than insert.



NET fast write/read large data PostgreSQL


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.