Asp.net enables Postgresql to quickly write/read a large number of data instances, asp. netpostgresql
Recently, because a large amount of data needs to be inserted in some projects, I have studied asp.net's implementation of Postgresql to quickly write/read a large amount of data, so I should take a note.
Environment and Test
Use the. net driver npgsql to connect to the post database. Configuration: win10 x64, i5-4590, 16G DDR3, SSD 850EVO.
Postgresql 9.6.3: the database and data are installed on SSD. The default configuration is no extension.
CREATE TABLE public.mesh( x integer NOT NULL, y integer NOT NULL, z integer, CONSTRAINT prim PRIMARY KEY (x, y))
1. Import
Use data backup and import in csv format. The file is on the HDD, 480 MB, and the data volume is more than 2500 million.
Use COPY
copy mesh from 'd:/user.csv' csv
Running time 107 s
Use insert
Single connection, c # release any cpu non-debugging mode.
class Program{ static void Main(string[] args) { var list = GetData("D:\\user.csv"); TimeCalc.LogStartTime(); using (var sm = new SqlManipulation(@"Strings", SqlType.PostgresQL)) { sm.Init(); foreach (var n in list) { sm.ExcuteNonQuery($"insert into mesh(x,y,z) values({n.x},{n.y},{n.z})"); } } TimeCalc.ShowTotalDuration(); Console.ReadKey(); } static List<(int x, int y, int z)> GetData(string filepath) { List<ValueTuple<int, int, int>> list = new List<(int, int, int)>(); foreach (var n in File.ReadLines(filepath)) { string[] x = n.Split(','); list.Add((Convert.ToInt32(x[0]), Convert.ToInt32(x[1]), Convert.ToInt32(x[2]))); } return list; }}
Postgresql CPU usage is very low, but after a year of operation, the program still cannot end, no patience..., so insertion will not work.
Multiline insert
When multiline is used, about 100 data records are inserted in one statement.
Var bag = GetData ("D: \ user.csv"); // execute the tostring method of stringbuilder directly. List <StringBuilder> listbuilder = new List <StringBuilder> (); StringBuilder sb = new StringBuilder (); for (int I = 0; I <bag. count; I ++) {if (I % 100 = 0) {sb = new StringBuilder (); listbuilder. add (sb); sb. append ("insert into mesh (x, y, z) values"); sb. append ($ "({bag [I]. x}, {bag [I]. y}, {bag [I]. z}) ");} else sb. append ($ ", ({bag [I]. x}, {bag [I]. y}, {bag [I]. z })");}
Postgresql's CPU usage is almost 27%, and the disk write is about 45 MB/S. It feels like it is working, and the last time is 217.36 s.
If it is changed to 1000 rows, the CPU usage is increased, but the average write speed to the disk is decreased. The last time is 160.58 s.
Prepare syntax
The prepare syntax allows postgresql to plan SQL statements in advance to optimize performance.
The CPU usage for single-row insertion is less than 25%, and the disk write speed is about 63 Mb/s. However, the efficiency of single-row insertion is not improved, and the time is too long to wait.
The CPU usage of multi-row insertion is 30%, the disk is written to 50 MB/S, and the final result is 163.02. An exception occurs at the end, that is, the last group of data length does not meet the conditions and is harmless.
static void Main(string[] args){ var bag = GetData("D:\\user.csv"); List<StringBuilder> listbuilder = new List<StringBuilder>(); StringBuilder sb = new StringBuilder(); for (int i = 0; i < bag.Count; i++) { if (i % 1000 == 0) { sb = new StringBuilder(); listbuilder.Add(sb); //sb.Append("insert into mesh(x,y,z) values"); sb.Append($"{bag[i].x}, {bag[i].y}, {bag[i].z}"); } else sb.Append($",{bag[i].x}, {bag[i].y}, {bag[i].z}"); } StringBuilder sbp = new StringBuilder(); sbp.Append("PREPARE insertplan ("); for (int i = 0; i < 1000; i++) { sbp.Append("int,int,int,"); } sbp.Remove(sbp.Length - 1, 1); sbp.Append(") AS INSERT INTO mesh(x, y, z) values"); for (int i = 0; i < 1000; i++) { sbp.Append($"(${i*3 + 1},${i* 3 + 2},${i*3+ 3}),"); } sbp.Remove(sbp.Length - 1, 1); TimeCalc.LogStartTime(); using (var sm = new SqlManipulation(@"string", SqlType.PostgresQL)) { sm.Init(); sm.ExcuteNonQuery(sbp.ToString()); foreach (var n in listbuilder) { sm.ExcuteNonQuery($"EXECUTE insertplan({n.ToString()})"); } } TimeCalc.ShowTotalDuration(); Console.ReadKey();}
Use Transaction
On the basis of above, use transaction transformation. Insert 1000 pieces of data into each statement, and each 1000 pieces is used as a transaction. The CPU usage is 30%, and the disk usage is 34 MB/S, which takes 170.16 seconds.
Change to 100 transactions, which takes 167.78 s.
Multithreading
On the basis of the preceding operations, multiple threads are used. Each thread establishes a connection and processes 100 SQL statements. Each SQL statement inserts 1000 data records and imports the data in this way. Note: The connection string can be larger than maxpoolsize. If this parameter is not set, a connection timeout error is returned.
The CPU usage reaches 80%. Note that due to the generation of many Postgresql server processes, statistics are not good. The cumulative calculation should be 100 MB/S, and the final time is 98.18 s.
When TPL is used, because the results returned by Parallel. ForEach are not checked, the time may not be very accurate (too small ).
var lists = new List<List<string>>();var listt = new List<string>();for (int i = 0; i < listbuilder.Count; i++){ if (i % 1000 == 0) { listt = new List<string>(); lists.Add(listt); } listt.Add(listbuilder[i].ToString());}TimeCalc.LogStartTime();Parallel.ForEach(lists, (x) =>{ using (var sm = new SqlManipulation(@";string;MaxPoolSize=1000;", SqlType.PostgresQL)) { sm.Init(); foreach (var n in x) { sm.ExcuteNonQuery(n); } }});TimeCalc.ShowTotalDuration();
Write Method |
Time consumed (1000 rows) |
|
COPY |
107 s |
|
Insert |
N/ |
|
Multi-row insert |
160.58 s |
|
Prepare multi-row insert |
163.02 s |
|
Multi-row insert of transactions |
170.16 s |
|
Multi-join multi-row insert |
98.18 s |
2. Write updates
Data is updated in real time, and the number may continue to grow. Simple insert or update statements are not supported. postgresql 9.5 and later support the new syntax.
insert into mesh on conflict (x,y) do update set z = excluded.z
Postgresql supports on conflict so late, mysql has been there...
If the table contains more than 2500 million data, the data is repeatedly written into the database. Here, only multiple rows are inserted for the update test. Other results are similar.
It takes 272.15 s to insert common rows.
Multithreading insertion takes 362.26 s, and the CPU usage reaches 100%. When multiple connections are guessed, updating the lock causes performance degradation.
3. Read
Select Method
Standard reading is still performed using the select method, which is directly read by ADO. NET.
It takes 135.39 s to use the adapter method, and 71.62 s to use dbreader.
Copy Method
Postgresql's copy method provides the stdout binary method. You can specify a query for output, which takes 53.20 s. Public List <(int x, int y, int z)> BulkIQueryNpg () {List <(int, int, int)> dict = new List <(int, int)> (); using (var reader = (NpgsqlConnection) _ conn ). beginBinaryExport ("COPY (select x, y, z from mesh) to stdout (format binary)") {while (reader. startRow ()! =-1) {var x = reader. read <int> (NpgsqlDbType. integer); var y = reader. read <int> (NpgsqlDbType. integer); var z = reader. read <int> (NpgsqlDbType. integer); dict. add (x, y, z) ;}return dict ;}
Conclusion
Conclusion:
- To import data to an empty data table or to a non-duplicate data table, use the COPY statement first (for more information about this premise, see P.S .);
- Inserting multiple data records with one statement can greatly improve the insertion performance and determine the optimal number of data records;
- Insert using transaction or prepare. The optimization effect is not obvious in this scenario;
- Multi-connection/multi-thread operations have an advantage in speed. However, if you are not sure about the speed, the resource usage is too high and the number of connections is too large, which may affect other applications;
- Writing updates is a new feature of postgresql, which may cause certain performance consumption (relatively direct insertion );
- When reading data, you can use the COPY statement to obtain better performance;
- Because the ado.net dbreader object does not require fill, the reading speed is also fast (although it cannot catch up with COPY). You can also give priority to it.
P.S.
Why not use mysql
There is no best, but the most suitable one. I also like mysql for the truth. Postgresql is mainly used for the following reasons:
The SQL command "copy" imported and exported by postgresql directly supports the Binary mode to stdin and stdout. If the program wants to integrate directly, it is more convenient to use this command, mysql SQL syntax (load data infiletings does not support stdin1_stdout. it can be implemented through mysqldump.exe, and there is no special way to import (mysqlimport may be ).
Disadvantages compared with mysql
When postgresql uses copy for import, if the target table already has data, COPY will be automatically terminated if an error occurs in the table with primary key constraints, and may lead to incomplete insertion. In other words, the load Syntax of mysql can explicitly specify the action (IGNORE/REPLACE) after an error, without interrupting the import process.
Others
If you want to use mysql to import data from a program, you can first export the data to a file through the program and then import the data using the file. It is said that the efficiency is much higher than insert.
The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.