How to solve big data volume storage problem with MySQL database?

Source: Internet
Author: User
Tags bulk insert types of tables mysql index

Question: How do I design or optimize a big table for tens? In addition, there is no other information, the personal feel that this topic is a little fan, have to simply say how to do, for a storage design, must consider the business characteristics, the information collected are as follows:
1. Data capacity: 1-3 years will be about how many data, each piece of data about how many bytes;

2. Data item: Whether there is a large number of paragraphs, the values of those fields are often updated;
3. Data query SQL Condition: The column names of which data items often appear in the Where, GROUP by, and ORDER by clauses are medium;
4. Data Update class SQL Condition: How many columns often appear in the WHERE clause of update or delete;
5.SQL amount of statistical ratio, such as: select:update+delete:insert= how much?

6. What is the magnitude of the total daily execution of large tables and associated SQL?
7. Data in the table: Updated business or query-based business
8. What database physical server are you planning to use, and the database server architecture?
9. What about concurrency?
10. Does the storage engine choose InnoDB or MyISAM?

Generally understand the above 10 questions, as to how to design such a large table, should be clear of everything!

As far as optimization refers to the creation of a good table, can not change the table structure, it is recommended to InnoDB engine, use more memory, reduce the disk IO load, because IO is often the bottleneck of the database server

In addition to the optimization of the index structure to solve performance problems, it is recommended to prioritize the modification of class SQL statements, so that they are faster, have to rely only on the way index organization structure, of course, the premise is that
The index has been created very well, if the main reading, you can consider opening Query_cache,

and adjust some parameter values: sort_buffer_size,read_buffer_size,read_rnd_buffer_size,join_buffer_size

For more information, see:
MySQL database server-side core parameters detailed and recommended configuration
MYSQLOPS.COM/2011/10/26
Hello, the main is to retrieve a certain period of time the analog value (SELECT * from table where datatime between T1 and T2), currently intends to use the table, partitioned way to solve

No, talk about my ideas and my solution,
I've been working on this lately.
My current company has three tables, is 500 million of the data, the daily increment of the table is 100w
Each table is about 10 columns.
Here are the tests and comparisons I made
1. First look at the engine, in the case of large data volume, without partitioning
Mysiam than InnoDB in the case of read-only, the efficiency is about 13% higher
2. After doing the partition, you can read the official MySQL document, in fact, for partition, specifically for the MyISAM to do optimization, for InnoDB, all the data is there ibdata inside, so even if you can see the schema changed, There is no essential change.
In the case where the partition is under the same physical disk, the promotion is about 1%
In the partition under the different physical disk, I divide into three different disks under, Ascend probably in 3%, actually so-called throughput, is determined by many factors, such as your explain parition time can see, record in that one partition, if each partition has , in fact, does not solve the problem of reading, it will only improve the efficiency of writing.
Another problem is that partitioning, how you divide, if a table, there are three column is often used to make query conditions, is actually a very sad thing, because you have no way to do the targeted partitioning of all SQL, if you just like the official MySQL file, only to do a partition of time, And you only use time to inquire, congratulations.
3. The table is mainly used for reading or writing, in fact, this problem is not sufficient, it should be asked, you write at the same time, concurrent query how much? My problem is relatively simple, because MongoDB shredding support can not, after crush, or back to MySQL, so under normal circumstances, 9am-9pm, write a lot of situations, this time I will do a view, View is based on a recently inserted or frequently queried, by doing a view to separate the read, that is, written on the table, read in the logical judgment before the operation of the view
4 Do some archive table, such as the first to do a lot of statistical analysis of these large tables, and then through the existing analysis + increment to solve
5 If you use Mysiam, there is one more question you should note, if your. Configure, when you add a max index length parameter, when your record number is greater than the length of the set, the index will be disable
6

According to your needs, there can be two ways, one is the sub-table, the other is the partition
The first is the table, just like you said, you can divide the table by month, you can divide the table by user ID, and so on, as to which way to divide the table, to see your business logic, the sub-table is not a good place is the query sometimes need to cross multiple tables.

Then partition, the partition can separate the table in a number of different table space, using divide and conquer to support the infinite expansion of large tables, to the large table at the physical level of manageability. Splitting large tables into smaller partitions can improve the maintenance, backup, recovery, transaction, and query performance of tables. The benefits of partitioning are the advantages of partitioning:

1 Enhanced usability: If one partition of the table cannot be used due to a system failure, the rest of the table's good partitions can still be used;

2 Reduce shutdown time: If a system failure only affects a portion of the table partition, then only this part of the partition needs to be repaired, so it can spend less time than the whole large table repair;

3 Ease of maintenance: If you need to rebuild a table, it is much easier to manage each partition independently than to manage a single large table;

4 Balanced I/O: The different partitions of the table can be assigned to different disks to balance I/O performance improvement;

5 Improve performance: The query, addition, modification of large tables can be decomposed into different partitions of the table to execute in parallel, which can make the operation faster;

6 The partition is transparent to the user and the end user does not feel the presence of the partition.

Causes for very slow node insertion times:

1. Connection to the database: the number of connections and connections has been established too many times, resulting in more frequent IO accesses.

2, you should use BULK INSERT and batch modification method, instead of having a piece of data to insert, which will result in the actual special slow access to the database.

3, in the establishment of the library to establish an appropriate index: such as primary key, foreign key, unique, etc., optimize query efficiency.

First, when the amount of data is large, you should try to avoid full table scanning, you should consider the where and order by the columns involved in the index, building indexes can greatly speed up the retrieval of data. However, there are cases where indexing is not effective:

1. Try to avoid using the! = or <> operator in the WHERE clause, or discard the engine for a full table scan using the index.

2, should try to avoid the null value of the field in the Where clause to judge, otherwise it will cause the engine to abandon the use of the index for a full table scan, such as:
Select ID from t where num is null
You can set the default value of 0 on NUM, make sure that the NUM column in the table does not have a null value, and then query:
Select ID from t where num=0

3, try to avoid using or in the WHERE clause to join the condition, otherwise it will cause the engine to abandon the use of the index for a full table scan, such as:
Select ID from t where num=10 or num=20
You can query this:
Select ID from t where num=10
UNION ALL
Select ID from t where num=20

4, the following query will also cause a full table scan:

Select ID from t where name like '%abc% '

To be more efficient, consider full-text indexing.

5, in and not in also to use caution, otherwise it will cause a full table scan, such as:
Select ID from t where num in
For consecutive values, you can use between instead of in:
Select ID from t where num between 1 and 3

6, if the use of parameters in the WHERE clause, also causes a full table scan. Because SQL resolves local variables only at run time, the optimizer cannot defer the selection of access plans to run time; it must be selected at compile time. However, if an access plan is established at compile time, the value of the variable is still unknown and therefore cannot be selected as an input for the index. The following statement will perform a full table scan:
Select ID from t where [email protected]
You can force the query to use the index instead:
Select ID from T with (index name) where [email protected]

7. You should try to avoid expression operations on the field in the Where clause, which causes the engine to discard full table scans using the index. Such as:
Select ID from t where num/2=100
should read:
Select ID from t where num=100*2

8, should try to avoid in the WHERE clause function operations on the field, which will cause the engine to abandon the use of the index for a full table scan. Such as:
Select ID from t where substring (name,1,3) = ' abc ' –name ID starting with ABC
Select ID from t where DATEDIFF (day,createdate, ' 2005-11-30′) =0– ' 2005-11-30′ generated ID
should read:
Select ID from t where name like ' abc% '
Select ID from t where createdate>= ' 2005-11-30′and createdate< ' 2005-12-1′

9. Do not perform functions, arithmetic operations, or other expression operations on the left side of "=" in the WHERE clause, or the index may not be used correctly by the system.

10. When using an indexed field as a condition, if the index is a composite index, you must use the first field in the index as a condition to guarantee that the system uses the index, otherwise the index will not be used, and the field order should be consistent with the index order as much as possible.

11, do not write some meaningless queries, such as the need to generate an empty table structure:
Select Col1,col2 into #t from T where 1=0
This type of code does not return any result sets, but consumes system resources and should be changed to this:
CREATE TABLE #t (...)

12, a lot of times with exists instead of in is a good choice:
Select num from a where num in (select num from B)
Replace with the following statement:
Select num from a where exists (select 1 from b where num=a.num)

What to look for when building an index:

1, not all indexes are valid for the query, SQL is based on the data in the table to query optimization, when the index column has a large number of data duplication, SQL query may not use the index, such as the table has a field sex,male, female almost half, So even if you build an index on sex, it doesn't work for query efficiency.

2, the index is not the more the better, although the index can improve the efficiency of the corresponding select, but also reduce the efficiency of insert and UPDATE, because the INSERT or update when the index may be rebuilt, so how to build the index needs careful consideration, depending on the situation. The number of indexes on a table should not be more than 6, if too many you should consider whether some of the indexes that are not commonly used are necessary.

3, as far as possible to avoid updating clustered index data columns, because the order of clustered index data columns is the physical storage order of table records, once the column value changes will result in the order of the entire table records adjustment, it will consume considerable resources. If your application needs to update clustered index data columns frequently, you need to consider whether the index should be built as a clustered index.

other places to be aware of:

1, try to use numeric fields, if only the value of the field is not designed as a character type, which will reduce the performance of query and connection, and increase storage overhead. This is because the engine compares each character in a string one at a time while processing queries and joins, and it is sufficient for a numeric type to be compared only once.

2. Do not use SELECT * from t anywhere, replace "*" with a specific field list, and do not return any fields that are not available.

3. Try to use table variables instead of temporary tables. If the table variable contains a large amount of data, be aware that the index is very limited (only the primary key index).

4. Avoid frequent creation and deletion of temporary tables to reduce the consumption of system table resources.

5. Temporary tables are not unusable, and they can be used appropriately to make certain routines more efficient, for example, when you need to repeatedly reference a dataset in a large table or a common table. However, for one-time events, it is best to use an export table.

6. When creating a temporary table, if you insert a large amount of data at one time, you can use SELECT INTO instead of CREATE table to avoid creating a large number of logs to increase the speed; If the amount of data is small, create table and insert to mitigate the resources of the system tables.

7. If a temporary table is used, be sure to explicitly delete all temporary tables at the end of the stored procedure, TRUNCATE table first, and then drop table, which avoids longer locking of the system tables.

8. Avoid using cursors as much as possible, because cursors are inefficient and should be considered for overwriting if the cursor is manipulating more than 10,000 rows of data.

9. Before using a cursor-based method or temporal table method, you should first look for a set-based solution to solve the problem, and the set-based approach is usually more efficient.

10. As with temporary tables, cursors are not unusable. Using Fast_forward cursors on small datasets is often preferable to other progressive processing methods, especially if you must reference several tables to obtain the required data. Routines that include "totals" in the result set are typically faster than using cursors. If development time permits, a cursor-based approach and a set-based approach can all be tried to see which method works better.

11. Set NOCOUNT on at the beginning of all stored procedures and triggers, set NOCOUNT OFF at the end. You do not need to send a DONE_IN_PROC message to the client after each statement that executes the stored procedure and trigger.

12, try to avoid the return of large data to the client, if the amount of data is too large, should consider whether the corresponding demand is reasonable.

13, try to avoid large transaction operation, improve the system concurrency ability

Now with the development of the Internet, the magnitude of the data is also exponential growth, from GB to TB to PB. The operation of the data is also more difficult, the traditional relational database can not meet the needs of fast query and insert data. At this time, the advent of NoSQL temporarily solved the crisis. It gains performance by reducing the security of the data, reducing support for transactions, and reducing support for complex queries. However, in some cases nosql compromises are not enough to satisfy the usage scenario, for example, some usage scenarios are absolutely business and security indicators. NoSQL is certainly not sufficient at this time, so it is still necessary to use a relational database.

While relational databases are inferior to NoSQL databases in massive amounts of data, their performance will meet your needs if you operate correctly. Different operations for the data, the direction of optimization is not the same. For data migration, query and insert operations, can be considered in different directions. When optimizing, you also need to consider whether other related operations will have an impact. For example, you can improve query performance by creating an index, but this can lead to the insertion of data because the performance of the insert degrades when you want to create an update index, and you can accept this reduction. Therefore, the optimization of the database is to consider a number of directions, looking for a compromise of the best solution.

One: Query optimization 1: Create an index.

The simplest and most commonly used optimization is the query. Because the read operation occupies most of the scale for CRUD operations, read performance essentially determines the performance of the application. The most common use for query performance is to create indexes. After testing, 20 million records, each record 200 bytes two columns varchar type. It takes a minute to query a record without an index, and the query time can be ignored when an index is created. However, when you add an index to an existing data, it takes a very large amount of time. After I insert 20 million records, I create an index about a few 10 minutes.

Disadvantages and occasions for creating indexes. Although creating an index can greatly optimize the speed of queries, the drawbacks are obvious. One is that when inserting data, creating an index also consumes part of the time, which reduces the performance of the insert to a certain extent, and the other is obviously that the data file becomes larger. When you create an index on a column, the length of each index is the same as the length you set when you created the column. For example, if you create a varchar (100), when you create an index on that column, the index length is 102 bytes, because a length of more than 64 bytes will add an additional 2 bytes to the length of the record index.

From the Ycsb_key column (length 100), you can see that I created an index with a name of Index_ycsb_key, each with a length of 102, and imagine that the size of the index cannot be underestimated when the data becomes extremely large. Also, it can be seen that the length of the index and the length of the column type are also different, such as varchar it is a variable length character type (see MySQL data type analysis), the actual storage length is the actual character size, but the index is the size of the length you declared. When you create a column, you declare 100 bytes, then the index length is the byte plus 2, and it doesn't matter how big your actual storage is.

In addition to the time it takes to create an index, the size of the index file becomes bigger and larger, and creating an index also requires looking at the characteristics of your stored data. When you store a large part of the data is a duplicate record, then this time to create an index is a harm without a profit. Please review the MySQL index introduction first. So, when a lot of data is repeated, the effect of the index's query promotion can be ignored directly, but at this time you also have to bear the performance cost of creating indexes when inserting data.

2: The cached configuration.

In MySQL there are a variety of caches, some cache is responsible for caching query statements, and some are responsible for caching query data. These cached content clients are not operational and are maintained by the server side. It will be updated with the corresponding actions such as your query and modification. With its configuration file we can see the cache in MySQL:

Here the main analysis of the query cache, which is mainly used to cache queries data. When you want to use the cache, you must set the Query_cache_size size to not 0. When the setting size is not 0, the server caches the results returned by each query, and the next time the same query server fetches the data directly from the cache, instead of executing the query. The amount of data that can be cached is related to your size setting, so when you set it large enough, the data can be fully cached to memory, and it will be very fast.

However, the query cache also has its drawbacks. When you do any update operation (Update/insert/delete) on the data table, the server will force the cache to flush the cached data to ensure that the cache is consistent with the database, causing the cached data to be invalidated. Therefore, when a table has a very good number of updates to the table, query cache will not be able to improve performance and affect the performance of other operations.

3:slow_query_log analysis.

In fact, for the query performance improvement, the most important is the most fundamental means is also slow_query settings.

When you set Slow_query_log to ON, the server will log each query, and log the query when it exceeds the slow query time (long_query_time) you set. While you optimize the performance, you can analyze the slow query log, the query of the slow query for the purpose of optimization. By creating various indexes, you can work with tables. Then why divide the table that, when not divided into the table when the place is the limit of performance. Here's a brief introduction.

4: Sub-database sub-table

The Sub-database table should be the killer of query optimization. The above measures in the amount of data reached a certain level, the role of optimization can not be obvious. At this time, the amount of data must be diverted. There are two kinds of measures, such as sub-database and sub-table. And there are two ways of dividing table and vertical slicing and horizontal slicing. Here's a brief introduction to each of these approaches.

For MySQL, the data files are stored as files on disk. When a data file is too large, the operation of the operating system on large files will be more cumbersome and time-consuming, and some operating systems do not support large files, so this time must be divided into tables. In addition, the common storage engine for MySQL is InnoDB, and its underlying data structure is B + tree. When the data file is too large, the B + tree will be more from the level and node, when querying a node may query a number of levels, and this will inevitably cause multiple IO operations to load into memory, it will certainly be time-consuming. In addition there are innodb for the B + tree lock mechanism. Lock each node, then when changing the table structure, this time the tree is locked, when the table file is large, this can be considered to be not possible.

So we have to do the operation of the sub-table and the library.

Two: Data transfer

When the amount of data reaches a certain level, then moving the library will be a very prudent and dangerous job. It is a very difficult problem to ensure the consistency of the data, the processing of various emergencies and the change of data in the process of moving the library.

2.1: Insert data when the data migration, there will be a re-import of big data, you can choose the direct load file, sometimes you may need to insert code. At this point, you need to optimize the INSERT statement. This time you can use the Insert delayed statement, which is when you make an INSERT request, the Department immediately inserted into the database and placed in the cache, waiting for the time to mature before inserting.

I want to add ...

1: Create an index.

The simplest and most commonly used optimization is the query. Because the read operation occupies most of the scale for CRUD operations, read performance essentially determines the performance of the application. The most common use for query performance is to create indexes. After testing, 20 million records, each record 200 bytes two columns varchar type. It takes a minute to query a record without an index, and the query time can be ignored when an index is created. However, when you add an index to an existing data, it takes a very large amount of time. After I insert 20 million records, I create an index about a few 10 minutes.

Disadvantages and occasions for creating indexes. Although creating an index can greatly optimize the speed of queries, the drawbacks are obvious. One is that when inserting data, creating an index also consumes part of the time, which reduces the performance of the insert to a certain extent, and the other is obviously that the data file becomes larger. When you create an index on a column, the length of each index is the same as the length you set when you created the column. For example, if you create a varchar (100), when you create an index on that column, the index length is 102 bytes, because a length of more than 64 bytes will add an additional 2 bytes to the length of the record index.

From the Ycsb_key column (length 100), you can see that I created an index with a name of Index_ycsb_key, each with a length of 102, and imagine that the size of the index cannot be underestimated when the data becomes extremely large. Also, it can be seen that the length of the index and the length of the column type are also different, such as varchar it is a variable length character type (see MySQL data type analysis), the actual storage length is the actual character size, but the index is the size of the length you declared. When you create a column, you declare 100 bytes, then the index length is the byte plus 2, and it doesn't matter how big your actual storage is.

In addition to the time it takes to create an index, the size of the index file becomes bigger and larger, and creating an index also requires looking at the characteristics of your stored data. When you store a large part of the data is a duplicate record, then this time to create an index is a harm without a profit. Please review the MySQL index introduction first. So, when a lot of data is repeated, the effect of the index's query promotion can be ignored directly, but at this time you also have to bear the performance cost of creating indexes when inserting data.

2: The cached configuration.

In MySQL there are a variety of caches, some cache is responsible for caching query statements, and some are responsible for caching query data. These cached content clients are not operational and are maintained by the server side. It will be updated with the corresponding actions such as your query and modification. With its configuration file we can see the cache in MySQL:

Here the main analysis of the query cache, which is mainly used to cache queries data. When you want to use the cache, you must set the Query_cache_size size to not 0. When the setting size is not 0, the server caches the results returned by each query, and the next time the same query server fetches the data directly from the cache, instead of executing the query. The amount of data that can be cached is related to your size setting, so when you set it large enough, the data can be fully cached to memory, and it will be very fast.

However, the query cache also has its drawbacks. When you do any update operation (Update/insert/delete) on the data table, the server will force the cache to flush the cached data to ensure that the cache is consistent with the database, causing the cached data to be invalidated. Therefore, when a table has a very good number of updates to the table, query cache will not be able to improve performance and affect the performance of other operations.

3:slow_query_log analysis.

In fact, for the query performance improvement, the most important is the most fundamental means is also slow_query settings.

When you set Slow_query_log to ON, the server will log each query, and log the query when it exceeds the slow query time (long_query_time) you set. While you optimize the performance, you can analyze the slow query log, the query of the slow query for the purpose of optimization. By creating various indexes, you can work with tables. Then why divide the table that, when not divided into the table when the place is the limit of performance. Here's a brief introduction.

4: Sub-database sub-table

The Sub-database table should be the killer of query optimization. The above measures in the amount of data reached a certain level, the role of optimization can not be obvious. At this time, the amount of data must be diverted. There are two kinds of measures, such as sub-database and sub-table. And there are two ways of dividing table and vertical slicing and horizontal slicing. Here's a brief introduction to each of these approaches.

For MySQL, the data files are stored as files on disk. When a data file is too large, the operation of the operating system on large files will be more cumbersome and time-consuming, and some operating systems do not support large files, so this time must be divided into tables. In addition, the common storage engine for MySQL is InnoDB, and its underlying data structure is B + tree. When the data file is too large, the B + tree will be more from the level and node, when querying a node may query a number of levels, and this will inevitably cause multiple IO operations to load into memory, it will certainly be time-consuming. In addition there are innodb for the B + tree lock mechanism. Lock each node, then when changing the table structure, this time the tree is locked, when the table file is large, this can be considered to be not possible.

So we have to do the operation of the sub-table and the library.

Two: Data transfer

When the amount of data reaches a certain level, then moving the library will be a very prudent and dangerous job. It is a very difficult problem to ensure the consistency of the data, the processing of various emergencies and the change of data in the process of moving the library.

2.1: Insert data when the data migration, there will be a re-import of big data, you can choose the direct load file, sometimes you may need to insert code. At this point, you need to optimize the INSERT statement. This time you can use the Insert delayed statement, which is when you make an INSERT request, the Department immediately inserted into the database and placed in the cache, waiting for the time to mature before inserting.

I want to add ...

MySQL Big Data processing   First, overview   sub-table is a relatively popular concept, especially in the case of large load, the sub-table is a good way to disperse the database pressure.   First to understand why the table, the benefits of the sub-table is what. Let's take a look at one of the following database execution SQL procedures: Receive SQL--Put SQL execution queue----use parser to decompose SQL--Extract data by analysis results or modify--and return processing results   of course, This flowchart is not necessarily correct, it is just my own subjective consciousness so I think. So what is the most likely problem in this process? That is, if the previous SQL is not completed, then SQL will not be executed, because in order to guarantee the integrity of the data, the data table file must be locked, including the shared lock and the exclusive lock two locks. A shared lock is a lock during which other threads can access the data file, but no modification is allowed, and the corresponding, exclusive lock is that the entire file is owned by one thread, and other threads cannot access the data file. General MySQL is the fastest storage engine MyISAM, it is based on table locking, that is, if a lock, then the entire data file can not be accessed outside, must wait until the previous operation is completed before receiving the next operation, then the previous operation did not complete, The latter operation waits in the queue to be unable to execute the situation is called the blockage, generally we are called "the lock Table" in the popular sense.   What are the immediate consequences of the lock list? Is that a large amount of SQL cannot be executed immediately, and must wait until the SQL in front of the queue has been fully executed to continue execution. This non-executable SQL can result in no results, or severe delays that affect the user experience.   Especially for some use more frequent tables, such as the SNS system in the User Information table, forum system posts table and so on, is a large number of access to the table, in order to ensure that the rapid extraction of data back to the user, must use some processing methods to solve the problem, this is what I want to talk about the sub-table technology today.    Sub-table technology as the name implies, is to store several tables of the same type of data into several table sub-table storage, when extracting data, different users access to different tables, non-conflict, reduce the probability of locking the table. For example, the current save User sub-table has two tables, one is the user_1 table, there is a user_2 table, two tables save different user information, user_1 saved the first 100,000 of the user information, user_2 saved the last 100,000 users of information, now if you query the user Heiyeluren1 and Heiyeluren2 This two users, then is the table from the different tables extracted, reduce the possibility of locking table。    I'm going to tell you two kinds of sub-table methods I have not experimented with, do not guarantee accurate use, but provide a design ideas. The following is an example of the sub-table I assume is a post-paste system based on the processing and construction. (If you have not used the bar users hurriedly Google a bit)     second, based on the table of the table processing   This base table based on the general idea of the table processing is: a main table, saving all the basic information, If a project needs to find the table it stores, it must look for items such as table names from this base table, so that you can access the table directly. If you feel that the base table is not fast enough, you can completely save the entire base table in cache or in memory for efficient query.   Based on the situation of bar paste, we construct 3 tables as follows:  1. Bar Section Table: Save the bar in the section of Information 2. Stick to the theme table: Save the bar in the section of the topic Information for browsing 3. Post bar reply table: Save the original content of the theme and reply content    "Stick section table" contains the following fields: Section id       board_id           Int (10) section name     board_name       CHAR (50) Sub-table id       table_id             smallint (5) generation time     created              datetime  "Stick Theme Table" contains the following fields: Theme id          topic_id        Int (10) Theme name         topic_name     char (255) Section id           board_id          Int (10) Create Time         created           datetime  The fields of "post-bar reply table" are as follows: Reply id         reply_id           Int (10 ) Reply content       reply_text        text Theme id         topic_id           Int (10) Section ID         board_id         Int (10) Creation time       created            datetime  so the above has saved our entire table structure information, the three tables corresponding to the relationship is:  section  -->  multiple topics  -->  multiple replies   so that means , the relationship of the table file sizeis: section table file  <  subject Table file  <  reply table file   So it's almost certain that the topic table and the reply table need to be divided into tables, which has increased the speed and performance of our data retrieval query changes.   looked at the above table structure, it will be obvious that in the "section table" saved a "table_id" field, this field is used to save a section corresponding to the topic and replies are the table is saved in what form.   For example, we have a bar called "PHP", board_id is 1, the sub-table ID is also 1, then this record is:  board_id | Board_name | table_id | created1 | PHP | 1 | 2007-01-19 00:30:12  Accordingly, if I need to extract all the topics in the "PHP" bar, then you have to follow the table to save the table_id to combine the name of the tables that store the theme, such as our topic table prefix is "topic_", then the combination of " PHP "bar corresponding to the topic table should be:" Topic_1 ", then we do:  select * from topic_1 WHERE board_id = 1 ORDER by topic_id DESC LIMIT 10  This will be able to obtain Take this topic to reply to the list, so that we can view, if you need to see a topic below the reply, we may continue to use the "table_id" saved in the section table to query. For example, our reply table prefix is "reply_", then we can combine the "PHP" bar ID 1 of the subject's reply:  select * from reply_1 WHERE topic_id = 1 ORDER by reply_id DESC LIMIT 10  here, we can see clearly, actually we use the basic table here, the base table is our section table. The corresponding, will certainly say: the basic table of the amount of data in the future how to ensure its speed and efficiency?   Of course, we have to make this base table the best speed and performance, for example, can be stored in MySQL memory table, or stored in memory, such as memcache memory cache, etc., can be adjusted according to the actual situation.   generally based on the basic table of the table mechanism in SNS, friends, forums and other Web2.0 site is a relatively good solution, in these sites, you can use a single table to save the basic identity and the relationship between the target table. The advantage of using tables to save correspondence is that later extended, you only need to add a table record.    "advantage" to increase the deletion of nodes is very convenient, for late upgrade maintenance brought great convenience "disadvantage" need to increase the table or a table to operate, or can not leave the database, will create bottlenecks     three, hash algorithm based on the table processing   We know that a hash table is a value computed by a particular hash algorithm, which must be unique and can use the computed value to find the desired value, called a Hashtable.   Our hash algorithm in the table is similar to this idea: through a primitive target ID or name through a certain hash algorithm to calculate the table name of the data storage table, and then access the corresponding table.   Continue to take the above bar, each bar has a section name and the section ID, then the two values are fixed, and is unique, then we can consider the two values by doing some operations to get the name of a target table.   Now if we are aiming at our post-paste system, assuming that the system allows up to 100 million data, consider saving 1 million records per table, then the entire system will be able to accommodate no more than 100 tables. According to this standard, we assume that we hash the section ID of the bar, get a key value, this value is our table name, and then access the corresponding table.   We construct a simple hash algorithm:  function get_hash ($id) {     $str = Bin2Hex ($id);      $hash = substr ($str, 0, 4);     if (strlen ($hash) <4) {          $hash = Str_pad ($hash, 4, "0");    }     return $hash;} The   algorithm is basically passing in a section ID value, and then the function returns a 4-bit string, and if the string is not long enough, use 0 to complete the completion.   such as: Get_hash (1), the result of the output is "3100", Input: Get_hash (23819), the result is: 3233, then we have a simple combination of table prefix, we can access the table. Then we need to access the content of ID 1 when Oh, the combination ofThe table will be: topic_3100, reply_3100, then you can access the target table directly.   Of course, after using the hash algorithm, some of the data is probably in the same table, this is different from the hash table, hash table is as far as possible to resolve the conflict, we do not need here, of course, also need to predict and analyze table data may be saved table name.    If you need to store more data, the same, you can hash the name of the section, such as the above binary conversion to 16, because the Chinese characters are much more than numbers and letters, then the probability of repetition is smaller, but may be combined into more tables, Some other problems must be considered accordingly.   In the final analysis, the use of hash method must choose a good hash algorithm, in order to generate more tables, but the data query faster.    "Advantage hash algorithm directly get the target table name, high efficiency" through the "disadvantage" extensibility is poor, choose a hash algorithm, define how much data, the later can only run on this data volume, can not exceed this data volume, scalability slightly poor      Other issues  1.  search problems now that we have a table, we cannot search directly on the table because you cannot retrieve dozens of or hundreds of tables that already exist in the system, so the search must be done with third-party components, For example, Lucene as a site search engine is a good choice.  2.  Table file Problems we know that MySQL MyISAM engine each table will generate three files, *.frm, *. MYD, *. MYI three files, tables are used to save table structure, table data, and table indexes. The number of files under each directory of Linux is best not more than 1000, otherwise the retrieval data will be slower, then each table will generate three files, the corresponding if the table more than 300 tables, then the retrieval is very slow, so this time must be divided, such as in the database separation.   Using the underlying table, we can add a new field to save what data the table holds. Using hash, we must intercept the hash value of the first few to be the name of the database. In this way, the problem is solved in good condition.    Summary   In the application of large load, the database has been a very important bottleneck, must break through, this article explains the two types of tables, I hope that many people can have an enlightening role. Of course, the code and the idea of this article has not been tested by any code, so the design is not guaranteed to be completely accurate and practical, or need the reader in the use of the process of careful analysis and implementation.    articles written in a more hurried, quality may not be guaranteed, encountered errors, do not take offense, HuanWelcome to criticize and advise, thank you ~~~~!   Article Source: http://blog.csdn.net/likika2012/article/details/38816037

How can i solve big data storage problems with MySQL database?

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.