About the design of the Forum database
Article Category: Database
A simple forum system
1: Contains the following information:
2: Daily forum visits about 3 million, update posts 100,000 or so.
Please give the structure design of database table, and combine the paradigm to explain the design idea briefly.
I. Posting the subject and reply information in a table and adding the User_name field to the table
For the operation of the database, the performance of retrieving data is largely unaffected by the data (in the case of exact lookups), and the connection between tables and tables can have a huge impact, especially between tables with huge amounts of data, so the positioning of the problem is basically deterministic: when displaying and retrieving data, Minimize database connection and table-to-table connections;
Reference
1:user: User basic information table
Fields are: User_id,user_name,email,homepage,tel,add ...
2:forum_item: Theme and reply hybrid table
Fields are: Id,parent_id,user_id,user_name,title,content,....
Parent_id=0 or null is a subject, otherwise =n is a reply to id=n that post
The username field is redundant, so there is a problem of synchronizing the data when the user modifies the username, which needs to be remedied by the program.
Two. topic list and theme replies are saved separately
Reference
1:user: User basic information table
Fields are: User_id,user_name,email,homepage,tel,add ...
2:forum_topic: Topic table
Fields are: Id,user_id,title,content,....
3:forum_topic_back: Topic Reply table
Fields are: Id,topic_id,user_id,title,content,....
Three. The content of the subject table is individually designed as a table
Reference
1:user: User basic information table
Fields are: User_id,user_name,email,homepage,tel,add ...
2:forum_topic: Topic table
Fields are: Id,user_id,title,....
3:forum_topic_content: Topic content Table
Fields are: id,topic_id,content
4:forum_topic_back: Topic Reply table
Fields are: Id,topic_id,user_id,title,content,....
Four. The user information is saved in 2 tables, and the related table is divided into table processing
Reference
1: Simple user table Tb_user:
ID, username
2: User Details table Tb_userinfo
Id,userid, email, homepage, phone, address ...
3: Forum Theme Table Tb_bbs
ID, userid, title, IP, Repleycount, Replyuserid, Createtime, Lastreplytime
4: Forum Content standard Tb_bbs_content (this table can be stored according to the BBSID table)
Id,bbsid, content;
5: Forum Reply table tb_bbs_reply (This table can be stored according to the BBSID table)
ID, BBSID, userid, Content, Replytime, IP
Five. Add a theme cache table to fetch the first 100 records of each zone
Reference
1: Simple user table Tb_user:
Fields are: ID, username
2: User Details table Tb_userinfo
Fields are: Id,userid, email, homepage, phone, address ...
3: Forum Theme Table Tb_bbs
Fields are: ID, userid, title, IP, Repleycount, Replyuserid, Createtime, Lastreplytime
4: Forum Content standard Tb_bbs_content (this table can be stored according to the BBSID table)
Fields are: id,bbsid, content;
5: Forum Reply table tb_bbs_reply (This table can be stored according to the BBSID table)
Fields are: ID, bbsid, userid, Content, Replytime, IP
6: Theme Cache Table Tb_bbs_cache
Fields are: ID, userid, title, IP, Repleycount, Replyuserid, Createtime, Lastreplytime
------------------------------------------------------------------------------
The following is a discussion of the above scenario:
1: Scenario one seems to look less like a table, but because of redundancy, because the number of posts is huge, it takes up a lot of space. This is a large amount of data, but for real-time and data absolute security requirements of low-volume applications, a large number of use of caching can greatly improve processing power.
2: Plan one so you design, index how to build better, there is will not cause this table overheating, there is ... I think a system like a forum that uses caching can significantly reduce the load on a database
3: Everyone means to be divided into the topic table, Reply table and so on multiple tables? or synthesize a table and do the physical partitioning? Which is better?
4: So high Insert update frequency under the index is somewhat impractical, the index will reduce the speed of inserting updates and the traffic is so large, the index is not recommended to use
5: On such a forum, in real-time updates, posts, replies. I don't think it's good to have indexes on a database, but how do you improve the performance of queries if you don't build indexes?
6: It's all a distributed database. Put it in more than one table, it's okay to associate directly. Important is horizontal slicing
7: Recognition of sub-tables, sub-libraries, caching practices
Reference
Problem Analysis:
Daily forum visits about 3 million, update posts around 100,000.
1. Read and write ratio of about 30:1, should be inclined to read efficiency. Indexing is required to refer to the primary key that is commonly read.
2. Monthly data is 10w*30=300w. Can be divided by the monthly table
3. Annual posts in 300w*12=3600w, the estimated data will not be less than 30T. Can be divided into the library by year
Structure:
User information: Stand-alone table, UserID primary key
Post, reply: Monthly table storage, post unique ID primary key, date index.
Post Content details: Monthly storage, post unique ID primary key
8: Take a 5 million of the table to say things
Reference
If there's no index at the time of the update,
Update time about 30 seconds or so refers to the full table update ~ ~
It takes about 10 seconds to query a single line of records ~ ~
and add the index.
The update time is almost a little more than
And the query record is reduced to the millisecond level ~ ~
Faster than a Hundredfold ~ ~
What's the weight of the choice
9: A bit of your own experience:
Reference
1. sub-table storage;
2. Establish an index; SQL Press so the speed of the query is still very fast;
3. Avoid the whole table scan, read the topic first, read the reply according to the subject ID, and then read the user according to the user ID, instead of using the association;
4. Use the cache;
10: Need to divide 3 tables, and build index ...
The reasons are as follows:
Reference
1: Set up 3 tables can avoid redundant data, maintenance convenient ...
2: Daily Forum visit volume of 3 million or so, the main pressure is from the query, SQL query efficiency is to avoid full table scan, it is necessary to establish index ...
3: About creating an index reduces the speed of inserting updates this problem does not exist ... Because indexes degrade the speed of updates, the index of the corresponding field needs to be updated after the corresponding field has been updated.
4: See update post about 100,000, this sentence is said, we may post the title, post content, reply title, reply to the content of the 4 fields do update ... It is important to note that these four fields are not used to establish the table join field, in order to optimize the query speed we will not be indexed on these four fields, so from this topic, we set up the index does not affect the performance of the update post ...
So, I think the final answer is to set up 3 tables to index The fields used in the connection ...
11:
Reference
Is it possible to build a visual with two tables?
The view is also very slow.
12: Every day to update 100,000 posts, every day to visit so much, must not put all the main stickers in a table, large table sub-table, establish the index of the characters commonly used segment, and then configure the cache. Cascading relationships are best not configured, and then queried when needed.
13: Although the topic is not described, but in practice, the review post is usually only paginated display, and a page will show up to dozens of posts, then in fact, as long as the SQL statement is well constructed, the T_user table is simply a subquery with only dozens of rows of result sets to connect, There should be a basic need not to worry about performance issues.
And in fact, a universal table is simply related to the million-row table (in fact, the number of soul song in my opinion is not really a big table), in the database has a lot of optimization, and even can improve the hardware configuration to improve performance, there is no significant need for structural redundancy. Once the structure has redundancy, in order to ensure data consistency, often you also need to consume more resources, instead of the outweigh the costs.
14: Sub-table with vertical and horizontal sub-table
Reference
1: No matter how many records you take (even 1), if two large table associations can produce very large intermediate values, if you sort (the sort field is not used to index), you can cause the database to be calculated in a variety of ways.
2: Index will cause insert, update record is slow, everybody knows.
3: Horizontal sub-table can solve this problem, as long as you can ensure that each table only the appropriate number of records (such as 100W a table) (horizontal partitioning can also solve some problems of IO)
4: There is also study write separation, master is written, slave is read (plus cache, general problems are OK)
The above is a relatively large workload, it is best to ensure that your database design is reasonable (the paradigm is the first step, and then consider the inverse paradigm), and basically can meet a lot of problems.
15: Program four content and other information to separate the advantage is that each table can be minimized, the database operation pressure will be reduced, the operation will be fast, but also with the cache, the content is cached according to the situation, you can access the table data as little as possible.
Reference
1: For the above table can also be applied to the sub-Library operation, which reduces the pressure of the database library, the pressure spread to each machine
2: My approach is to avoid table associations as much as possible
3: Then for the SQL statement as far as possible to ensure that the index is valid, not indexed SQL, as far as possible to use an efficient way to solve the index
16: The Peripheral scenario:
Reference
1 reading, generate static pages, or cache the latest hottest posts.
2 Writing, the main estimate is the Insert bar, which can be operated asynchronously. All of the paste operations are put into a queue and then the insert database operation is performed in bulk.
17: Plan four more reliable, plus regular dump, massive cache, large-scale forum to do this.
18: I think it would be better to use 3 sheets.
Reference
1: Business, it is likely that the main post with the reply paste has a different extension, such as attachments, and so on a table, if the main post and reply to the existence of personality needs, how to do? Unlimited add field?
2: The main stick with the reply in the same table, will increase the probability of the lock table.
3: The index does reduce the speed of the table update, but the query efficiency is also very impressive, so I think, the index can not be used, but less.
4: The establishment of the table, it is true that a person upstairs can reply to say, with the level of the table, in fact, the principle is to use the first count and then check.
5: In the front-end performance, you can use Ajax, such as the steps to take data, such as the content of the main paste first to take out, and then gradually load back information and so on.
19: The key to speed increase:
Reference
1. Build the index and make full use of it when querying;
2. Avoid the use of associations, so as to avoid the whole table scan; using correlation is not as fast as using primary key queries more than once;
3. Some of the processing functions can be put into memory as much as possible, such as organizing themes and responses;
4. The use of static pages is also a good practice;
20: Scenario Three is the continuation of the Hibernate two cache idea, for frequently updated data are designed as separate tables, so as to maximize the use of Hibernate cache
21: No fast=true settings, some people say or better than in, exists better than in, the index is better than the full table scan, partition can improve query efficiency, but the partition to reduce the efficiency of insertion
What I want to say is that without the option of fast=true, if one can find a step, or a few steps to formulate a method to improve efficiency, then the optimizer will do it by itself, without the user's fear at all.
Assuming that the or is better than in, the database optimizer will take the in syntax and or syntax to walk the execution plan, so why torture the user.
Say the actual, many people open mouth said, SQL optimization is to avoid the full table scan, do not know that you have not known the principle of Index lookup. Index lookup data, there are two steps to do, the first step is to quickly query the index, the index only stores the corresponding table data rowid, so there is a second step, According to rowID to get all the data, so need a disk I/O, do not underestimate the disk I/O, the results through the index query more time, disk I/O is very large, this time is much slower than full table scan, in fact, the Oracle 10g cost-based optimizer (CBO), An index that is not highly selective, the optimizer does not use it at all, and automatically uses a full-table scan.
22: This level of BBS I designed, at that time to do so (Programme V):
Reference
A total of four tables:
1. User tables
2. Subject list (including final reply, last reply, last reply ID, etc.)
3. Reply form
4. The theme cache table (this takes the first 100 records of each area), generally the most load is the theme of the page, so the cache table is a small table.
A total of 3 app clusters, 1 web,2 set Oracle one master, run down speed is acceptable.
23: Do not recommend the design of the table redundancy, feel like repeating the code, there is a bad taste
Reference
1: Cache frequently used pages and data
2: Read-write tables or libraries separated (based on vertical separation)
3: The database can be vertically delimited (fields are divided into multiple tables) and then horizontally separated (data is divided into multiple tables)
4: Forum functions can be separated, different servers are responsible for different functions, chip servers, Web servers, mail servers, etc.
In short, it is to refine the division
24: Design support for Scenario three
Read operation:
Reference
1: Display the Post list interface, if the main paste content in the Forum_topic table, then this is redundant, assuming that you want to get 100 posts, a row of data length, the database needs to scan more data blocks, performance is worse.
2: When you open a post, the read operation is associated to two tables (Forum_topic and forum_topic_content) through the index, and the performance consumption is not much for the entire database.
How to write a post:
Reference
Post the title table and the table of contents into a separate insert
Updating a non-indexed column does not cause an index update:
Reference
As long as the indexed column (for example, the header ID of the reply table) is not updated frequently, the index is not updated to produce performance consumption, even if the other columns of the index's row are frequently update, and a table is updated 300,000 times a day. Why would anyone have such a big misunderstanding about the index, because it is so small that it can be easily assumed even if the database is installed on the Pentium 3 single-core CPU? For a forum (or most of the system), retrieving (SELECT) data consumes much more system resources than the update data (insert/update) itself, and indexes are specifically for retrieving data services, in order to save the small performance cost of updating data, Pay to retrieve 100 data when the database scan tens of millions of billions of data to match the cost of data? If this is the case, even a database with a 32-core top-level CPU may not be able to be used for parallel queries.
Do database design, or more understanding of the principles of the database is good.
25: Database slicing is a must.
Reference
1: Vertical segmentation: User table, user Information table, subject table, topic Content table, reply form
2: Horizontal Slicing: Theme 1, Theme 2, Theme 3 、...、 topic n
3: Cache: Cache Routing table
4: Re-match the database read and write separation and cluster bar
Another: Actually the forum changes the title, the content probability is very small. Most of them are new