When I first came into contact with PHP and started to contact MYSQL, many people said: "MySQL is enough to run on a small site with hundreds of thousands of IP addresses a day. If it is a few millions of IP addresses, it won't work ", I don't remember the original words. This is basically what it means. I have never had a good chance to verify this statement. First, I have never taken over such a large-traffic website. Second, I am busy at work, so I am too lazy to take care of it. I don't need it anymore, I kept this problem to the nearest point by thinking about it.
Just A few days ago, A website owned by the company (because it is the company's commercial content, I will not say it is that website) is hereinafter referred to as site, this year's traffic soared from 700 thousands or 800 thousands A day to millions of IP addresses, and nearly 10 million PV in A day caused the whole server to work under heavy pressure, service downtime occurs from time to time.
The first thing that reflects the situation is data statistics. One piece of data is not counted in a day. It turns out that MySQL has stopped working.
This article focuses on the solutions of several technicians in our company.
1. MySQL Server Cluster
Because small companies have limited funds and costs, and few technical technicians have been involved, this method is naturally rejected by everyone.
Here is my personal understanding! Cluster creation not only imposes additional charges, but also brings great technical challenges. This is not realistic for our company's current situation. A cluster is nothing more than transferring the pressure on one server to two or more servers. I understand this. Maybe I understand it incorrectly. Please advise.
2. Divide and conquer
This method is similar to the cluster, but the Statistical Code is run on different servers. As the company has many low-configuration servers running tens of thousands to hundreds of thousands of IP addresses, there is still no problem, we can divide millions of traffic into dozens, hundreds, and hundreds of thousands.
Advantage: the current resources are fully utilized to solve the current problems.
Disadvantage: This method is not a long-term solution. Sooner or later, problems will occur. In addition, it is troublesome to collect statistics.
3. Modify the Statistical Code
Since we used to determine whether the IP address exists or not before inserting data, the processing of routes and so on virtually increases the pressure on the server, so everyone changes the Statistical Code to one and inserts it into the database, regardless of whether it is processed after November 21.
This method basically keeps the data of the day. However, due to the huge amount of data, the server is still killed when going back and forth, in addition, the indexes reserved during data structure design during insertion also greatly consume a lot of server resources.
It is slow to remove the index to the final processing, and the loss is worth the candle.
4. Modify the statistical method
The last method is very effective. What is that!
Here we will mainly introduce this method:
A. Keep the original data structure unchanged, and store all data in A certain structure into A file
Structure: It can be xml, json, or any regular data emission you want.
For example:
1 221.2.70.52, httpwww.baidu.com, windowxprn
2 221.2.70.52, httpwww.baidu.com, windowxprn
Writing a file fopen and fwriteno describes a very useful technique. Maybe everyone knows it, but a rookie like me probably doesn't know it, that is, using error_log, isn't this an error log written? It's very convenient.
Format:
1 |
Error_log ("content", 3, "date. dat "); |
I will not talk about his usage here. If you don't understand it, you can check the manual.
B. Data File naming
Why are we going to talk about file naming? If we just write data into the file without any processing, how much is the difference between it and direct database insertion? What we do is useless..
First, the value obtained from date ('ymdh') on the utilization of time is as follows: 2008121112. What is the value of this parameter? In this way, the data is an hour, and a file is not too large, you do not need to determine whether automatic generation is required.
IP address application: because many data operations are performed by one IP address, it is very convenient to put data of the same IP address in one file and process it later. Please refer to the introduction below. Here we take the IP address into three digits as part of the file name.
C. Data Import and Processing
The data of the two operations on the same day will be retained a lot. The next step is not very important. But here I will talk about my thoughts.
Warehouse receiving pre-processing:
As mentioned above, three identical IP addresses are put into one file for processing before they are put into the database. First, you can use the most stupid method to split the data into N pieces of data. Deleting duplicate data.
If an IP address browses multiple pages, PV can be calculated here, And the accessed page is processed to form new data.
Import method:
Here we will introduce the data import method.
1 |
$sql=”LOAD DATA INFILE ’”.$file.”‘ INTO TABLE `test` FIELDS TERMINATED BY ’,‘ LINES TERMINATED BY ’rn’(`time` , `md5`)”; |
This statement can be said to be very fast when importing hundreds of thousands of data. I have not introduced the usage here. You can refer to the Manual if you do not understand it, therefore, the solution to the bottleneck of the database does not have to add equipment or change the database. You only need to change your mind to solve many problems.