The queries in MySQL, Sphinx, and many databases and search engines are single-threaded. For example, to execute a query on a 32-CPU-Core, 16-disk R910 server, it uses at most one core and one disk. Yes, only one is used.
If the query is CPU intensive, it uses about 3% of the machine's CPU capacity (in the case of the 32 nuclear machines mentioned above). In the case of disk-intensive, approximately 6% of the machine IO capability is used (the same configuration as the previous example, with 16 disks consisting of RAID10 or RAID0).
Let me put it another way. If you run a query on a single core disk and take 10 seconds, it takes 10 seconds for the same query to be on a 32-core 16-disk machine, without any improvement.
You already knew that, didn't you? So, my question is-is there any way to improve it?
If it is Sphinx, great, the answer is YES! And it doesn't take too much effort. You don't even need to modify the application and the database, just slightly change the Sphinx configuration.
Plan
First of all, let me explain our goal.
The Sphinx itself supports distributed search, which has long been designed to scale horizontally. If the index is not placed on a single machine, you can have more than one machine to index different parts, set up an aggregation node, responsible for receiving requests from the application, and then send the request to all the data nodes at the same time, and finally the results of their return merged to return to the application. In the application it looks as if only one server is serving it.
Well, guess what? Ha, we can apply this function to a single machine, so that our query faster than n times. And now that Sphinx has supported this approach, we don't have to pretend to query which remote nodes.
Another benefit is that, after configuring a distributed search, indexes can be built in parallel!
It's still a point to note that while this approach can speed up the vast majority of queries, there are some exceptions. Because the results of parallel queries still need to be merged, the merge process is single-threaded. Also, merging includes some CPU-intensive operations, such as rating, sorting, or even count with group by, and if the volume of data is large, the merge process becomes a bottleneck.
To confirm this is also very simple, just look at the Sphinx query log, and see how many records each query matches, we have a number of minds.
Perform
Assume that the previous index configuration on the server is as follows (many of the details are omitted):
Copy Code code as follows:
SOURCE Src1
{
Type = MySQL
Sql_query = SELECT ID, text from table
}
Index idx1
{
Type = Plain
Source = Src1
}
Searchd
{
dist_threads = 0 # Default
}
Now we're using a machine with 3 CPU cores and disks to do the index--that's the idx1. Here is the configuration file that we changed:
Copy Code code as follows:
SOURCE Src1
{
Type = MySQL
Sql_query = SELECT ID, text from table
}
SOURCE Src1p0:src1
{
Sql_query = SELECT ID, text from table WHERE ID% 3 = 0;
}
SOURCE Src1p1:src1
{
Sql_query = SELECT ID, text from table WHERE ID% 3 = 1;
}
SOURCE Src1p2:src1
{
Sql_query = SELECT ID, text from table WHERE ID% 3 = 2;
}
Index Idx1_template
{
Type = Plain
Source = Src1
}
Index Idx1p0:idx1_template
{
Source = src0
}
Index Idx1p1:idx1_template
{
Source = Src1
}
Index Idx1p2:idx1_template
{
Source = Src2
}
Index idx1
{
Type = Distributed
Local = Idx1p0
Local = IDX1P1
Local = IDX1P2
}
Searchd
{
Dist_threads = 3
}
After you have done this, you need to rebuild the index. But now the index indexer command idx1p0 to IDX1P2 can be synchronized.
In addition, the use of different operations to separate data is not the best way, you can use a secondary table in MySQL to distinguish their scope, with sql_query_range use or whatever, depending on your data to determine.
Written in the last
I've always loved that Sphinx,sphinx can be so easily extended to the machines you need so much that it was used many years ago. Then, I thought, I didn't use this feature to make queries on a single machine go faster, as I always did. Well, that's not to say it's slow or what it is, it's just that the query is never too fast, is it?