High-concurrency generic search engine architecture design for hundreds of millions of data)

Source: Internet
Author: User

[Article banquet this article version: V1.0 last modified: 2008.12.09 reprinted please indicate the original article link: http://blog.zyan.cc/post/#/]

In July, I wrote an article titled Architecture Design for full-text retrieval (search engine) of tens of millions of data records based on sphsf-+ mysql. The former company's classification information search was based on this architecture, the effect is obvious, and even a large part of MySQL SQL queries with where conditions are switched to sphenders + MySQL search. However, this architecture still has limitations: first, MySQL has limited concurrency capabilities, ranging from 200 ~ Query and update are slow under 300 concurrent connections. Second, because the primary key of the MySQL table corresponds to the ID of the sphsf-index one by one, the whole site query cannot be created across multiple tables, in addition, it is troublesome to modify the configuration file to add new categories. Third, it is difficult to take advantage of Sphinx because it is integrated with MySQL.

Recently, I have designed the following latest search engine architecture. Now I have written a beta version of "search query interface" and "Index Update interface. According to the test, a general PC with a 4-3.6 GHz dual-core CPU and 2 GB memory is equipped with 70 million index records, the average query speed of the "search query interface" is 0.0xx seconds (the query speed has reached the level of search engines such as Baidu, Google, sogou, and Yahoo China. For details, refer to "Appendix 2" at the end of the article "), it also supports up to 5000 concurrent connections, while the Index Update interface performs data analysis, enters the queue, and returns information to the user throughout the entire process, up to 1500 requests/sec.

The "queue controller" is the core. It needs to control the reading of queues, update the MySQL master table and incremental table, update the data storage layer of the search engine, Tokyo tyrant, and quasi-real-time (within 1 minute) update the incremental index of sphinx and merge the indexes of Sphinx regularly. I expect to write a beta version this week.

  1. Search and query interface:
① The Web application server uses http post/get to pass the search keyword and other conditions to the search. php interface of the search engine server;
② ③ Search. PHP uses the Sphinx API (based on the latest sphinx 0.9.9-RC1 API, I changed a C language PHP extension sphinx. so), query the sphsf-index service, and obtain the unique ID of the search engine that meets the query conditions (15-bit search unique ID: first 5 category ID + last 10 original data table primary key ID) List;
④ Search. php uses these IDCs as keys and uses memcache protocol to retrieve the text data corresponding to IDCs from mget in Tokyo tyrant at a time.
7. Search. php highlights the search result set by query conditions and returns the result to the Web application server in JSON or XML format.

  2. Index Update interface:
(1) The Web application server uses http post/get to notify the search server of the update. php interface of the content to be added, deleted, or updated;
(2) Update. php writes the received information to the TT high-speed Queue (a queue system based on Tokyo tyrant );
Note: The speed of these two steps can exceed 1500 requests/second, and the search index update call of 60 million PV can be responded.

  3. Search indexes and data storage control:
(I) The "queue controller" daemon reads information cyclically from the TT high-speed Queue (50 messages each time until the end );
(Ii) The "queue controller" writes the read information to the data storage layer of the search engine, Tokyo tyrant;
(3) Information read by the queue ControllerAsynchronousWrite Data to the MySQL master table (this master table is partitioned by 5 million records and used only for permanent data backup );
(4) The "queue controller" writes read information to the MySQL incremental table;
(V) in one minute, the "queue controller" triggers sphinx to update the incremental index. The indexer of Sphinx will use the MySQL incremental table as the data source to create the incremental index. The incremental index of sphenders corresponds to the MySQL incremental table used as the data source;
(6) The "queue controller" stops reading information from the TT high-speed queue for a short time every 3 hours, and triggers sphenders to merge incremental indexes into the primary index (this process is very fast ), at the same time, the MySQL incremental table is cleared (this ensures that the number of records in the MySQL incremental table is always from several thousand to several 100,000, greatly accelerating the update speed of the Sphinx incremental index ), then retrieve data from the TT high-speed queue and write the data to the MySQL incremental table.

  Open source software used in this architecture:
1. sph00000.9.9-RC1
2. Tokyo tyrant 1.1.9
3. MySQL 5.1.30
4. nginx 0.7.22
5. PHP 5.2.6

  Self-developed procedures for this architecture:
1. search query interface (search. php)
2. Index Update interface (update. php)
3. Queue Controller
4. sph1_0.9.9-PHP extension of RC1 API (sph1_. so)
5. High-Speed Queue System Based on Tokyo tyrant

  Appendix 1: Comparison results of MySQL Fulltext, Lucene search, and Sphinx search by third parties:
  1. query speed:
MySQL Fulltext is the slowest, Lucene and Sphinx query speed is equal, and Sphinx is slightly dominant.

  2. indexing speed:
The index creation speed of Sphinx is the fastest, 9 times faster than Lucene. Therefore, sphtracing is very suitable for quasi-real-time search engines.

  3. For detailed comparison data, see the following pdf document:  

Download file click here to download the file

  Appendix 2: search speed analysis of various Chinese search engines:
Using "apmserv Zhang banquet" as the keyword to compare the search speed of large and medium-sized search engines:
  1. Baidu:
① First search

② Second search

Analysis: Baidu caches the results of the first search, so the second query is very fast.

  2. Google:
① First search

② Second search

Analysis: Google also caches the search results of the first search, but the two queries are slower than those of Baidu.

  3. sogou:
① First search

② Second search

③ The third search

Analysis: sogou is suspected to cache the results of the first search. The second search is very fast, and the third search is slower than the second search. Sogou's first search speed is similar to that of Baidu.

  4. Yahoo China:
① First search

② Second search

Analysis: the search results are not cached. Yahoo China's search speed is similar to that of Baidu's first search.

  5. NetEase youdao:
① First search

② Second search

Analysis: youdao caches the search results of the first search. However, like Google, both searches are slower than Baidu, sogou, and Yahoo.

Tags: Linux, PHP, Sphinx, search, tokyotyrant, ttserver, tokyocabinet, MySQL, Google, Baidu, Google, sogou, Yahoo,» search engine technology | comment (85) | reference (1) | read (96748)

Strong linvo ~ I don't know how the architecture of those search portals jk2008-12-9 suspect that the time displayed by Baidu is not the actual time spent on retrieval. Big Pineapple feels that Baidu is as fast as Google. Use firebug to view the page loading time. Baidu: 165 ms, Google: 154 Ms. Zhang banquet replied that the search engine displayed only the index query time and the page loading time is not included. Outrace2008-12-9 13:00 sphinx does not support non-int type IDs. This is annoying.
Ttserver cannot deserialize PHP content and does not support compression, which is also annoying.

It would be nice if there were no such issues. Ptubuntu is not familiar with development programs. however, I can learn more. but Google is complicated. you can imagine. cyt2008-12-9 this set of things from Google, Baidu is still far away .. Is fei2008-12-9 so exaggerated? Dugu2008-12-9 want to use in practice. Will cncaiker2008-12-9 be released? We are very much looking forward to the top in terms of jeck-9, and admire them! I look forward to the high load production environment in the Data dd_macle2008-12-10 learning... gs2008-12-10 Sina wants to develop his own search enginettplay2008-12-10 a sentence explanation:
One person writes the record to the cache, the database, and updates the index,
Another person reads records from the cache or database through indexes.
Plantegg2008-12-11. In principle, it's impossible for you to catch up with those of Google/Baidu/sogou, although it looks like Lucene (Java write, you didn't have any optimizations, lucene search results also need to consider relevance, ranking, and so on (so the number of hits in each article is counted, not once it is found). Do you know what sphinx has? If not, this is unfair to Lucene :)

The cache hit rate of search engines is generally slightly higher than 60%, and the memory used for indexing is several hundred GB and several hundred GB.

You are only sensitive to incremental data. If you delete the data, you cannot update the index?

However, I have to give you a thumbs up!) Zhang banquet replied to. Of course, Google/Baidu/sogou's weight calculation algorithm is more complex, with hundreds of millions to billions of indexes, A lot more. Why the search results in this article Google will be slow, Google Web index number has reached an astonishing 1 trillion (see: http://www.readwriteweb.com/archives/google_hits_one_trillion_pages.php), such a large order of magnitude, index is slower than Baidu, it is also normal.

Therefore, relatively speaking, a single machine of my index supports 0.1 billion data records, which is a good query speed for Google, Baidu, and sogou.

Compared with Lucene, Sphinx has the following Weight Calculation Modes corresponding to Lucene. the PDF file has been compared with Lucene in various types:
Sph_rank_proximity_bm25, default ranking mode which uses and combines both phrase proximity and bm25 ranking.
Sph_rank_bm25, statistical ranking mode which uses bm25 ranking only (similar to most other full-text engines). This mode is faster but may result in worse quality on queries which contain more than 1 keyword.
Sph_rank_none, disabled ranking mode. This mode is the fastest. It is essential equivalent to boolean searching. A weight of 1 is assigned to all matches.
Random, ranking by keyword occurrences count. This ranker computes the amount of per-field keyword occurrences, then multiplies the amounts by field weights, then sums the resulting values for the final result.
Sph_rank_proximity, added in version 0.9.9, returns raw phrase proximity value as a result. This mode is internally used to emulate sph_match_all queries.
Sph_rank_matchany, added in version 0.9.9, returns rank as it was computed in sph_match_any mode ealier, and is internally used to emulate sph_match_any queries.

Incremental indexes can increase and update indexes. Deleting an index is simpler. sphenders supports attribute tagging. If the is_delete attribute in the normal state is 0, the is_delete attribute is marked as 1 and the attribute tag is in the memory, when Sphinx is stopped, it automatically writes data to the disk, which is very fast. Therefore, deleting an index can be said to be real-time. When you merge indexes, you can use the -- merge-DST-range parameter to exclude the indexes marked as deleted. Dd2008-12-12 11: 36 cattle, learn from you, although now some still do not understand EOOD2008-12-12 have to say sphinx and Google Baidu even Lucene are not comparable Syu2008-12-12 don't know Zhang banquet met a problem no.
I found that anyone with a [] number would seriously interfere with the search structure. bluepower2008-12-22 may I ask what software did you use to draw your piaotu? Rrddd2008-12-26 16: 53 OK 11112008-12-27 17: 39 excuse me, what software is your piaotu painted?
I also want to know? Zhang Yan replied to, 4.1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.