Hong Qiangning, chief architect of Watercress, talk about Douban technical framework

Source: Internet
Author: User
Keywords We watercress this very very now

Summary
How to deal with high concurrency, large traffic? How to ensure data security and database throughput? How do I make data table changes under massive data? Doubanfs and DOUBANDB characteristics and technology implementation? During the QConBeijing2009, the Infoq Chinese station was fortunate enough to interview Hong Qiangning and discuss related topics.

Personal profile
Hong Qiangning, graduated from Tsinghua University in 2002, is currently the chief architect of Beijing Watercress Interactive Technology Co., Ltd. Hong Qiangning and his technical team are committed to using technology to improve people's culture and quality of life, in the Web site architecture, performance, scalability in-depth research. Douban has been the best technology application website in China 2006.

About meetings
The QCon Global Enterprise Development Conference (QCon Enterprise Software Development Conference) is the world's top technology event hosted by the INFOQ website of C4media Media Group, held annually in London, San Francisco, Beijing and Tokyo. Since it was first held in London in March 2007, there have been nearly million architects, project managers, team leaders and senior developers from the financial, telecommunications, Internet, aerospace and other fields who have attended the Qcon conference.

Hello audience friends, here is the Infoq Chinese station Rai Xiang, now in the first Qcon Beijing conference scene, sitting next to me is from Douban Hong Qiangning. Jiangning Hello, to introduce themselves and their own and watercress contact. I joined the watercress in 06, probably in March. It was supposed to be a No. 02-number programmer in watercress. Number No. 01 is north. Now is the chief architect of Watercress. Responsible for the development of watercress technology related work. I remember in the community before the high concurrent ability of watercress discussion, watercress now the number of users and how much traffic? How long did it take to reach the current level? Now, I did not surf the internet, do not know if it has reached 3 million users, if not yet reached, will soon be, may be today, may be tomorrow. 3 million refers to our registered users, in addition to tens users. The number of visits should now be 20 million per day. If you can achieve such a visit, it is true that watercress high concurrency ability is quite strong, I would like to ask you from the perspective of technology to introduce the Douban architecture. This topic is a little bit bigger, and I have already expressed this in my speech earlier. It can be said that the simplest way, Douban can be divided into two large chunks: a front-end Web, that is, users in the browser will trigger a series of operations, data from the database, rendering HTML page feedback to the user, which is the front-end; the other is the back end, In the watercress has a strong data mining team, every day the user generated data analysis, the combination, and then produce a user recommendation, and then put in the database, the front-end will be real-time crawl the data displayed to the user. If so, if you were to redesign it, would you feel the need to improve which parts? Watercress (architecture) design now at the end of the web, there are a number of technologies: The front end is Ngix and lighttpd, the middle is the Quixote web framework, followed by MySQL and our own development of DOUBANDB. These are the most popular and cutting-edge technologies in addition to Quixote. Quixote a little older, if you want to redesign, you may have some consideration in this regard. For example, the Python community of Django, pylons, etc. can be considered, then in the inside of watercress, we generally use web2py, a very lightweight web framework to do, but also a very good choice, it may need to do something more. But it is unlikely to be completely redesigned. If you want to ease the pressure of high concurrency, the use of cache is certainly a very effective way. How much is the cache hit rate in watercress? What is the strategy? Memcache hit rate is generally around 97%, it should be relatively high. The strategy is actually relatively simple, if you're going to perform a more time-consumingResources, such as to the database query, will be in the Python object form in the Memcache inside, the next time to take this data directly from the cache to take the line. This side chooses what kind of thing, as far as possible has a guideline, one is must be time-consuming, consumes the resources, moreover is the re-use. For example, it is a resource consuming, but only once, cache also meaningless. Almost in this way to ensure that the cache things are really effective, but also improve the hit rate. To improve the flow of high pressure, another effective measure is to partition the database, in this respect how the watercress do? Watercress has not yet reached the level of database fragmentation. Our most common means now is to follow the functional zoning. We will divide the data table into separate libraries, and now there are 4 libraries. Each table is a part of the library, and each library will have two primary and secondary. In this way to reduce the pressure on the database, of course, this is the current scheme, then, the table will increase the number of rows, to a certain degree, but also to the level of division, which is certain. And then our technical aspects, before manipulating the database, first get the database cursor, there is a way, this method will do all the things, we will do in the future from this method to determine where to take things. This framework is already in, but it has not yet been done. What is the main solution for this side of the database? On the database side, we mainly use MySQL. MySQL has a problem with large text fields that affect its performance. If the amount of data is too large, it will squeeze the indexed memory. Now an effective way is to create a scalable Key-value database called Doubandb. We put large text fields that don't need an index into the doubandb. MySQL only holds information about the relationship that needs to be indexed. This reduces the pressure on the MySQL database and can guarantee its performance. For example, like the security of data, and database throughput, what is the strategy of watercress? First, DOUBANDB will backup each data at three nodes, and any failure will not affect the request data. MySQL is a two-master solution, with 1 to 2 slave, so there are three to four backups in MySQL. This can be assured. Did you talk about MySQL's dual master scheme, and what's wrong with that? For example, the problem of synchronization, etc.? In MySQL, the dual master scheme is a classic solution, and we now use it in large part to solve the problem of synchronization delay. In the switch, there will be synchronization delay, but in fact, the synchronization speed of MySQL can be, in the switch, IThey will endure a few seconds to wait for synchronization. We'll wait a little while for the script to switch. What is the size of the watercress datasheet? Datasheet, this is difficult to say, because different tables are not the same. Our biggest table is "Nine O ' Clock" Entry table, "Nine point" crawler crawling over all the articles, now should have about 40 million rows. And then the other millions of watches also have a lot. There are also the number of rows that include a collection table and a tens. In such a large amount of data, the data table on the structure of the changes, it must be a relatively troublesome problem. Common situations, such as adding a new index, can result in indexing for several hours. Like watercress before there will be such a problem, how to solve it? The problem has been so painful that we went to the table in the neglect of it and then spent a lot of time on it. Then we realized that, if there were any changes to the table, we would try it on a test library first, whether it was in an acceptable range, if it was an acceptable range, say a few minutes, do a timed task and execute it in the middle of the night. If time consuming is intolerable, it must be through other technical means, our current means is generally to build a new table, the new table synchronized data from the old table, and then write the data, it will be synchronized, to both sides to write, until both sides exactly the same, and then delete the old table, presumably is such a way. Just now you seem to mention that you have designed your own doubandb, and another is Doubanfs, what is the relationship between the two? First came the Doubanfs, we started with mogilefs to solve our scalable image storage problem, because MogileFS has a heavy database, which has become its performance bottleneck. To solve this problem, we developed a DOUBANFS, based on hashing to find nodes. After that, we found new problems, and large text fields in the database also affected performance. So, on the basis of DOUBANFS, we changed a bottom, made some adjustments, referring to Amazon's Dynamo thought, set up the doubandb, put the text field in Doubandb. After that, and in turn with DOUBANDB to achieve FS, is generally such a process. Doubanfs and DOUBANDB implementation, they are in the content of security, or content redundancy ... Three copies. This can be configured, and is now configured in 3 copies. What mechanism is doubandb to realize? Doubandb Simple is this: you come to a key, it is the Key-value database, you want to write or read, through this key to find this value. Hash it with a key, find out which node it is in by consistent hash, and write or read to the node. On this node, the cis-hashWheel in sequence to find the second to third node, write to ensure that the three nodes are written, read the time is any one, if one of the read failed, will automatically switch to the next. What is the technique that you have just mentioned doubandb? DOUBANDB's underlying storage is tokyocabinet, a lightweight, efficient Key-value database. We do it on the basis of the distributed, in this way to achieve. In fact, there are some other solutions, such as Berkeley DB (BDB), couchdb, and so on, why do you choose Tokyocabinet? The simplest reason is because it is fast enough, in fact BDB is similar to it, BDB more powerful. For us, we need a reliable, efficient key-value store here, both of which we can replace, as long as the interface is unified. Couchdb word is another thing, it is a document database, it does not only do a key-value work, it has done a lot of other things, such as it has the concept of view, you can query. These tokyocabinet are not, and we do not need these functions for the time being. Couchdb is a very interesting database, we may be in other areas (applications), we are also studying it. From the discussion we just made, the Web front-end you used Ngix and lighttpd. They are very popular Front end, these two kinds of schemes often fight, why does the watercress blend them together? This is the historical reason. We are not deliberately inclined to a certain. Both are very good Web servers, both lightweight and efficient. At the beginning we used the LIGHTTPD, and then because there have been some problems, in fact, is not lighttpd problem, but we suspected that there may be lighttpd problems, and then tried Ngix, think this is also good, and then the structure is preserved. Ngix is better for both developers and users. I give an example, such as restart, in fact, in the Watercress Web server is often restarted, we will have a health check the script, the regular inspection site is not normal, if it is not normal, will do some protection measures, including restart. LIGHTTPD's reboot is a very rough kill. Ngix is a reload plan that will be done before restarting. It will be much better, and it will help you do something good before restarting. So now we're using ngix more and more. Ngix configuration files are also more comfortable to write than lighttpd. Watercress now has a large user community, for such a number of massive data do a good jobAccording to mining, certainly not an easy thing, from the perspective of technology to talk about the realization of mining? In the watercress has a special algorithm team, their main job is data mining. The technical realization of this side, may not be over. Can only some of the approximate, data mining is how and front-end combination, let the user see. Every day the user in the watercress operation will produce a lot of data, in the watercress see things, the collection of things, there will be a database or access log. Every day this information is passed to the algorithm team's machine, and then a matrix is created from the data, what you've seen, what you've done. They maintain a very efficient matrix of coefficient matrices, and then use it to do a variety of attempts to see whether the results are good, once found that the result is good, it will be written into the database. Then the user in the interview, the front-end from the database to take out the recommended data, and then do some filtering of the data (such as you read the things will no longer show you), adjust, and finally show to the user. Is basically such a logic. From what you have just described, you can find that watercress is actually a very many applications, almost with open source framework? All are open source. I believe that you will get a lot from the wisdom of the community and all kinds of things, I do not know that watercress to the open source community has also done some feedback? Yes, our biggest feedback form is patch. We use a lot of open source software, which inevitably has a variety of problems, we will try to solve these problems through their own efforts, our solutions to the developers. More typical like libmemcached, is a C memcached client. It's also very hot, basically an official C client. It actually has a lot of bugs and we find it in use and fix it. Now our team members have direct development members. For example, the Mako template, like Python, is a very many templates for people. We are also using it, using it to find its performance slightly weaker, and we have spent a lot of time optimizing it, and this optimization is now accepted and released in later versions of Mako. Then watercress himself has some open source projects, the most important open source project is Watercress API access client, this is on Google code above, there are a lot of volunteers to participate in, to help us modify together. Then from another aspect, watercress and the domestic open source community also have close ties. Watercress on-line notice is sent in the open source organization CPUG mailing list, many members of Watercress is also CPUG members, will be in the mailing list to help answer questions, discuss issues, this is a way of feedback. What is the development team of watercress? We now develop the team here is 11 people, have full-time part-time, or more relaxed. We are using agile methods, but they are not exactly the same way.Inside the watercress, we try to exert everyone's creativity. For example, in the watercress is free, you can decide when to come, when to go. For example, if you want to stay at home and write code, you can send a message to the mailing list, said, I do not come today, can be at home. Every day there will be a lot of discussion, we are in the Watercress office is a separate area. There is a whiteboard in this area and you can discuss it at any time. Then each week we will have a technical exchange meeting, we take turns to publish what they have recently looked at something, what experience, to share with you, these are promoting the communication and development of the team, very useful things. It seems that watercress is a fairly open, technical and interest-driven team. We want to keep it that way. Is there any problem with the audience at the scene? Other reporters: I was next to the community that topic to ask, watercress now has a lot of accumulation, there are many things have been formed, have you considered opening some projects? We have this plan. For example, DOUBANDB, in fact, when we started this project, that is to say, the project is open source after we have done it, it is not open source, because the project is still changing. Due to the development of time constraints, so now and the watercress itself is tied too tight, we are also constantly adjusting, and now still in the process of adjustment. Find a suitable time, we will be it with the watercress data stripping out, become a can independently to install, run the application, it will take out, I think should be able to do this very soon. Thank Jiangning to accept our interview, also congratulate today in the speech of the General Assembly has made a very successful success. Thank you.

Original link and video address: HTTP://WWW.INFOQ.COM/CN/INTERVIEWS/DOUBAN-HQN

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.