Original article: http://www.csdn.net/article/2010-07-26/277273
Following the successful holding of the first Tup event, the second Tup event jointly planned by csdn and programmer magazine arrived on schedule in the hongyun 2 Hall of the Beijing liting Huayuan Hotel, this activity focuses on the hot social networks, Weibo architectures, and real-time search fields with the Web 2.0 technology as the topic. Provides a fully open communication platform for participants on the technology, product design, and user experience topics in related fields and product R & D. Even for paid salons, the number of participants is still rising, and more than 300 people have come to the event.
Zhang tie an, Renren Technical Manager
The following is a speech by Zhang tie an, Renren Technical Manager:
Zhang tiean:I will share with you today this article is about Renren's system architecture. We will talk about some new technologies and open-source projects, and hope to help you in your future work. First, I want to talk about the main functions of our new system In SNS. I want to send a log on Renren's network, which can be seen in my circle of friends and fans in a timely and efficient manner. My friends can reply to me quickly and interact with me. I must ensure that the system is efficient and stable. For an SNS website such as SNS and Weibo, it is very important that there will be an explosive effect when a special event occurs. I was not a football fan during the World Cup two days ago. I went to bed that night. At two o'clock, my cell phone kept ringing. I said what was going on. I thought my colleague updated the service and thought about what kind of competition might be very popular this evening. The next morning, I said that the German team scored a goal. The system encountered such a trigger. At the beginning of last Spring Festival Gala Zhao Benshan's sketch, the entire system would trigger an Explosive alarm, therefore, for our system, we need to solve the pressure of many emergencies to ensure sufficient stability of our system.
In addition, a large part of all the data in our system comes from various services of the website, some from other portal websites, and some other websites that have cooperation with us, the Open Platform supports sending data to our feed system when many third-party applications or links are used to produce practices. Our input content will be very complex, and we need to be able to process different data. We need a good data specification to ensure that our system can accept different types of data. In addition, we provide several outputs, including logging on to the personal homepage list of Renren homepage, a PC client called Renren desktop, and a mobile client. However, the requirements for different business needs to be presented are different. Mobile phone display requires that I do not want everything. I only want other parts and have some options. From various aspects, the current design of this system is very complex, and the connection with various businesses is also very complicated, resulting in a high complexity of the system.
Next I would like to talk about the challenges our system faces. For a website such as Renren, there are many active users, and tens of millions of users may exist in a day. We can calculate that, of course, this data may not be a real data. We think that one thousand feeds and one thousand customers will generate some content every second, in the system, we may need to process billions of raw data. Let's talk about the characteristics of feed. When I change the status, all of my friends who receive this information is a spread problem. We need to send this data to all those who want to receive the data, therefore, this feed is widely spread. If I have 100 friends, I want to spread to 100 people. If I am a star, more people will see it.
The new things have such a characteristic: I sent a log and two friends shared the log when they saw it interesting. If the other person is a friend of those two, there are two similar content sharing on the page, which may cause problems. We will adopt a strategy to merge two related new things, or to replace, sort, and merge them. In addition, the maximum number of user requests to human networks is the number of login requests. Finally, I have already talked about how to filter new things based on different business requirements.
Next, let's talk about two problems in system design: The Push mode and the PULL mode. What is the difference between the two modes? The push mode means that when an event is generated, I copy it n times to the person he wants. Pull is another method. When a user logs on to the page, the home page displays new things that interest all friends. In this case, use the PULL mode. That is to say, I have logged on. I checked all my lists related to me and obtained these lists to sort and merge all the new things according to the new things list corresponding to these people. The push may be a very fast operation. After the push, it will be available immediately. The login list is ready for use and will be retrieved very quickly. But there is a problem, for example, I have several hundred million users, but there are only tens of millions of active users. The remaining hundreds of millions of users may come once every six months, or once every two weeks. After the data is given to him, he may not be able to see it at all, which wastes a lot of resources. Pull Mode does not have this problem, but there is another problem. You have a large amount of requests. When a user logs on to the system and must return data quickly, the calculation workload is very large. All considerations are taken into consideration, because we need to make a system that is highly realistic and time-consuming. We still choose the push mode, but some trade-offs can be made in some places when pushing, remove unnecessary system overhead.
This is the current aspect of the feed system. The first is the distribution of new things. That is to say, after I send something, I want to tell everyone who has something to do with me. This is done by PAGE dispatch. There will be the newsfeed Index Service, which is related to our new things, including user feedback, as well as some sorting methods and relationship with friends, some things related to the circle of friends in SNS. For example, some friends are very close to you, and your relationship with your wife may be very close to him. We all know that, there are also friends you often play with. You may have relatively close relationships with people like you. We will consider putting your wife's mood at the top of the sorting weight when considering new things. We need to immediately respond to the leaders' instructions.
This is related to sorting of new things, including feed sorting algorithms and social networks. What we are doing is to classify the content based on the new content, a bit like Baidu encyclopedia. We know which users are interested in music and which users are interested in technology or politics. These will be reflected in the sorting of new things through some system calculations. The following is a list of new things sent by miinfeed. Another is the content of the new things. I sent a log of new things, which can be seen as a brief summary of dozens of words. Next we will talk about the new things that make index data huge. We will talk about the significance of index data for us. When a user needs to query his index for new things, and then retrieve the content, the memory cache will be lost and nothing will be found on the user's page. So we need to make persistence. Indexdb data has a list written to the hard disk, and finally our rendering engine. We have a lot of input and output, and different output requirements are different, for example, the output format for the mobile phone is completely different from the client format. Therefore, these two tasks are completed by a system.
This is a simple structure chart for us to see new things. The content in it is not the whole thing of our current online system. It may only be part of it. I took out the most important thing. A smiling face is a person who is very happy. He sent a log and sent the log content to the new system through the Log Module system related to our website logs, first, DIS processes the data and finally delivers the content to three different places. First, newsfeed. For example, if I want to send minifeed, the third is to cache the new content and send it to a cluster. We will understand the persistence section below. The minifeed volume is small and we can save it as a data table. We put 100 copies of the flash memory at the end of the ID and put them together. Such a hash table policy is distributed on the machine to share the pressure. Let's talk about the logic of getting new things when a user logs on to Renren. If a Website user wants to access a server after logging on to the device, it will add new things. We do not use servers in the traditional sense, the feature is that it can support high user concurrency, and the speed will be very fast. Our entire website has only four new Web servers, which provide an external push and pull mode. website data pulling is the PULL mode. The work done in this area is actually used to match the new data and templates, and then merge them into TMl to form a module by matching the data and templates.
The following are some technical details. The first is the distribution system, the second is the cache, and the third is the persistence storage with rendering and so on.
We have now designed the entire system to be related to opnesource. The first is ice, the largest communication framework used by the engine in our team, which provides us with a good cache cluster, for our good data interaction network communication. In fact, many of our systems are developed based on this, and the third is memcache. If all SNS companies do not use this, it will not be 2.0. We also use it. We have a proxy layer between the lower application layer, which is a bit like a proxy server, implementing some load balancing policies and so on. The serialization and deserialization of the following googleprotobuf object can be said to be very good, including Google's internal use of such a thing, which I think is very good. The following binary data is compressed, and there are multiple index structures. The massive storage engine is like this.
The following is the feed distribution system. When users send new things, they send them to our system data, including new things, some merging policies, and some weighting data. An object is large, which may be several K or several hundred bytes in size. The first thing we need to do is to split the data into some text data that is used for public data and then for data display. The other is to sort the data. How to locate an index structure such as feed? Index structure our system's internal index architecture is about a size of only 32 bytes, And the content is very large, the two data will be sent to different places, the index data is sent to minifeed from newsfeed. Let's send a new message. For example, one of the 100 friends sends a log in the push mode to send a new message, I want the index structure to tell me to append such a new index to my friend list. I want to know which of my friends should send this content to and what magnitude is this operation? There will be one thousand times in one second. I checked the list for one thousand times, and 100 friends were 100. I am a celebrity and have millions of fans. We can retrieve the list from the database for the first time. The second query will go to the memory and cannot be performed on the database. The earliest system has no memory cache more than a year ago. We say that you have made our database red every day. After we have done this, it will be very good, and there is basically no load on the machine. In the third asynchronous thread pool, some pulse bursts sometimes, so we need to implement some control. When I have 10 thousand requests in one second, one pulse generates 10 thousand requests, and one request is sent to me, I can make a long stream of water to slowly digest it. It is not too late for a user to see your change status after several seconds. But you two are not in front of a computer, so he does not feel it, a little bit of pulse data into a smooth curve. There is a good improvement in the load capacity of the system. It is very important to deal with the number of thread pools in distribution.
Let's talk about the memory optimization of feedcache and the design mode called flygeight. At that time, an example in the book said that we were doing text editing in WPS. Each text had various attributes, such as the font size, but the same word appeared n times in one article. We only have one copy of the big data description Data Object globally. When we need to use this word, we need to store a single root for the corresponding location. The probability of repeated occurrences of the object in a system is very high. This operation is of great help to us. We tried to use this method, but we found some problems in performance.
We divide the new content into two types: data content and index data. Index data is relatively small. We store such an index in another cache, in fact, it meets the flyweight concept from a macro perspective. Our index should be sent to 100 or 500 people. They only get an index object, and it is the same thing that actually points to the content. We also use the same idea for each index cache, because for example, we deploy the frontend users on ten machines, that is to say, if I have 100 friends, on average, ten people in each service may index the same new thing ten times in each service. To do this, we think that an index structure is 32 bytes, pointing to 32 bytes with the smallest things can save some memory overhead.
Then we need to support different business Selection Conditions in index 5. Some people asked how to create an index in the memory. Similar to what a database supports, a database is called a multi-index. A database table can contain N different indexes or even joint indexes, however, we rarely implement such a structure in the memory. What should I do if I create indexes based on different latitudes for new things? In fact, a data structure is provided. We can make the same index for different latitudes, update the same content in the object, and the bytes will automatically change. Four objects are placed in the following cloud, with different shapes. The first is to sort the four objects by shape, the third is the size, and the other is four different objects, in this way, similar objects can support different indexes, and we can use it to conveniently implement multi-index structures.
Memory compression storage can significantly save memory. The picture on the right is a quicklz comparison chart. This compression and decompression speed is very good. We use a method to serialize the object and then compress it, in our system, we can save 30%-35% of memory.
Then, let's explain why we want to use memcache. First, we need to support high concurrency. A single user page shows 30 new things. I want to perform this operation 30 times, fetch 30 desired objects, and send them to the front-end for display. For an application with the size of Renren, the PV may be tens of thousands per second. We need to handle the memory cache. Another reason is that we have a large amount of data, and we know that the server memory is getting bigger and bigger now. We used 16 GB of memory at the company and thought it was much larger than our PC. In another year, switch to 32 and 36, and now the server gets 72 GB. We need to use the memory cache to query data. As the cache content in the memory increases, we need to ensure that the query performance is constantly improved. We ensure that the query performance is somewhat decreased but not very large as the data volume increases. On the other hand, when the cache cannot be placed on one machine, when some servers are restarted, we need a larger cache volume and some machines to be added. We need to ensure that the entire cache can be resized and some machines can be easily removed. We need some redundancy between all the caches. Finally, I hope we have enough cache policies and machines. We now have over a dozen or even a dozen cache servers. When we implement the cache of hundreds or even hundreds of machines, we need to ensure that all cache server management is more convenient, not to say that we need to redeploy it once. We learned from Facebook and want to do something like this. I just want to talk about the memcache proxy. There are two open-source projects, however, we have investigated these two projects, but they are not particularly ideal. The reason why we are motivated to do this on our own is that I used to be a client server, I still have confidence in such communication and other things. The other one is that the mbmcache protocol is very simple. So I think we are sure to do it well. Then we did it and the result was made.
What is a basic function, that is, we put all the Cache Management at this level, including policies, and we have a cacheloader. Our new operations are pv6, And the ID is the same in the database. In this way, I can make a cacheloader to work well with memcache. For example, I am not doing anything new, when I want to add something, I only need to prepare it in cacheloaler. This prevents developers from repeatedly developing some services for loading. The other one is why cacheporxy is required. If this is not the case, we have to place all the scattered services on the client. This will make it inconvenient for developers to use this cluster, as our clients continue to grow and other businesses continue to grow, there will be more and more people using this cluster, which will cause the same problem. With such a layer, we can ensure this problem.
The following is an index persistence system. Why do we need to do this? It is because we often had some problems when we didn't have it a year ago. There are some major changes to the new things. We need to restart the index cache, but I cannot store these caches in the database, because this is a large volume, we just said that if the total amount of new things we produce every day is more than 100 million, and each person has friends on average, the total number of new things generated every day is several billion. How many machines do we need to store these indexes? It may require hundreds of machines. Therefore, if we want to process tens of thousands or hundreds of thousands of machines in a second or at least 100,000 machines, we must solve this problem. In addition, we do not solve this problem. If the memory index cache does not exist, we need to release the new processes from the beginning to the end.
The legendary solution, myspl is not popular, and apensource is not fast enough. Third, gfs may solve this problem. But we cannot buy this system. When we did this, we did some research, including Sina support and Baidu support. Generally, we need to solve the problem 100,000 times per second. The third is that we generate more than GB of index data every day. What is the solution? The first is that for normal machines, we can read and write data immediately, that is, the iops can reach more than 800. Since the hard disk can only be like this, how can we solve it? The data read from the data disk is one piece of read, and our index is very small. If the size of an index is changed, we will waste a lot of writing resources, we must immediately change a large number of random writes into an ordered Write File. We need to turn all such random things into an ordered question. If we can turn them into ordered things, we can solve this problem with normal machines.
In other words, if we want to do this, we need to do something like I/O. When we use asynchronous I/O, we can directly operate the hard disk, we also conducted research with intel when using SSD to improve hard drive Write Performance and read performance.
We must merge all the writes and convert a large number of random write operations into sequential write operations. Now that we need to merge them, we must first put all the Random Index write operations in the memory for some lag, after integration, write and read again. We will record all write and read operations through log files. When the machine goes down, we can read the data through playback. We use TT to save the index. Why is it so fast? Because all his data is run in the memory, it is the same as the memory operation. We used TT To Do Something. tt supports a simple storage method. We chose asynchronous Io for the I/O model above the data node. Why use direct to block the OS 5 Cache Policy for I/O, and finally use SSD to solve a large number of concurrent reads.
This is the entire system node. nidexnode is responsible for storing the location information from userid to the most trusted data block. We put the user IDs of 0.5 billion users to the index block information in the memory, it also uses less than 10 Gb, and uses TT to ensure that all files in the system are put down in the memory at least. We use a 32 GB machine to store this file. In addition, shared memory is used for TT implementation. As long as the machine does not die, the node service is killed by me, the operating system is still running, the memory is still there, and the system will refresh the data back to the hard disk. The following is datafile, which is the structure of datafile. The structure is file1 on the left and file2 on the right.
Finally, let's talk about template rendering. When it comes to data format consistency, the new data format is to use feed input for many different businesses. The data format must be in a single shape, data is changed to different views through the rendering engine and provided to various businesses. The technical solution ctemplate provides efficient Template Selection capabilities, as well as Google's approach.
I have talked a lot today. You are welcome to join our team for further understanding. Thank you.
Question:The PPT just passed quickly. Some common databases, mebcachedb, why not?
Zhang tiean:We do not know much about this. We cannot write more than 200 pieces of data, and we do not know what the problem is. We don't think it is very reliable, because we must ensure that the problem is not solved after the problem is solved, and we must know how to solve the problem.
Question:You said that there are millions of fans, and we chose the push mode. We want to push the information to each of the millions of fans to quickly obtain the information of these fans, the fan information is stored in the memory. So I want to know, if I put hundreds of them in the memory, how is it organized by hundreds of fans?
Zhang tiean:It is a list. We send an expression. When we put it, we don't mean to put everything in it. We actually only use a queue of tens of thousands. Why? For these purposes, we are only interested in new things for our friends. For millions of fans, this list is added in real time. In this case, there is no way to make an accurate policy like a friend. Let's create a queue. It should not be too long. I make a queue of tens of thousands. This is the way users log on to the queue. After playing on this website, they will be shut down for dozens of minutes, it doesn't make much sense to turn off the cache, so there is a lot of data to be evicted.
Question:You said that now I put the user in the memory, and the query index is queried by ID. In our database, the ID and index are put into the cache by ID.
Zhang tiean:We use ice for the two memory caches.
Question:That is to say, there is a hard query, many of which are IDs, and this hard query is also offered.
Zhang tiean:There are no more memcache machines that can put all the new things in it. Now we need to retrieve one thousand lists. There will be less than one thousand lists. MIS has a long tail effect, and most hotspot data needs to be cached, you do not need to cache previous data that took a very long time.