With the rise of internet web2.0 websites, the relational database has become a very hot new field, the development of the non relational database products is very rapid. But the traditional relational database in dealing with web2.0 website, especially the super large-scale and high concurrent SNS type web2.0 pure dynamic website already appeared to be powerless, exposed many insurmountable problems, for example:
1, high configured-the need for a database to read and write
web2.0 Web site in accordance with user personalization information to generate real-time dynamic pages and provide dynamic information, so basically can not use dynamic page static technology, so the database concurrent load is very high, often to achieve tens of thousands of times per second read and write requests. Relational databases are barely able to handle tens of thousands of SQL queries, but with tens of thousands of requests for SQL write data, hard disk IO is already unbearable. In fact, for ordinary BBS Web sites, there are often high concurrent write requests, such as the Javaeye site real-time statistics online user status, record the number of hits of popular posts, polling count, etc., so this is a fairly common demand.
2, Huge Storage-the need for efficient storage and access to massive data
Like facebook,twitter,friendfeed such SNS website, daily user produces massive user dynamics, take FriendFeed as an example, One months to reach 250 million user dynamics, for the relational database, in a 250 million records of the interior and exterior of the SQL query, the efficiency is extremely low even unbearable. For example, large Web site users login system, such as Tencent, Shanda, hundreds of millions of accounts, the relational database is difficult to cope with.
3. High scalability && Hi availability-requirements for highly scalable and high-availability databases
In a web-based architecture, databases are the hardest to scale horizontally, and when an application is growing in number of users and accesses, your database has no way to extend performance and load capabilities simply by adding more hardware and service nodes than Web server and app server. For many sites that need to provide 24-hour uninterrupted service, it is very painful to upgrade and expand the database system, which often requires downtime maintenance and data migration, why can't the database expand by adding server nodes continuously?
In front of the "three high" requirements mentioned above, relational databases encounter insurmountable obstacles, and for web2.0 sites, many of the main features of relational databases are often useless, such as:
1. Database Transaction Consistency Requirements
Many web real-time systems do not require strict database transactions, the requirements for read consistency are low, and in some cases the requirement for write consistency is not high. Therefore, database transaction management is a heavy burden under the high load of the database.
2, the real time of the database and the demand of real-time reading
For a relational database, a query immediately after inserting a piece of data can definitely read the data, but for many web applications, it doesn't require such a high real time, for example, after I send a message (Javaeye Robbin), after a few seconds or even more than 10 seconds, My subscribers have only seen this dynamic as completely acceptable.
3, for complex SQL queries, especially the needs of multiple table associated query
Any large number of web systems, are very taboo on multiple large table associated query, as well as complex data analysis types of complex SQL report query, especially SNS type of Web site, from the requirements and product design perspective, to avoid this situation. Often more just a single table of the primary key query, as well as single table simple conditional paging query, the function of SQL is greatly weakened.
Therefore, the relational database in these more and more application scenarios are not so appropriate, in order to solve such problems of the non-relational database came into being, now these two years, a variety of non-relational databases, especially key-value database (Key-value Store DB) surging, much to dazzle. Recently, foreign countries have just held NoSQL conference, the NoSQL database has been unveiled, coupled with a reputation outside, at least more than 10 open source nosqldb, such as:
Redis,tokyo cabinet,cassandra,voldemort,mongodb,dynomite,hbase,couchdb,hypertable, Riak,tin, Flare, Lightcloud, Kiokudb,scalaris, Kai, Thrudb, ...
These NoSQL database, some of the use of C + + written, some in Java, and some of the written in Erlang, each has its own unique, can not see, can only pick some more features, looks more promising products to learn and understand. These NoSQL databases can be broadly grouped into three categories:
Kye-value database for extremely high read and write performance requirements: Redis,tokyo Cabinet, Flare
The main feature of High-performance Key-value databases is high concurrent read and write performance, Redis,tokyo Cabinet, Flare, which are all 3 Key-value db written in C, and they perform fairly well, but in addition to outstanding performance, They also have their own unique features:
1, Redis
Redis is a very new project that has just released version 1.0. Redis is essentially a key-value type of memory database, much like memcached, the entire database loaded in memory to operate, periodically through asynchronous operation of the database data flush to the hard disk to save. Because it is a pure memory operation, Redis performance is very good, can handle more than 100,000 times per second read and write operations, I know the fastest performance of Key-value DB.
Redis's outstanding feature is not only performance, Redis's greatest charm is to support the preservation of list lists and set sets of the data structure, but also support the list for various operations, such as from the list at both ends of the push and pop data, take the list range, sorting, etc., In addition to the set support for sets of set intersection operations, and the maximum limit of single value is 1GB, unlike memcached can only save 1MB of data, so redis can be used to implement a number of useful functions, such as using his list to do FIFO two-way list, Implementation of a lightweight high-performance Message Queuing service, with his set can do high-performance tag system and so on. In addition, Redis can also set the expire time for the deposited key-value, so it can also be used as a function-enhanced version of the memcached.
The main disadvantage of Redis is that the database capacity is limited by physical memory, can not be used for high performance reading and writing of massive data, and it has no native extensible mechanism, no scale (extensible) capability, and relies on client to realize distributed reading and writing. Therefore, the Redis suitable scenarios are mainly confined to the high-performance operations and operations of smaller data volumes. The site currently using Redis has Github,engine Yard.
2, Tokyo cabinet and Tokoy tyrant
The TC and TT developers are Japanese Mikio Hirabayashi, Mainly used in Japan's largest SNS website mixi.jp, TC Development of the earliest time, is now a very mature project, but also the Key-value database field, the largest hot spot, now widely used in many many websites. TC is a high-performance storage engine, and TT provides multi-threaded high concurrency server, performance is excellent, can handle 450,000 times per second read and write operations.
In addition to supporting Key-value storage, TC also supports saving Hashtable data types, so it is much like a simple database table, and also supports conditional queries based on column, paging query and sorting functions, essentially supporting Tanku's basic query functionality So you can simply replace many of the operations of the relational database, which is one of the main reasons that TC is popular, and there is a Ruby project miyazakiresistance the Hashtable of TT into the same operation as ActiveRecord, It's very cool to use.
Tc/tt in the practical application of Mixi, storing more than 20 million data, while supporting tens of thousands of concurrent connections, is a time-tested project. TC, while ensuring high concurrent read and write performance, has a reliable data persistence mechanism while supporting hashtable of relational database table structures as well as simple conditions, paging and sorting operations, is a great NoSQL database.
The main disadvantage of TC is that after the amount of data reached billion levels, concurrent write data performance will be significantly reduced, nosql:if only It is easy to mention, they found that in the TC insert 160 million 2-20KB data, write performance began to drop dramatically. It seems that when the amount of data on the billion, TC performance began to decline significantly, from the TC author's own Mixi data, at least thousands the amount of data has not encountered such a significant write performance bottlenecks.
This is a simple performance evaluation of Memcached,redis and Tokyo tyrant by Tim Yang, for reference only
3, Flare
TC is Japan's first major SNS website Mixi developed, and flare is the second largest SNS website green.jp developed, interesting bar. Flare simply adds scale functionality to the TC. He replaced the TT part, himself to the TC wrote a network server, flare's main feature is to support the scale capability, he added a node server before the network server to manage the backend of multiple servers, so you can dynamically add database service nodes, delete server nodes, also supports failover. If your use scenario has to be scale, you can consider flare.
Flare's only drawback is that he only supports the memcached protocol, so when you use flare, you can't use TC's table data structure, only use TC's KEY-VALUE data structure to store.
Ii. document-oriented databases to meet mass storage needs and access: MONGODB,COUCHDB
The main problem of document-oriented non relational database is not high performance concurrent reading and writing, but also good query performance while guaranteeing massive data storage. MongoDB was developed in C + + and COUCHDB was developed by Erlang:
1, MongoDB
MongoDB is a product between relational database and non relational database, and is the most powerful and relational database in the relational database. The data structure he supports is very loose and is a JSON-like Bjson format, so you can store more complex data types. The biggest feature of MONGO is that the query language he supports is very powerful, and its syntax is somewhat similar to an object-oriented query language, which can almost achieve most of the functions of a single table query like relational database, and also supports indexing data.
MONGO is mainly to solve the problem of access efficiency of massive data, according to official documents, when the amount of data reaches 50GB or more, MONGO database access speed is 10 times times more than MySQL. MONGO's concurrent read and write efficiency is not particularly good, according to the official performance test, which shows that approximately 5,000-1.5 requests for read and write can be processed per second. For MONGO concurrent read and write performance, I (Robbin) also intend to have free time to take a good test.
Because MONGO is mainly to support the massive data storage, so MONGO also brought an excellent distributed file system Gridfs, can support a large number of data storage, but I also see some comments that Gridfs performance is not good, this is still to do some testing to verify.
Finally, because MONGO can support complex data structure, and with powerful data query function, so very popular, many projects consider using MongoDB to replace MySQL to achieve not particularly complex web applications, for example why we migrated from MySQL to MongoDB is a real case from MySQL migrated to MongoDB, because the amount of data is too large, so migrated to the MONGO above, the speed of data query has been greatly improved.
MongoDB also has a ruby project Mongomapper, a MongoDB interface that mimics Merb's datamapper and is very simple to use, almost identical to Datamapper, and very powerful and easy-to-use.
2, CouchDB
Couchdb is now a very famous project, it seems that no more introductions. But I have no interest in couchdb, mainly because COUCHDB only provides HTTP rest based interface, so couchdb simple from the concurrent read and write performance is very bad, which makes me immediately abandoned the interest in COUCHDB.
Third, distributed computing-oriented databases to meet high scalability and availability: Cassandra,voldemort
Scale-oriented database in fact, the main problem area and the above two types of database is not the same, it must first be a distributed database system, by the distribution of different nodes in the database together to form a database service system, and according to this distributed architecture to provide online, Resilient scalability, such as adding more data nodes without downtime, deleting data nodes, and so on. So like Cassandra is often seen as an open-source version of the Google BigTable alternative. Cassandra and Voldemort are all developed in Java:
1, Cassandra
The Cassandra Project, which was created by Facebook in 2008, was followed by Facebook itself using another Open-source branch of Cassandra, and the Open-source Cassandra was largely maintained by Amazon's Dynamite team, And Cassandra is considered a Dynamite2.0 version. Twitter and digg.com are using Cassandra in addition to Facebook.
The main characteristic of Cassandra is that it is not a database, but a distributed network service composed of a bunch of database nodes, and a write operation to Cassandra will be replicated to other nodes, and the Cassandra read operation will also be routed to a node to read. For a Cassandra cluster, extended performance is a relatively simple thing to do, just add nodes to the cluster. I see an article that says Facebook's Cassandra Cluster has a database cluster of more than 100 servers.
Cassandra also supports a richer data structure and powerful query language, and MongoDB comparison, the query function is slightly weaker than MongoDB, Twitter's platform architecture department leader Evan Weaver wrote an article about Cassandra:http ://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/, a very detailed introduction.
Cassandra with a single node, the concurrent read and write performance of its nodes is not particularly good, there are articles saying that the evaluation down Cassandra less than 10,000 times per second read and write requests, I also see some comments on this question, but it is not meaningful to evaluate the performance of Cassandra individual nodes , the real distributed database access system is a system composed of n multiple nodes, whose concurrent performance depends on the number of nodes in the whole system, the efficiency of routing, and not just the concurrent load ability of single node.
2, Voldemort
Voldemort is a similar Cassandra-oriented distributed database system for solving scale problems, Cassandra from Facebook, the SNS site, and Voldemort from LinkedIn. SNS Web site for us to say that more than n NoSQL database, such as Cassandar,voldemort,tokyo cabinet,flare and so on. Voldemort is not a lot of information, so I do not particularly careful to delve into, Voldemort official Voldemort of concurrent read and write performance is also very good, more than 15,000 times per second read and write.
From the development of Cassandra,linkedin Development Voldemort Facebook, we can also see the large foreign SNS Web site for the distributed database, especially the database scale ability of the demand is very strong. I mentioned earlier (Robbin) that the web and app layers are relatively easy to scale horizontally, and that only the database is single and difficult to scale, and Facebook and LinkedIn are now exploring a good way to distribute relational databases. This is the main reason why Cassandra is so popular now.
Today, the NoSQL database is an exciting area, always have new technology new products come out, change the inherent technical concept that we have formed, myself (Robbin) slightly understand some, feel oneself deeply indulge in, can say NoSQL database field is also profound, I (Robbin) also can only taste, I (robbin) write this article is a little bit of my own research experience, but also to attract friends in this field have experience to discuss and exchange.
From my (Robbin) personal interest, the distributed database system is not the technology I can actually use, so I'm not going to take the time to delve into it, and the other two data areas (high-performance nosqldb and mass storage nosqldb) are of interest to me, especially redis,tt/ TC and MongoDB are the 3 NoSQL databases, so I'll write three articles detailing the 3 databases separately.