I said big data processing refers to the need to search the data at the same time, there are high concurrent additions and deletions to modify the operation. Remember before in XX to do power, millions of data, then a search query can let you wait for you minutes. Now I want to discuss the large amount of data processing, at that time I was thinking about, for example, Tencent, Shanda, the number of hundreds of millions of accounts, how can so fast, so found the Internet now on the development of data processing.
For large data processing, if it is the internet processing, generally divided into the following stage: The first stage, all data are loaded into a database, when the amount of data will certainly be a problem, like just said the query, so find a way. The second stage, then certainly want to do caching mechanism, indeed can add cache memcached, but the buffer is also a symptom not the root causes, the amount of data is too large, so there is the following method. The third stage, Master-slave mode, the master-slave database, Master provides write, slave to read, this is suitable to write a database card method, xx that or not, so-the fourth stage, vertical library, this meaning is still not small, for this collection of data, so- The fifth stage, the level of the library, this is good, remember the previous from Hing is also according to this time level of the library, in fact, can be divided into finer points to estimate the effect of a better sixth stage, with NoSQL did, about NoSQL how to do can refer to Google's bigtable
In fact, the main purpose of this article is to explore the nosql of large amount of data processing:
NoSQL is the write operation in memory, timed or according to a certain condition of the data in memory directly to the disk, on the basis of the solution to some problems: high concurrent read and write requirements of massive data access requirements of database horizontal scalability needs
Cap theory, NoSQL is sacrificing consistency, doing the AP, and consistency just guarantees final consistency.
The disadvantages are also obvious:
1. When the machine hangs the data will be lost, you can consider the shared memory solution.
Add: In fact, this can be expanded to speak, one is through shared memory to achieve.
Cluster Memory: Based on the quorum NRW theory, for example, you have n machines used to cluster, every time you read and write data can be synchronized to at least X nodes to be successful, so you read the data only need to read more than N-X nodes can maintain your correct rate, in fact, the data is a redundant backup, But we're saving memory, and it's faster to do memory operations across the network than with direct disk operations.
In fact, a guarantee of data consistency, that is, log, when the data each write operation memory is logged, and then in memory to write operations, at least a lot of database is done, such as Redis.
2. Memory limitations, limited memory when the write data operation is too large memory will explode.
Workaround: BigTable's approach is to merge the same operations through the Bloom-filter algorithm, such as update a= ' A ', update a= ' B ' can be merged directly. Basic Theoretical Basis
NoSQL Rationale: Memory is the new hard disk, the hard disk is the new disk
Relational databases implement transaction acid: Atomicity (atomicity), consistency (consistency), isolation (isolation), persistence (durability).
CAP Theory: Consistency consistency availability-availability Partition-fault tolerance
Most NoSQL databases do not support transactions, SQL, and so on, so you still have to keep relational databases. It is now mentioned that the use of memory database, overall if the simple business, NoSQL faster than the memory database, but nosql the biggest drawback, does not support transactions, not support SQL queries.