Preface
The concept of cloud computing is very popular recently. When I heard the term "Cloud" before, I thought it was too mysterious-I don't know where it is used or how it is implemented. I always feel confused!
Fortunately, due to the need of recent work, I learned and developed an internal system similar to the "cloud computing" infrastructure, and then I will look back at the two oligarchy (Google, Amazon) in the industry) by introducing their own cloud computing services, the concept of "Cloud" is actually implemented from the perspective of understanding. TheArticleOn the basis of my personal understanding, I will make some in-depth analysis and comparison on the concept, architecture, and applicability of cloud computing, hoping to help you understand the cloud computing architecture.
Part 1 What is cloud computing
Let's leave the standard definition of cloud computing to Google. Here I will talk about What cloud computing is after my simplified understanding: Let's look at the causes of cloud computing first! -- The primary cause is to cope with the conflict between the explosive growth of data to be processed and the shortage of machine storage and computing capabilities (borrow the calligraphy of China's current basic contradiction :). As the number of data to be processed is growing, it is impossible to describe the scope of the data. Imagine if you want to store and compute access logs of tens of millions of users, or calculate the Page Rank of hundreds of millions of webpages), it is difficult to store them on one or a limited number of storage servers, in addition, one or a limited number of computing servers cannot process such massive data. -- Of course, you may think of buying a pretty EMC storage array and a minicomputer like HP's superstone to get everything done, but they're a little expensive, this type of money-consuming guy can only be used by large banks, telecom companies, or the National Meteorological Administration-then it is necessary to be able to use ordinary machines (such as cheap pcs collected in Zhongguancun) and can compute the data in a distributed manner. Are you sure you want to say that this is distributed computing soon? Yes, cloud computing can be said to be the same technical path through distributed computing, parallel computing, and grid computing, or even the same genes. However, they give different appearances, because cloud computing is a term packaged in commercial form, in fact, the technology of distributed storage and distributed computing is finding a profit model-selling storage and computing capabilities to third-party enterprises. Third parties do not need to know whether their data exists on that machine or whether the machine is processing their data. Therefore, for them, data is on the cloud and computing is on the cloud, cloud computing.
Currently, two industry leaders, Amazon and Google, are selling cloud computing services (I heard that Oracle and apple are starting to work, and EMC seems to have plans). The services sold are basically the same, the profitability is also similar (For details, refer to the S3 service, EC2 service, or Google App Engine service on their website ). Although their technical architecture is different, cloud computing can be regarded as an organic combination of "Storage cloud" and "computing cloud", that is, "cloud computing = storage cloud + computing cloud"
Part 2 Introduction to the storage cloud architecture
Storage cloud Concept
In my opinion, the storage cloud is a commercial packaged distributed storage system-it only exposes storage interfaces to third-party users, allowing users to buy capacity and bandwidth, and has a large distributed storage system. We will not discuss much about the business model. You can check it out on your website. My focus here is to compare and analyze the distributed storage systems used for storing the cloud. (But if you have no idea about the storage cloud, I suggest you read the introduction of the related papers before reading the following content! So that our discussion can get twice the result with half the effort .)
Comparison of storage cloud Structure -- dynamo vs bigtable
Typical basic storage cloud systems include Amazon's dynamo system and Google's bigtable system. These two systems are not only commercially available (see S3 and Google App Engine ), furthermore, we have published detailed implementation papers (especially dynamo system papers), which show Amazon's selflessness and self-confidence ). Their respective implementation architectures are different, with different storage features, but both of them have a beautiful structure and are technically commendable. They all have their own advantages, but they share the same path.
Next, we will analyze and compare the data storage requirements, architecture, resizing, Server Load balancer, fault tolerance, data access and query among other important aspects, to identify the cause.
Data structuring
the first thing to mention is the difference between the two data storage attributes, although both are stored in the form of key/value, however, dynamo tends to store the original data because the stored data is unstructured, the parsing of value is completely about the user's Program . The dynamo system does not recognize any structure data and treats it as binary data; bigtable stores structured or semi-structured data (Web data is characterized by structured and unstructured data, therefore, it is called semi-structured data. I will not talk about it here. If you don't know about semi-structured data, please google it !), Its value is structured data, just like a column in a relational database, so it supports a certain degree of query (for example, it can be performed in a single column ). In this regard, bigtable is closer to the database (similar rather than equivalent! For the specific differences with relational databases, you can go to Google. There are many discussions on the internet !); In addition, the data stored in bigtable is implemented in string format, so the sorting of the primary or column (and its automatically added timestamp) is performed in the collation, dynamo's key values are not stored as strings, but are stored in 16-byte md5_key after MD5 algorithm conversion, therefore, you must know the key for data access. Therefore, you cannot scan tables (with cursors) or query access. Of course, on the basis of dynamo, it is impossible to implement the query by using some methods. Some specific methods will be discussed later!
Comparison between control and Storage Architecture
Dynamo uses DHT(For details about the distribution hash table, refer to the relevant materials.) as a basic storage Architecture and concept, this architecture maximizes the uniformity of data storage in the ring, each storage point can be aware of each other (because data needs to be forwarded within the ring and fault detection between each other, communication between nodes is required), which is highly self-managed, because it does not require master control point control, it is a bit hotspot-free and there is no risk of single point of failure-insert, at present, Sina's memcachedb (transforming memcached and increasing the continuous capability) it can be considered as the simplest representation of this architecture (after data enters the system, the DHT algorithm is used to evenly send data to the storage node, and the storage engine uses berkelery dB, data is continuously stored on the local disk ).
Bigtable control adopts the traditional server farm mode.Is composed of one master server and multiple sub-Table servers. Data storage adopts a multi-dimensional sparse structure of map, which can be viewed as composed of multiple lists. The so-called sparse structure means that each record does not require a full column. Its data (including index, log, and record data) is stored on the DFS of the Distributed File System. The data is stored on each node in the DFS-specific file form. Compared with DHT's storage ring self-management technology, it requires a master control server to monitor the storage nodes of various customers (allocating sub-tables, failure detection, and load balancing ), in addition, the root of the index file is also centralized storage, which requires the client to read the index first (the pre-read and cache technologies can be used to reduce the number of times the index table is read ). One drawback of this centralized control approach is that the system has a single point of failure. Therefore, a single point of failure requires high availability, such as recording recovery logs or dual-Machine backup. The advantage is that it is more controllable and easy to maintain, data Synchronization during centralized management is easy-obviously, updating the original data (such as data indexes or node routes) stored in the centralized storage is more convenient than the original data (such as membership, that is, the routing relationship between each point) it is much easier to use the "chatting mechanism" to notify the method in sequence for progressive update.
Fault Tolerance
Both dynamo and bigtable are not lab-oriented leaders or demos that show off technology, but products that actually perform business operations. Therefore, the first thing to consider is the machine cost! The most economical way is to use ordinary PC servers (currently the market price is about 2/3 yuan to buy machines that store 1 TB of data-naturally there are no monitors, sound cards, and other peripherals) as storage machines. However, anyone who does Big Data Processing knows that, the stability and service life of the IDE/STAT hard disk cannot be comparable to that of the SCSI hard disk on the Real Server (the stability and service life of other components except the hard disk are also quite different from those on the server ), damage under pressure is common-a cluster of 1000 machines breaks down one machine on average every day, according to Google-So hardware faults were considered normal at the beginning of design, that is to say, fault tolerance is a top priority for design.
For the above reasons, both dynamo and bigtable data are stored in redundancy, that is, one copy of data will be copied into several copies (the number of copies can be specified based on the degree of data importance ), they are scattered on different machines in order to cause machine downtime (unexpected downtime or network failure is a temporary fault, while hard disk failure is a permanent fault, for a permanent fault, you need to recover the fault. When you recover data from a copy, there are still available copies to continue providing services. Generally, three copies can be stored without any worries, because you need to know that the possibility of three copies being damaged during the same period is less than 1000*1000*1000.
Dynamo's redundant copy read/write policy is interesting. It defines three parameters: N, W, and R. N indicates the number of copies of each record in the system, W indicates the number of copies to be written for each successful write operation, and r indicates the minimum number of copies to be read for each read request. Data Consistency can be ensured as long as W + r> N. Because W + r> N, there will always be an intersection of read and write -- there must be at least W + R-N read requests will be written to the copy, therefore, it is inevitable to read the "last" updated copy data (as for who is "last", the time stamp or the clock vector must be used to complete the judgment-the logical relationship is determined by the clock vector, otherwise, the timestamp is used to determine the order. for more information, see the dynamo thesis ). This is the simplest way of thinking than ours-our intuitive idea must be that if the system requires redundant n copies, then n copies will be written each time, when reading a request, you can read any one of the available records-more secure and flexible. It is safer to ensure data consistency. For example, if a customer writes a record, the record has three copies at three different points, but one of them has a temporary fault, therefore, the record is not written/updated. When the record is read again, if two points (r = 2) are obtained, the minimum correct value will be read (the temporary fault point may be restored when reading the record, the read value does not exist or is not the latest; if the temporary fault point has not been restored, the Read Request cannot access the copy on it ). Using our traditional method, we may read the point where a temporary fault occurs. At this moment, we may read the error records (old or nonexistent). Therefore, we can see that increasing W and R can improve system security; more flexible means that the parameters n, W, and r can be configured to meet various scenarios that require different access methods, speeds, and data security: for example, for write-Multi-read-less operations, W can be configured with low, R with high; for the idea of write-less-read operations, W can be configured with high, R with low.
The Fault Tolerance of bigtable is not detailed in this paper., I think it should be to hand over the job to the DFS for processing: DFS is when the file chunk (64 m) is written to the chunk server, the data Chunk is propagated to the nearest N-1 chunk server, ensuring that each chunk in the system has multiple copies, the chunk location information is recorded in the original file data of the master server. When accessing the file, the original data is obtained first, and then the data is obtained from the available chunk server. Therefore, the failure of a chunk server does not affect data integrity and can still be read. In addition, the DFS fault recovery and other work are also monitored by the master server to copy a copy chunk to restore the data copies on the faulty machine.
It is worth mentioning that dynamoTemporary Fault Handling MethodYes: Find an available machine and temporarily write the data to the temporary table on it. After the temporary fault is restored, the data in the temporary table will be automatically written back to the original destination. The goal is to always write data (for fear that only one machine in the cloud is available, the data written to the request will not be lost ).This requirement is not mentioned in bigtable.However, from the perspective of its architecture, DFS is applicable to write operations, it should also be able to meet the always writable requirements close to Dynamo (the master will help to select a writable chunk server as the receiver of write requests, so the system as long as the master is unavailable, at least one chunk server is available.
Expansion Problems
For a storage system that can be regarded as an infinite number of storage systems, the expansion requirement (in addition to insufficient storage capacity, the storage node's concurrent processing capacity is insufficient, and the expansion requirement will also be required) naturally cannot be avoided, in addition, during the expansion period of the online service system, do not stop the service or stop the service as short as possible. Therefore, the beautiful resizing solution is one of the most important points worth attention in the storage cloud.
Let's talk about the expansion strategy and implementation of the dynamo system. Imagine how to scale up a specified data table in a machine. First, split the data table into two tables, and then transfer one to another. The split table action here is simple, and it is quite laborious to do it, because no matter whether the data table is organized in an orderly manner according to the key-value range (such as the DHT ring method ), or the key value itself is organized in an orderly manner (such as the bigtable mode), it is inevitable to scan the entire data table (operations that consume special resources will definitely affect other services) in order to select a part of ordered data from it and move it to the new table, so that the split two tables still maintain the ordered structure. To avoid clumsy table scanning, Dynamo uses the ring interval enclosed by the MD5 key, and the granularity is as fine as possible, that is, it is divided into smaller intervals/segments (a segment corresponds to a data table on the hard disk), but a physical machine is required to store more than one segment, instead, it stores a group of field tables in consecutive intervals, so that the operations on split tables can be avoided during expansion. For example, divide a ring into 1024 segments (the actual deployment period table is much more detailed when storing data), and then define that each storage point maintains 64 segment tables, then all the data can be first deployed on the Storage Ring of 16 machines. If you find that a machine can store no more than 64 field tables (or cannot afford the current number of concurrent requests), you can transfer some of the field tables to the newly expanded machine, for example, you can transfer 32 segment tables from the original machine to the new machine to complete the expansion. This small table migration avoids table sharding and table splitting. Of course, you will say that there is a limit on this type of expansion. You can only scale up six times. Yes, so at the beginning of the actual storage ring, it is necessary to estimate the total data volume, the number of resizing, and other issues, but this is absolutely true.
In addition to the fact that dynamo is worth learning, Dynamo also puts forward the requirement of non-stop service during expansion. We have also tried this kind of high-availability resizing design. Its main task is to make it clear, so as to carefully process the state machine of access requests during the resizing period (including data resizing and route updates. In addition, in order not to affect normal access requests during expansion, all expansion routines are arranged at a low priority so that they can be secretly performed during normal read/write requests!
Google's paper on bigtable resizing is ambiguous.But you can see a clear description in another bigtable-like system-hypertable. Hypertable is an open-source C ++ Implementation of bigtable. Because the record storage in hypertable is a tablet of a fixed size merged by the Set (the default maximum value is 200 MB per) there is a DFS-and DFS itself has Scalability (allowing new machines to be added online to the server farm)-so there is no problem with resizing the total hypertable storage space. Only when the sub-table (range segment) is too large, split it from the intermediate key into two new tables, migrate the new word table containing the second half of the key range to another range server. Note that the number of implementation paths of such molecular tables still needs to be scanned. In this regard, I personally think it is not as smart and convenient as dynamo. It is worth noting that the management of hypertable tables is not mentioned here.For more information, see the official documentation.
Server Load balancer
Server Load balancer (data storage and access pressure balancing) is a natural advantage for Dynamo systems , because it uses the DHT method to store data evenly to various points, there is no hotspot (or to heat up, all the points in the ring are hot together ), the data storage and access pressure at each point should be balanced (determined by the MD5 Algorithm ). In addition, the virtual node Concept in the dynamo system-vnode can be viewed as a resource container (similar to a virtual machine), where storage runs as a service. The purpose of introducing vnode is to unit resource management granularity. For example, if a vnode allows you to manage only 5 Gb hard disks and MB of memory, you can only use so many resources. There are two obvious advantages: 1. It is convenient to manage heterogeneous machines with different configurations. For example, multiple machines with more resources deploy some vnodes while fewer machines with less resources do not deploy some vnodes. 2. expansion is advantageous because a new node is added to the DHT ring. To maintain the uniform distribution of data, you must move all the data in the ring, this will undoubtedly increase network volatility, so the most ideal way is to scale up each point in the ring, so you only need to move the data of the node next to it. Therefore, adding one or more machines alone is obviously not evenly allocated to other storage points of the ring. Therefore, you need to divide a physical machine into multiple vnodes, in this way, the vnodes may be evenly distributed to other nodes in the ring. With the gradual addition of machines, the data uniformity is gradually improved. It can be seen that this is a gradual data balancing process.
ForBigtable load balancing is also based on traditional server farm: Relies on a master server to monitor the load of the sub-Table server and migrate data based on the load of all sub-Table servers, for example, you can migrate a very popular access list to a child table server that is under low pressure (the data is still stored on the storage service point of the chunk server-DFS, in terms of level structure, it is under the sub-Table server ). For more information, see their papers. In general, there are too many innovations.
Data access and query Problems
Both dynamo and bigtable support key-value record insertion, and also support primary-created random queries. As mentioned above, if you want to query by columnRange Query queries, dynamo is powerless, and only bigtable architecture can be used.(However, you must know that bigtable is not as strong as relational data. The query only supports a single column and cannot be used as a composite condition query with multiple column targets, let alone join queries ). Bigtable is closer to the database, while dynamo is a simple storage system.
Amazon launchedSimpledb System Supporting Query. This system is similar to bigtable, but it seems to be more powerful. It supports = ,! =, <,> <=, >=, Starts-with, And, or, not, intersection and union, and other complex query operations. This is really an outstanding product. Unfortunately, Amazon has not published a paper about its implementation as generously as dynamo does. Therefore, you can only guess its implementation, some of them are rewritten on Erlang, some are developed on the basis of dynamo, and some are all new implementations of dynamo. Currently, there are many sayings and there is no way to know about them. Here I just talk about how to implement a similar query function on dynamo.
The first thing we can think of is to add a schema for Dynamo, that is, to divide the value into logical columns, so that index files can be created by column during storage, you can naturally query the columns. The content of the index file can be a ing between the column value and the primary set. The content is stored in the Distributed File System as a file. When you query a column, first find the primary collection in the index file, and then obtain records from Dynamo at the primary. However, this method has certain limitations: 1. Distributed File systems need to support concurrent file modification (because index files need to change frequently), while most distributed file systems need to consider data consistency and efficiency, only concurrent append operations are supported. Therefore, it is difficult to update data in real time and support query operations. The simple method is to update index files on a regular basis, the side effect is that the query results are not up-to-date. 2. You can only sort the single columns with pre-created indexes (of course you can create a joint index), and cannot support any columns, or any combination of conditions to complete the query-I did not expect any good solution.
Another method is to use relational databases as the storage engine of dynamo. If you have read Dynamo's paper, can you remember that it mentioned that the actual storage engine can use berkelery dB or MySQL --, the query operation in the storage ring can be reduced to zero: the Query Task is routed to each storage node for separate query, and then the results are collected, requests to be sorted must be sorted in a centralized manner. This approach transfers indexing and other work to relational databases. We only need to summarize the results. In this way, we can further combine the data distribution partition policy: instead of uniformly storing data on the node as the MD5 key, we distribute data according to the column, such as Beijing in the address attribute, when data of different geographic names, such as Xi an, is routed to different specified nodes, requests can be directly routed to the corresponding storage node when querying by geographic name, in this way, query characters are not distributed across the entire environment, and complex queries with column conditions can be completed more effectively. In addition, you can also sort the intervals of the columns. For example, for an age column, there is an interval ranging from 0 to 10, ranging from 10 to 20, and ranging from 20 to 30. Each interval is stored on different nodes, this sort interval deployment method supports column-based sorting query. However, there may be loss. The disadvantage of partition storage is that the data is not even enough, so the load is unbalanced. Therefore, you need to expand the partation node vertically, for example, we need to increase the number of storage nodes responsible for the 20-30 interval to share the concurrency pressure.
Part 3 current computing cloud architecture implementation
The business model of the storage cloud is to sell the storage capability, while the business model of the computing cloud is to sell the computing capability. The basic technology of the storage cloud is distributed storage, while the basic technology of the computing cloud is distributed computing-more accurately, it is "Parallel Computing". Parallel Computing is used to split large computing tasks and then distribute them to nodes in the cloud for Distributed Parallel Computing. Finally, the results are collected and sorted in a unified manner (such as sorting and merging ). If cloud computing cloud is the sublimation of parallel computing, there will be only some progress at one level-computing resource virtualization: all computing resources in the computing cloud are regarded as a computing resource pool that can be allocated and recycled. You can purchase corresponding computing resources based on your actual needs.
This kind of resource virtualization benefits from the recent rise of virtual machine technology. using virtual machines to achieve resource virtualization can not only avoid the heterogeneous hardware features (no matter what kind of hardware machines are together, its computing resources can be quantified to the computing resource pool and dynamically allocated.) This allows for dynamic resource adjustment, therefore, it can greatly save computing resources in the cloud (Dynamic Adjustment means the resource size can be adjusted without restarting the system, which is one of the greatest uses of virtualization technology ). This virtualization technology is similar to the virtualization technology we use for virtual machines installed on our own machines. The difference is that our individual user's use mode is to virtualize the resources of a physical machine into multiple copies, the virtualization technology in the cloud virtualizes the resources of multiple physical machines into a large resource pool, it makes users feel like a machine with huge resources-but it makes sense to virtualize resource pools only when tasks can be computed in parallel. For example, a computing cloud consisting of 100 machines and 386 machines can process 1 TB of log data. If the processing of log data can be performed in parallel, therefore, each 386 machines can process 1/TB of data and merge all intermediate results into the final results. However, if tasks cannot be divided in parallel, a large computing pool is useless (cloud computing applications are limited. What is currently most useful is Web websites-large data volumes, but relatively simple processing ).
All in all: the computing cloud architecture can be viewed as parallel computing + resource virtualization.
Parallel Computing Architecture (MAP/reduce)
We will not discuss the question of resource virtualization here. We have the opportunity to discuss it in detail in a special topic. Here we will talk about the parallel computing in the cloud. Parallel Computing is an old topic, and many MPI-based parallel computing software are everywhere. MPI uses message transmission between tasks for data exchange. The basic idea of parallel development is to divide tasks into separate parts that can be completed independently, and then send them to each computing node for separate computing, after calculation, the results of each node are summarized to the primary computing point for final summary. The interaction between the points is completed by message transmission. The main problems faced by parallel computing are: 1. Can an algorithm be divided into independent parts? 2. it is costly to obtain computing data and store intermediate results, because the reading of massive data brings heavy Io pressure-for example, in processing internet applications such as page rank, to a large extent, frequent reading of webpage data stored in Distributed Storage results in a bottleneck in task computing speed.
For the first algorithm problem, it is difficult to consider segmentation in the computing architecture. The key lies in the segmentation algorithm. For the second I/O pressure problem, the best solution is the map/reduce method used by the hadoop project. The idea is very simple: the computing program is delivered to the data storage node, computing is performed locally to avoid the pressure on data transmission over the network. This is not an innovative idea. There have been many attempts long ago (for example, IBM once launched a mobile agent project called aglet, that is, distributing computing programs to various nodes for computing and collecting information ), however, this method is undoubtedly the most attractive and cost-effective for massive data processing today.
In short, map is a process of separating data, while reduce is the process of merging separated data. For example, hadoop's word count example: Map [one, word, one, Dream] to [{one, 1}, {word, 1}, {one, 1}, {dream, 1}], and then use reduce to put [{one, 1}, {word, 1}, {one, 1}, {dream, 1}] returns the result set of [{one, 2}, {word, 1}, {dream, 1. The abstract method of MAP/reduce is one of the essence of MAP/recduce, but it is not mentioned in this article. (You can refer to the function language or other materials. I will not describe it here ), this article focuses on the source of map data.
MAP/reduce data source
Map data sources are not a problem at first glance, it is nothing more than reading local data (we have already said that the computation program is used as the callback operator of Map-by using Java-is transferred to the Data Location for execution ). However, in the Application Scenario of massive data processing, you must consider the combination with the distributed storage system. The easiest way to match hadoop is to work with DFS of the Distributed File System: locate the distribution node location of the file block through the original data of the file system, then, the callback operator is sent to it, and the data is read from the local file system in order. For Log File Analysis and other applications, the above practices are very efficient, because log files are read sequentially, the pre-read feature of the file system can be fully utilized-Offline log analysis is a typical application of MAP/reduce analysis.
However, we should also see that the use of MAP/recduce has obvious limitations: first, if complex input requirements, such as the need to query a data set, instead of reading file input in sequence, you cannot directly use the hadoop MAP/recduce framework. Second, the Distributed File System under it does not support multiple concurrent writes for consistency considerations, it cannot be modified after writing. These features have good effect on post-event analysis such as logs, but it is difficult for scenarios where data needs to be generated in real time. Therefore, whether dynamo, bigtable, and other distributed storage systems can be used in the MAP/recduce environment becomes a new requirement. However, I feel that the storage structure of bigtable does not seem easy to implement complex queries in the local environment (for example, multi-column matching queries may not be completed, and it is not easy to complete locally-it should be unable to avoid remote data retrieval, and if cross-machine query brings too much network I/O, it violates the original design intention of the MAP/reduce architecture for parallel computing. So can dynamo meet the query requirements? If you use the method mentioned in the preceding query: Define the corresponding schema for the record and store it in the storage point in the traditional relational database (you can create an index on the required columns ), then, the "Callback operator -- here is the query statement" is delivered to it, and the query can be performed locally in the traditional way! In this way, both the original intention of MAP/reduce and the complex input requirements can be met without affecting the real-time generation of data. Therefore, I think that a flexible and convenient parallel computing architecture can be implemented by dynamo or its variant storage systems (as mentioned above in partition mode) + MAP/reduce.
Star systems in several cloud computing Architectures
At present, various subsystems in cloud computing are surging and emerging. I will briefly mention a few projects I have learned. If you are interested, you can track them and learn about cloud computing in the near future.
1. bigtable/dynamo has been mentioned above.
2. hbase is a sub-project of hadoop. Similar to bigtable, hbase is the most suitable for storing very sparse data (unstructured or semi-structured data ). Hbase is good at storing such data because hbase, like bigtable, is a column-oriented storage mechanism.
3. couchdb is a file-oriented storage service under Apache developed by Erlang. It is also a distributed storage system like other new storage systems and has good scalability. However, the difference is that there is no unified schema, and the data organization is flat with no rows or columns. If you need to perform query and other operations, you can use the aggregation and filtering operators provided by the user, full-text retrieval of document information in MAP/reduce mode -- in this perspective, it can also implement queries similar to databases, the methods are completely different-but it provides a logical interface for the Data Relationship of a view, which can be imagined as a traditional table for users.
4. simpledb is a distributed data storage system developed by Amazon for query. It supplements and enriches dynamo key-value storage and is currently used in its cloud computing service. The specific implementation method is not published in the paper.
5. Pig is an interesting project that Yahoo donated to Apache. It is not a system, but a SQL-like language. The goal is to build an advanced query language on mapreduce. The purpose is to compile some operations into the map and reduce of the mapreduce model, allowing users to have their own functions. pig supports many algebraic operations, complex data types (tuple, map), statistical operations (count, sum, AVG, Min, max), and related database search operations (filter, group, order, distinct, union, join, foreach... generate ).
Conclusion:
I know almost that much about it. You need to know that this field is developing rapidly and that knowledge is changing with each passing day-various storage systems and computing architectures are popping up and promoting each other. The best one depends on your actual needs. I have not started learning for a long time. I may not understand it or understand it clearly in many places. You are welcome to criticize and correct it. I hope that I can make friends who study similar systems, even if my goal is achieved. Haha!
Source: http://blog.csdn.net/kanghua/article/details/2919766