What exactly is cluster & distributed

Source: Internet
Author: User
Tags version control system

For the landlord so work for a year rookie, occasionally will see some article title with "distributed" "cluster" keyword, and then confused forced. Recently, these concepts have been a certain understanding, collated a bit of ideas, here to share to you ape friends. Shortcomings also hope to correct, thanks.

In fact, in this year's work, some of the distributed and clustering technologies have some contact, but not in-depth research. such as distributed service Framework Dubbo, search engine Elasticsearch.

The concept is always abstract, and the combination of examples will make your understanding of the concept clearer. So, if you happen to have an ape friend who uses distributed and clustered technology, you can look at some of the concepts in this article and think about the distributed and clustered technologies you've used. If you have not used the relevant technology, it is also possible to understand the state of mind to read this article, behind the contact, at least there will be an approximate impression.

Let's look at the views of other apes on "distributed" and "cluster":

(1) Another blogger's view (http://blog.csdn.net/bluishglc/article/details/5483162)

Bloggers have made a few changes to his statement to add, convenient for you ape friends tomorrow he means.

简单说,分布式是以缩短单个任务的执行时间来提升效率的,而集群则是通过提高单位时间内执行的任务数来提升效率。例如:如果一个任务由10个子任务组成,每个子任务单独执行需1小时,则在一台服务器上执行改任务需10小时。采用分布式方案,提供10台服务器,每台服务器只负责处理一个子任务,不考虑子任务间的依赖关系,执行完这个任务只需一个小时。(这种工作模式的一个典型代表就是Hadoop的Map/Reduce分布式计算模型)而采用集群方案,同样提供10台服务器,每台服务器都能独立处理这个任务。假设有10个任务同时到达,10个服务器将同时工作,10小后,10个任务同时完成,这样,整身来看,还是平均1小时完成一个任务!(注意这里的任务和子任务的区别)

(2) Knowledge (https://www.zhihu.com/question/20004877)

The ape friend described it very simply and clearly:

分布式:一个业务分拆多个子业务,部署在不同的服务器上集群:同一个业务,部署在多个服务器上

Another ape friend expressed from another angle:

集群是个物理形态,分布式是个工作方式。

The ape-friend's description is concise, but rather abstract:

按照我的理解,集群是解决高可用的,而分布式是解决高性能、高并发的

(3) Baidu Encyclopedia (http://baike.baidu.com/view/4804677.htm, http://baike.baidu.com/view/3022776.htm)

Cluster:

集群是一组相互独立的、通过高速网络互联的计算机,它们构成了一个组,并以单一系统的模式加以管理。一个客户与集群相互作用时,集群像是一个独立的服务器。集群配置是用于提高可用性和可缩放性。

Distributed:

一种基于网络的计算机处理技术,与集中式相对应。由于个人计算机的性能得到极大的提高及其使用的普及,使处理能力分布到网络上的所有计算机成为可能。分布式计算是和集中式计算相对立的概念,分布式计算的数据可以分布在很大区域。

Is it a kind of indefinitely feeling to read these? Bloggers are the same! So we continue to understand.

The blogger has said that he has contacted the Distributed service Framework Dubbo, so let's see why it says that it is a distributed service architecture. (Http://dubbo.io/User+Guide-zh.htm#UserGuide-zh-%E8%83%8C%E6%99%AF)

分布式服务架构当垂直应用越来越多,应用之间交互不可避免,将核心业务抽取出来,作为独立的服务,逐渐形成稳定的服务中心,使前端应用能更快速的响应多变的市场需求。此时,用于提高业务复用及整合的 分布式服务框架(RPC) 是关键。

Occasionally, it is said that "Git is a distributed version control system", why is it distributed? (http://zhidao.baidu.com/link?url=WYNUjpVK8c-G5lq9EP6CMWAAwexIKduWUYlSC09iC5NRPYJI4L7HxoxgTRIiGxKoNQpBy4XCC_j_ 6TOJOSBQZY8O6-NIXCBVUZ2–ZCJWTK)

Git就是分布式版本控制系统,对应的是集中式的版本控制如SVN。简单的说,分布式的版本控制就是每个人都可以创建一个独立的代码仓库用于管理,各种版本控制的操作都可以在本地完成。每个人修改的代码都可以推送合并到另外一个代码仓库中。而像SVN这样,只有一个中央控制,所有的开发人员都必须依赖于这个代码仓库。每次版本控制的操作也必须链接到服务器才能完成。很多公司喜欢用集中式的版本控制是为了更好的控制代码。如果个人开发,就可以选择Git这种分布式的。
从一般开发者的角度来看,git有以下功能:1、从服务器上克隆完整的Git仓库(包括代码和版本信息)到单机上。2、在自己的机器上根据不同的开发目的,创建分支,修改代码。3、在单机上自己创建的分支上提交代码。4、在单机上合并分支。5、把服务器上最新版的代码fetch下来,然后跟自己的主分支合并。6、生成补丁(patch),把补丁发送给主开发者。7、看主开发者的反馈,如果主开发者发现两个一般开发者之间有冲突(他们之间可以合作解决的冲突),就会要求他们先解决冲突,然后再由其中一个人提交。如果主开发者可以自己解决,或者没有冲突,就通过。8、一般开发者之间解决冲突的方法,开发者之间可以使用pull 命令解决冲突,解决完冲突之后再向主开发者提交补丁。

Looking at these descriptions of the Distributed Service Framework Dubbo and the distributed version control system git, think about it as if it were similar to the above "distributed: a business that splits multiple sub-businesses, deploys on different servers, clusters: The same business, deployed on multiple servers".

Dubbo the core business, as a separate service module, each module only needs to rely on interface, interface implementation separation, then the developers can each complete their own responsible service module, and finally complete a complete system. Their goal is to complete a system, and each sub-service module is equivalent to a sub-business. Git is similar.

In fact, distributed many times can not open the cluster, in Dubbo, Hadoop, Elasticsearch are embodied.

Now the distributed concept may be relatively clear, the cluster concept may be more ambiguous. In addition, the cluster is how to cooperate with the distributed, then we continue to understand the cluster.

Clusters are divided into three major categories (high-availability clusters, load-balanced clusters, scientific computing clusters)

Highly available clusters (high availability Cluster)

Load Balancer cluster (load Balance Cluster)

Scientific computing Clusters (high performance Computing Cluster)

1. Highly available cluster (high availability Cluster)

Common is 2 nodes made of HA cluster, there are many popular unscientific names, such as "dual-machine hot standby", "dual-machine mutual preparation", "dual-machine".
A highly available cluster solves the ability of a user's application to continuously deliver services externally. (Note that high-availability clusters are neither used to protect business data, but that the user's business processes are continuously serviced, minimizing the impact of software/hardware/human-caused failures on the business).

2. Load Balancer cluster (load Balance Cluster)

Load balancing system: All nodes in the cluster are active and they share the workload of the system. Generic Web server clusters, DB clusters, and application server clusters are of this type.

A load Balancer cluster is typically used for Web servers, database servers, for corresponding network requests. This cluster can, upon request, check for servers that accept fewer requests, are not busy, and transfer requests to those servers. From the point of checking other server states, load balancing and fault tolerant clusters are very close, and the difference is more in number.

3. Scientific computing Cluster (high performance Computing Cluster)

High Performance Computing (perfermance Computing) clusters, referred to as HPC clusters. Such clusters are dedicated to providing powerful computing power that a single computer cannot provide.

High-performance computing classification:
 
3.1. High throughput calculation (High-throughput Computing)
 
There is a class of high-performance computing, which can be divided into several sub-tasks that can be parallelized, and the subtasks are not related to each other. Like searching for aliens at home (e-mail protected]–search for extraterrestrial Intelligence at home) is this type of application. This project uses idle computing resources on the Internet to search for aliens. The SETI server sends a set of data and data patterns to a compute node on the Internet that participates in SETI, and the compute node searches for a given pattern on the given data and then sends the results of the search to the server. The server is responsible for pooling the data returned from each compute node into complete data. Because a common feature of this type of application is to search for certain patterns on massive amounts of data, this kind of calculation is called high-throughput computing. So-called Internet computing belongs to this category. According to the classification of Flynn, high throughput calculation belongs to the scope of SIMD (single instruction/multiple Data).
  
3.2. Distribution calculation (distributed Computing)

The other kind of calculation is just the opposite of high throughput computation, although they can be divided into several parallel subtasks, but the sub-tasks are closely related and require a lot of data exchange. According to the classification of Flynn, distributed high-performance computing belongs to the MIMD (multiple instruction/multiple Data) category.

Let's talk about the scenarios of these clusters:

High-availability clusters are not described here.

Want to Dubbo is more inclined to load balanced cluster, used ape friends should know (do not know can self-understanding), Dubbo the same service can have multiple providers, when a consumer comes, it wants to consume that provider, here is a load balancing mechanism inside.

Search engine Elasticsearch is more inclined to calculate the distribution of scientific computing clusters.

And here, probably a lot of apes know, some of the terminology of clustering: Cluster fault tolerance, load balancing.

We take Dubbo as an example:

Cluster fault tolerance (HTTP://DUBBO.IO/USER+GUIDE-ZH.HTM#USERGUIDE-ZH-%E9%9B%86%E7%BE%A4%E5%AE%B9%E9%94%99)

Dubbo provides these fault tolerant strategies:

集群容错模式:可以自行扩展集群容错策略,参见:集群扩展Failover Cluster失败自动切换,当出现失败,重试其它服务器。(缺省)通常用于读操作,但重试会带来更长延迟。可通过retries="2"来设置重试次数(不含第一次)。Failfast Cluster快速失败,只发起一次调用,失败立即报错。通常用于非幂等性的写操作,比如新增记录。Failsafe Cluster失败安全,出现异常时,直接忽略。通常用于写入审计日志等操作。Failback Cluster失败自动恢复,后台记录失败请求,定时重发。通常用于消息通知操作。Forking Cluster并行调用多个服务器,只要一个成功即返回。通常用于实时性要求较高的读操作,但需要浪费更多服务资源。可通过forks="2"来设置最大并行数。Broadcast Cluster广播调用所有提供者,逐个调用,任意一台报错则报错。(2.1.0开始支持)通常用于通知所有提供者更新缓存或日志等本地资源信息。

Load Balancing (HTTP://DUBBO.IO/USER+GUIDE-ZH.HTM#USERGUIDE-ZH-%E8%B4%9F%E8%BD%BD%E5%9D%87%E8%A1%A1)

Dubbo provides these load balancing policies:

Random LoadBalance随机,按权重设置随机概率。在一个截面上碰撞的概率高,但调用量越大分布越均匀,而且按概率使用权重后也比较均匀,有利于动态调整提供者权重。RoundRobin LoadBalance轮循,按公约后的权重设置轮循比率。存在慢的提供者累积请求问题,比如:第二台机器很慢,但没挂,当请求调到第二台时就卡在那,久而久之,所有请求都卡在调到第二台上。LeastActive LoadBalance最少活跃调用数,相同活跃数的随机,活跃数指调用前后计数差。使慢的提供者收到更少请求,因为越慢的提供者的调用前后计数差会越大。ConsistentHash LoadBalance一致性Hash,相同参数的请求总是发到同一提供者。当某一台提供者挂时,原本发往该提供者的请求,基于虚拟节点,平摊到其它提供者,不会引起剧烈变动。算法参见:http://en.wikipedia.org/wiki/Consistent_hashing。缺省只对第一个参数Hash,如果要修改,请配置<dubbo:parameter key="hash.arguments"value="0,1" />缺省用160份虚拟节点,如果要修改,请配置<dubbo:parameter key="hash.nodes"value="320" />

And more curious about how they communicate?

Like the earlier version of Elasticsearch, the automatic discovery node mechanism, ES is a peer-based system, it first through the broadcast to find the existing nodes, and then through the multicast protocol to communicate between nodes, but also support point-to-point interaction.

Dubbo has a registry that supports multiple registries, but it is recommended to use zookeeper. About zookeeper you can see for yourself that many of the cluster-related frameworks have been used to it. Of course, like Elasticsearch is the corresponding mechanism to achieve.

Here, the small pigeon is a bit tired, ready to work first ~~~~~~~~

What exactly is cluster & distributed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.