Distributed system, this article refers to the "courage" blog

Last Update:2016-05-24 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. How to understand "distributed"?

Often hear "distributed systems", "distributed Computing", "distributed Algorithms". What is the specific meaning of the distribution? Narrow distribution refers to the geographical distribution of multiple PCs in different places.

2. Distributed System

Distributed System System : Multiple computers capable of running independently (called nodes). Each node uses the computer network to transmit information, thus achieving a common "goal or task".

Distributed Program: a computer program running on a distributed system.

distributed Computing : Computing problems with Distributed system solutions. In distributed computing, a problem is refined into multiple tasks, and each task can be done by one or more computers.

differentiate between distributed computing and parallel Computing: The common denominator is that large tasks are divided into small tasks. Different points: Distributed computing: Based on multiple PCs, each PC accomplishes different parts of the same task. Distributed computing is divided into small tasks with each other independent, the results between the nodes almost do not affect each other, the real-time requirements are not high. Parallel computing: Based on the same PC, the multi-core of the CPU is used to accomplish a task together.

1) Distributed operating system

Distributed Operating system: responsible for managing distributed processing system resources and controlling distributed program operation. It differs from the centralized operating system in aspects such as resource management , process communication , and system architecture .

2) Distributed File System

The Distributed file system has the ability to perform remote file access and transparently manages and accesses files distributed across the network.

3) distributed programming and compiling and interpreting system

Distributed programming languages are used to write distributed programs that run on distributed computer systems. A distributed program is composed of several program modules that can be executed independently , and they are executed concurrently on multiple computers distributed in a single distribution processing system. It has three features compared to the centralized programming language: distribution, communication and robustness .

layered applications can be divided by the number of tiers that can be transferred from the data layer (typically stored in a database) to the presentation layer (displayed on the client). Typically each layer runs on a different system than the other, or in a different process space on the same system. Tiered benefits: Reduce the complexity of the entire application, enabling applications to scale better and keep pace with the needs of the enterprise.

Two-tier application: A typical structure, a client PC (front end), a network server (backend) that contains a database. Logically based on the physical location of the two. Typically the client contains most of the business logic, and as the database and stored procedures evolve, the SQL language allows the business logic to be stored and executed in the database server.
Three Floor applications: The most commonly used three-tier application architecture consists of a User Service layer (the presentation layer), a business service layer , and a data service layer. The business logic layer is detached from the user interface and the data source. due to the functional limitations of the two-tier application, the client/server side structure, distributed applications are typically divided into three or more tiers. Each layer of the component performs a specific type of processing.

3) Distributed database

I see: distributed databases, which are connected by multiple databases (called sites) that are distributed across different locations (geographically distributed). The use of distributed DBMS for each site unified management, each site logically unified together. Based on the transparency of data distribution, it seems to be managing data on a single site. The advantages are: fault tolerance, improve access speed.

The wiki is an official explanation: A distributed database is a logically unified database that is composed of multiple physically dispersed database units connected by a computer network. Each connected database unit is called a site or node. Distributed database has a unified database management system to manage, called Distributed database management system.

The basic characteristics of distributed database include: physical distribution, logical integrity and site autonomy. Other features that can be derived from these three basic features are: transparency of data distribution, control mechanism combining centralization and autonomy, appropriate data redundancy, and distribution of transaction management. Distributed database is divided into heterogeneous distributed database and homogeneous distributed database according to the similarities and differences of the data model of database management system in each site, according to the type of control system is divided into global control centralization, global control decentralized type and global control variable type.

3. Hadoop, HDFS, HBase, Hive

my opinion:

Hadoop is a distributed system infrastructure that develops distributed applications based on this framework, leveraging the power of high-speed computing and storage of clusters. Similar to the Nvidia-based Cuda parallel architecture to develop parallel programs to play the parallel computing power of the GPU.

HDFs is the file system for Hadoop. Based on HDFs, you can manipulate files, such as new, delete, edit, rename, and so on.

HBase: A database system based on Hadoop architecture. is not a relational database, column-based mode.

Hive: HBase-based high-level language. A computer language similar to SQL---Accessing and processing a relational database.

Official explanation:

Hadoop is a Distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the underlying details of the distribution . leverage the power of the cluster for high-speed operations and storage.

HDFS (Hadoop Distributed File System) is a distributed filesystem implemented by Hadoop. It stores files on all storage nodes in the Hadoop cluster. For external clients, HDFS is like a traditional hierarchical file system. You can create, delete, move, or rename files, and so on. However, the architecture of HDFs is built on a specific set of nodes, and the files stored in HDFs are partitioned into chunks and then copied to multiple computers (DataNode). This is very different from the traditional RAID architecture. The size of the block (typically 64MB) and the number of copied blocks are determined by the client when the file is created. NameNode can control all file operations. All traffic inside HDFS is based on the standard TCP/IP protocol.

Hbase–hadoop Database is a highly reliable, high-performance, column-oriented, scalable distributed storage system that leverages HBase technology to build large-scale structured storage clusters on inexpensive PC servers . HBase is a sub-project of the Apache Hadoop project. HBase differs from the general relational database, which is a database suitable for unstructured data storage. The other difference is that HBase is column-based instead of row-based patterns. Hadoop HDFs provides high-reliability, low-level storage support for HBase, and Hadoop MapReduce provides high-performance computing power for HBase, and zookeeper provides a stable service and failover mechanism for hbase. In addition, pig and hive provide high-level language support for HBase, making data statistics processing on hbase very simple. Sqoop provides a convenient RDBMS data import function for HBase, which makes it very convenient to migrate traditional database data to hbase. HBase data model and storage structure, reference http://www.searchtb.com/2011/01/understanding-hbase.html

Hive is a Hadoop-based data warehousing tool that maps structured data files (such as XML) to a database table and provides full SQL query functionality . You can convert the SQL statement to a MapReduce task to run. The advantage is that the learning cost is low, the simple mapreduce statistics can be quickly realized through the class SQL statements, and it is very suitable for the statistical analysis of data Warehouse without developing specialized mapreduce applications.

Distributed system, this article refers to the "courage" blog

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More