Why is binary memtable not suitable for importing large amounts of data in Cassandra version 0.6.1.

Last Update:2018-12-07 Source: Internet

Author: User

Tags cassandra hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous articleArticleUsing Binary memtable to import large amounts of data into cassandra explains how to use binary memtable to import large amounts of data into Cassandra.

This week, we have been watching if we use binary memtable to import a large amount of data. Today, I still think this version is not suitable for importing data.

The reason is as follows:

In0.6 +Version,CassandraCluster canceledUDPCommunication, fully usedTCPListen to the fixed port (7000). This method causes2Serious problems:

1. If a machine is a Cassandra server, it will bind port 7000. At this time, we will not be able to start the client imported in binary memtable mode again on the local machine.ProgramBecause the client itself needs to bind port 7000.

2 because the binary memtable method is used to import data, it must be combined with mapreduce. However, mapreduce may run multiple cetcetask tasks on a server Load balancer instance. This means that multiple client programs are started on one server at the same time, and each client program needs to bind port 7000, therefore, only one of these client programs can be initialized successfully.

Binary memtableThe operating mechanism is as follows:ClientSerialize the data to be imported, and then pass the serialized data to eachCassandra ServerMachines (including disaster recovery backup machines), eachCassandra ServerAfter receiving the data, first execute deserialization, and then put it into the memory. When the data in the memory reaches a volume or receivesFlushRunFlushOperation to write data in the memorySstablePersistent data to the hard disk. This method causes the following2Problem:

1. Because all machines (including disaster recovery and backup machines) corresponding to a data are computed on the client side, the client side will check whether Cassandra server is available or not, send data to all related machines. When a Cassandra server is unavailable, the unavailable Cassandra server will not contain the data. Of course, this problem can be solved through the actual read repair operation later, but when the machine is not available to operate n/2 + 1, the data will never be consistent.

2. After we write a certain data to the Cassandra server, we do not know whether the Cassandra server is successfully written because the Cassandra server that accepts the data has no return value. For more information, see sendoneway and binaryverbhandler.

CassandraThe cluster mechanism of is as follows:KeyspaceOneKeyThe data will correspond to oneCassandra ServerAnd automatically select several other servers based on the disaster tolerance backup data we have configured.Cassandra ServerDisaster Tolerance. When there is a newCassandra ServerAfter startup, allCassandra ServerThe newly startedCassandra ServerAdd your ownToken MapThe machine in the cluster is considered to have a new one. Related data will be added later.Cassandra Server. When a serverCassandra ServerWhen it is unavailable, the machine responsible for disaster recovery backup will be sent to unavailableCassandra ServerData is cached to your ownStstem tableIs unavailable before discoveryCassandra ServerAfter startingStstem tableInHintThe entire data backup and recovery process is completed after data is sent. This method also brings about a serious problem:

1. The client program we started will use the same startup method as the normal Cassandra server and then join the entire Cassandra cluster. In this way, other Cassandra servers in the cluster will perform load balancing operations on data, transmit the backup data to the client through strean. At the same time, because the client program is not a real server machine, many Cassandra server ststem tables cache a large amount of data that should be sent to the client machine.

CassandraVersion itselfCodeAlso existsBug:

1. After the client program is started, the client program cannot be normally closed, and there are still threads running in the background. This problem may cause the hadoop mapreduce program to fail to run normally. This bug has been fixed.

2. Other bug reports are available on Cassandra's official website.

Based on these considerations, we decided to abandon the binary memtable method to import a large amount of data.

More about Cassandra: http://www.cnblogs.com/gpcuster/tag/Cassandra/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More