A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Basically are in group discussion, when others ask the introductory questions, later thought of new problems to add in. But the problem of getting started is also very important, the understanding of the principle determines the degree of learning can be in-depth.
Hadoop is not discussed in this article, only peripheral software is introduced.
This is the most software I've ever asked, and it's also the highest utilization rate around Hadoop.
What the hell is hive?
How to strictly define hive is really not too easy, usually we are easy to understand for non-hadoop professionals, often called it a data warehouse. But technically, I think this is not a strict definition, and hive itself does not save any data, even its metadata, it does not have the data structure that the database in the traditional sense is designed for indexing and storage. Its metadata is stored in either the memory database or the relational database, and the data is read directly from the file or path on the HDFs. This is not the same as a data warehouse in the traditional sense.
An explanation for my own thinking is that hive is a compiler that converts SQL statements to Map/reduce tasks.
It frees you to write your own map/reduce development cycle, with just one SQL statement to do what might have been a day before, and I think this is the greatest place for this tool. So I'm grateful for Facebook to open it up.
What is meta data (metadata)?
This is also the question that countless people ask, some say the data that is used to describe the data. I'd rather say metadata is the data that manages the data. For example, terra-cotta warriors, as we all know, there are many pits, not next to each other, each pit has a lot of terracotta warriors. The Terracotta Warriors are data, and the number of pits, a pit, number second, this number second is metadata. Bring up a number, you know, there are 100, lift second, you know there are 120. This is the data and metadata relationship.
Where does hive's table structure and data come from?
It says that Hive does not save its own data, all of its data comes from files stored on HDFs, and the table structure you are designing is just a mapping name for one column of the contents of the file, and that's it. In other words, you define field A, field B, in the Hive table. So hive the text file, the first column in the file is considered to be a field, and the second column belongs to the B field. The simpler analogy is that you can think of hive as a MySQL CSV engine, or an Excel that can handle billions of rows of data, one that can write SQL.
is HIVEQL compatible with sql92/95?
Most of the syntax is compatible, but not completely compatible, from the use of many functions are not perfect, such as clustering, sampling and so on. But it's good, and Hive provides the UDF method, you can write your own function to load into HIVEQL. HIVEQL syntax is more like MySQL. Hive also provides Map/reduce language interface, you can embed your own map/reduce jar or script into the HIVEQL language, you can hql the results of a customized map/reduce calculation.
is hive remote access convenient?
Very convenient, but non-Java language development to learn the thrift framework. For Java, native provides JDBC drivers. For window applications, ODBC drivers developed by third parties, such as Cloudera and MAPR, provide hive ODBC drivers. This means that data analysis, previously based on Oracle or MySQL, can be migrated to hive at a very small cost.
Can the hive be processed in real time?
Not at all, although he is SQL language, but its essence is map/reduce. The essence of Map/reduce is prepared for offline processing, so don't expect hive to be the same as mysql,oracle. But some memory-based distributed SQL engines are being perfected, such as Cloudera's Impala or Hortonworks's Tedz (as it seems). The query speed is very high, but can not do real-time query. Actually imagine, tens of billions of, hundreds of terabytes of data, plus multidimensional statistics and calculations. Even if the disk IO, memory io enough, the CPU is not necessarily counted.
For specific syntax and more details, see the documentation on DDL and DML on hive.apache.org.
In addition to Hadoop and hive, this is the most asked, whether it is understanding or error analysis.
is hbase a database?
It is a nosql, not a traditional database, a nosql based on the column clan. is the Open-source implementation of Google BigTable paper.
What is a row family (column accessibility)?
I would like to give a professional explanation about the term, but Google and Baidu did not have a Chinese interpretation of the group. So I can make an analogy that is not particularly accurate, just for easy understanding. In HBase, the column family corresponds to a table of relational databases. A key-value pair such as Key-value is the equivalent of a row within a database. column, can be regarded as a field bar. A professional metaphor is a matrix. But it's a dense matrix, not a sparse matrix.
How does the HBase update record work?
As we all know, HDFs is not allowed to modify the file, which means that once the file is written off, it cannot be changed. Only append operations can be done. So, HBase here too, if you modify the Key-value, then this operation will be appended to the end of the file HBase store, rather than the original record is modified and overwritten. Well, there is one great benefit to this, that is, the old records will be saved. The modified data is kept as log until it exceeds the limit you want to keep.
Does hbase need to be fitted with zookeeper?
No need, hbase himself brought zookeeper. The following will talk about zookeeper, so here is not to repeat.
HBase can do real-time query, capacity limited system?
Can be real-time, this is the use of Cassandra, Facebook used to deal with user login and user information, and so on, and then transferred to HBase. There is almost no limit to the capacity, tens of billions of rows, millions of columns are lightly loose to complete.
What does zookeeper do?
Zookeeper is a collaborative tool to ensure the consistency of the distributed system in the work, lock, configuration synchronization and other basic operation and maintenance functions. But not to pretend that Hadoop and the surrounding ecosystem need to use it, usually is not used, hbase, and other basic will not use. Sometimes in the Hadoop 1.0 ha inside use, but to do their own namenode brain crack (split brain) work, the brain crack do not explain, Google check.
Does the zookeeper need to install an odd number of servers?
This is a special classic of enduring rumors, each group mentioned zookeeper people will ask the basic, there are many Chinese zookeeper articles are said to be odd, sure, but is it true?
This to from the principle of Paxos algorithm, Paxos is a leader election algorithm, since the election, should be more than half of the agreement, then a person is not to vote, two people are not good, because two people are voted for each other vote.
Paxos also has the prerequisite agreement, one of the most basic stipulation is, the first proposal, the person behind must accept.
A proposal says, I'm going to choose my own head, B do you agree? B said, I agree to choose a
Then the B proposal says, "I'm going to vote for myself, a do you agree?" A says, I agree to choose B.
A and I'm going to take the initiative, b do you agree? B said, I agree to choose a
Then B launched the proposal, I want to choose my own head, a Do you agree? A says, I agree to choose B.
...... An endless selection.
Each person has the same number of votes, will always be this way, so two people are not elected, to at least three talents can continue, for example:
A proposal says, I want to choose myself, B do you agree? C, do you agree? B said, I agree to choose A; C said, I agree to choose a. (Note that more than half of this, in fact, in the real world election has been successful.) But the computer world is very strict, in addition to understand the algorithm, to continue to simulate. )
Then the B proposal said, I want to choose myself, a do you agree? A said, I have half agreed to be elected, your proposal is invalid; C said that a half had already agreed to be elected, and that proposal B was invalid.
Then the C proposal said, I want to choose myself, a do you agree? A said, I have half agreed to be elected, your proposal is invalid; B said that a more than half of the agreed to be elected and that the C proposal was invalid.
Elections have been leader, followed by follower, only to obey leader's orders. And here's a little bit of detail, which is actually who starts first.
This is not an example of a complete description of the Paxos algorithm and the underlying voting protocol, the algorithm itself and many of the subsequent protocols are complex, and there are many other roles and engagements, and this example just makes it easier for people to understand the algorithm. Paxos algorithm in Baidu Encyclopedia introduction simply can not see, could have simply described things that are very complex, loaded with a very understanding of the appearance.
Then through such an election, it can be learned that zookeeper as long as the server to ensure that more than 3 started, you can work properly. No matter how many units to add, even even, can only accept a is leader. So, by the same token, a hbase server can be not an odd number.
Even numbers are zookeeper, but not unavailable. The reason is this, if you have 4 servers in the cluster, 2 hang, you can not choose Leader, and you have 5, Hang 2, the remaining 3, but also to select leader. Therefore, zookeeper is not necessarily odd, but as long as more than 4, the basic can guarantee high availability. So the rumor that zookeeper must have an odd number of servers has been terminated. But the impact of this rumor may take a long time to eliminate.
By the way, the Paxos algorithm simplified version is Microsoft's patent, Paxos is also Microsoft's engineer proposed. Zookeeper is an open-source implementation of Google Chubby. Chubby and zookeeper all use the Paxos algorithm to realize the consistency maintenance of the distributed system.
Mahout or something to write later.
Start building with 50+ products and up to 12 months usage for Elastic Compute Service