"Guide" the author (Xu Peng) to see Spark source of time is not long, note the original intention is just to not forget later. In the process of reading the source code is a very simple mode of thinking, is to strive to find a major thread through the overall situation. In my opinion, the clue in Spark is that if the data is processed in a distributed computing environment, it is efficient and reliable.
After a certain understanding of spark internal implementation, of course, I hope to apply it to practical engineering practice, this time will face many new challenges, such as the selection of which as a data warehouse, is HBase, MongoDB or Cassandra. Even once selected, many unexpected problems are encountered in the practice process.
In order to quickly solve the development and on-line process encountered in the series of problems, but also need to have a considerable depth of Linux knowledge, happened to work before the use of Linux experience in large data areas can also be fully used.
I am not, on the face of some of the problems, sorted out and share together.
1. Cassandra
NoSQL database choice of pain, the market currently has nearly 150 kinds of nosql database, how in such a complex team to select the best business scene, it is not easy.
Good is after a large number of screening, we are more certain of several NoSQL databases are hbase, MongoDB and Cassandra respectively.
In what ways does Cassandra attract a large number of developers? Here's a rough analysis.
1.1 High reliability
Cassandra adopts gossip as the node communication protocol in the cluster, the nodes in the whole cluster are in the same position, there is no master-slave, which makes the exit of any node not cause the whole cluster to fail.
Cassandra and HBase have borrowed the idea of Google BigTable to build their own systems, but Cassandra another important innovation is the introduction of the Peer-to-peer (peer to peer) that existed in the file-sharing architecture into NoSQL.
One of the characteristics of Peer-to-peer is to center, all nodes in the cluster enjoy the same status, which greatly avoids the single node to quit and make the whole cluster can not work.
In contrast, HBase uses a master/slave approach, which has the potential of single point failure.
1.2 High Scalability
Over time, the scale of the cluster is not enough to store the newly added data, and the system expands at this time. Cassandra Cascade Scalable, very easy to add new nodes to existing clusters, easy to operate.
1.3 Final consistency
Distributed storage systems face the CAP law problem, and any distributed storage system cannot meet both consistency (consistency), availability (availability) and partitioning fault tolerance (partition tolerance).
Cassandra is a priority to ensure AP, availability and partitioning fault tolerance.
Cassandra provides different levels of consistency for write operations and read operations, and users can choose different levels of conformance based on specific scenarios.
1.4 Efficient Write operations
The write operation is very efficient, and this feature of Cassandra is undoubtedly of great advantage for a very large application scenario for real-time data.
Data reading depends on the situation:
specifies a key value if it is a single read, and the query results are returned quickly. If it is a range query, because the target of the query may be stored on multiple nodes, this requires querying multiple nodes, so the return speed will be very slow to read full table data, very inefficient.
1.5 Structured Storage
Cassandra is a column-oriented database, with a relatively flat learning curve for developers who turn from RDBMS.
Cassandra also provides a more friendly CQL language, which is similar to SQL statement.
1.6 Maintenance Simple
From the point of view of system maintenance, because of the Cassandra Peer-to-peer system architecture, the maintenance operation is simple and easy to do. Steps like adding nodes, deleting nodes, and even adding new data centers are straightforward.
Resources
1.http://cassandra.apache.org 2.http://www.datastax.com/doc 3.http://planetcassandra.org/documentation/
2. Cassandra Data Model
2.1 Single Table Query
2.1.1 Single table primary key query
In the establishment of personal information database, the personal ID card ID as the primary key, when the query is only the identity card for the keyword query, the table can be designed to be:
CREATE TABLE person (userid text primary key, fname text, lname text, age int, gender int);
The first column name in the Primary key is the partition key. That is to say, according to the hash result for partition key, where the record is stored in the partition, if not compact the single primary key causes all the hash results to fall in the same partition, will cause the partition data to be full.
The solution to this problem is to combine the partitioning key (Compsoite key) to make the data as evenly distributed to each node as possible.
For example, you might set (Userid,fname) as a composite primary key. The corresponding table creation statement can be written as
CREATE TABLE person (userid text,fname text,lname text,gender int,age Key ((int,primary), userid,fname);) with Clustering order BY (lname DESC);
Explain a little bit about the meaning of primary key ((UserID, fname), lname):
where (Userid,fname) is called a composite partitioning key (composite partition Key) LName is a clustered column (Clustering column) ((userid,fname), lname) is called a compound primary key ( Composite primary key)
2.1.2 Single table non-primary key query
If you want to query people with the same name in table person, you must create an index for fname or the query will be slow.
Create Index on person (fname);
Cassandra you can currently index only one column in a table, and do not allow a federated index to multiple columns.
2.2 Multi-Table Association query
Cassandra does not support associative queries, nor does it support grouping and aggregation operations.
Does that mean that Cassandra just looks beautiful and doesn't really solve the real problem? The answer is obviously no, as long as you don't stick to the idea of an RDBMS to solve the problem.
For example, we have two tables, a table (Departmentt) records the company's department information, and another table (employee) records the company's employee information. Obviously every employee has to belong to the department, if you want to know each department has all the employees. If you are using an RDBMS, the SQL statement can be written as:
SELECT * FROM Employee E, Department d where e.depid = d.depid; To achieve the same effect with Cassandra, you must be outside the employee table and the department table, Create an additional table (Dept_empl) to record employee information for each department.
Create Table Dept_empl (deptid text,
See here you must have understood that, in the Cassandra through data redundancy to achieve efficient query effect. Converts an associated query to a single table operation.
2.3 Grouping and aggregation
The common group by and Max and Min in RDBMS do not exist in Cassandra.
How do you create a data model if you want to group all the people's information by their last name?
Create table Fname_person (fname text,userid text,primary key (fname)); 2.4 Subquery
Cassandra does not support subqueries, the following illustration shows an example of a subquery in MySQL:
To be implemented with Cassandra, redundant information must be stored by adding additional tables.
Create table Office_empl (Officecode text,country text,lastname key (text,firstname,primary)); Create index on OFFICE_EMPL (country);
2.5 Summary
In general, when the Cassandra Data model is established, it is required to read the requirements of the data clearly, and then use the design of the inverse normal mode to achieve fast reading, the principle is to exchange space for time.
Resources
http://planetcassandra.org/blog/cql-cassandra-query-language/http://maxgrinev.com/2010/07/12/ Do-you-really-need-sql-to-do-it-all-in-cassandra/ps: The third section, "Using the spark enhanced Cassandra Real-time analysis function", please click "Next" Access below.
Free Subscription "CSDN cloud Computing (left) and csdn large data (right)" micro-letter public number, real-time grasp of first-hand cloud news, to understand the latest big data progress!
CSDN publishes related cloud computing information, such as virtualization, Docker, OpenStack, Cloudstack, and data centers, sharing Hadoop, Spark, Nosql/newsql, HBase, Impala, memory calculations, stream computing, Machine learning and intelligent algorithms and other related large data views, providing cloud computing and large data technology, platform, practice and industry information services.