A glimpse of Cassandra and Spark data processing

Source: Internet
Author: User
Tags cassandra



Learn about Linux, please refer to the book "Linux should Learn"


The Apache Cassandra database has recently attracted a lot of interest, mainly due to the availability and performance requirements of modern cloud-based software. So, what is Apache Cassandra? It is a distributed online transaction processing (OLTP) database optimized for high availability and linear scalability



When it comes to the use of Cassandra, think about the systems you want to be close to your users, such as systems that allow our users to interact, applications that need to be guaranteed to be available in real time, and so on, such as product catalogs, the Internet of things, medical systems, and mobile applications. For these programs, downline time means lower profits and even worse results. Netfilix was an early adopter of the Open source project in 2008, and their contribution to the project and the success they brought to it have made the project a big name.

Cassandra became the top project of the Apache Software Foundation in 2010 and has been popular since then. Now, as long as you have the relevant knowledge of Cassadra, find a job can be a lot easier. To think of it, the NoSQL language and open source technology can reach the height of enterprise-level SQL technology, which is really crazy and incredible. This raises a question. What makes it so popular?

Thanks to the design pioneered in Amazon's published Dynamo paper, Cassandra has the ability to stay in real-time online for large-scale hardware and network failures. With a point-to-point mode, we can survive a rack failure or even a network outage without a single point of failure. We can handle data center failures without compromising the user experience. A distributed system that can take into account failures is a distributed system that has no worries, because, honestly, failures happen sooner or later. With Cassandra, we can face the brutal life and integrate it into the structure and function of the database.


We can guess what you're thinking, "but I only have a relational database background, isn't that a very difficult transition?" The answer to this question is between and not. Using Cassandra to build a data model is pro for developers with relational database backgrounds. We use tables to build the data model and use CQL (Cassandra query Language) to query the database. However, unlike SQL, Cassandra supports more complex data structures, such as nesting and user-defined types. For example, when you want to store the number of likes for a kitten photo, we can store the entire data in a collection containing the photo itself to get a faster sequential lookup instead of creating a separate table. Such a statement is very natural in CQL. In our photo table, we need to record names, URLs, and people who have liked this photo.



In a high-performance system, millisecond processing can have an impact on user experience and customer retention. Expensive JOIN operations constrain our ability to expand by increasing unforeseen network calls. We can benefit from reduced disk space costs and get predictable, high-performance applications when we reverse-instantiate data so that it can be obtained with as few requests as possible. We introduce the inverse paradigm together with Cassandra because it offers attractive tradeoffs.

Obviously, we're not going to be limited to the number of likes of kitten photos. The Canssandra is a scheme optimized for high concurrent writes. This makes it an ideal solution for big data applications that require constant throughput of data. Applications for real-time applications and the Internet of things are growing steadily, both in terms of demand and market performance, and we continue to use the data we collect to find ways to improve our technology applications.

This leads us to the next step, and we've already mentioned how to store data in a modern, cost-effective way, but how do we get more power? Specifically, when we  we need, how do we deal with it? How can you effectively analyze hundreds of terabytes of data? How can we get real-time feedback on the information we collect and make decisions in a matter of seconds instead of hours? Apache Spark will give us the answer.


Spark is the next step in big Data transformation. Both Hadoop and MapReduce are revolutionary products that give the big data community the opportunity to analyze all the data we have achieved. Spark's dramatic increase in performance and a significant reduction in code complexity has lifted big data analytics to another level. With Spark, we can process computations in large quantities, react quickly to convection, make decisions through machine learning, and understand complex recursive relationships through graph traversal. This is not just about providing your customers with fast and reliable application connectivity (Cassandra has already provided such capabilities), but it is able to gain insight into Canssandra's stored data, make more informed business decisions, and better meet customer needs.


A glimpse of Cassandra and Spark data processing


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.