Why SQL is beating NoSQL, what this means for future data: http://geek.csdn.net/news/detail/238939
The translator notes: After years of silence, today's SQL is coming back. What's the reason? What impact does this have on the data community? Look at the analysis of this article. The translation is as follows.
The data we've been collecting since we've been able to work with computers is growing exponentially, so the requirements for data storage, processing, and analysis technologies are increasing. For the past 10 years, as SQL has failed to meet these requirements, software developers have abandoned it, and NoSQL has grown: Mapreduce,bigtable,cassandra,mongodb and so on.
Now, however, SQL is back on its comeback. Key providers in the cloud now offer a popular, hosted relational database service: For example, Amazon RDS, the PostgreSQL database for Google Cloud Sql,azure (Azure will be released this year). In Amazon's own words, the Aurora database combines PostgreSQL and MySQL databases, so the product has been "the fastest growing service in AWS history." The SQL interface on top of Hadoop and spark continues to flourish. Just last month, Kafka launched SQL support.
In this article, we'll look at why SQL is now coming back, and what this means for future data community engineering and analytics.
The first chapter: New Hope
To understand why SQL is making a comeback, let's first look at the reasons why we originally designed it.
Good stories are all originated in the the 1970s.
Our story began with IBM research in the early the 1970s, when the relational database was born. The query language at that time relied on complex mathematical logic and symbols. Donald Chamberlin and Raymond Boyce two people have just completed their PhD in philosophy and are impressed with the relational data model, but finding the query language will be a major bottleneck in their development. So they started designing a new query language (in their own words): "Making it easier for users who have not received formal training in math and computer programming."
Comparison of two query languages? (source)
Think carefully about it. Before the advent of the Internet, before the advent of the PC, when programming language C was first introduced into the world, two young computer scientists realised that "the success of the computer industry relies heavily on fostering a user other than a trained computer expert." "What they want is an easy-to-read query language like English, which will also include database management and operations."
The result was the first introduction of SQL into the world in 1974. Over the next few decades, SQL will prove to be very popular. With relational databases such as System R, Ingres, DB2, Oracle, SQL Server, PostgreSQL, MySQL (and so on) taking over the software industry, SQL has become an excellent language for interacting with databases, becoming an increasingly crowded, The common language of an intensely competitive ecosystem.
(Unfortunately, Raymond Boyce never had a chance to witness the success of SQL.) 1 months later, he died of a brain aneurysm and only made one of the first SQL speeches, when he was only 26 years old, leaving behind a wife and a young daughter. )
For a while, it seems that SQL has successfully completed its task. But then the internet happened.
Chapter II: The NoSQL counterattack
While both Chamberlin and Boyce were developing SQL, they had no idea that the second group of Engineers in California were working on another budding project that would later spread widely and threaten the existence of SQL. This project is ARPANET, October 29, 1969, it was born.
Some of the creators of ARPANET eventually become the Internet today (source)
SQL has been developing very well, but until 1989 another engineer appeared and invented the World Wide Web.
Physicist who invented the network (source)
Like the dense weeds, the Internet and the Internet are booming, disrupting our world, but for the data community it creates a particular problem: new data sources generate data at a higher number and speed than before.
As the internet continues to evolve, the software community has found that the relational database was unable to handle this new load. As a result, there was a burst of power, as if 1 million databases were suddenly overloaded.
Then, two new internet giants made breakthroughs and developed their own non-relational distributed systems to help solve this new data shock: MapReduce published by Google (published in 2004) and BigTable (published in 2006), and Amazon (Amazon) released Dynamo (2007 published). These groundbreaking papers led to the emergence of more non-relational databases, including Hadoop (based on MapReduce files, 2006), Cassandra (inspired by bigtable and Dynamo files, 2008) and MongoDB (2009). Because these new systems are basically written from scratch, they also don't use SQL, leading to the rise of the NoSQL movement.
Software engineers in the developer community have also embraced NoSQL, and have received a wider audience than SQL was. The reason is easy to understand: NoSQL is now popular; it promises size and power; it seems to be a shortcut to a project's success. But then there was a problem.
A typical software developer who has been seduced by NoSQL. Don't learn this guy.
Developers quickly discovered that no SQL is actually very limited. Each NoSQL database provides its own unique query language, which means learning more languages (and disseminating knowledge among colleagues), increasing the difficulty of connecting the database to the application, resulting in strong coupling between the code, and the lack of a third-party ecosystem that requires the company to develop its own operations and visualisation tools.
These NoSQL languages are new, but they are not fully developed. For example, a relational database has been running for many years, such as adding the necessary features to SQL (such as join) that have already been done; The immaturity of the NoSQL language means that there is more complexity at the application level. The lack of joins also leads to normalization, which in turn leads to data bloat and rigidity.
Some NoSQL databases have added their own "class SQL" query language, such as Cassandra's CQL. But this often makes the problem worse. If you use an interface that is exactly the same as anything else, the more common it is, it actually leads to more psychological questions: The engineer doesn't know what to support and what it doesn't support.
The query language for class SQL is like the Star Wars Holiday Special. Acceptance does not imitate.
(And always avoid "Star Wars" special program)
Some in the community saw the problem of NoSQL early on (for example, DeWitt and Stone Blake discovered in 2008). Over time, more and more software developers have agreed to this by using the painstaking accumulation of personal experience in the process.
Chapter III: Regression of SQL
The software community, initially seduced by the dark ones, began to see the light, and SQL staged a heroic comeback.
The first is the SQL interface on Hadoop (which is followed by Spark), causing the industry to rise up with nosql,nosql, which means "not just SQL."
Then Newsql arose: A new extensible database of SQL was fully accepted. H-store (published in 2008) from the MIT and brown researchers, is one of the earliest extended OLTP databases. Google once again led the Vane, based on their spanner paper (published in 2012) (whose authors include the original MapReduce author), created a database of geo-duplicated SQL interfaces, followed by other pioneers such as COCKROACHDB (2014).
At the same time, the PostgreSQL community started to recover, adding some key improvements, such as the JSON data type (2012), and the potpourri of the new features in PostgreSQL 10: better local support for partitioning and replication, support for full-text search of JSON, And many more features (scheduled for release later this year). Other companies, such as CITUSDB (2016) and others (Timescaledb released this year), have found new ways to extend PostgreSQL for specific data workloads.
In fact, the process of developing timescaledb is closely related to the development trajectory of this industry. The early timescaledb build used our own class SQL query Language "IOQL". Yes, we are not able to withstand the temptation of the dark side: we feel that being able to build our own query language should be very powerful. However, while this may seem like a simple path, we quickly realize that more work is needed. We also find that we need to constantly find the right syntax to query for content that can already be queried with SQL.
One day we realized that it was meaningless to build our own query language. The most critical thing is to accept SQL. This is one of the best design decisions we have made. Suddenly, a brand-New world appeared. Now that our database is 5 months old, users can use our database in production, and there are many other good things: Visual Tools (Tableau), connectors with common ORM, various tools and backup options, rich online tutorials and grammar explanations, and more.
Believe in Google, have eternal life
Google has been leading the data engineering and infrastructure industry for more than 10 years. We should pay close attention to what they are doing.
Look at Google's second-largest spanner paper, released just four months ago (spanner: Become a SQL system, May 2017), and you'll find it supports our findings.
For example, Google started out by building on BigTable, but later found that no SQL would cause a lot of problems (highlighting all of our references below):
While these systems provide some of the benefits of database systems, they lack the traditional database features that many application developers often rely on. A key example is a robust query language, which means that developers must write complex code to process and aggregate data in the application. Therefore, we decided to turn spanner into a full SQL system, and query execution is tightly integrated with other architectural features of spanner (such as strong consistency and global replication).
Later in the paper, they further grasped the fundamental principle of transition from NoSQL to sql:
Spanner's original API provides a NoSQL approach to point lookup and range scanning for single and cross-table. Although the NoSQL approach provides a simple way to start a wrench and continues to be useful in a simple retrieval scenario, SQL provides significant added value in expressing more complex data access patterns and pushing calculations to data.
This article also describes how the adoption of SQL does not stop on the wrench, but actually extends to the rest of Google, where multiple systems now share a common SQL dialect:
Spanner's SQL engine shares a common SQL dialect, called "Standard SQL", with several other systems on Google drilling including internal systems such as F1 and small holes (etc.) and external systems such as bigquery ...
For Google users, this reduces the barriers to cross-system work. A developer or data analyst has written SQL for the Spanner database, and can shift their understanding of the language to Dremel without worrying about minor differences in syntax, empty processing, and so on.
The success of this approach is self-evident. Spanner has become the "source of truth" for major Google systems, including AdWords and Google games, while "potential cloud customers are very interested in using SQL."
Given that Google first helped launch the NoSQL movement, it is worth noting that it is now accepting SQL. (Leading some people to think recently: "Did Google send the big data industry in the 10 vacation time?")
What this means for the future of data: SQL will become a slender waist
In the computer network, there is a concept called "waist structure".
The idea has solved a key problem: Imagine a stack, the underlying hardware layer, and the top software layer on any given network device. There may be various network hardware in the middle, and there are a variety of software and applications. There is a need to ensure that the software can still connect to the network regardless of what happens to the hardware, and that the network hardware knows how to handle network requests regardless of what happens to the software.
In the network, the role of the slender waist is played by the Internet Protocol (IP), which is the underlying networking protocol designed for local area networks and the public interface for higher-level applications and transport protocols. (This is a good explanation.) and (in a broad simplification), this public interface becomes the universal language of computers, enabling networks to connect to each other, devices that can communicate, and this "network network" can evolve into today's rich and diverse internet.
We think that SQL has become the waist of data analysis.
In our time of life, data is becoming "the most valuable resource in the world" (The Economist, May 2017). Therefore, we have seen the professional database (OLAP, time series, documents, charts, etc.), data processing tools (Hadoop,spark,flink), Data Bus (KAFKA,RABBITMQ) and so on, showing the Cambrian explosion-like situation. We also have more applications that need to rely on these data infrastructures, whether third-party data visualization tools (Tableau,grafana powerbi,superset), Web Frameworks (Rails,django), or custom data-driven applications.
Like the web, we also have a complex stack, underlying infrastructure and top-of-the-line applications. Usually, we end up writing a lot of glue code to do this stack work. But the glue code may be fragile: it requires careful operation.
What we need is a public interface that allows the various parts of the stack to communicate with each other. Ideally, the industry has been standardized. It enables communication barriers between different layers to be minimized.
This is the power of SQL. Like IP, SQL is also a public interface.
But SQL is actually much more complex than IP. Because the data also needs to support human analysis. Moreover, one of the goals that the SQL creator initially set for it is readability.
Is SQL Perfect? No, but most people in the community already know the language. While engineers are already developing a more natural language interface, where will the systems eventually connect? or SQL.
So there's a layer on the top of the stack. That layer is our human being.
SQL regression
SQL has returned. Not just because it's disgusting to write glue code when you assemble a nosql tool. Not only is it difficult to learn all kinds of new languages. It's not just because standards bring all sorts of advantages.
And because the world is full of data. It surrounds us and binds us. At first, we relied on the human sensory nervous system to deal with it. Now, software and hardware systems have become smart enough to help us. As more data is gathered, we can better understand the world, and the complexity, storage, processing, analysis, and the need to visualize these data will only continue to grow.
data scientist Yoda Master
We can live on the streets of the system are like paper generally fragile, the interface volume reached millions of of the world. Or we can choose SQL again, so that the world we live in may become more and more powerful.
Why SQL is beating NoSQL, what this means for future data (reprint)