An overview of the data storage technologies available in big data projects, focusing on couchbase and ElasticSearch, showing how to use them, and how they differ, first understand the different technologies in the NoSQL world.
Nosql
The relational database is the choice of the past and is almost the only choice for many developers and DBAs to apply to traditional three-tier architectures. There are many reasons for using this scenario, data modeling methods, querying language and data interaction, ensuring consistent deployment of data, and being able to serve complex applications.
However, this is not the only solution to all data storage problems, but also the cause of nosql. NoSQL provides a new approach rather than a standard SQL-oriented paradigm.
NoSQL technology seamlessly blends with high scalability, with many technologies having high distribution and high performance. Most of the time, they make the architecture of existing RDBMS technology more complete, such as cache server, search engine, unstructured storage, variable information storage, etc. Mainly divided into 4 categories:
Key/value
Column Storage
Document-oriented storage
Diagram Storage
Now dive into a variety of technologies and choose the one that works best for your scenario.
Key/value
The first and earliest NoSQL data store is key/value. This data, like a dictionary, matches value according to key, typically used in basic information stores that require high performance, such as session information that needs to be read and written quickly, which are very efficient and often highly scalable.
Key/value is also often used for the context of queueing to ensure that data is not lost, such as the log schema or search engine index schema. Redis and Riak KV are very well-known key/value data stores, and Redis is used more broadly because it has a memory-type k/v storage, and persistence is optional. Redis is often used in Web applications to store session-related data, such as node or-
Web Application of PHP; Thousands of session information can be extracted per second without loss of performance. Another typical scenario is the serialization that follows: Redis is located between Logstash and ElasticSearch to store the index in the T ElasticSearch query.
Column
Column storage is required to store ultra-large amounts of record information when the Key/value storage limit is reached. Columnstore Technologies may not be easy to understand for engineers in the RDBMS world, but they are actually very simple. Data in an RDBMS is stored on a row, while column storage is column-based. The main benefit of using a column database is high-speed access to massive amounts of data. A row of an RDBMS is a contiguous storage on a hard disk, multiple rows may be stored in different locations on the hard disk, making access slightly more complex, and a column of data in the column database is continuously stored.
For example, consider querying the title of an index blog in an RDBMS, especially when there are millions of of data, which requires a lot of IO operations, and in a column database, such a query is only one visit. Such a database is very handy for extracting massive amounts of data from a particular cluster, but this is a lack of flexibility. The most used column store database is
Google Cloud Bigtable, but the open source Columnstore database is Apache HBase and Cassandra.
Another benefit of Columnstore databases is the ease of scaling, which is highly scalable when mass storage. This is why they are primarily used to preserve non-volatile and long-lasting information.
Document
A Columnstore database is not the best storage for structured data that contains a deep nested structure, and this scenario requires the use of document-oriented data storage. Data is actually stored in Key/value, but all compressed data is called a document. The document relies on a structure or code such as XML, but more often it is JSON (JavaScript Object Notation).
Although a document-based database is useful for the structured storage and presentation of data, it has its own fragile side, especially when it interacts with data. They basically traverse the entire document, for example, when reading a particular field, the traversal can affect performance.
When you need to store nested information, you can use a document-based database. For example, consider how to express an account in an app, presumably with the following information:
Basic information: Name, birthday, photo, URL, creation date, etc.
Complex information: Addresses, authentication methods (password, Facebook, and other third-party certifications), interests, etc.
This is why NoSQL document databases are often used in Web applications: It's easy to express nested objects, and because they all use JSON, they can also be seamlessly integrated with the front-end JavaScript technology.
The most used document databases are MongoDB, Couchbase, and Apache CouchDB, which are very easy to install and launch, have good documentation, and are scalable, but they are also a clear choice for open modern Web applications.
Couchbase is the technology chosen here to store, query, and organize data. The reason is that based on its performance benchmark, its high throughput operation is slower than MongoDB.
Another notable concern is that couchbase today is a combination of CouchDB and memcached. From a support point of view, the use of couchbase more meaningful.
Graph
Graph database is fundamentally different from other databases. It uses different paradigms to express data-tree structures, nodes and edges linked together called relationships. These databases are created with social networks, such as expressing the user's network of friends, their friend relationships, and so on. For other types of data storage, it is possible to store a user's friend relationship in a document, but it is still very complex to store friends; using a graph database is very simple, creating nodes for each friend, connecting them through relationships, relying on the needs of the query and scoping the map.
The most famous graph database is neo4j, as previously mentioned, the main usage scenario is to handle complex relational information, such as connections between entities, and also for classifying scenarios.
Figure 2-1 shows how the 3 entities in the graph database are connected.
Figure 2-1. Graph Database Example
The two-day account node in the diagram Jane and John, each of these sides define their relationship, one day they know each other, and the other group of nodes connected to the two accounts shows that Jane and Joh have become members of the football team a few days later.
Using NoSQL in the scene
Based on the usage scenario, you first need a document-type NoSQL database that will store the data in a relational database in a JSON document that is structured. As mentioned earlier, traditional RDBMSS stores data in a number of relational tables, becoming more complex and inefficient when a complete object is obtained. In Figure 2-2. You can see an example where an account is split into multiple tables.
Figure 2-2. Account tables
If you want to get all the account information, you basically need to join two to three tables. Now consider the situation where you need to handle every connection that all users have in your app, and these connections have different business logic. Finally, you want the view of the account itself. By passing an account ID what kind of document does the API get from all user views?
{ "ID":"Account_identifier", "Email":"[email protected]", "FirstName":"Account_firstname", "LastName":"Account_lastname", "Birthdate":"Account_birthdate", "Authentication":[{"token": "authentication_token_1", "source": "Authenticaton_source_1" /c5>, "created": "12-12-12" }, {"token": "authentication_token_2", "source": "authenticaton_source_2", "created": "12-12-12" }], "Address":[{"street": "address_street_1", "City": "address_city_1""Zi P ": " address_zip_1 "" Country ": " address_country_1 "" created ": " 12-12-12 " }]}
The benefits are obvious: The data can be accessed faster and better by keeping the JSON representation of an entity. Further, generalize this method, read all the read operations from the NoSQL database, and let all writes (Create, Update,delete) be on the RDBMS. But a logic must be implemented to maintain the RDBMS-to-nosql data synchronization, If it's not in the cache, create a relational database object.
Why maintain an RDBMS when NoSQL creates documents efficiently and scalable? Because this is not the real purpose of the application. I don't want to have a big Bang effect. Suppose the RDBMS is ready, but a NoSQL store is integrated because of the lack of flexibility in the RDBMS. Want to take advantage of the two best technologies-especially RDBMS data consistency and nosql scalability.
In addition, this is an example of a simple query, but hopefully further, such as full-text retrieval of any field in the document. So how do you do it in a relational database? is the index, but do you want to index all the columns in the table? In fact, this is not possible, but it is easy to use nosql technology to do this, such as Elasticsearch. Before delving into a NoSQL cache system, take a look at how to use the Couchbase document database, then review its limitations and switch to Elasticsearch.
Take a look at Couchbase's scalable architecture, but because of some of the serious limitations of couchbase, consider the full architecture of the elasticsearch before migrating.
Couchbase Introduction
Couchbase is an open-source, document-based database with a flexible data model, performance and scalability for general usage scenarios, and storing relational database data in a structured JSON document. Most NoSQL technologies have a similar architecture-first look at Couchbase How the architect organizes, then describes the naming convention in Couchbase, and then delves into how the stored data is queried, and finally discusses replication across data centers.
Architecture
Couchbase is a true shared-nothing architecture, which means there is no single point of failure because each node in the cluster is self-sufficient and independent. What is the way distributed technology works? --Each node does not share any memory or hard disk storage. The document is stored in JSON or binary form in couchbase, replicated between clusters, and the organized unit is called buckets. A bucket by setting the RAM cache to scale according to storage and access requirements, you can also set the number of elastic replicas. From the bottom, buckets are split into small units called vbuckets, which are actually data partitions. Couchbase uses a cluster map to associate partitions with servers. A couchbase server has 3 replicas of a bucket within a cluster, and each couchbase manages vbucket activations or subsets of replicas. This is the principle of elasticity of couchbase; Each index of each document has a copy, and if one node in the cluster is hung, the cluster activates the replica partition to guarantee continuous service.
Figure 2-3 explains that only one active copy of the data in the cluster has one or more replicas.
Figure 2-3. Couchbase Active document and replicas
From the client's point of view, if you use a smart client (Java, C, C++,ruby, etc.), then these clients connect to the cluster map; The client sends the request from the app to the appropriate server. In terms of interaction, there is a very important point: document operations are asynchronous by default. This means that when you update the document, Couchbase cannot immediately update the data on the hard disk. Figure 2-4. Presents a real-world process.
Figure 2-4. Couchbase Data Flow
As figure2-4 shows, the smart client connects to a Couchbase instance, first writing the document to the cache asynchronously, the client responds immediately instead of blocking until the data stream is processed, and the client changes the behavior state when the write operation completes. Then, the document is placed in the write queue over the cluster, so it replicates between the clusters; Thereafter, the document is placed into the write queue of the hard disk storage to persist to the related node. If multiple clusters are deployed, the Cross data Center Replication (XDCR) feature can propagate such data changes across different clusters, and clusters can be located in different data centers.
Couch base has its own method of data query; In fact, the document can be queried simply by using the document ID, and the couchbase is powerful inside the view feature. In Couchbase, there is a level two index called the design document that was created inside the bucket. Buckets can contain multiple types of documents, such as a bucket for a simple e-commerce application that contains the following:
- Account
- Product
- Cart
- Orders
- Bills
Couchbase uses design documents to complete their logical segmentation. A bucket can contain multiple design documents, and it also contains multiple views. A view is a function of the index document within a bucket defined by the user. The function is a user-defined map/reduce function that maps documents to a cluster, outputs key/value pairs, and stores them in the index for future information extraction. Review the example of the e-commerce website and try to index it from all the orders identified by the account. The Map/reduce function is as follows:
function(doc, meta) { if (doc.order_account_id) null);}
The If Judgment statement allows the function to focus on a document, which contains the order_account_id field and then indexes the identity. As a result, any client can query data from couchbase based on this identity.
Cluster Manager and Management Console
Cluster Manager is a special node in the cluster. At any time, if a node in the cluster is hung, the orchestrator is enabled by notifying all other nodes in the cluster to handle the failover and locating the replica partition of the failed node. Figure 2-5 describes the failover processing.
Figure 2-5. Couchbase failover
If the orchestrator node is hung, all nodes can be detected by heartbeat monitoring, and the watchdog runs on all nodes in the cluster. All cluster-related features can be managed by API in the form of couchbase, but there is a ready-made management console. Couchbase has a security console to manage and monitor the cluster; You can choose which actions are available, including server Setup, creating buckets, browsing and updating documents, implementing new views, and monitoring vbucket and hard-disk write queues.
Figure 2-6 shows the Couchbase console home page, the memory used by the existing buckets, the hard disk used by the data, and the activities of buckets.
Figure 2-6. Couchbase Console Home
Perform cluster management in the Server Nodes, allowing users to configure failover and replication to prevent data loss. Figure 2-7 shows the insecure type of single-node installation failover.
Figure 2-7. Couchbase Server Nodes
At any time, you can add a new couchbase server by clicking the Add Server button, and when this is done, the data begins to replicate between nodes to ensure failover. By clicking
Server IP, which can access a bucket of monitoring data, such as Figure 2-8.
Figure 2-8. Couchbase Bucket Monitoring
The figure shows a bucket of data called Devuser that contains user-related JSON documents. As mentioned earlier, the processing of new document indexes is part of the complex underlying processing. When dealing with massive amounts of data with high index throughput, you can see basic measurement information from the monitoring console. For example, the bottleneck of the disk write queue is counted when the write operation occurs.
In Figure 2-9, the number of drain rate-written to disk from the disk write queue can be viewed-smooth when the node is being written to, and grow relatively evenly within the smoothing interval. A change in behavior would be to see the average age of the active item keep growing, which means that the write operation is too slow compared to the number of data push-to-write disk queues.
Figure 2-9. Couchbase Bucket Disk Queue
Managing Documents
The bucket view allows you to manage documents from the management console. This view allows users to browse buckets, design documents, and views. The document is stored in a bucket in couchbase and can be accessed through the Data Bucket section of the management console, as shown in Figure 2-10.
Figure 2-10. Couchbase Console Bucket View
From the server, the console gives the bucket statistics such as RAM and storage size and the number of operations per second. But the real benefit is that this view can browse documents and extract them by ID, as shown in Figure 2-11.
Figure 2-11. Couchbase Document by ID
At the same time, you can create a view of the design document and the indexed document through this view as shown in Figure 2-12.
Figure 2-12. Couchbase Console View Implementation
In Figure 2-12, a view was implemented that extracts the document based on the company name. The management console can be used to manage documents very easily, but in the real world, you need to start implementing a design document in the management console and create a backup of the industrialized deployment. All design documents are stored in a JSON file with a simple structure that depicts all views, as shown in table 2-1.
Listing 2-1. Designing a Document JSON Example
[{... "Doc": {"JSON": {"views": {"by_id": {"Map":"function (Doc, meta) {\ n emit (Doc.id, doc); \ n}"},"By_email": {"Map":"function (Doc, meta) {\ n emit (Doc.email, doc); \ n}"},"By_name": {"Map":"function (Doc, meta) {\ n emit (doc.firstname, null); \ n}"},"By_company": {"Map":"function (Doc, meta) {\ n emit (doc.company, null); \ n}"},"By_form": {"Map":"function (Doc, meta) {\ n emit (meta.id, null); \ n}"} } }...}}]
It has been seen that the management of the document can be performed through the administration console, but in a commercial architecture, a large number of such operations are done through scripts using the Couchbase API.
Introduction Elasticsearch
Now that you know a nosql like Couchbase, Elasticsearch is a nosql technology, but it's completely different from couchbase. This is a distributed database named by the elastic company.
Architecture
It is an index/search engine built on Apache Lucene (a Java-written full-text search engine). From the outset, ElasticSearch is distributed and scalable, can be scaled vertically by increasing node resources, and can be scaled up horizontally to increase high availability while maintaining resiliency. If a node is hung, services are provided by other nodes due to replication between the clusters.
ElasticSearch is a seamless engine; Data is stored in JSON, and partitions are called shards. A shard is actually an index of Lucene and is the smallest expandable unit in the Elasticsearch. Shards is organized into an index in the Elasticsearch to enable the application to complete the read-write interaction. Finally, the index is a logical namespace in the ElasticSearch, a re-grouping of the shards collection, and when the request arrives, ElasticSearch routes it to the appropriate shard, as shown in Figure 2-13.
Figure 2-13. Elasticsearch Index and Shards
There are two types of shard in Elasticsearch: Basic shards and replication shards. When you start a Elasticsearch node and start adding only a basic shard, that's enough, what happens when the throughput of the read/retrieve request increases rapidly? In this case, a basic shard may not be enough, and another shard is needed. Cannot increase shard in real time to expect Elasticsearch to expand; This will re-index all of the data in two new base shard. Therefore, it is important to properly assess how many basic shard are in the cluster at the start of a Elasticsearch project. Adding multiple shards at one node does not increase the capacity of the cluster because it has hardware now. To increase the cluster capacity, additional nodes are required to host the basic shards, as shown in Figure 2-14.
Figure 2-14. ElasticSearch Primary Shard allocation
A good thing when elasticsearch on a new node, the Shard can be automatically copied over the network, such as Figure 2-14. But how do you ensure that data is not lost? This is the role of replica shards.
Replica shards begins with failover, and when primary Shard is hung, a copy is promoted to primary to ensure cluster continuity. Replica shards has the same load when primary shards indexing; Once a document has been indexed in primary Shard, it is also indexed in replica shards. This is why adding replicas in a cluster does not increase the performance of the index, but if additional hardware is added, the performance of the search is improved. In a three-node cluster, there is a primary shards and two copies of figure 2-15 showing how to repartition.
Figure 2-15. ElasticSearch Primary and replica shards
In addition, better performance is achieved by helping to load balance the requests in the cluster. Finally, we will discuss the indices of the pure ElasticSearch architecture and the multi-node. Indices are re-grouped into ElasticSearch nodes, figure 2-16 shows 3 types of nodes.
Figure 2-16. ElasticSearch cluster topology
This is a description of three nodes:
Master nodes: These nodes are lightweight and responsible for cluster management. They do not host any data, server indices, or search requests. They are designed to ensure cluster stability and low load. It is recommended to use three master nodes to ensure failover through redundancy.
Data nodes: These nodes hold the database, index, and search requests.
Client nodes: This guarantees load balancing of some of the processing steps, and some workloads are performed on data nodes, such as search requests diverging across nodes.
Understanding the Elasticsearch architecture, you can run some queries using the API.
ElasticSearch Monitoring
Elastic provides a plug-in called Marvel, the goal is to monitor the ElasticSearch cluster. This plugin is a commercial product, but you can use it for free in development mode.
The download and installation are very simple: https://www.elastic.co/downloads/marvel
Marvel is a kibana-dependent, elastic visual console that brings a number of visualization techniques that allow the operator to know exactly what is happening on the cluster. Figure 2-17 shows the control panel for the Marvel.
Figure 2-17. ElasticSearch Marvel Console
Marvel provides the information of node indices and shards; CPU utilization; The memory used by the JVM, the speed of the index, and the speed of the search. Marvel can even go to the bottom of lucene to view
Flushes and merges information. For example, the distribution list for Shard in a cluster is shown in Figure 2-18.
Figure 2-18. Marvel ' s shard allocation view
Marvel can provide a lot of information about the cluster. Figure 2-19 shows a subset of node statistics.
Figure 2-19. Marvel Node Statistics Dashboard
Dashboard were organized into several lines; In fact, you don't see more than 20 lines of content in the screenshot. Each row contains a visualization of one or row headings. In Figure 2-19, you don't see the Send to
Get request for indices; That's why the line chart is flat. In development mode, these statistics can help determine whether the server is scaled, for example, starting a single-node cluster and viewing the behavior of special requirements. In the production environment, you can actually see the information of the cluster without losing anything.
Search by Elasticsearch
Marvel Another feature called Sense, is a query editor for Elasticsearch. Sense of power is another auto-completion of the ability to query, when not familiar with
ElasticSearch API is very useful, see figure 2-20.
Figure 2-20. Marvel Sense Completion Feature
You can also export the query to curls, for example, using the script in Figure 2-21.
Figure 2-21. Marvel sense Copy as CURL feature
In this case, the query is treated as a curl command, see Listing 2-3.
Listing 2-3. Example of CURL Command Generated by Sense
-XGET"http://192.168.59.103:9200/yahoo/_search"-d‘{ "query": { "match_all": {} }}‘
This query basically searches all the documents under the Yahoo Index. In the future, the advantage of the search API is to use a data set from Yahoo, which contains information about the Yahoo shares for several years. The key feature in a ElasticSearch's search API is that this is an aggregation framework. Data can be aggregated in different ways; The more general way is the career date histogram, which corresponds to the timeline. Figure 2-22 explains an example of a query; Aggregate stock data by date, the interval can be months, or you can calculate the maximum value of the stock price for a given month.
Figure 2-22. Marvel Sense Aggregation Example
As a result, get an array of documents, each entry contains the date, the number of documents per month, and the maximum value, see listing 2-4.
Listing 2-4. Aggregation Bucket
{ "key_as_string"2015-01-01T00:00:00.000Z", "key1420070400000, "doc_count20, "by_high_value{ "value": 120 }}
You can see a query like Figure 2-22 with two levels of aggregation: one is the date histogram and the second is the maximum value. In fact, ElasticSearch supports multi-level aggregation.
The search API is very rich and cannot be browsed individually, see:
Https://www.elastic.co/guide/en/elasticsearch/reference/1.4/search-search.html
Now you're familiar with two nosql technologies and look at the different ways to integrate them into an e-commerce application.
Using NoSQL caches in SQL-based schemas
While you understand the benefits of NoSQL technology over SQL databases, you don't want to break the existing architecture that relies on SQL databases. The following approach can be used to refine the architecture here and to increase the flexibility of accessing data based on NoSQL.
Document caching
The first one to discuss is how to replicate data to a NoSQL backend. It is expected that every time the data is written to a SQL database, a document is created or updated in NoSQL. Document creation, update fields, or enrich the table relationships of the sub-documents corresponding to the RDBMS. When accessing a document, whenever a GET request for an API is generated, it looks at the NoSQL backend and returns the document if one is available.
What if the document is not found? The cache misses an event that will be triggered and NoSQL manager rebuilds the document from the RDBMS. What if the transaction inserted at the SQL layer fails? The framework should be transactional, triggering the rebuilding of the document only when the SQL transaction is complete. Figure 2-23 summarizes this mechanism.
Figure 2-23. Caching a document in NoSQL
Figure 2-23 describes the content:
The first block diagram shows how an account is created in an RDBMS, and how to create a document representation of a full account in NoSQL
The second block diagram shows how the API obtains an account information from NoSQL storage.
The third block diagram shows an example of a cache loss that must be rebuilt from an RDBMS when the document is not in NoSQL
In fact, NoSQL builds a cache for Web applications. This approach relies on the flexibility of NoSQL access to the entire document without adding to the burden of the SQL database and leveraging NoSQL's distributed flexibility. In this way, with a large increase in request speed can expand the cluster, reduce the pressure of SQL database.
Elasticsearch's Couchbase Plugin
To get such a caching mechanism, you need to choose NoSQL technology. The first method is used independently of the
Couchbase, but Couchbase's search features are not very good. It is more cumbersome to index data with the Map/reduce function, especially when only simple aggregations are done. Couch base is a great NoSQL database, but it's really painful to implement a search. As an alternative, you can use the Couchbase/elasticsearch integration plug-in, which basically transmits all data to the ElasticSearch cluster using the cross data Center Replication (XDCR) ( Www.couchbase.com/jp/couchbase-server/connectors/elasticsearch).
In terms of operational tasks, there are three different technologies to maintain: RDBMS, couchbase, and ElasticSearch.
The most effective Elasticsearch
Couchbase reasons for staying in use:
Ability to index all objects, such as registering account information in SQL to NoSQL
The ability to adapt from simple to complex aggregate queries with a flexible search API
Starting with the selection of Couchbase, as a best practice, documents are stored in the database. When you experience architecture, you need to know what is the most effective way to access and request data. In most usage scenarios, ElasticSearch is the most effective method.
NoSQL and Big Data