Brief introduction
In many different data models, the relational data model has been dominant since the 80, and there have been many implementations, such as Oracle, MySQL, and MSSQL, which are also known as relational database management systems (RDBMS). Recently, however, a number of issues have been exposed as a result of the growing use cases of relational databases, mainly because of two reasons: some of the pitfalls and problems in data modeling, and the limitations of horizontal scaling over large data volumes and multiple servers. The two trends have brought these issues to the attention of the global software community:
- The amount of data generated by users, systems, and sensors has grown exponentially, and its growth has been accelerated by the fact that most of the data is concentrated on distributed systems like Amazon, Google, and other cloud services.
- The increase in data internal dependencies and complexity is exacerbated by internet, Web2.0, social networking, and access to open and standardized data sources for a large number of disparate systems.
Relational databases create more problems when dealing with these trends. This leads to the emergence of a large number of different techniques to address specific aspects of these problems, which can interact with or replace existing RDBMS-also known as hybrid persistence (Polyglot persistence). Database substitutes are not new, they have been in the form of object Databases (OODBMS), hierarchical databases (such as LDAP) for a long time. However, over the past few years, a number of new projects have emerged that are collectively referred to as NOSQL databases (nosql-databases)
This article aims to introduce the position of graph database in the NoSQL movement, and the second part is a brief introduction to NEO4J (a Java-based graphic database).
NoSQL Environment
NOSQL (not just SQL, not limited to SQL) is a very wide range of persistence solutions that do not follow a relational database model or use SQL as the query language.
Simply put, NoSQL databases can be divided into 4 categories according to their data model:
- Key-value repository (key-value-stores)
- BigTable implementation (Bigtable-implementations)
- Document library (Document-stores)
- Graphical databases (graph database)
In the case of Voldemort or Tokyo cabinet such key/value systems, the smallest modeling unit is a key-value pair. For clones of BigTable, the minimal modeling unit is a tuple with a different number of attributes, and for a document library such as Couchdb and MongoDB, the smallest unit is a document. The graph database simply models the entire data set into a large, dense network structure.
Here, let's dive into the two interesting aspects of the NoSQL database: scalability and complexity.
1. Scalability Cap:acid vs. BASE
To ensure data integrity, most classic database systems are transaction-based. This ensures the consistency of data in data management in all directions. These transactional attributes are also known as acid (a for atomicity, C for consistency, I for isolation, and d for persistence). However, the scale-out of the acid-compatible system has been a problem. In distributed systems, conflicts between different aspects of high availability are not fully resolved-also known as the CAP law:
- Strong consistency (C): All clients see the same version of the data, even if the data set has been updated-such as leveraging the two-phase commit protocol (XA transaction), and acid,
- High Availability (a): All clients can always find at least one version of the requested data, even if some machines in the cluster are down,
- Partition tolerance (P): The entire system maintains its own characteristics, which is transparent to the client, even when deployed to different servers.
The CAP rule assumes that only two of the 3 different aspects of scaling out can be fully implemented at the same time.
In order to handle large distributed systems, let's take a closer look at the different cap features used.
Many NoSQL databases have first relaxed the requirements for consistency (C) in order to achieve better availability (A) and partition tolerance (P). This results in a system called base (Basic (B) availability (A), soft State (S), final consistency (E)). They do not have a classic sense of affairs and have introduced constraints on the data model to support better partitioning patterns (such as dynamo systems, etc.). A more in-depth discussion of caps, acid, and base can be found in this introduction.
2. Complexity
Protein homology networks (Protein homology network), thanks to Alex Adai: Institute of Cell and Molecular biology-University of Texas
The increased interconnection of data and systems has resulted in a dense dataset that cannot be scaled and automatically partitioned in a simple, straightforward or domain-independent (domain-independent) manner, and even Todd Hoff mentions the problem. Visualizations of large, complex datasets can access visual complexity (visual complexity).
Relational model
Before we throw the relational data model into mitchellsays, we should not forget that one of the reasons for the success of the relational database system is to follow e.f. Codd's idea that relational data models can, in principle, model any data structure without information redundancy and loss through standardized means. Once the modeling is complete, you can use SQL to insert, modify, and query data in a very powerful way. There are even databases that optimize patterns for inserting speed or multidimensional queries (star mode) for different usage scenarios such as OLTP, OLAP, web apps, or reports.
It's just a theory. In practice, however, rdbm encounters the limitations of the previously mentioned cap problem and the problems that arise from the implementation of high-performance queries: SQL queries that join a large number of tables and are deeply nested. Other issues include scalability, the evolution of patterns over time, modeling of tree structures, semi-structured data, hierarchies, and networks.
Relational models are also difficult to adapt to current software development methods, such as object-oriented and dynamic languages, which are called object-relational impedance mismatches. As a result, an ORM layer such as Java Hibernate has been developed and applied to this hybrid environment. They certainly simplify the task of mapping the object model to a relational data model, but do not optimize the performance of the query. In particular, semi-structured data is often modeled as large tables with many columns, many of which are empty (sparse tables), which results in poor performance. Even as an alternative, it is problematic to model these structures into a large number of junction tables. Because a junction in an RDBMS is a very expensive collection operation.
Graph is an alternative technique of relation normalization
Look at the schema of the domain model in the data structure, there are two mainstream schools-RDBMS adopted relational methods and diagrams-that is, network structure, such as the use of Semantic Web.
Although the graph structure can even be normalized with RDBMS in theory, due to the implementation of the relational database, the recursive structure such as file tree and network structure like social graph have serious query performance effect. Each operation on a network relationship results in a "junction" operation on the RDBMS that is implemented as a collection operation between the primary key sets of two tables, which is slow and does not scale as the number of tuples in these tables increases.
Basic terms for attribute graphs (property graph)
In the field of graph, there is not a set of widely accepted terms, there are many different types of graph models. However, there is a commitment to create an attribute graph model that unifies most of the different diagram implementations. According to this model, the information in the property map is modeled using 3 construction units:
- Nodes (that is, vertices)
- Relationship (that is, edge)-has direction and type (Mark and label)
- Attributes above nodes and relationships (that is, attributes)
More specifically, this model is a tagged and labeled attribute multi-plot (multigraph). Each edge of the marked graph has a label that is used as the type of the edge. A directed graph allows an edge to have a fixed direction, from the end or source node to the first or target node. The attribute graph allows each node and edge to have a mutable set of attribute lists, where the attributes are the values associated with a name, simplifying the graphical structure. Multiple graphs allow multiple edges to exist between two nodes. This means that two nodes can be connected multiple times by different edges, even if two edges have the same tail, header, and tag.
Shows a small property map that is marked.
Tinkerpop for small staff figures
The great use of graph theory has been recognized, and it is associated with many problems in different fields. The most commonly used graph theory algorithms include various types of shortest path calculation, geodesic (Geodesic path), concentration measurement (such as PageRank, eigenvector concentration, affinity density, relationship degree, hits, etc.). However, in many cases, the application of these algorithms is limited to research, as there is no real-world implementation of the high-performance graphics database available in the product environment. Fortunately, the situation has improved in recent years. Several projects have been developed and targeted at 24/7 of the product environment:
- NEO4J-Open Source Java attribute graphics model
- Allegrograph, closed source, Rdf-quadstore
- Sones-closed source, focusing on. NET
- Virtuoso-closed source, focusing on RDF
- HYERGRAPHDB-Open source Java Hyper-map model
- Others like Infogrid, filament, flockdb and so on.
Shows the location of major NoSQL classifications in the context of complexity and scalability.
For more on scale and complexity extensions, read Emil Eifrem's blog post.
NEO4J-Java-based graphics database
NEO4J is a Java-implemented, fully acid-compatible graphical database. The data is saved on disk in a format that is optimized for the graphics network. The neo4j kernel is an extremely fast graphics engine with all the features expected by the database product, such as recovery, two-phase commit, and XA compliance. Since 2003, NEO4J has been used as a 24/7 product. The project has just released version 1.0-a major milestone on scalability and community testing. High availability and master-slave replication through online backup are currently in beta and are expected to be released in the next release. NEO4J can be used as an inline database without any administrative overhead, or as a standalone server, where it provides a widely used rest interface that can be easily integrated into PHP-based,. NET and JavaScript environments. But the main focus of this paper is to discuss the direct use of neo4j.
Developers can interact directly with the graphical model via JAVA-API, which exposes a very flexible data structure. As with other languages such as Jruby/ruby, Scala, Python, Clojure, the community has also contributed to the excellent binding library. Typical data characteristics of the NEO4J:
- Data structures are not required and can even be completely out of the way, which simplifies schema changes and delays data migrations.
- It is easy to model common complex domain datasets, such as access control in a CMS can be modeled into fine-grained access control tables, use cases for class object databases, Triplestores, and other examples.
- Typical areas of use are Semantic web and RDF, Linkeddata, GIS, genetic analysis, social network data modeling, in-depth recommendation algorithms, and other areas.
Even "traditional" RDBMS applications often contain challenging, well-suited data sets such as folder structure, product configuration, product assembly and classification, media meta-data, semantic transactions in the financial sector, and fraud detection.
Around the kernel, NEO4J provides a set of optional components. It is supported by the meta-model to construct the graphical structure, SAIL-a SPARQL-compliant RDF Triplestore implementation or a set of common graphics algorithms.
If you want to run neo4j as a separate server, you can also find the rest wrapper. This is ideal for architectures built with lamp software. With memcached, E-tag, and Apache-based caches and web tiers, rest even simplifies the scaling of large-scale read loads.
Performance?
It is difficult to give accurate performance benchmark data because they are associated with the underlying hardware, the datasets used, and other factors. Adaptive scale neo4j can handle graphs that contain billions of of nodes, relationships, and attributes without any additional work. Its read performance makes it easy to traverse 2000 relationships per millisecond (approximately 12 million traversal steps per second), which is completely transactional and has a hot cache for each thread. Using the shortest path calculation, the neo4j is even 1000 times times faster than MySQL when dealing with small graphs with thousands of nodes, and the gap is increasing as the graph size increases.
The reason for this is that, in neo4j, the speed of the graph traversal execution is constant, independent of the size of the graph. Unlike a common join operation in an RDBMS, this does not involve a set operation that degrades performance. neo4j traverse graphs in a deferred style-nodes and relationships are traversed and returned only when the resulting iterators need access to them, which greatly improves performance for large-scale deep traversal.
Write speed is very much related to the file system's lookup time and hardware. The Ext3 file system and SSD disks are a good combination, which results in about 100,000 write transactions per second.
Example-Hacker Empire Map
As already mentioned, social networks represent the tip of the iceberg for graphical database applications, but using them as an example can be easy to understand. To illustrate the basic functions of neo4j, this small figure comes from the movie of the Hacker Empire. The figure is generated using Neo4j's Neoclipse, which is based on Eclipse RCP:
Graphical databases, NoSQL, and neo4j