Reproduced from the original address: http://www.cnblogs.com/loveis715/p/5277051.html
I recently used a graphical database to support a start-up project. In the process of using this kind of graphics database is actually very interesting. So here's a brief introduction to you.
The NoSQL database is believed to have been heard. They can often be used to deal with a series of problems that traditional relational databases are difficult to solve. Typically, these NoSQL databases are divided into four Graph,document,column family and Key-value store. These four types of databases use different data structures, respectively, to record the information. So the scenarios they apply to are different.
The most special one is the graph database. It can be said that it is very different from a series of other NoSQL databases: a rich relationship representation, complete transactional support, but no pure scale-out solution.
In this article, we will make a brief introduction to the industry's very popular graphical database neo4j.
Introduction to Graphical databases
I believe that you and I, in the use of relational database often encounter a series of very complex design problems. For example, each actor in a movie often has a supporting role in the main character, but also has the Director, special effects and other people's participation. Often these people are abstracted into the person type, which corresponds to the same database table. At the same time a director can also be an actor in other movies or TV dramas, more likely a singer, or even an investor in some film and television company (yes, I do use Vicki as a template for this example). And these studios are often a series of films, the management of TV dramas. This interrelated relationship is often complex, and there are often several different relationships between the two entities:
When trying to model these relationships using a relational database, we first need to establish a series of tables representing various entities: a table representing people, a table for a movie, a table for a TV show, a table for a movie company, and so on. These tables often need to be linked by a series of related tables: These relate tables to record which movies a person has been in, what dramas they have played, what songs they have sung, and what companies they are investing in. At the same time we need to create a series of related tables to record who is the protagonist in a movie, who is a supporting role, who is the director, who is the special effects and so on. As you can see, we need a large number of association tables to document this complex set of relationships. After more entities are introduced, we will need more and more association tables, which makes the relational database-based solution cumbersome and error-prone.
The crux of all this is that the relational database is designed for the basic idea of entity modeling. The design philosophy does not provide direct support for the relationships between these entities. When it is necessary to describe the relationship between these entities, we often need to create an association table to record the association between these data, and these association tables are often not used to record data other than the exception key. In other words, these association tables also simply simulate the relationship between entities through the functionality already available in a relational database. This simulation leads to two very bad results: the database needs to indirectly maintain the relationship between the entities through the association table, resulting in inefficient execution of the database, and a sharp increase in the number of associated tables.
How low is the effectiveness of this implementation? Take the example of building investment relationships between people and movies. A design that uses an associated table is often as follows:
If we now want to find all the investors in a movie through this relationship, what do relational databases often do? First, perform a table scan operation in the associated table (assuming no index support) to find records that match the values of all film fields to the target movie ID. Next, you can find the corresponding record from the person table by the primary key value of the person recorded by the person field in these records. If there are fewer records, this step uses the clustered Index seek operation (assuming that the operator is used). The time complexity of the entire operation changes to O (NLOGN):
You can see that relationships organized by association tables are not performing well at run time. If the datasets we need to manipulate contain a lot of relationships and are primarily working on these relationships, you can imagine how bad the performance of the relational database will become.
In addition to performance, the management of the number of associated tables is also a very annoying problem. We just gave an example with four entities: people, movies, TV dramas, film companies. Real-life examples are not so simple. In some scenarios, we often need to model more entities to fully describe the relationships within a given domain. This relationship may include the controlling relationship of the film and television companies, the complex shareholding relationship between the holding companies and the loan and collateral relationship between the companies, and the relationship between the people, the relationship between the individual and the brands, and the relationships between the brands and their respective companies.
As you can see, traditional relational databases are overwhelmed by the need to describe a large number of relationships. It can take on more entities but the relationship between entities is slightly simpler. For this kind of entity relationship is very complex, often need to record data in the relationship, and most of the operation of the data is related to the situation, the original support of the relational graphics database is the right choice. It can not only bring us operational performance improvements, but also greatly improve the efficiency of system development, reduce maintenance costs.
In a graph database, the main composition of the database mainly has two kinds, the node set and the connection node relationship. A node set is a collection of nodes in a graph that is closer to the most commonly used tables in a relational database. And the relationship is the unique composition of the graph database. Therefore, for a person accustomed to the development of relational database, how to correctly understand the relationship is the key to correct use of the graphical database.
Note: The node set here is my own translation. In the NEO4J official document, it is called a label. The original is: A label was a named graph construct that was used to group nodes into sets; All nodes labeled with the same label belongs to the same set. I personally feel that the blunt name label is easy to confuse others, so chose the "group nodes into sets" of the free translation, but also good for the label and node, that is, the relationship between node set and node can better correspond.
But don't worry, after you understand how the graph database abstracts the data, you'll feel that the data abstraction is actually very close to the relational database. Simply put, each node still has a label that identifies its own entity type, as well as the set of nodes it belongs to, and records a series of attributes that describe the node's attributes. In addition, we can also connect each node through a relationship. So the abstraction of each node set is actually somewhat similar to the abstraction of individual tables in a relational database:
But when it comes to representing relationships, relational and graphical databases are very different:
As you can see, when you need to represent many-to-many relationships, we often need to create an association table to record many-to-many relationships for different entities, and these association tables are often not used to record information. If there are multiple relationships between two entities, then we need to create multiple association tables between them. In a graphical database, we only need to indicate that there is a different relationship between the two, for example, with Directby relationship to the director of the film, or the Actby relationship to designate the actors involved in the film shooting. At the same time, in actby relationships, we can also indicate whether it is starring in the movie by attributes in the relationship. And from the name of the relationship shown above, it can be seen that the relationship is forward. If you want to establish a two-way relationship between two node sets, we need to define a relationship for each direction.
In other words, the relationships in a graph database can provide a richer representation of relationships with the ability to include attributes, relative to the various relational tables in a relational database. Therefore, compared to the relational database, the user of the graphical database will have an additional weapon when abstracting things, that is, a rich relationship:
So when defining data representation for a graphical database, we should abstract the things that need to be presented in a more natural way: first defining the set of nodes for those things, and defining the individual properties of the node set. Next, identify the relationships between them and create the corresponding abstractions for those relationships.
So the data that is hosted in a graphics database will eventually have a structure similar to the one shown:
Design a high-quality diagram
After learning the basics of the graphical database, we're going to start experimenting with the graphical database. The first thing we need to figure out is how do we define a well-designed diagram for our graphical database? In fact, this is not difficult, you just need to understand the design of the graph database to use a series of points.
The first is to distinguish between node sets, nodes, and relationships in the graph. In the past design of relational database, we often use a table to abstract a class of things. As with the concept of human beings, we often abstract a table and add a record that represents two people in a table, Alice and Bob:
In the graph database, there are two concepts: node set and node. In general, data presentation in a graphical database does not use a node set, but rather a separate node:
If you need to add support for books to the diagram, these books will still be represented as a node:
In other words, although the concept of a node set is often in a graph database, it is no longer the most important abstraction for the graph database. Even from the point of view that some graphics databases have allowed software developers to use schemaless nodes, they have weakened the concept of node sets. In turn, the angle of our thinking should be the individual of the nodes and the series of relationships that exist between these individuals.
So can we arbitrarily define the data that each node has? No. One of the most common criteria here is that schemaless is good for you. Weak-type languages, for example, can provide greater development flexibility for software developers than strongly typed languages, but they are often less maintainable and rigorous than strongly typed languages. In the same way, flexibility and maintainability are also required when using schemaless nodes.
This allows us to add a variety of relationships to the nodes, rather than having to worry about the need to record some foreign keys by changing the schema of the database, as in a relational database. This in turn allows software developers to add a variety of relationships between nodes:
Therefore, in a graphical database, the concept of node set is not the most important category of the concept. For example, in some graphical databases, the IDs of individual nodes are not organized according to the node set, but are given by the order in which they are created. When you debug, you may find that the ID of the first node in a node set is 1, and the ID of the second node is 3. The node with the ID of 2 is in the other node set.
So how do we define an appropriate diagram for the business logic? Simply put, a single thing should be abstracted as a node, while nodes of the same type are recorded in the same node set. There may be some differences in the data contained in nodes within a node set, such as a person may have different responsibilities and thus be associated with different relationships and other nodes. For example, a person may be an actor, a director, or an actor and director. In a relational database, we may need to create different tables for actors and directors. In the graphical database, these three types of people are the data in this node set, but the difference is that they are connected to different nodes through different relationships. In other words, in a graphical database, the node set is not as small as the table in the relational database.
Once the individual node sets have been abstracted, we need to find out the possible relationships between these nodes. These relationships are more than just cross-node sets. Sometimes, these relationships are the relationships between nodes within the same node set, or even the same node that points to itself:
These relationships usually have a starting point and an end point. In other words, the relationships in the graphical database are often positive. If you want to create a relationship between two nodes, such as Alice and Bob, we need to create two know_about relationships between them. One of the relationships is directed to Bob by Alice, and the other relationship is directed by Bob to Alice:
It is important to note that although the relationship in the graph database is one-way, in some graphical database implementations, such as neo4j, we can find not only the relationship from a node, but also the individual relationships that point to a node. In other words, although the relationship in the diagram is one-way, the relationship can be found at both the start and end points.
using in Projects neo4j
OK, after some basic knowledge about the graph database, we'll take neo4j as an example of how to use a graphical database. NEO4J is the open source graphics database provided by Neo technology. It organizes the data according to the node/relationship model described above and has the following set of features:
- Support for the transaction. NEO4J enforces that every change to the data needs to be done within a single transaction to ensure data consistency.
- Powerful graphical search capabilities. NEO4J allows the user to manipulate the database through the Cypher language. The language is designed specifically for manipulating graphical databases, so it can operate graphically in a very efficient manner. NEO4J also provides clients for a range of popular languages for the current market, allowing developers who use these languages to quickly operate on neo4j. In addition, some projects, such as spring data neo4j, provide a very simple and straightforward way to manipulate data, making it easier for users to get started.
- Has a certain scale-out capability. Because a node in a diagram often has a relationship associated with other nodes, it is often unrealistic to cut graphs like a series of sharding solutions. Therefore, the current scale-out scheme provided by NEO4J is mainly read-write segmentation through read replica. Conversely, because a single neo4j instance can store billions of nodes and relationships, this horizontal scale-out capability is sufficient for general enterprise applications.
Okay, now let's take a look at an example of creating and manipulating a graphical database via cypher (from http://neo4j.com/developer/guide-data-modeling/):
1//Create a node of the person type of Sally, the node's Name property is Sally,age property of 2 Create (Sally:person {name: ' Sally ', age:32}) 3 4//Create John Node 5 Create (John:person {name: ' John ', age:27}) 6 7//create graph databases corresponding node 8 create (Gdb:book {title: ' Graph Databases ', 9 authors: [' Ian Robinson ', ' Jim Webber ']}) 10 11//Build a friendship between Sally and John, the since value here should be timestamp. Please remind yourself how the dates are recorded in the relational database ~~~12 CREATE (Sally)-[:friend_of {since:1357718400}]-> (John) 13 14//In Sally and graph Databases A read relationship between a book create (Sally)-[:has_read {rating:4, on:1360396800}]-> (GDB) 16 17//Between John and graph databases book Create a read relationship to create (John)-[:has_read {rating:5, on:1359878400}]-> (GDB)
This statement creates three nodes: Person node Sally and John, and book node gdb. The relationships between them are also specified:
Note: The original is from http://neo4j.com/developer/guide-data-modeling/
Want to save time to spend in useful places, but for the sake of completeness have to write
One thing to note here is that the relationship is one-way. If you want to build a two-way relationship, as Sally and John are friends with each other, we should logically need to repeat the process of creating the relationship. Since I have not tried the latest version of NEO4J (because it recently has a corrupted post-compatibility changes, we have no way to upgrade neo4j, there is no way to confirm that the above code is not created once friend_of), so please note the reader. If anyone has experimented, please add the results to comment and be grateful.
With the data, we can manipulate the data. Although cypher and SQL operate on different data structures, their grammatical structure is very similar. For example, the following statement is used to obtain when Sally and John became friends (from http://neo4j.com/developer/guide-data-modeling/):
1 match (Sally:person {name: ' Sally '}) 2 match (John:person {name: ' John '}) 3 match (Sally)-[r:friend_of]-(John) 4 RETURN R.since as Friends_since
And there are some more complicated grammars. The following actions are used to judge Sally and John who first read "Graph Databases," a book:
1 match (People:person) 2 WHERE people.name = ' John ' OR people.name = ' Sally ' 3 MATCH (People)-[r:has_read]-> (Gdb:book { Title: ' Graph Databases '}) 4 RETURN people.name as First_reader5 ORDER by R.on6 LIMIT 1
Of course, no one is willing to write SQL, otherwise hibernate will not develop. One of the more popular solutions today is Spring Data neo4j. By defining a series of Java classes and using a series of tags on them, we can use neo4j in our system. Now let's take the 3.4.4 version of Spring Data neo4j as an example of how to use it.
First, we need to define the corresponding data type (from http://projects.spring.io/spring-data-neo4j/) for the data that will be deposited into neo4j:
1//through the nodeentity tag to create a data type that needs to be deposited to neo4j 2 @NodeEntity 3 public class Movie {4///by Graphid tag to specify the domain as the ID. If a new node is created, then we need to empty the domain (NULL). Don't know if 4.0.0 has this limit 5 @GraphId Long ID; 6 7//Create an index of 8 @Indexed for the domain (type = fulltext, IndexName = "search") 9 String title;10 11//Direct reference to the person class. When it is saved, it is automatically saved to the person node set and the movie class is kept a reference to that instance. director;13 14//By relatedto tag to indicate that the relationship of each entity referenced by the current collection corresponds to the current movie instance is act S_in. Note here that the direction is incoming, that is, the direction is from person to movie, also is the person acts_in movie. And in person, we can also have a collection of movie, also use relatedto tag use acts_in relationship, and direction to OUTGOING15//another, The relatedto and relatedtovia markings are supposedly discarded in 4.0.0, but are still used in the official example @RelatedTo (type= "acts_in", direction = INCOMING), set< ; person> actors;18 @RelatedToVia (type = "rated") iterable<rating> ratings;21 22//Read data using custom query 23 @Query ("Start Movie=node ({self}) match movie-->genre<--similar return similar") iterable< Movie> similarmovies;26}
Next, you can create a repository that is used to perform crud operations on the type you just defined:
1//From the Graphrepository interface directly to the Movie class to get Crud function 2 interface Movierepository extends graphrepository<movie> {3 // Perform a specific action with the Cypher statement 4 @Query ("Start movie={0} match m<-[rating:rated]-user 5 return rating") 6 iterable <Rating> getratings (movie movie); 7 8 //And Spring Data JPA, you can add filter criteria by combining function names with specific rules 9 iterable<person> Findbyactorsmoviesactorname ( Name) 10}
Finally, we need to specify where these components are located in the spring configuration file:
1 <neo4j:config storedirectory= "target/graph.db" base-package= "com.example.neo4j.entity"/>2 <neo4j: Repositories base-package= "Com.example.neo4j.repository"/>
neo4j Cluster
OK, after understanding how to use neo4j, the next step is to build a neo4j cluster to provide a high-availability, high-throughput solution. The first thing you need to know is that the neo4j cluster actually has a certain limit compared to the nearly limitless scale-out capabilities offered by other NoSQL databases. To get a better understanding of these limitations, let's first look at the architecture of the neo4j cluster and how it works:
A master-slave cluster of three NEO4J nodes is presented. Typically, each neo4j cluster consists of one master and multiple slave. Each neo4j instance in the cluster contains all the data in the diagram. The failure of any one of the neo4j instances will not result in loss of data. The Master in the cluster is primarily responsible for writing the data, and then slave synchronizes the data changes in master to itself. If a write request arrives at slave, the slave will also communicate with master on the request. At this point the write request will be executed first by master and then asynchronously updated to the individual slave. So in, you can see that the red lines that represent how data is written are from master to slave, from slave to master, but not from slave to slave. And all of this is done through transaction propagation composition.
Some readers may have noticed that the writing of data in the neo4j cluster is done through master, is it not the master that becomes the system's write bottleneck? The answer is almost no. The first is that the complexity of the graph data modification makes it not as easy to modify as a stack, an array, and other data types. In modifying a diagram, we not only need to modify the graph node itself, but also to maintain the relationship, itself is a relatively complex process, for users is also more difficult to understand. As a result, the operation of the graph is often much more read than written. There is also a write queue inside the neo4j that can be used to temporarily cache write operations to the NEO4J instance, allowing neo4j to handle the sudden influx of write operations. In the worst case scenario, the neo4j cluster needs to face a constant amount of write operations. In this case, we need to consider the vertical expansion of the neo4j cluster, because horizontal scaling is no use to solve this problem.
In turn, the read throughput of the neo4j cluster can theoretically grow linearly with the number of neo4j instances in the cluster, since the data reads can be done by any of the neo4j instances in the cluster. For example, if a neo4j cluster with 5 nodes can respond to 500 read requests per second, adding a node can expand it to respond to 600 read requests per second.
But when the volume of requests is very large and the data being accessed is very random, another problem can occur, and that is Cache-miss. NEO4J internally uses a cache to record recently accessed data. These cached data are stored in memory for fast response to data read requests. However, when the request volume is very large and the data is distributed randomly, the Cache-miss will continue to occur, so that every time the data is read through the disk lookup to complete, which greatly reduces the efficiency of the NEO4J instance operation. The solution provided by neo4j is called cache-based sharding. Simply put, the same neo4j instance is used to respond to all the requirements sent by a single user. The rationale behind it is also very simple: the data that the same user accesses over a period of time is often similar. Therefore, sending a series of data requests from this user to the same NEO4J server instance can greatly reduce the probability of cache-miss occurring.
Another component of the NEO4J data server, cluster management, is responsible for synchronizing the state of each instance in the cluster and monitoring the addition and departure of other NEO4J nodes. It is also responsible for maintaining consistency in the results of the leadership election. If the number of failed nodes in the neo4j cluster exceeds half the number of nodes in the cluster, then the cluster will only accept read operations until the effective node is re-exceeded by half the number of cluster nodes.
At startup, a neo4j DB instance will first attempt to join the cluster identified by the configuration file. If the cluster exists, it will join as a slave. Otherwise the cluster will be created, and it will be used as the master of the cluster.
If one neo4j instance in the NEO4J cluster fails, other NEO4J instances detect the condition in a short time and mark it as invalid until it reverts back to its normal state and synchronizes the data to the latest. There is a special case in which the master fails. In this case, the NEO4J cluster will elect the new master through the built-in leader election function.
With the help of the Cluster management composition, we can also create a global Cluster. It has a master Cluster and multiple slave Cluster. This cluster-building approach allows master cluster and slave cluster to be in service clusters in different regions. This allows the user of the service to access services that are closest to their own. Similar to the relationship between master and slave instances in a neo4j cluster, data writes are usually done in master cluster, and slave cluster will only be responsible for providing the data read service.
Improve neo4j the Performance
I believe you have seen in the above on the NEO4J cluster explained that the neo4j cluster actually has some limitations. These limitations can lead to a bottleneck in the neo4j cluster in terms of total system capacity, such as the number of storage nodes or write throughput. In the article "Extensibility of services," we've described that by scaling up, we can also improve the overall performance of our services. In addition to providing higher-capacity hardware for neo4j, using neo4j more efficiently is also an important way to scale up vertically.
Similar to the execution plan provided by databases such as SQL Server, NEO4J also provides the execution plan. When a request is executed, NEO4J will disassemble the request into a series of smaller operators (Operator). Each operator performs a portion of the work and collaborates with each other to complete the response to the request. Similar to SQL Server's execution plan, NEO4J's execution plan also has many types of operations, such as Scan,seek,merge,filter. We can get a tree representation of how a request will be executed by explain or the profile command. By looking at these tree representations, software developers can see how a request works in neo4j:
The following are all the operators supported by the current version of neo4j: Http://neo4j.com/docs/stable/execution-plans.html.
Readers with execution plan tuning experience may have seen an operator at the first glance: Node Index Seek. Its name directly reveals another tuning tool in neo4j: Index. We know that the clusted Index seek in SQL Server is often the most efficient operation as long as there is not much data recorded in the result set that is found. So in neo4j, we also need to use the index as reasonably as possible, so that the execution plan generated by NEO4J can use an index-based series of operators. Let's recall the movie class that was abstracted from the spring Data neo4j in the way we showed you:
1 @NodeEntity2 public class Movie {3 @GraphId Long id;4 5 @Indexed (type = fulltext, IndexName = "search") 6 Str ing title;7 8 ... 3 ·
The code above shows how we should create an index from the @indexed tag. If you are using cypher directly to manipulate neo4j, you can create an index from the following statement:
1 CREATE INDEX On:movie (title)
And here's a slightly inconsistent with SQL Server, which is the understanding of the @graphid tag. In SQL Server, Primary key is actually not associated with an index, but it is often automatically added to an index by default. In neo4j, the domain modified by the @graphid tag is more like the internal implementation of the neo4j. Instead of automatically adding an index to the node, NEO4J performs access to the nodes through the values recorded by the domain.
There is also the possibility that it can be very susceptible to neo4j performance, which is trying to use neo4j to record data that is not appropriate for its record. In the beginning of this article we have introduced the field of neo4j, that is, the recording of graphical data, and the relationship between node sets and nodes. For some other types of data, such as the user's username/password pair, it is not the domain that the graphics database is good at. In these cases, we should select the appropriate database to record the data. In a large system, many different types of databases work together in a common way, so it is not necessary to abruptly some data that should otherwise be recorded by other types of databases in neo4j.
Working with other databases
In the above we have just mentioned that the data that should not be recorded by neo4j should not be logged to ensure that the neo4j is not used in an unreasonable way to reduce its execution efficiency. So where should the data be recorded? The answer is simple: other types of databases that are suitable for recording this data.
Maybe you think my sentence is nonsense. Actually, I feel the same way. And what I want to introduce here is how to complete the integration between neo4j and other databases so that they work together to provide a complete service to the user. For some systems, we can allow some degree of inconsistency between these databases, and for other systems we need to keep the data consistent all the time.
The technical solutions supported by NEO4J are three main types: event-based synchronization,periodic synchronization and periodic full export/import of Data. Event-based synchronization is actually sending the same message to the NEO4J-based background and the background based on the other database, and the writing of the data is done by these backgrounds. Periodic synchronization synchronizes data changes in one database to another database on a regular basis. and periodic full export/import of data is done by importing all of a database into another database.
All three of these solutions are used to handle the same data recorded by NEO4J as other databases. The more common case is that neo4j records the more complex diagrams of entity relationships, and other databases are used to record data with other types of representations. The data between neo4j and these databases is only part of the intersection, and each database has its own unique data. The handling of this situation is often a multi-step submission. For example, in a dating site, users can complete their own account settings on the page, such as user name, password, etc., and can add a series of friends in the next add friend interface and comments about the friend. In this system, the user's own account settings may be recorded in the relational database, and related information about friends are recorded in the graphics database. If you send all of the information in both steps to the background as a single request, you might have a successful save on one of the databases and a failed save on another. To avoid this, we need to divide the information that fills these two parts into two pages, and a "save and make Next" button at the bottom of each page. This way, if the steps to set up the account in the first step are not saved properly, then the user will not be able to add the next friend's action. In this step of adding friends, if the graphics database does not save properly, then we will be able to explicitly tell the user to add a friend failure, allowing the user to retry.
In fact, many times, the problem of saving data across different databases can be solved by adjusting the design, moreover, the data recorded by these databases often have very different data structures. As a result, it is often a natural way for users to split into multiple-step submissions.
Open source software: NoSQL database-Graph database neo4j