The term NoSQL is becoming ubiquitous in recent years. But what exactly is "NoSQL" referring to? How and why is it so useful? In this article, we will write a NoSQL database to answer these questions through pure Python (which I prefer to call it, "light-structured pseudocode").
Oldsql
In many cases, SQL has become a synonym for the database. In fact, SQL is the acronym for strctured Query Language , not the database technology itself. Rather, it refers to a language that retrieves data from an RDBMS (relational database management system, relational databases Management systems). MySQL, MS SQL Server and Oracle belong to one of the RDBMS.
The R in the RDBMS, the "relational" (relational, associative), is the richest part of the content. Data is organized by tables (table) , and each table is made up of columns (column) that are associated with the type . The type of all tables, columns, and their classes is referred to as the database schema (schema or pattern). The schema of each table depicts the structure of the database in its entirety. For example, a Car table called may have some of the following columns:
In a table, each single entry is called a row, or a record. To differentiate each record, a primary key (primary key)is typically defined. The primary key in the table is one of the columns that uniquely identifies each row. In the table Car , VIN is a natural primary key choice because it guarantees that each vehicle has a unique identity. Two different rows may have the same value on make, Model, year, and Color columns, but for different cars, there will definitely be different VIN. Conversely, only To have the same VIN on both lines, we don't have to check the other columns to see that the two lines refer to the same car.
Querying
SQL allows us to get useful information by querying the database. In simple queries , a query is a structured language that questions the RDBMS and interprets the rows it returns as the answer to the question. Assuming that the database represents all registered vehicles in the United States, in order to obtain all the records, we can make the following SQL query on the database:
SELECT make, Model from Car;
To roughly translate SQL into Chinese:
That is, "show me the values of make and Model in each row of the table Car." After executing the query, we will get the results of some queries, each of which is make and Model. If we only care about the color of the car registered in 1994, then you can:
SELECT Color from Car WHERE year = 1994;
At this point, we will get a list similar to the following:
Blackredredwhiteblueblackwhiteyellow
Finally, we can specify the query for a car by using the table's (primary key) primary key , which is VIN:
SELECT * from Car WHERE VIN = ' 2134afger245267 '
The above query will return the property information for the specified vehicle.
The primary key is defined as unique and non-repeatable. In other words, a vehicle with a specified VIN can appear at most once in a table. This is very important, why? Take a look at an example:
Relations
Suppose we are running a car repair business. In addition to some other necessary things, we also need to track the service history of a car, that is, all the trimming records on that vehicle. Then we might create a table with some of the following columns ServiceHistory :
| Vin |
Make |
Model |
Year |
Color |
Service performed |
Mechanic |
Price |
Date |
Thus, each time a vehicle is repaired, we add a new line to the table and write to the service what we do, what the repairman is, how much it costs and how long it will take to service.
But wait a minute, we all know that for the same car, all the information about the vehicle's own columns is constant. That is, if I trim my Black Lexus RX 350 10 times, the information will not change even if make, Model, year and Color, and each time it is still recorded. As opposed to invalid duplicate records, it is more reasonable to store such information only once and to query when necessary.
So what should we do? We can create a second table: Vehicle It has the following columns:
| Vin |
Make |
Model |
Year |
Color |
In this way, for ServiceHistory tables, we can refine to some of the following columns:
| Vin |
Service performed |
Mechanic |
Price |
Date |
You might ask, why does VIN appear in both tables? Because we need to have a way to confirm that the car in the table refers to the car in the ServiceHistory Vehicle table, that is, it is necessary to confirm that two of the two records in the table indicate the same car. In this case, we only need to store the information of each car once. Each time a vehicle comes in for repair, we ServiceHistory create a new row in the table without having to Vehicle add a new record to the table. After all, they are referring to the same car.
We can use SQL query statements to expand Vehicle the ServiceHistory implicit relationships that are contained in two tables:
SELECT Vehicle.model, vehicle.year from Vehicle, servicehistory WHERE Vehicle.vin = Servicehistory.vin and Servicehistory . Price > 75.00;
This query is designed to find the Model and year for all vehicles with a maintenance cost greater than $75.00. Notice that we Vehicle ServiceHistory filter the records that meet the criteria by matching the VIN values in the table. The return will be a few records in the two tables that match the criteria, while "Vehicle.model" and "vehicle.year" mean we only want the Vehicle two in the table Column.
If our database does not have an index (indexes) (the correct one should be indices), the query above needs to perform a table scan ( table Scan) to locate the rows that match the query requirements. Table scan sequentially checks each row in the table in order, which is usually very slow. In fact, the table scan is actually the slowest of all queries.
You can avoid scanning the table by Legasso. We can think of an index as a data structure that allows us to quickly find a specified value (or some value within a specified range) on the indexed column by pre-ordering. That is, if we have an index on the price column, then we do not need to scan the entire table line by row to determine whether its prices are greater than 75.00 , but only need to use the information contained in the index to "jump" to the first price above 75.00 the row, and return each subsequent row (because the index is Order, so the price of these lines is at least 75.00 ).
Indexing is an indispensable tool for improving query speed when dealing with large amounts of data. Of course, as with everything, there is a loss, and using an index can result in some extra consumption: the indexed data structure consumes memory, which can be used to store data in the database. This requires that we weigh the pros and cons and seek a compromise, but it is common practice to index the columns that are frequently queried.
The Clear Box
Because the database is able to examine the schema of a table (which describes what type of data each column contains), advanced features such as indexes can be implemented and can make a reasonable decision based on the data. That is to say, for a database, a table is actually a "black box" (or Transparent box) antonym?
Keep this in mind when we talk about NoSQL databases. This is also a very important part when it comes to the ability to query different types of database engines.
Schemas
We already know that the schema of a table describes the name of the column and the type of data it contains. It also includes other information, such as which columns can be empty, which columns do not allow duplicate values, and any other restrictions on the columns in the table. At any one time a table can have only one schema, and all rows in the table must conform to the schema's rules .
This is a very important constraint. Suppose you have a database table with millions of consumer information. Your sales team wants to add some extra information (such as the age of the user) to improve the accuracy of their email marketing algorithms. This is required to alter (change) an existing table – add a new column. We also need to decide whether each row in the table requires that the column must have a value. In general, it makes sense to have a column with values, but doing so may require information that we cannot easily obtain (such as the age of each user in the database). So at this level, there is a need for some trade-offs.
In addition, making some changes to a large database is usually not a trivial matter. To prevent errors, it is important to have a rollback scenario. But even so, once the schema has changed, we are not always able to undo those changes. Schema maintenance may be one of the most difficult parts of a DBA's work.
Key/value Stores
Before the term "NoSQL" exists, memcached a key/value data store like this key/value data Stores does not require a table schema to provide the functionality of the datastore. In fact, there is no concept of "table" at all when storing k/v. Only keys and values . If the key-value store sounds familiar, it may be because the concept is built on the same principles as dict Python set : a hash table (hash table) is used to provide fast, key-based data queries. A Python-based, most primitive NoSQL database, simply a large dictionary (dictionary).
To understand how it works, write one yourself! First look at some simple design ideas:
A Python dict as the primary data store
Only string type is supported as key (key)
Supports storing integer, string, and list
A simple TCP/IP server that uses ASCLL string to deliver messages
Some like INCREMENT , DELETE , APPEND and STATS such advanced commands (command)
The advantage of having an ASCII-based TCP/IP interface for data storage is that we use simple telnet programs to interact with the server without needing special clients (although this is a good practice and requires only 15 lines of code).
For the information we send to the server and other returns, we need a "wired format". The following is a simple explanation:
Commands supported
PUT
GET
Putlist
APPEND
INCREMENT
DELETE
STATS
Now let's define the structure of the message itself.
Message Structure
Request Messages
A Request message contains a command, a key (key), a value (value), and a value's type (type). The latter three depends on the message type, which is optional and non-mandatory. ;is used as a delimiter. Even if the above options are not included, you must still have three characters in the message ; .
COMMAND; [KEY]; [VALUE]; [VALUE TYPE]
command is one of the commands in the list above
key is a string that can be used as a database key (optional)
VALUE is an integer in the database, list or string (optional)
A list can be represented as a string of strings separated by commas, for example, "Red, green, blue"
value type describes the value that should be interpreted why the type
Examples
Reponse Messages
A response message (reponse messages) contains two sections, ; separated by. The first part is always True|False , it depends on whether the command executed is successful. The second part is the command message, which displays an error message when an error occurs. For those commands that execute successfully, if we do not want the default return value (for example PUT ), a successful message will appear. If we return the value of a successful command (for example GET ), then the second part will be its own value.
Examples
True; Key [foo] set to [1]
True; 1
True; Key [bar] set to [['a', 'b', 'c']]
True; Key [bar] had value [d] appended
True; ['a', 'b', 'c', 'd']
True; {'PUTLIST': {'success': 1, 'error': 0}, 'STATS': {'success': 0, 'error': 0}, 'INCREMENT': {'success': 0, 'error': 0}, 'GET': {'success': 0, 'error': 0}, 'PUT': {'success': 0, 'error': 0}, 'GETLIST': {'success': 1, 'error': 0}, 'APPEND': {'success': 1, 'error': 0}, 'DELETE': {'success': 0, 'error': 0}}
Show Me the code!
I'm going to show all the code in the form of a block summary. The whole code is 180 lines, and it won't take a long time to read them.
Set up
Here are some of the boilerplate code that our server needs:
"" "NoSQL database written in Python" "" # Standard library Importsimport sockethost = ' localhost ' PORT = 50505SOCKET = socket . Socket (socket.af_inet, socket. Sock_stream) STATS = { ' PUT ': {' success ': 0, ' ERROR ': 0}, ' GET ': {' success ': 0, ' ERROR ': 0}, ' GETLIST ': {' succ ' ESS ': 0, ' ERROR ': 0}, ' putlist ': {' success ': 0, ' ERROR ': 0}, ' INCREMENT ': {' success ': 0, ' ERROR ': 0}, ' APPEND ': {' success ': 0, ' ERROR ': 0}, ' DELETE ': {' success ': 0, ' ERROR ': 0}, ' STATS ': {' success ': 0, ' ERROR ': 0}, }
It is easy to see that the above is just a package of imports and some data initialization.
Set Up (Cont ' d)
Next I'll skip some code so that I can continue to show the rest of the code that was prepared. Note that it involves some functions that do not yet exist, but that's OK, we'll cover them later. In the full version (which will be presented at the end), everything will be organized in an orderly fashion. Here is the remaining installation code:
Command_handers = {' PUT ': handle_put, ' GET ': handle_get, ' GETLIST ': handle_getlist, ' putlist ': handle_putlist , ' INCREMENT ': handle_increment, ' APPEND ': handle_append, ' DELETE ': handle_delete, ' STATS ': handle_stats,}data = {}def main (): "" "Main entry point for Script" "" Socket.bind (HOST, PORT) Socket.listen (1) while 1:co Nnection, address = Socket.accept () print (' New connection from [{}] '. Format (address)) data = Connection.recv (4096). Decode () command, key, value = parse_message (data) if command = = ' STATS ': response = Handl E_stats () elif command in (' GET ', ' GETLIST ', ' INCREMENT ', ' DELETE '): Response = Command_handers[command] (key) Elif command in (' PUT ', ' putlist ', ' APPEND ',): RESP Onse = Command_handers[command] (key, value) Else:response = (False, ' Unknown COMMAND type {} '. Format (CO Mmand)) Update_stATS (command, response[0]) Connection.sandall (' {};{} '). Format (response[0], response[1])) connection.close () if __name__ = = ' __main__ ': Main ()
We created COMMAND_HANDLERS it, often referred to as a lookup table (look-up table) . COMMAND_HANDLERS the job is to associate the command with the function that is used to process the command. For example, if we receive an GET order, COMMAND_HANDLERS[command](key) it is tantamount to saying handle_get(key) . Remember, in Python, a function can be considered a value and can be stored in one of the same values as any other value dict .
In the above code, although some commands request the same parameters, I still decide to process each command separately. Although it is simple and rude to force all handle_ functions to accept one key and one value , I hope that these processing functions will be more organized, easier to test, and less prone to errors.
Note that the socket-related code is already very minimalist. Although the entire server is based on TCP/IP communication, there is not much underlying network interaction code.
The last thing you need to pay attention to: the DATA dictionary, because this point is not very important, so you are likely to miss it. DATAis actually used to store the key-value pair, it is they actually constitute our database.
Command Parser
Here are some command parser , which are responsible for interpreting the received message:
def parse_message (data): "" " Return a tuple containing the command, the key, and (optionally) the value cast to the appropriate type. "" " command, key, value, Value_type = Data.strip (). Split (';') If Value_type: if value_type = = ' LIST ': value = Value.split (', ') elif value_type = = ' int ': value = Int (VA Lue) Else: value = str (value) else: value = None return command, key, value
Here we can see that the type conversion has occurred. If you want the value to be a list, we can str.split(',') get the value we want by calling the string. For int , we can simply use the argument as a string int() . str()the same is true for strings.
Command handlers
The following is the code for the command handler. They are very intuitive and easy to understand. Note that although there are a lot of error checking, but it is not exhaustive, very complicated. In the course of your reading, if you find any errors, please visit here for discussion.
def update_stats (Command, Success): "" "Update the stats dict with info on if executing *command* was a *success*" "If success:stats[command][' success '] + = 1 else:stats[command][' ERROR '] + = 1def handle_put (key, VA Lue): "" "Return a tuple containing True and the message to send back to the client." " Data[key] = value return (True, ' key [{}] ' set to [{}] '. Format (key, value)) def handle_get (key): "" "return a tuple con Taining True If the key exists and the message to send back to the client "" "If key is not in Data:return (fals E, ' Error:key [{}] not found '. Format (key)) Else:return (True, Data[key]) def handle_putlist (key, value): "" " Return a tuple containing True if the command succeeded and the message to send back to the client. "" Return Handle_put (key, value) def handle_putlist (key, value): "" "return a tuple containing True if the command succeeded And the message to send back to the client "" "Return HanDle_put (key, value) def handle_getlist (key): "" "Return a tuple containing True if the key contained a list and the Me Ssage to send back to the client. "" Return_value = exists, value = Handle_get (key) if not Exists:return return_value elif not isinstance (value, List): Return (False, ' Error:key [{}] contains non-list value ([{}]) '. Format (Key, value)) Else: Return return_valuedef handle_increment (key): "" "return a tuple containing True if the key ' s value could be increme Nted and the message to send back to the client. "" Return_value = exists, value = Handle_get (key) if not Exists:return return_value elif not isinstance (list_va Lue, List): Return (False, ' Error:key [{}] contains non-list value ([{}]) '. Format (Key, value)) Else : Data[key].append (value) return (True, ' key [{}] had value [{}] appended '. Format (key, value)) def Handle_del Ete (Key): "" "Return a tuple containing True If The key could is deleted and the message to send back to the client. "" If key isn't in Data:return (False, ' Error:key [{}] is not found and could not is deleted. '. Format (key)) Else:del Data[key]def handle_stats (): "" "Return a tuple containing True and the contents of the STATS dict. "" " Return (True, str (STATS))
There are two points to note: Multiple assignments (multiple assignment) and code reuse. Some functions are simply wrappers for more logical use of existing functions, such as handle_get and handle_getlist . Since we sometimes just need a return value for an existing function, and other times we need to check what the function is actually returning, we are using multiple assignments .
Take handle_append a look. If we try to invoke handle_get but the key does not exist, then we simply return the handle_get returned content. In addition, we want to be able to handle_get reference the returned tuple as a single return value. Then when the key does not exist, we can simply use it return return_value .
If it does exist , then we need to check the return value. Also, we want to be able to handle_get reference the return value as a separate variable. To be able to handle both of these situations, we use multiple assignments, taking into account situations where the results need to be handled separately. As a result, you don't have to write multiple lines of code and keep your code clear. return_value = exists, list_value = handle_get(key)can explicitly indicate the return value that we are going to reference in at least two different ways handle_get .
How are this a Database?
The program above is obviously not an RDBMS, but it is definitely a NoSQL database. It's so easy to create because we don't have any actual interaction with data . We just did a minimalist type check to store any content that the user sent. If you need to store more structured data, we may need to create a schema for the database to store and retrieve the data.
Since NoSQL databases are easier to write, easier to maintain, and easier to implement, why not just use MongoDB? Of course there is a reason, or that sentence, there will be lost, we need to the NoSQL database to provide the data flexibility (flexibility) based on the database of the search (searchability).
Querying Data
If we have the NoSQL database above to store the earlier Car data. Then we might use VIN as a key, using a list as the value of each column, that is, 2134AFGER245267 = ['Lexus', 'RX350', 2013, Black] . Of course, we've lost the meaning of each index in the list (meaning) . We just need to know where to index 1 of the Mod that stores the car. El, index 2 stores year.
The bad thing is, what happens when we want to execute the previous query statement? Find 1994 all the colors of the car will become a nightmare. We must traverse DATA each value to confirm whether the value stores the car data or is otherwise irrelevant data, such as checking index 2, to see whether the value of index 2 equals 1994, and then continue to take the value of index 3. This is more than the table scan. Worse, because it not only scans every row of data, it also needs to apply some complex rules to answer queries.
The authors of the NoSQL database, of course, are aware of these issues, (given that the query is a very useful feature) and they have come up with ways to make the query less "unreachable". One approach is to structure the data used, such as JSON, to allow referencing other rows to represent the relationship. At the same time, most NoSQL databases have the concept of namespaces (namespace), and a single type of data can be stored in a "section" of the type that is unique to the database, allowing the query engine to take advantage of the "shape" information of the data being queried.
Of course, although there are some more complex ways to enhance the accessibility of the query, it is always an inescapable problem to compromise between storing a smaller number of schemas and enhancing the accessibility of the query. In this case, our database only supports querying via key. If we need to support richer queries, things can get a lot more complicated.
Summary
At this point, I hope that the concept of "NoSQL" is already very clear. We learned a little bit of SQL and learned how the RDBMS works. We saw how to retrieve data from an RDBMS (using SQL query). By building a NoSQL database of toys, we learned about some of the issues between Queryable and conciseness, and discussed some of the issues that database authors used to address these problems. Some of the methods.