ArticleDirectory
- Advantages:
- Technical Overview:
- Data format:
- Cache:
- Update mechanism (how to maintain data consistency ):
- Distributed:
- Stability:
- Python client:
- What does couchdb do (whether it can or cannot ):
- Prospects:
- References
Couchdb is a database that has been sought after by geek in the past two years. The author was a lotus developer. Unlike traditional relational databases, it claims to be a document database. The so-called document database does not mean that it can only store text because it is actually a schemal-less concept. All those who have used relational databases know that each field defined in the data table is defined as a Type: int, Char, datetime. However, couchdb has only three fields: Document ID, document version number, and content. The content field can be seen as a text type, which can define data at will without worrying about the data type, but the data must be expressed and stored in JSON format. For example, a document describing a user can be represented as: [_ ID: 1001, _ Rev: 1-32443289, {'name': 'wentrue', 'location ': 'beijing'}]. The underlying layer of couchdb is the Erlang language, which provides services in the format of restful APIs. All the read/write capabilities of couchdb can be implemented by simply calling its HTTP requests. Because such a unified and concise service interface is used, clients in various languages can be easily developed to facilitate differentProgramPersonnel usage.
There are not many documents in couchdb. Basically, the subjects are in the overview and Wiki of its website. Here, according to my understanding, the organization once again did not dare to take the name of "Daquan". It is still usable by "Xiao Quan. The following is a brief description of the keywords.
Advantages:
Couchdb features ease of use and concurrency. The latter I know is mainly from Erlang, because Erlang is known for its easy parallel implementation, and I have not tested it myself. But I have a deep understanding of ease of use. As long as you set up the service, you not only have a data server for your dispatch, but also a simple Web server. If you deploy it locally, access HTTP in a browser: // localhost: 5984/_ utils/, you can see a background for management and query. In that background, you can basically implement everything that can be done by other clients. That is to say, even if you do not understandProgramming LanguageYou can use it freely on this management platform. Of course, if you know some JavaScript, you can play around couchdb.
What does this usability mean? Let's make a hypothesis that you are a website developer and you have a lot of data to store, but you don't want to deal with complicated background logic, maybe you don't want to perform background maintenance too much, so with couchdb, you can directly call data with JS and then display the data with HTML + CSS, then write the data back through JS, which achieves a simple content publishing system without any background architecture. There are actually many websites implemented using couchdb.
Technical Overview:
The bottom layer of couchdb is a B-tree storage structure. To improve efficiency, all data insertion or update operations are directly added to the leaf node of the tree without deleting the old node, use the version number to determine the latest data. The version number can also be used to resolve conflicts between concurrent writes. Therefore, the data files will become larger and larger. You can run the compact or replication process at the appropriate time to delete the old files and compress the data files.
Comparison with MySQL (why prompted me to leave MySQL for the moment ):
1. couchdb attributes can be added and deleted flexibly. To add a new attribute, you only need to write one more attribute to the data field. Adding or deleting fields after a large amount of data in MySQL takes a long time, and online operations are even more unbearable (depending on the data size, it is possible to lock the table for dozens of minutes to a day );
2. Inconsistent attributes between documents. If some records may have the attribute but some do not, MySQL can only set the user values without this attribute to 0 or null, storage and query are affected;
3. query efficiency. You often need to query a field to obtain a subset that meets the conditions, to use MySQL for efficient query of each field, you can only create an index for this field to make the index file larger. If you want to create a new index for a new field, creating an index is also a lengthy process of locking a table. The map-reduce method is used by couchdb to distribute the computing workload. (Note: unlike Google, the map-reduce method is actually completed on one machine and no process is found, I don't see the configuration of the number of threads, which is the most difficult to solve. Is it still a single-thread stream processing ?), Although MySQL, which does not have an index, needs to scan the entire table, it can create a permanent index for some common queries. indexes are incrementally updated as data increases, reducing query time.
Data format:
Couchdb stores data in JSON format, and the returned data is also in JSON format by default, which is very convenient for front-end JavaScript processing. However, if you use other languages, such as Python, to obtain a large amount of data, use the simplejson package to interpret this huge JSON string as dict, which is a little time-consuming. One solution is to upgrade simplejson to the latest version, which can be several times faster, or you can use cjson, so that the format conversion time is negligible. However, the cjson interface and processing are a little different from simplejson. If you use couchdb-python as the client, you needCodeFor modification. Another method is to require the server to directly return the data in the List format, and then explain it directly in Python, avoiding the JSON interpretation. The speed should be guaranteed. The List format request API can be viewed here.
Cache:
It should be cached because the second query is much faster than the first query, and the temporary files generated by the query can also be seen in the data directory of couchdb. However, I guess the main data is still on the hard disk, because the couchdb service occupies a small amount of memory, and the data is unlikely to be in it. Based on the speed of the second query, it is not like getting data directly from the memory, but like getting data from the hard disk.
Update mechanism (how to maintain data consistency ):
As mentioned above, each document in couchdb has three fields. _ rev indicates the version number of the document. Each time the data is updated, the version number of the document is also updated. This version number plays a significant role in maintaining data consistency. As mentioned above, couchdb puts all the attributes ignoring schema in one field. When two programs modify this field at the same time, it is possible that the data written first is overwritten by the data written later, resulting in data inconsistency. Couchdb uses the version number to solve this problem. When a program needs to modify a piece of data, it must first read the version number of the data and write it together with the version number, at this time, couchdb checks whether the version number is consistent with the original data. if the version number is consistent, the data is written. if the version number is inconsistent, it indicates that the data has been modified after the program reads the data, couchdb fails to write data and returns a conflicting signal to be processed by the client.
Distributed:
Currently, couchdb (<1.0) does not have a built-in distributed processing mechanism. Therefore, the solution to distributed and Data Consistency advocated on the official website is: you set up a write couchdb service, and then copy the data to other machines synchronously during the replication process, as a read service. This distributed architecture sounds silly, but it does. Therefore, information consistency is hard to guarantee. That is to say, You just posted a message on the website and may not immediately see it after refreshing it. I don't know if couchdb developers will make efforts in this regard in the future. At least the plan we see now does not include more distributed support.
Stability:
In the introduction on the internet, stability is introduced as a feature of Gentoo, But I directly used the couchdb0.9 service in portage to disappear three times in a week, and no trace is left in the log. So I am reserved for this.
Python client:
Because it provides a simple restful interface, it is much easier to write clients in various languages for couchdb. A large number of clients are listed here. The Python clients mainly include couchdb-Python and couchdbkit. In my experiment, the former is used, and the problem in use is the efficiency of JSON interpretation. If you use the simplejson or JSON (> 2.5) module that comes with python to explain, touch a large string or dict, you can only wait. I initially thought couchdb was very slow. Who thought that JSON interpretation took most of the time. To solve this problem, you can install cjson quickly, but you needSource codeAnother option is to upgrade simplejson to the latest version, which greatly improves the speed.
What does couchdb do (whether it can or cannot ):
Even if the problem of JSON interpretation is solved, you cannot have too much confidence in reading and writing large amounts of data, after all, it is designed to be used concurrently at the front end, instead of allowing you to read or update millions of data records at a time. My test results are as follows: Read million data records (each record is not big). The first query takes about a few minutes. The second query takes about one minute to update million records. You need to wait for more than 20 minutes. So far, If You Want To Perform Batch read/write operations in the background and cannot tolerate this speed, it is not the best choice. At the same time, when you update data, the volume of the data file will gradually increase, because the data is appended to the end of the file, and the old data is not deleted, therefore, couchdb must prepare a large hard disk space and regularly run the compact or replication process to clear old data by compressing or copying it.
Prospects:
Couchdb has been sought after in the past two years,CommunityIt is very active. The degree of Community activity is usually a key factor affecting the success or failure of an open source software. The key to couchdb's attractiveness is, of course, it provides some good solutions for Web data storage and indexing. If we regard relational databases as a set of workbooks with fixed formats, then, document databases are a collection of various documents. Compared with relational databases, document databases are obviously more suitable for diverse and flexible web data. Couchdb is easy to get started, and the system is simple. It is in line with geeks's pursuit of concise and sufficient ideas. Couchdb said in its publicity that "like a porn star" is also quite expressive. So I am optimistic about his development. What will the stable version 1.0 look like? It is destined to be a database closely related to the web.
I personally hope that it can make more improvements in Distributed Processing and improve service configuration and query efficiency. As for the processing of large data volumes, it is not very difficult to expect, because it should be a website front-end product, rather than suitable for background analysis.
References
1. couchdb Official Website: http://couchdb.apache.org/
2. couchdb wiki: http://wiki.apache.org/couchdb/
3, couchdb Getting Started Guide: http://erlang-china.org/study/couchdb-guide.html