Custom serialization of MongoDB)

Source: Internet
Author: User
Tags mongodb documentation mongodb query

I have been studying MongoDB recently, so be careful. I found that the original blog Park supports live writer.

I was excited and finally came back here many years later. I used liver writer to write MySpace and Wordpress.

But the former is over, and the latter is annoying.

========================================================== ================

First, we recommend a MongoDB query analyzer.

Mongovue

This tool is very easy to use, although it is beyond the trial period, but still can be used

Only three query windows can be opened.

 

 

Db4o and protobuf.net have been used in the past, so it is very suitable for MongoDB.

Because the similarity is too large. Especially for Object persistence, the details are slightly different.

 

========================================================== =====

1. Requirements:

One of my new algorithms needs to read a complete collection, which takes dozens of seconds.

In the beginning, automatic serialization and deserialization of feature labels are used, regardless of the adjustment in any way, insertbatch and

The performance of findall () is not improved.

 

2. Thinking:

At first, I thought the reading speed and storage speed were extremely slow because of MongoDB's own problems. Today I think about it carefully. The key issue is that too much data is written to the hard disk.

MongoDB data persistence is in bson format. The redundancy of this format is quite large. In particular, default serialization and deserialization.

 

"_ Id": objectid ("4f4e2a02c992571e54c30465 "),
"Value": "XXXXX ",
"Chars ":[{
"Words ":[{
"Index": 0,
"Length": 2,
"Wordtypes": 0
}]
},{
"Words ":[{
"Index": 0,
"Length": 2,
"Wordtypes": 0
},{
"Index": 1,
"Length": 2,
"Wordtypes": 0
}]
},

View the final data format by using the stored Vue, and find that the main storage space is consumed on the attribute name with little significance. After calculation, we can know that the name is almost 5-10 times the size of the value.

Compared with protobuf, using numbers as attribute names saves much space.

However, MongoDB can retrieve fields, but protobuf cannot. Therefore, Mongo does not use protobuf.

I have a collection of 50000 documents, with an average document of 4000 bytes. This is really surprising about inefficient persistence. It takes dozens of seconds to read data. The entire data storage consumes MB of space.

I have read the official MongoDB documentation

Http://www.mongodb.org/display/DOCS/CSharp+Language+Center

Therefore, I am somewhat impressed with customizing serialization.

The description in the official document is very simple. It only says that the class should inherit the ibsonserializable interface, and then implement four methods. However, there is no example and I have no idea how to perform the operation.

Public class myclass: ibsonserializable {// implement deserialize method // implement serialize method}

Okay, there's a huge Google presence.

Stackoverflow is a good website

Http://stackoverflow.com/questions/7105274/storing-composite-nested-object-graph

3. solution:

Part 1: Convert objects into numbers to save on name and space consumption

 

Public uint32 intvalue
{
Get
{
VaR V1 = (uint32) wordtypes) <24;
VaR v2 = (uint32) index) <16;
VaR V3 = (uint32) length) <8;
VaR V4 = (uint32) 0; // Reserved

Return V1 | V2 | V3 | V4;
}
}

Public void fromint32 (uint32 value)
{
This. wordtypes = (wordtypes) (value> 24 );
This. Index = (int32) (value <8> 24 );
This. Length = (int32) (value <16> 24 );
}

 
There is nothing to mention above. It is nothing more than shift left and right. Of course, there may be data type overflow. In this case, change to int64 or modify it as appropriate.
Note: I am not going to search fields in MongoDB for this third-level object, but only for storage. As for retrieval, it is converted into another string keyword for retrieval.
Therefore, since no search is required, the attribute does not need a name at all. Therefore, multiple attributes can be bitwise or a value and stored in an array. All objects are saved.
 
Part 2
Public partial class sentence: ibsonserializable
{

Public static int idsum;
Public bool getincluentid (Out Object ID, out type idnominaltype, out iidgenerator idgenerator)
{
Id = This. ID = idsum ++;
Idnominaltype = typeof (INT );
Idgenerator = NULL;
Return true;
}

Public void serialize (MongoDB. bson. Io. bsonwriter, type nominaltype, ibsonserializationoptions options)
{
Bsonwriter. writestartdocument ();
Bsonwriter. writeint32 ("_ id", this. ID); // more than 10 bytes. If objectid is used
Bsonwriter. writestring ("value", this. Value); // You can save a dozen bytes if you change the name to several letters.
Bsonwriter. writestring ("Words", this. wordstr );
Bsonwriter. writeboolean ("isconf", this. isconflict );
Bsonwriter. writestartarray ("C ");

Foreach (VAR item in chars)
{
Bsonserializer. serialize (bsonwriter, item. Words. Select (V => v. intvalue). tolist ());
}


Bsonwriter. writeendarray ();
Bsonwriter. writeenddocument ();
}

Public void setincluentid (Object ID)
{
Throw new notimplementedexception ();
}

Public object deserialize (MongoDB. bson. Io. bsonreader, type nominaltype, ibsonserializationoptions options)
{
// Bsonreader. readstartdocument ();
// This. ID = bsonreader. readint32 ();
// VaR value = bsonreader. readstring ("v ");
// Var wordstr = bsonreader. readstring ("W ");
// Bsonreader. readstartarray ();

// Var list = new list <int32> ();
// While (bsonreader. readbsontype ()! = Bsontype. endofdocument)
//{
// Var element = bsonserializer. deserialize <list <int32> (bsonreader );
// List. Add (element );
//}

// Bsonreader. readendarray ();
// Var isconflict = bsonreader. readboolean ("I ");
// Bsonreader. readenddocument ();


If (nominaltype! = Typeof (sentence ))
Throw new argumentexception ("serialization is not allowed because the type definition is inconsistent ");
VaR Doc = bsondocument. readfrom (bsonreader );

This. ID = (int32) Doc ["_ id"];
This. value = (string) Doc ["value"];
This. wordstr = (string) Doc ["Words"];
This. isconflict = (bool) Doc ["isconf"];
VaR list = (bsonarray) Doc ["C"];

This. Chars = new list <charobj> ();
For (INT I = 0; I <list. Count; I ++)
{
VaR CH = new charobj {Index = I, Sen = This, words = new list <wordobj> ()};
This. Chars. Add (CH );

VaR words = (bsonarray) list [I];

Foreach (int32 item in words)
{
VaR wordobj = new wordobj (uint32) item );
Wordobj. Sen = this;
Ch. Words. Add (wordobj );
}
}


Return this;
// Return new sentence {id = 1, isconflict = true, value = "1", wordstr = "1 "};
}

}
   

There are several points to note:

One is ID generation. I don't know why the ID assignment function has complicated parameters, but it can bypass the guid-type ID of objectid and use int to save some space.

Of course, if the overall object is large, use objectid. There is no need to use int at all, and there are many problems with Int. You need to save the maximum value in another collection, and there is no way to cross multiple collections like objectid. Therefore, it makes sense to use objectid instead of int for MongoDB design IDs. If the object is large as a whole, there is no need to save the consumption of these 10 bytes.

 

Second, the serialize method must start with bsonwriter. writestartdocument () and end with bsonwriter. writeenddocument (). Remember, otherwise an error that cannot be written will be reported.

 

Third, how to write a set of two layers?

             foreach (var item in Chars)
            {
                bsonWriter.WriteStartArray("words");
                foreach (var w in item.Words)
                    bsonWriter.WriteInt32((Int32)w.IntValue);
                bsonWriter.WriteEndArray();
            }          

 

However, MongoDB does not support this nested persistence.

 

Must be changed

             foreach (var item in Chars)
            {
                BsonSerializer.Serialize(bsonWriter, item.Words.Select(v=>v.IntValue).ToList());  
            }         
 
Note that although the bsonserializer. serialize parameter is an ienumerable <t>, it must be tolist; otherwise, data will not be saved successfully.
 
Fourth, when deserialization is performed, the start end method cannot be used directly, and an error is inevitable. The dictionary value can only be read once and then retrieved.
 
4. Comparison

 

 

The new bson storage format is relatively compact.

"_ Id": objectid ("4f4e2a02c992571e54c30465 "),
"Value": "XXXXX ",
"Chars ":[{
"Words ":[{
"Index": 0,
"Length": 2,
"Wordtypes": 0
}]
},{
"Words ":[{
"Index": 0,
"Length": 2,
"Wordtypes": 0
},{
"Index": 1,
"Length": 2,
"Wordtypes": 0
}]
},

 

Compared with the original one, the gap is very obvious.

 

Use volume vue to view the average document size. The average document size is only 364byte. It turned out to be 4000 of the scare.

The total size also drops from MB to 17 MB.

 

It takes about 9 seconds to use my laptop. It turns out to be more than 40 seconds. While using a desktop hard disk is faster, it can be several times faster, within several seconds.

 

 

5. Others

In fact, why do we need to implement custom persistence methods? First, the performance is quite worrying. The second is the re-binding of the object Association pointer.

The previously read data from the database requires manual restoration of correlated pointers. Now you can directly perform this operation in the deserialization function.

That is to say, once the queried objects are all the same as the memory pair.

The advantage is that it greatly reduces the complexity of the program.

 

Using MongoDB Data Objects performs pointer operations like memory objects. Then the data is automatically stored permanently.

Er. I found myself in love with MongoDB. Although it has many disadvantages.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.