PHP serialization (serialize) format detailed

Source: Internet
Author: User
Tags object serialization php server php source code scalar

1. Objective
PHP (starting with PHP 3.05) provides a set of serialized and deserialized functions for saving objects: Serialize, Unserialize. However, the description of the two functions in the PHP manual is limited to how they are used, and the format of the serialized results is not described. As a result, this makes it more cumbersome to serialize the PHP approach in other languages. Although some other languages have been compiled to implement the PHP serialization program, but these implementations are not complete, when the serialization or deserialization of some more complex objects, there will be an error. So I decided to write a document about the details of the PHP serialization format (that is, this document) so that I could have a more complete reference when writing PHP serializers implemented in other languages. What I wrote in this article is that I have written a program to test and read the PHP source code, so I can't 100% guarantee that everything is correct, but I will try my best to ensure the correctness of what I have written, and I will make it clear in the text that I am not quite sure. We also hope that we can supplement and perfect it.
2. Overview
PHP serialized content is a simple text format, but the letter case and whitespace (space, carriage return, newline, etc.) sensitive, and the string is calculated by byte (or 8-bit characters), so it is more appropriate to say that the content of PHP serialization is the byte stream format. Therefore, when implemented in other languages, if the string in the implemented language is not a byte storage format, but rather a Unicode storage format, the serialized content is not suitable to be saved as a string, but should be saved as a byte stream object or an array of bytes, otherwise the data exchange with PHP will produce an error.
PHP uses different letters to mark different types of data, and the use of serialized PHP with Yahoo! Web Services provides all the letters and their meanings in the Yahoo development site:
A-array B-boolean d-double i-integer O-common object r-reference s-string C-custom object O-class n-nu ll R-pointer Reference U-unicode string
N represents NULL, while B, D, I, and s represent four scalar types, and currently the PHP serializer implemented in other languages basically implements serialization and deserialization of these types, although there are problems with implementations of s (strings).
A, O is the most commonly used composite type, most of the other language implementations are well implemented for the serialization and deserialization of a, but to O only implemented the PHP4 in the object serialization format, but did not provide support for the extended object serialization format in PHP 5.
R, R, which represent both object references and pointer references, are also useful in serializing complex arrays and objects that produce data with these two markers, which we will explain in detail later, which are not yet found in other languages.
C is introduced in PHP5, which represents a custom object serialization method, although this is not necessary for other languages because it is seldom used, but it is explained in detail later.
U is introduced in PHP6, which represents a Unicode encoded string. Because the ability to save a string in Unicode is provided in PHP6, it provides the format of this serialized string, although this type is not supported by PHP5, PHP4, and these two versions are currently mainstream, so it is not recommended for serialization in other languages when implementing that type. However, it is possible to implement its deserialization process. I'll also explain the format of it later.
And finally there's an O, which is the only one I haven't figured out yet. A data type indicator. This indicator was introduced in PHP3 to serialize objects, but was replaced by O after PHP4. In the source code of PHP3, you can see that the serialization and deserialization of O is essentially the same as array A. But in the source code of PHP4, PHP5, and PHP6, it is not found in the serialization section, but in these versions of the deserialization program there is a processing of it, but I haven't figured out what to do with it. So there is no more explanation for it for the time being.
3. NULL and serialization of scalar types
The serialization of NULL and scalar types is the simplest and constitutes the basis for conforming to type serialization. This part of the content is believed to be familiar to many PHP developers. If you feel you have mastered this section, you can skip this chapter directly.
3.1. Serialization of NULL
In PHP, NULL is serialized as:
N 3.2. Serialization of Boolean data
The Boolean data is serialized as:
b:<digit>;
Where <digit> is 0 or 1,,<digit> is 0 when the Boolean data is false, otherwise 1.
3.3. Serialization of Integer data
Integer data (integers) are serialized as:
i:<number>;
Where <number> is an integer with a range of: 2147483648 to 2147483647. A number can have a sign before it, and if the serialized number exceeds this range, it is serialized as a floating-point type instead of an integer. If the serialized number exceeds this range (the problem does not occur when the PHP itself is serialized), the expected value will not be returned when deserializing.
3.4. Serialization of double type data
The double type data (floating-point number) is serialized as:
d:<number>;
Where <number> is a floating-point number whose range is the same as the range of floating-point numbers in PHP. Can be expressed as integer form, floating-point number form and science and technology law form. If you serialize an infinite number, <number> is the INF, and if you serialize negative infinity, <number> is-inf. The serialized number range exceeds the maximum value that PHP can represent, then returns Infinity (INF) when deserializing, and returns 0 if the serialized number range exceeds the minimum precision that PHP can represent.
3.5. Serialization of string-type data
String data (strings) are serialized as:
S:<length>: "<value>";
Where <length> is the length of <value>,<length> is a non-negative integer, the number can be preceded by a plus sign (+). <value> is a string value, where each character is a single-byte character, and its range corresponds to a 0-255 character of the ASCII code. Each character represents the original character meaning, there is no escape character,<value> both sides of the quotation marks ("") are required, but not calculated in <length>. The <value> here is equivalent to a byte stream, and <length> is the number of bytes in the stream.
4. Serialization of simple composite types
There are two types of composite types in PHP, the array and the object, and this chapter focuses on the serialization format of the two type of data in simple cases. The serialization format of the nested definition of the composite type and the custom serialization method is discussed in detail in a later section.
4.1. Serialization of arrays
Arrays are usually serialized as:
A:<n>:{<key 1><value 1><key 2><value 2>...<key n><value n>}
Where <n> represents the number of array elements, <key 1>, <key 2>......<key n> represents array subscripts, <value 1>, <value 2>......<value N > represents the value of the array element corresponding to the subscript.
The subscript type can only be integer or string, and the serialized format is the same format as the integer and string data after serialization.
The array element value can be any type, and its serialized format is the same as the serialized format of the corresponding type.
4.2. Serialization of objects
Objects (object) are usually serialized as:
O:<length&gt: "<class name>": <n>:{<field name 1><field value 1><field name 2>< Field value 2>...<field name N><field value N>}
Where <length> represents the string length of the object's class name <class name>. <n> represents 1 numbers of fields in an object. These fields include fields that are declared with Var, public, protected, and private in the class of the object and its ancestors, but do not include static fields for static and const declarations. This means that only the instance (instance) field.
<filed name 1>, <filed name 2>......<filed name n> represents the field name for each field, and <filed value 1>, <filed value 2> ... <filed value n> represents the field value corresponding to the field name.
The field name is a string type, and the serialized format is the same as the format after the string data is serialized.
The field value can be of any type, and its serialized format is the same as the serialized form of the corresponding type.
But the serialization of field names is related to the visibility of their declarations, and the following focuses on serialization of field names.
4.3. Serialization of Object field names
The fields declared by Var and public are common fields, so the serialization format of their field names is the same. The field name of the public field is serialized according to the field name when it is declared, but the variable prefix symbol is not included in the serialized field name when declaring.
The field declared by protected is a protected field that is visible in the declared class and in the subclass of the class, but not in the object instance of the class. Therefore, when the field name of the protected field is serialized, the field name is preceded by a
\0*\0
The prefix. This means that the ASCII code is 0 characters, not the combination.
The fields of the private declaration are proprietary, visible only in the class that is declared, and are not visible in subclasses of the class and object instances of that class. Therefore, when the field name of the private field is serialized, the field name is preceded by the
\0<declared class Name>\0
The prefix. Here <declared class Name> represents the class name of the class that declares the private field, not the class name of the object being serialized. Because the class that declares the private field is not necessarily the class of the object being serialized, it may be its ancestor class.
When a field name is serialized as a string, the string value includes the prefix that is added according to its visibility. The length of the string also includes the length of the prefix added. Where the/s character is also calculated length.
1 Note: In the PHP manual, fields are called attributes, and in fact, the object members introduced in PHP 5 that are defined by __set, __get, are more appropriate to be called attributes. Because object members defined with __set, __get are consistent with the behavior of properties in other languages, the properties described in the PHP manual are actually called fields in other languages (for example, C #), and in order to avoid confusion, this is also called a field, not a property.
5. Serialization of nested composite types
The previous chapter discusses the serialization of simple composite types, and you will find that simple arrays and objects are actually easy. But how can PHP serialize such objects and arrays when it encounters an object or array that contains itself or a that contains b,b and a? In this chapter we will discuss the serialization form in this case.
5.1. Object references and pointer references
In PHP, scalar type data is passed by value, and composite type data (objects and arrays) is passed by reference. However, the reference passing of composite type data is distinguished by the use of the & notation to explicitly specify a reference pass, which is an object reference, and the latter is a pointer reference.
Before interpreting object references and pointer references, let's look at a few examples.
<?php
echo "<pre>";
Class SampleClass {
var $value;
}
$a = new SampleClass ();
$a->value = $a;

$b = new SampleClass ();
$b->value = & $b;

Echo serialize ($a);
echo "\ n";
Echo Serialize ($b);
echo "\ n";
echo "</pre>";
?>
The output of this example is this:
O:11: "SampleClass": 1:{s:5: "value"; r:1;}
O:11: "SampleClass": 1:{s:5: "Value"; r:1;}
You will find that the value of the variable $a is serialized into r:1, and the value of the $b value field is serialized as r:1.
But what is the difference between an object reference and a pointer reference?
Let's take a look at the following example:
echo "<pre>";
Class SampleClass {
var $value;
}
$a = new SampleClass ();
$a->value = $a;

$b = new SampleClass ();
$b->value = & $b;

$a->value = 1;
$b->value = 1;

Var_dump ($a);
Var_dump ($b);
echo "</pre>";
You will find that the results of the operation may be unexpected:
Object (SampleClass) #1 (1) {
["Value"]=>
Int (1)
}
Int (1)
Changing the value of the $a->value only changes the value of the $a->value, while changing the value of the $b->value changes the $b itself, which is the difference between an object reference and a pointer reference.
Unfortunately, the serialization of the PHP array made a mistake, although the arrays themselves are passed as object references, but at the time of serialization, PHP seems to have forgotten this, as shown in the following example:
echo "<pre>";
$a = array ();
$a [1] = 1;
$a ["value"] = $a;

echo $a ["value"] ["value"][1];
echo "\ n";
$a = unserialize (serialize ($a));
echo $a ["value"] ["value"][1];
echo "</pre>";
The result is:
1
You will find that the array structure is changed after the original array is serialized and deserialized. The value 1 in the original $a ["value"] ["value"][1] was lost after deserialization.
What is the reason? Let's take a look at the results after serializing the output:
$a = array ();
$a [1] = 1;
$a ["value"] = $a;

Echo serialize ($a);
The result is:
A:2:{i:1;i:1;s:5: "Value"; A:2:{i:1;i:1;s:5: "Value"; N;}}
Originally, after serialization, $a ["value"] ["value"] became NULL instead of an object reference.
In other words, PHP only generates object references (r) for objects when they are serialized. Object references are not generated for all scalar types and arrays (also including NULL) when serializing. However, if a reference to the & symbol is explicitly used, it is serialized as a pointer reference (R) when serialized.
5.2. The number after the citation is marked
As you can see in the example above, the format of the object reference (r) and pointer Reference (R) is:
r:<number>;
r:<number>;
Everyone must be very surprised. What is this <number> in the back? In this section we will discuss the issue in detail.
This <number> simply means that the object being referenced is the first occurrence in the serialized string, but this position does not refer to the position of the character, but rather to the position of the object (where the object is referred to as the amount of all types, not limited to the object type).
I think we may not quite understand, so I would like to illustrate:
Class ClassA {
var $int;
var $str;
var $bool;
var $obj;
var $pr;
}

$a = new ClassA ();
$a->int = 1;
$a->str = "Hello";
$a->bool = false;
$a->obj = $a;
$a-&GT;PR = & $a->str;

Echo serialize ($a);
The result of this example is:
O:6: "ClassA": 5:{s:3: "int"; I:1;s:3: "Str"; s:5: "Hello"; s:4: "bool"; B:0;s:3: "obj"; R:1;s:2: "PR"; R:3;}
In this example, the first serialized object is an object of ClassA, then it is numbered 1, the next thing to serialize is a few members of the object, the first serialized member is an int field, it is numbered 2, then the serialized member is STR, then its number is 3, and so on , to the obj member, it discovers that the member has been serialized and is numbered 1, so it is serialized as r:1 when it is serialized; , the PR member is then serialized, and it finds that the member is actually a reference to the STR member, and the STR member is numbered 3, so the PR is serialized as R:3; The
How is PHP to number the objects that are serialized? In fact, PHP at the time of serialization, first set up an empty table, and then each serialized object before being serialized, it is necessary to calculate the hash value of the object, and then determine whether the hash value has appeared in the table, if not appear, add the hash value to the end of the table, return to add success. If it does, it returns the Add failure, but before returning the failure to determine whether the object is a reference (a reference defined with the & symbol), and if not, add the Hash value to the table (although the addition fails). If the return fails, the last occurrence of the position is also returned.
After adding the hash value to the table, if the addition fails, it is judged whether the addition is a reference or an object, and if it is a reference, the R flag is returned, and if it is an object, the R flag is returned. Because of the failure, the last occurrence is returned, so R and R indicate the following number, which is the position.
5.3. Deserialization of object references
PHP is interesting when deserializing object references, if the deserialized string is not generated by PHP serialize () itself, but is built artificially or in other languages, even if the object reference is not directed to an object, it can be deserialized correctly by the data pointed to by the object reference. For example:
echo "<pre>";
Class Strclass {
var $a;
var $b;
}

$a = unserialize (' o:8: "strclass": 2:{s:1: "a"; S:5: "Hello"; s:1: "B"; r:2;} ');

Var_dump ($a);
echo "</pre>";
Operation Result:
Object (strclass) #1 (2) {
["A"]=>
String (5) "Hello"
["B"]=>
String (5) "Hello"
}
You will find that the above example is deserialized, the value of $a->b is the same as the value of $a->a, although $a->a is not an object, but a string. So if you use other languages to serialize, you don't have to treat string as a scalar type, even if you serialize a compound type that has the same string content by object reference, it can be deserialized correctly with PHP. This saves the space occupied by the serialized content.
6. Custom Object Serialization 6.1. Serialization of custom objects in PHP 4
PHP 4 provides two methods, __sleep and __wakeup, to customize the serialization of objects. However, these two functions do not change the format of the object serialization, only affect the number of fields that are serialized. The introduction of them, in the PHP manual written in more detail. There is no more introduction here.
6.2. Serialization of custom objects in PHP 5
The interface (interface) feature is added in PHP 5. PHP 5 itself provides a Serializable interface, if the user implements the interface in their own defined class, then when the object of the class is serialized, it is serialized in the way the user implements it, and the serialized label is no longer O, but instead C. The format of the C designation is as follows:
C:<name length>: "<class name>": <data Length>:{<data>}
Where <name length> represents the length of the class name <class name>, <data length> represents the length of the custom serialized data <data>, and the custom serialized data <data&gt ; is a fully user-defined format that can be completely unrelated to the PHP serialization format, which is managed by the user's own implementation of the serialization and deserialization interface methods.
The Serializable interface defines 2 methods, serialize () and Unserialize ($data), which are not called directly, but are automatically invoked when a PHP serialization function is called. Where the Serialize function has no arguments, its return value is the content of <data>. and Unserialize ($DATA) has a parameter $data, the value of this parameter is the content of <data>. So everyone should understand that, in fact, the interface serialize method is to let the user to serialize the contents of the object, the serialized content format, PHP does not care, PHP is only responsible for filling it into <data>, until deserialization, PHP is only responsible for removing this part of the content , and then passed to the user implementation of the Unserialize ($data) interface method, let the user to deserialize this part of the content.
Here is a simple example to illustrate the use of the Serializable interface:
Class MyClass implements Serializable
{
Public $member;

function MyClass ()
{
$this->member = ' member value ';
}

Public Function serialize ()
{
Return Wddx_serialize_value ($this->member);
}

Public Function unserialize ($data)
{
$this->member = wddx_deserialize ($data);
}
}
$a = new MyClass ();
Echo serialize ($a);
echo "\ n";
Print_r (Unserialize (serialize ($a)));
The output is (source code in the browser):
C:7: "MyClass": 90:{<wddxpacket version= ' 1.0 ' >MyClass Object
(
[member] = member Value
)
So if you want to use other languages to implement the C flag in PHP serialization, you also need to provide a mechanism for users to customize the class, you can process the <data> content when deserializing, otherwise, the content will not be deserialized.
7. Serialization of Unicode strings
Well, let's talk about the problem of Unicode string serialization in PHP 6.
To tell you the truth, I don't really like to put strings into double-byte Unicode encoding stuff. This is also used in JavaScript, so it is very inconvenient when dealing with a byte stream. C # is the way to encode strings, but fortunately, it provides a comprehensive encoding conversion mechanism, and provides this string to the byte stream (in fact, to the array of bytes) conversion, so it is OK to deal with. But for those unfamiliar with this, turning around is a problem.
PHP 6 has been in bytes to encode strings, to PHP 6 suddenly out of a Unicode encoded string, although optional, but still feel very uncomfortable, if misconfigured, old program compatibility is problematic.
After that, of course, many of the old string-related functions have been modified. The serialization function is no exception. Therefore, PHP 6 adds a specialized Unicode string serialization indicator U. The serialization format for Unicode strings in PHP 6 is as follows:
U:<length>: "<unicode string>";
Here <length> refers to the length of the original Unicode String, not the length of the <unicode string>, because <unicode string> is the byte stream after encoding.
But it is also important to note that,<length>, although it is the length of the original Unicode String, is not only its number of bytes, of course, it does not exactly refer to its number of characters, which is exactly the number of its character units. Because Unicode strings use UTF16 encoding, which uses 16 bits to represent one character, not all of them can be represented in 16 bits, so some characters require two 16 bits to represent one character. Therefore, in UTF16 encoding, 16-bit characters count as a single character unit, an actual character may be a unit of characters, or it may consist of two character units. Therefore, the number of characters in a Unicode String is not always equal to the number of character units, and here <length> refers to the number of character units, not the number of characters.
How is the <unicode string> encoded? In fact, its encoding is also very simple, for encoding less than 128 characters (but not including \), according to a single byte write, for more than 128 characters and \ characters, the conversion to a 16-encoded string, with \ As the beginning, the next four bytes is the character unit of the 16 encoding, in order by the high To the low rank, that is, the 第16-13位 16 numeric characters (abcdef these letters are lowercase) as the first byte, 第12-9位 as the second byte, 第8-5位 as the third byte, and the last 第4-1位 as the fourth byte. Encoded in turn, get the content of <uncode string>.
I think there is no need for other languages to implement this serialization, because the content serialized in this way is not supported by the current mainstream PHP server, but it can be implemented to deserialize, so that in the future even with PHP 6 data exchange, you can read each other.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.