Describes how to implement a Python String object,

Source: Internet
Author: User
Tags bitmask

Describes how to implement a Python String object,

PyStringObject struct

The string object in Python corresponds to a struct called PyStringObject internally. "Ob_shash" corresponds to the hash value calculated by the string. "ob_sval" points to a string with the length of "ob_size" and ends with 'null' (for compatibility with C ). The initial size of "ob_sval" is 1 byte, and ob_sval [0] = 0 (corresponding to an empty string ). If you want to know the location where "ob_size" is defined, you can take a look at the corresponding section of the object. h header file PyObject_VAR_HEAD. "Ob_sstate" is used to indicate whether a string already exists in the dictionary corresponding to the intern mechanism. We will mention this again later.

typedef struct {  PyObject_VAR_HEAD  long ob_shash;  int ob_sstate;  char ob_sval[1];} PyStringObject;

Create a String object

As shown below, what happens when a new string is assigned to a variable?

1 >>> s1 = 'abc'
When running the above Code, the internal C function "PyString_FromString" will be called and generate pseudo code similar to the following:

arguments: string object: 'abc'returns: Python string object with ob_sval = 'abc'PyString_FromString(string):  size = length of string  allocate string object + size for 'abc'. ob_sval will be of size: size + 1  copy string to ob_sval  return object

Each time a new string is used, a string object is allocated.

Shared String object

Python has an elegant feature, that is, the short strings between variables are shared, which can save the necessary memory space. A short string is a string of 0 or 1 byte. The global variable "interned" corresponds to a dictionary used to index these short strings. The array "characters" can also be used to index strings with a length of 1 byte, such as a single letter. Next we will see how the array "characters" is used.

static PyStringObject *characters[UCHAR_MAX + 1];static PyObject *interned;

Next let's take a look: What happened when you assign a short string to a variable in a Python script.

static PyStringObject *characters[UCHAR_MAX + 1];static PyObject *interned;

The string object whose content is 'A' will be added to the "interned" dictionary. The key in the dictionary is a pointer to the string object, and the corresponding value is the same pointer. In the array "characters", this new String object is referenced at a position offset of 97, because the ASCII value of the character 'a' is 97. The variable "s2" also points to this string object.


What if another variable is assigned a value by the same string 'A?

1 >>> s3 = 'A'
After the preceding code is executed, a string object with the same content as previously created is returned. Therefore, both the 's1 'and 's3' variables point to the same string object. The array "characters" is used to check whether the string 'A' already exists. If yes, a pointer to the string object is returned.

if (size == 1 && (op = characters[*str & UCHAR_MAX]) != NULL){  ...  return (PyObject *)op;}

Next we will create a new short string with the content of 'C:

1 >>> s4 = 'C'
Then, we will get the following results:


We can also find that the array "characters" is still useful when accessing a string element in the following Python script.

>>> s5 = 'abc'>>> s5[0]'a'

In the second line of code above, the returned pointer element is in the position of the array "characters" offset of 97, rather than creating a new string with a value of 'A. When we access an element in a string, a function named "string_item" d is called, and the code of the function body is given below. The parameter 'A' corresponds to the string "abc", and the parameter 'I' is the index value of the access array (in this example, It is 0 ), the function returns a pointer to a string object.

static PyObject *string_item(PyStringObject *a, register Py_ssize_t i){  char pchar;  PyObject *v;  ...  pchar = a->ob_sval[i];  v = (PyObject *)characters[pchar & UCHAR_MAX];  if (v == NULL)    // allocate string  else {    ...    Py_INCREF(v);  }  return v;}

The array "characters" can also be used when the function name length is 1, as shown below:

>>> Def a (): pass
String search

Let's take a look at what happens when you perform string search in the following Python code?

>>> s = 'adcabcdbdabcabd'>>> s.find('abcab')>>> 11

The function "find" returns an index value, indicating the position of the string "abcd" where "s" is found. If the string is not found, the return value of the function is-1.

So what have you done internally? A function named "fastsearch" is called internally. This function is a hybrid version between BoyerMoore and Horspool algorithms. It has both excellent features.

We call "s" (s = 'adcabcdbdabcabd') A primary string, and call "p" (p = 'abcab ') A pattern string. N and m represent the length of string s and string p, where n = 15, m = 5.

In the following code segment, it is obvious that the program will determine for the first time: if m> n, we will know that such an index number cannot be found, so the function will return-1 directly.

w = n - m;if (w < 0)  return -1;

When m = 1, the program traverses each character in string s. If the matching succeeds, the corresponding index location is returned. In this example, the variable mode value is FAST_SEARCH, which means that we want to obtain the first matching position in the main string, rather than the number of times that the mode string successfully matches in the main string.

if (m <= 1) {  ...  if (mode == FAST_COUNT) {    ...  } else {    for (i = 0; i < n; i++)      if (s[i] == p[0])        return i;  }  return -1;}

Consider other cases, such as m> 1. First, create a compressed boyer-moore delta 1 table (corresponding to the bad character rules in the BM algorithm). In this process, we need to declare two variables: "mask" and "skip ".

"Mask" is a 32-bit bitmask, which uses its lowest five feature bits as the switch bit. This mask is generated by performing operations with the mode string "p. It is designed as a bloom filter to detect whether a character appears in the current string. This mechanism makes the search operation very fast, but there is a false positives ). For more information about Bloom Filters, see here. In this example, the following describes how bitmask is generated.

mlast = m - 1/* process pattern[:-1] */for (mask = i = 0; i < mlast; i++) {  mask |= (1 << (p[i] & 0x1F));}/* process pattern[-1] outside the loop */mask |= (1 << (p[mlast] & 0x1F));

The first character of the string "p" is 'A '. The binary representation of the character 'a' is 97 = 1100001. The minimum five feature bits are retained, and we get 00001. Therefore, the initial value of "mask" is set to 10 (1 <1 ). After the entire string "p" is processed, the mask value is 1110. So how should we use this bitmask? In the following line of code, we use it to detect the position of the character "c" in the string "p.

If (mask & (1 <(c & 0x1F ))))
So, does the character 'a' exist in the string "p" ('abcab? 1110 & (1 <('A' & 0X1F) is the result true? Because 1110 & (1 <('A' & 0X1F) = 1110 & 10 = 10, we can see that 'A' does exist in 'abcab '. When the detection character 'D', we get false, the same result for other characters (from 'e' to 'Z. Therefore, in this example, these filters are outstanding. The variable "skip" corresponds to the index location of the last successfully matched character in the main string (from the back to the Front ). If the last matching character of the mode string does not exist in the main string, the "skip" value is the length of the "p" of the mode string minus 1. In this example, the last character of the pattern string is 'B', because it can be matched after two characters jump backward at the current position of the primary string search, therefore, the value of the variable "skip" is 2. This variable is applied to a rule called bad-character skip. In the following example, p = 'abcab ', s = 'adcabcaba '. The matching starts from the index position 4 of the primary string "s" (calculated from 0). If the character matches successfully, the matching continues. The index where the first matching fails is located at 1 ('B' is not equal to 'D '). We can see that there are three characters after the end position of the pattern string and the primary string matching at the beginning, and the primary string also has a 'B ', the character 'C' also exists in "p", so we skipped the subsequent 'B '.


Next, let's take a look at the loop section of the search operation (the real code is C implementation, not Python ):

for i = 0 to n - m = 13:  if s[i+m-1] == p[m-1]:    if s[i:i+mlast] == p[0:mlast]:      return i    if s[i+m] not in p:      i += m    else:      i += skip  else:    if s[i+m] not in p:      i += mreturn -1

The "s [I + m] not in p" test code is implemented based on the bit mask, and "I + = skip" jumps against bad characters. When the next character to be matched in the main string is not found in "p", run the "I + = m" line of code.

Let's take a look at how the algorithm runs for matching strings "p" and "s. The first three steps are similar to the above. Then, the character 'D' is not found in the string "p", so we directly skip the number of characters equal to the length of the "p" string, then a match is quickly found.

The above describes the entire learning process of Python String object implementation, hoping to help you learn Python program design.

Articles you may be interested in:
  • Python character string encode and decode research experience garbled Problem Solution
  • Usage of python string split
  • Python string operations
  • Summary of N methods for python string connection
  • How to format strings using the format function in Python
  • Some basic knowledge about string objects in Python
  • Tips for searching substrings in Python strings
  • Python converts a string to an array
  • How to convert strings and dates in python
  • Python checks whether an object is a string class

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.