Implementation of string types in Lua source code _lua

Source: Internet
Author: User
Tags int size lua random seed reserved

Overview

LUA is fully 8-bit encoded, and the characters in the LUA string can have any numeric encoding, including a value of 0. That is, any binary data can be stored in a string. The LUA string is immutable (immutable values). If you modify it, you are essentially creating a new string. In Lua, strings are the objects managed by the automatic memory management mechanism and are implemented by the Federation tstring to store string values, according to the source implementation of data types in Lua. The following will look at the implementation of the string through the LUA 5.2.1 source code and summarize the considerations for using strings in Lua.

Source Code Implementation

First look at the string corresponding to the data structure tstring, its source code is as follows (Lobject.h):

410 
/* 411 Header for string value, string bytes Follow the end of this structure 
412/ 
413 typedef Union tstring { 
414  l_umaxalign dummy;/* ensures maximum alignment for strings */ 
415  struct {   416 Commonheader; 
417   lu_byte Extra/* reserved words for short strings; "Has hash" for longs * * * 
418   unsigned int hash; 
419   size_t Len/* Number of characters in String */ 
420  } TSV; 
421} tstring; 

For this consortium definition, there are several points to note:

I, the member L_umaxalign dummy in the consortium tstring is used to ensure alignment with the maximum length of type C, which is defined as follows (LLIMITS.H):

The/* type to ensure maximum alignment 
/M #if!defined (luai_user_alignment_t) 
#define Luai_user_alignment_t  Union {double U; void *s long l;} 
Wuyi #endif    
typedef luai_user_alignment_t l_umaxalign; 

This consortium member is also visible in the implementation of other objects that can be recycled, such as a table, in order to speed up CPU access to memory through memory alignment.

II, the member TSV in the consortium is really used to implement the string. Where member Commonheader is used for GC, it is defined as a macro in all recyclable objects, the code is as follows (Lobject.h):

 
* * * * Common Header for all collectable objects (in the macro form, to be a 
* * Included in the other objects) 
   
    77 * * 
#define Commonheader  gcobject *next; lu_byte tt; Lu_byte marked 

   

The structure of this macro corresponds to the following form (lobject.h):

Bayi * 
* * Common header in struct form, and the 
typedef struct GCHEADER {  commonheader; 
Gcheader}; 

The gcheader of the struct body is useful in the definition of the common recyclable object Union gcobject.

III, Lu_byte extra for short strings, used to record whether this string is reserved words, for long strings, can be used for lazy hash value; unsigned int The hash member is the hash value of the string (specifically how LUA calculates the hash value of the string), and size_t Len is used to represent the length of the string.

IV, the structure above simply describes the structure of a string, and the real string data is saved immediately after the structure body.

Before Lua5.2.1, the string is not distinguished from long and short strings, all the strings are stored in a global hash table, and for the LUA virtual machine, the same string has only one data, starting with Lua5.2.1, just putting the short string string (currently defined as length less than equal to 40) in the Global hash table, and Long strings are generated independently, and a random seed is introduced when calculating the hash value, in order to prevent the hash dos--attacker from constructing a very many different strings of the same hash value, thereby reducing the efficiency of LUA's entry into the global string hash table from the externally pressed strings. The following is a step in Lua5.2.1 to generate a new string with the corresponding code in LSTRING.C:

(1) If the string length is greater than Luai_maxshortlen (the default is 40), it is a long string, directly calling the function that creates the string interface Createstrobj (of course the length of the string needs to be saved in the member size_t Len, otherwise it will be returned directly). The code is as follows (LSTRING.C):

  
 * * * * * * * Creates a new string Object * * * *                                                  
 static tstring *createstrobj (lua_state *l, const char *STR, size_t l,                int tag, unsigned int h, gcobject **list) {tstring  *ts;  size_t totalsize/* Total size of tstring Object * *                       
102  totalsize = sizeof (tstring) + ((l + 1) * sizeof (char));  the TS = &luac_newobj (L, tag, totalsize, list, 0)->ts;  Ts->tsv.len = l;  Ts->tsv.hash = h;  Ts->tsv.extra = 0;  memcpy (ts+1, str, l*sizeof (char)); 
108  ((char *) (ts+1)) [l] = ' ending ';///                            
109 return  ts. 
110}  

You can see that the incoming string is specifically stored behind the tstring memory of the structure body, and note that 108 lines, the string is "closed" with the C language string.

  (2) If the string is a short string, first compute the hash value of the string, find the corresponding list (global hash table of short strings, using the method of linking, that is, put all the conflicting elements in the same list), find out whether the current to create the string is already in the hash table, If it already exists, the string is returned directly. Otherwise, the function newshrstr is called, and the function newshrstr invokes the Createstrobj function above to create a new string and place the newly created string into the hash table, the code is as follows (LSTRING.C):

130/* 
131 * Checks whether short string exists and reuses it or creates a new one 
132 * * * 
static tstring *internshrstr (lua_state *l, const char *STR, size_t L) {134 Gcobject *o  ; 
135  Global_state *g = g (L); 
136  unsigned int h = luas_hash (str, L, g->seed); 
137  for (o = G->strt.hash[lmod (h, G->strt.size)]; 
138    o!= NULL; 
139    o = GCH (o)->next) { 
140   tstring *ts = Rawgco2ts (o);   if (h = = Ts->tsv.hash &&     Ts->tsv.len = = L && 
143     memcmp (str, GETSTR (TS), L * sizeof (char)) = = 0)) { 
144    if (Isdead (G (L), O))/* string is dead (but am not collected yet)? * / 
145     changewhite (o);/* Resurrect it * * 
146 return    ts; 
147   } 
148  } 
149 return  newshrstr (L, str, L, h);/not found; Create a new String * * 

The global string hash table is saved in the virtual Machine Global State Member STRT (lstate.h):

119  

The type stringtable is a struct and is defined as follows (lstate.h):

The typedef struct STRINGTABLE {  gcobject **hash;  nuse/* Number of elements/lu_int32  int size; 
Stringtable}; 

The member Gcobject **hash is an array of pointers in which each member of the array points to tstring in essence (note that the tstring includes macro Commonheader, where the next member of the macro can construct a hash table) ; Nuse is the number of elements that have been used in the hash of the array; size is the hash of the current array.

In the function newshrstr insert a new string before, will determine whether the Nuse value is greater than size, if greater than, indicating that the hash table size is not enough to expand, then the size of the hash table expanded to the original twice times, the corresponding code is as follows (LSTRING.C):

121  if (tb->nuse >= cast (Lu_int32, tb->size) && tb->size <= max_int/2)                 
122   Luas_ Resize (L, tb->size*2); /* Too crowded * *  

At GC, the Nuse is judged to be smaller than the SIZE/2 (in Lua 5.1, Nuse is compared to SIZE/4), and if so, resize the stringtable is half the size of the original. The corresponding code is as follows (LGC.C):

783   int HS = g->strt.size/2;/* Half the size of the string table * *               
784   if (G->strt.nuse < cast (Lu _int32, HS))/* using less than that half? * *            
785    luas_resize (L, HS);/* Halve its size * * 

For string comparisons, first compare the type, if different types of strings, it will certainly not be the same, and then distinguish between short strings and long strings, for short strings, if the two pointer values are equal, otherwise different; corresponding long strings, first compare the pointer value, if different, compare the length value and the content literal character comparison.

Summarize

(1) the reserved word and meta method names in Lua are short strings, and they are put into the global short string hash table when the virtual machine is started and are not recycled.

(2) Looking for characters is more efficient, but modifying or inserting strings is inefficient, which, in addition to computation, should at least copy the outer strings to the virtual machine.

(3) for the hash value of the long string, LUA does not look at each character, thus avoiding the quick calculation of the hash value of the long string, and the corresponding code is as follows (LSTRING.C):

 unsigned int luas_hash (const char *STR, size_t l, unsigned int seed) {                                     
unsigned int h = seed ^ l;                                             
size_t L1;                              
size_t step = (L >> luai_hashlimit) + 1; (L1 = l; L1 >= step; L1-= Step) H = h ^ ((h<<5) + (H>>2) + CA                       
St_byte (Str[l1-1]));                                             
return h; * * * * * Lua would use at most ~ (2^luai_hashlimit) bytes from a                                                  
String to * * * COMPUTE its hash 24 */   #if!defined (luai_hashlimit) #define Luai_hashlimit 5 #endif 

(4) When there is more than one string connection, you should not use the string concatenation operator "..." directly, but instead use the table.concat operation or the String.Format to speed up the operation of string concatenation.

(5) The string hash algorithm in Lua is Jshash, with a variety of hash functions on the string, which is summarized in a network article: Https://www.byvoid.com/blog/string-hash-compare

The above is the entire content of this article, I hope to learn from Lua help.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.