Technical Analysis of squid Coss File System

Source: Internet
Author: User

Now many companies use squid as the cache, and Alibaba is no exception. This is no secret. Squid's Coss file system is especially suitable for the cache of small files. It is the main means for many companies to store small image cache.

Turn to the computer today and find a technical analysis written in the early stage for your reference. You are welcome to criticize and correct it.

Squid Coss Technical Analysis
1. Coss Mechanism Analysis
1.1. Coss file structure analysis
1.2. Coss Configuration Analysis
1.3. Coss relocation Mechanism
2. Coss Implementation Analysis 3
2.1 squid filesystem Architecture Analysis
2.2. Core Data Structure Used by Coss
2.2.1. swapdir
2.2.2. struct _ cossinfo
2.2.3. cossstripe
2.3. Key Process Analysis
2.3.1. Rebuild
2.3.2. Access Object during cache hit: storecossopen
3. Case Analysis
3.1. Case 1
4. Suggestions on using Coss

 
1. Coss Mechanism Analysis
1.1. Coss file structure analysis
The Coss file is an application-layer file system specially developed by squid for efficient storage and management of small files. The full name is cyclic object storage system. Logically, Coss files are divided into m + n stripe files. Stripe is the storage unit used by Coss logically. Physically, Coss is seamlessly pieced together by multiple strpes. M indicates the number of Mappings of stripe in memory form on the hard disk (10 by default, which indicates that the Coss file supports 10 Memory stripe ), N indicates the maximum number of strpes that can be divided by the rest of the Coss (Disk stripe.

Logically, the Coss file is represented as "6" by The image. N disks form a ring, such as the lower part of "6", and m files are mapped to Mem stripe, for example, the above section is "6.

One stripe can store one or more objects, but each object cannot span two stripe objects. This determines that once the stripe is fixed, it cannot store objects larger than the stripe. It also determines that there is almost an inevitable waste of space in the stripe.

In addition, the Coss file physically splits the stripe into blocks, so the block size must be divisible by the stripe size. I think the reason for using block as the squid Layer Physical IO Interface may be mainly: a) matching with the system, so that the filesytem of the OS and the physical disk block can be optimized B) squid uses the stripe and fileno numbers to address objects, while fileno only has 24 bits. Therefore, the larger the block used, the larger the files supported by Coss.

1.2. Coss Configuration Analysis
From the perspective of squid. conf, the typical configuration of the Coss file is as follows:
Cache_dir Coss/data/cache1/coss0 34500 max-size = 1000000 block-size = 4096 overwrite-percent = 50 membufs = 10
34500 indicates that the maximum size of the Coss file is 34.5 GB, and Max-size indicates the maximum object size that can be stored, which is essentially the size of the stripe. Block-size indicates that each physical storage block is 4 K. Overwrite-percent indicates that when the length of the entire Coss file between the current stripe cursor and the cursor (offset) where the object was originally stored is greater than 50%, relocation (File moving, this overwrite-percent must be well understood.

1.3. Coss relocation Mechanism
In fact, we can see from the test that squid uses Coss, even in the case of full cache hit, there will be a lot of Io write operations. The root cause is the Coss relocation mechanism.
In essence, relocation is introduced to solve the problem of aging and elimination of Coss files. Because Coss does not have LRU or other aging algorithms, Coss cannot delete them at will. Relocation is a disguised form of aging. In this case, we need to analyze the usage mechanism of a Coss file first.
As mentioned above, the disk stripe of Coss is a ring. For convenience, we assume that the stripe number of a Coss file is 1, 2, 3 ..... N. When the Coss file is created, squid records the currently used stripe number as 1. In addition, the system maintains a stripe-> membuf (default value: 1 MB) memory block ). When an object is created, squid writes the object to stripe-> membuf. When stripe-> membuf is full or cannot accommodate the new object, squid switches the stripe (swapout) to the stripe corresponding to the disk. At the same time, point the current stripe to stripe 2, and add the new file to stripe-> membuf 2. Similarly, when stripe-> membuf 2 is full, swapout is used, and number 3 is used, and so on. When the disk of the entire Coss
If the stripe is full, 1 is used in a loop (this is called a ring ). The stripe of the current cursor is overwritten ).
From this perspective, Coss seems to be inherently aging. Why bother developing relocation? In my understanding, when the Coss object is deleted, if you do not use relocation, it will cause serious internal fragments in the Coss and affect the service effect.

How does relocaton work?
When squid accesses an object, it first checks whether the object requires relocation based on the user's configuration (overwrite-percent. If necessary, read the object and put it in the current stripe-> membuf. After membuf is fully written, swapout is sent to the current disk stripe. So that the object can be saved in the Coss file within the longest period of time (only one circle around the Coss cursor will be reached when the object arrives again ), this achieves a similar role of LRU.
As you can see, due to the relocation, only the IO of the read operation increases the write operation, which is a disadvantage of Coss. After all, Coss uses large files to store small files, thus avoiding frequent open and close system calls by aufs. However, you can see that using Coss increases the number of Io operations and occupies too much iops (read/write ).

2. Coss Implementation Analysis
2.1 squid filesystem Architecture Analysis

<Illustration>

2.2. Core Data Structure Used by Coss
2.2.1. swapdir
/*
* Corresponding to the configuration of one line of cache_dir
*/
Struct _ swapdir {
Const char * type;
Int cur_size; // the total size of all objects in the current directory, in K.
Int low_size; // The number of low water bytes calculated by max_size and lowwatermark in this directory, in KB
Int max_size; // The maximum number of bytes in the local directory, in KB
Char * path; // path of the local directory
Int index;/* This entry's index into the swapdirs array */
Squid_off_t min_objsize; // The minimum object supported by the local directory
Squid_off_t max_objsize; // maximum objects supported by the local directory
Removalpolicy * repl; // The aging policy handler of the local directory
Int removals; // fixme: this parameter is not used
Int scanned; // fixme: this parameter is not used
Struct {
// When the current object is stored in the local directory, it is set to 1 and only used for cachemgr.
Unsigned int selected: 1;
// Set this directory to read only in cache_dir configuration.
Unsigned int read_only: 1;
} Flags;

/* The following are the operation function pointers corresponding to this FS */
Stinit * Init;/* initialise the FS */
Stcheckconfig * checkconfig;/* Verify Configuration */
Stnewfs * newfs;/* Create a new FS */
Stdump * dump;/* dump FS config snippet */
Stfree * freefs;/* free the FS Data */
Stdblcheck * dblcheck;/* double check the OBJ integrity */
Ststatfs * statfs;/* dump FS Statistics */
Stmaintainfs * maintainfs;/* Replacement maintainence */
Stcheckobj * checkobj;/* check if the FS will store an object */
Stcheckloadav * checkload;/* check if the FS is getting overloaded ..*/
/* These two are communications */
Strefobj * refobj;/* Reference this object */
Stunrefobj * unrefobj;/* unreference this object */
Stcallback * callback;/* handle pending callbacks */
Stsync * Sync;/* sync the directory */

/* The above are the operation functions of the objects in this FS class */
Struct {
Stobjcreate * Create;
Stobjopen * open;
Stobjclose * close;
Stobjread * read;
Stobjwrite * write;
Stobjunlink * unlink;
Stobjrecycle * recycle;
} OBJ;

/* Log operation function pointer */
Struct {
Stlogopen * open;
Stlogclose * close;
Stlogwrite * write;
Struct {
Stlogcleanstart * start;
Stlogcleannextentry * nextentry;
Stlogcleanwrite * write;
Stlogcleandone * done;
Void * State;
} Clean;
Int writes_since_clean;
} Log;

Struct {
Int blksize; // block size of the current File System
} FS;

Void * fsdata;
};

This is an abstraction of squid filesystem. For each specific implementation, such as Coss, it will use its own open, close, create and other functions to register these Apis. This implementation is consistent with that of the Linux kernel filesystem.
In a sense, we can develop a file system layer suitable for our own business based on this abstraction layer. This is called the abstract layer and the interface layer. It is an important concept and Technology in OS implementation. Those who have developed linux kernels should be well understood.

2.2.2. struct _ cossinfo
Only list the data members we care about.

Struct _ cossinfo {
Dlink_list membufs; // use link to quickly point to memory stripe

// A fast pointer always points to the membuf of the disk stripe in use
Struct _ cossmembuf * current_membuf;

// Point to the current offset of the entire Coss, which may be greater than maxoffset
Off_t current_offset;/* in bytes */

Float minumum_overwrite_pct; // configuration value of overwrite-percent
Int minimum_stripe_distance; // when the current cursor calculated by overwrite-percent and the base address of the object are larger than distance, the object is relocated. The only purpose is to judge storecossrelocaterequired.

// The following is the disk stripes Information
Int numstripes; // Number of numstripes that can be stored Based on the Coss File Size
Struct _ cossstripe * stripes; // The total size is the maximum number of disk stripe supported by the Coss file.
Int curstripe; // The number of the stripe currently in use
Int max_disk_nf; // equal to the maximum number of blocks in the Coss File

// Memory stripes Information
Off_t current_memonly_offset;
Struct _ cossmembuf * current_memonly_membuf;
Int nummemstripes; // The default value is 10 and the size of the memory stripe.

// Pointing to the memory stripe, which is a pointer array and the size is nummemstripes
Struct _ cossstripe * memstripes; // This is only cleared when storecossmaybefreebuf is called at writemembufdone.
Int curmemstripe; // It is not the number of memory stripe occupied, but the stripe currently idle
............
}

Cossinfo corresponds to each Coss file and records the allocation and ing of each stripe of Coss.
Note that cossinfo-> Stripes is a member. For disk stripe, cossinfo-> Stripes is allocated as an array of numstripes size (maximum number of disk strpes) during program initialization. Cossinfo-> stripes [N] corresponds to the N disks respectively. There is enough information to record the usage of each disk stripe.
Another important member is cossinfo-> memstripes. The implementation and usage of disk stripe are basically the same.

Squid uses this data structure to manage the entire Coss. We will understand it through process analysis below.

2.2.3. cossstripe
Struct _ cossstripe {
Int ID;
Int numdiskobjs; // number of objects in the stripe
Int pending_relocs;
Struct _ cossmembuf * membuf; // the object being written {s !} Mem_buf. After full writing, is it synchronized to disk?
Dlink_list objlist; // objectslist supported in this stripe. The type is storeentry.
};
 
2.3. Key Process Analysis
2.3.1. Rebuild
During rebuild, squid will open each Coss file in sequence. After reading a stripe file, it will not add the related information of the stripe to cossinfo-> stripes [N, n is the logical stripe block number in the Coss file. After reading the object information of the stripe, put it in cossinfo-> stripe [N], and put the information of an object in cossinfo-> stripe [N]-> objlist. Because the type is storeentry, where storeentry-> swap_filen and storeentry-> swap_dirn can be computed through a certain Coss ing relationship. In the Coss file offset, you can use fseek to read data.
Simply put, the rebuild process analyzes each stripe in sequence and restores the existing object information to the hash of storeentry and cossinfo-> stripes [N] to complete the rebuild.

2.3.2. Access Object during cache hit: storecossopen
The pseudocode of cossopen is as follows:
Storecossopen ()
{
// Locate mem stripe in cossinfo to check whether the object to be accessed is in memory stripe.
Storecossmempointerfromdiskoffset

If (MEM stripe hit ){
// Lock the mem stripe and do not allow swapout to disk
// Even if mem stripe is occupied
Storecossmembuflock
} Else {
If (storecossrelocaterequired ){
// When the relocation request is met, the object has not been accessed for a long time.
Storecossallocate
} Else {
// The object has just been accessed, and there is a stripe not far behind it, so you do not need to relocation it immediately.
Storecossmemonlyallocate
// This function reminds me when mem stripe is full.
// Storecosscreatememonlybuf: No free membufs.
// You may need to increase the value of membufs

If (MEM stripe is full ){
// The stripe that has been fully written will be written to the disk, which is the same as the relocation.
Storecossallocate
..................
}
}
....................
}
}

3. Case Analysis
3.1. Case 1
Case Description:
Squid server was pressurized using Coss, but the log showed
2009/02/22 10:26:27 | storecosscreatememonlybuf: No free membufs. You may need to increase the value of membufs on the/Cache/coss13 cache_dir
Then the cache hit service time reaches the response time in seconds. Normally, the response time is about 6 ms.

Technical Analysis
From the above analysis, we can see that when "storecosscreatememonlybuf: No free membufs. You may need to increase the value of membufs" is reminded, it is because mem stripe is full. Under normal circumstances, when m stripe of MEM stripe is full, squid will part of MEM stripe swapout to the Coss file. However, if storecossmembuflock locks a mem stripe, swapout is not executed. And when a mem
When an object in the stripe is being accessed (Object cache hit), Mem stripe is locked and unlocked after the service is complete.
This information is output during the internal execution of storecossopen (Cache hit). Therefore, analyzing this function is critical.
Therefore, the conclusion of this example is:
The prompt message is because the accessed object has been accessed not long ago (this is not long ago, when the default overwrite-percent is 50%, it refers to the process of traversing the entire Coss file for 50% of the time. If you do not understand it, take a look at the code, minimum_stripe_distance calculation method). Therefore, Coss's relocation mechanism storecossrelocaterequired considers that the object should not be relocation, so we hope to use mem stripe for storage. However, we cannot find the idle mem stripe to store this object, so we can only relocation (the consequence is that the relocation causes the accessed object to be directed to the stripe under the current cursor. When we access this object again next time, this stripe is new enough to avoid meeting the relocation conditions, and thus the mem
Stripe to store-form a vicious circle ).
The reason why idle mem stripe cannot be found is probably because squid is overloaded, resulting in a long service time for each object (the lock cycle of the object in MEM stripe is also long ), as a result, the number of concurrent locks increases, and MEM stripe is directly occupied.
From the past there was no such phenomenon to speculate that we should reduce the response time to solve this problem. Of course, we can see from the previous technical analysis that modifying overwrite-percent and increasing the number of membufs can solve this problem. But the root cause is the cyclical process of lock and unlock, that is, the popular Object Service takes too long.

4. Suggestions on using Coss
From our technical analysis on Coss, we can see that:
. Coss uses stripe and block to manage files, so hard disk space cannot be fully utilized.
. Coss added write operations due to the use of relocation
The rebuild process of. Coss needs to analyze the entire Coss file, so the fee is very long and cannot be used during the rebuild process.
. Because fileno only has 24 bits, the supported Coss file size is limited.
When the Coss file has a file deletion operation (Cache purge), resulting in a large number of holes in the Coss, the number of objects per stripe decreases, and the Read and Write efficiency decreases.
. Coss cannot effectively support directory-based refresh extension and deeper content management
The size of objects supported by. Coss has an upper limit.

Advantages of Coss:
Compared with Aufs and UFS, a large number of open/close system calls are avoided (each object is accessed once ).
. It is more likely that a relatively continuous space is used to store files (the operating system preferentially allocates the connection space when Coss is created ), the readahead pre-reading function of the OS may be effectively utilized (the pre-reading stripe may include the objects to be accessed later)

The above analysis may not be comprehensive enough, but we can still see that for general hard disks, using Coss may be helpful for the service of small files (mainly by reducing open/close, instead of using a large number of lseek), but Coss may not be the best choice when:
The device that uses an SSD hard disk. SSD has strong read capability, but poor write capability. Coss will increase write operations, but worse than aufs.
. Frequent file update or deletion (purge) Operations
. When the accessed hotspot files are concentrated, but larger than the size that mem stripe can store. It may cause frequent relocation when mem Stripes is full, but from falling into a vicious circle like a case.
. For squid startup speed with high requirements

Finally, whether to use the multi-squid architecture and what filesystem is used depends on the business. From the perspective of the computer system architecture, the entire system is simply composed of four resources: computing resources CPU, storage resources (memory and external storage), Io (this is especially used for NICs, because external storage also uses Io) and connects to their bus.
To optimize squid, first of all, we need to look at what bottlenecks, CPU or Io are the four resources for special businesses, taking full advantage of multiple CPUs, squid can make full use of the computing resources of multiple CPUs to reduce the waiting time for "serial" operations. However, when multiple squids use multiple Coss files, some overhead will also be added, one of which is to increase the overhead of Io competition. If the parameter settings are unreasonable, Io contention overhead will increase Io wait time (await and avgqu-Sz of iostat-X can be analyzed)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.