Implementation and usage of flashcache

Source: Internet
Author: User

For work requirements, read some flashcache content and record it as follows:

Implementation

Flashcache is a new open-source project developed by the Facebook Technical Team. It uses SSD hard disks to cache data to accelerate a MySQL kernel module. We can see that it was initially used for database acceleration, but it was also designed as a general cache module for any application built on Block devices.

Working principle. Based on device Mapper, it maps a fast SSD hard disk and a common hard disk into a logic block device with cache as an interface for user operation. You can directly perform read/write operations on this logical device, rather than the underlying SSD or general hard disk. If you operate on these underlying Block devices, the caching function provided as a whole will be lost.

Kernel level. Flashcache is implemented by adding a cache layer between the file system and the drive layer of the block device. We have to mention the DM ing mechanism of the DM layer here. Because DM is registered in the kernel as a virtual block device driver, it is not a real device driver and cannot complete bio processing. Therefore, it is mainly based on the bio ing table for bio decomposition, cloning and re- ing, and then bio to the underlying real device driver, start data transmission. Target_driver is introduced in device Mapper. Each target_driver is described by the target_type type and represents a type of ing. They are used to implement the ing process of Block devices in a specific way. Call the map method of a target_driver to map bio distributed from the upper layer, that is, find the correct target device and forward bio to the Request queue of the target device, complete the operation. Flashcache_target is such a new target_driver (as a new ing Class type, target_type is required), added to the DM layer in a modular manner.

Logical architecture. From the source code level analysis, flashcache can be divided into four modules: the scheduling module (also known as the 'read/write module ') and the logic processing module (also known as the "read/write post-processing module ") the underlying storage module and background cleaning module are implemented based on SSD layout and built on the SSD layout (which will be analyzed later. Among them, the scheduling module corresponds to the flashcache_map ing function in the Code. It is the flashcache cache level data entry, so the read/write requests that reach the logic device will eventually go through the DM layer, use flashcache_map to enter the scheduling module. After receiving data, it selects different processing branches based on factors such as the read/write type of bio requests and whether it hits the cache, for example, flashcache_read/write or flashcache_uncached_io. In read and write, flashcache_read_hit/Miss or flashcache_write_hit/Miss is selected. After reading and writing of different branches, the underlying storage module is called to read and write data on the disk or cache. The logic processing module corresponds to flashcache_io_callback in the code. It calls back the execution after the data read/write operation is completed in the scheduling module through the underlying storage module, so it is called the "read/write post-processing module ", it is implemented by using a state machine. It can be processed based on the read/write type in the scheduling module. For example, if a read Miss occurs, after the disk is read, it is called back to the logic processing module, it is responsible for writing the data read from the disk back to the SSD, or when the write is not hit, after the write SSD is completed, it calls back to the logic processing module to execute metadata updates, in addition, read/write errors in the tuning module are handled. The underlying storage module mainly provides two methods to complete real data read and write. One is the dm_io function provided by DM, which is finally implemented through submit_bio, submit bio processed by the scheduling module to the common block layer and forward it to the real device driver to complete data read/write. In addition, kcopyd, it is an underlying copy function provided by the kernel. It is mainly responsible for writing back dirty blocks (from SSD to disk), which may cause metadata updates. The background cleanup module clears data for each set. It recycles dirty Blocks Based on Two policies: (1) the set Internal block exceeds the threshold; (2) the dirty block exceeds the set idle time, that is, fallow_delay, which is generally 15 minutes. If it is not operated within 15 minutes, it will be recycled first. Note that there is no separate thread to periodically recycle idle blocks in the background, which must be triggered by Io operations. If a set operation is not performed for a long time, the dirty data is maintained for a long time, which may endanger data security.

Source code layout. Two work queues. Combined with device mapper Code, especially DM. c. When the flashcache_create tool is called to create a flashcache device, the flashcache_ctl function is called to execute the creation tool. It will create a working queue _ delay_clean, it is mainly responsible for cleaning the dirty blocks of the entire cache device. flashcache_clean_set is called under specific conditions (see the code) and scans and cleans all sets through flashcache_clean_all. Another working queue, _ kq_xxx (cannot be remembered). In flashcache_init, it is executed when loaded by the flashcache module and processed by the five job linked lists, the execution of metadata updates and processing functions, SSD writes after disk reading, and processing of waiting queues are mainly responsible for processing read/write data in the logic processing module, that is, the "read/write post-processing module" is scheduled under different circumstances after the disk or SSD reads/writes.

The scheduling time can be viewed in the flashcache_map function. The processing logic is mainly determined in the flashcache_io_callback function. If the waiting queue of the same block is empty, flashcache_do_handler is also called, processes the waiting queue.

Data scheduling. To read and receive bio, first obtain the set on the SSD according to bio-> bi_sector, that is, the sector number of the hard disk. Second, search for hit in the set. If hit, convert the fan area number of the hard disk to the fan area number of the SSD, and then submit the bio to the SSD for reading. If not hit, first, submit bio to the hard drive to read data from the hard disk. After reading the data, the callback function starts the SSD write-back operation to convert the bio sector code to the SSD = sector code, then, submit the data to the SSD driver to write the data read from the hard disk to the SSD. Write, the same as the file system page buffer, not directly written to the hard disk, but to the SSD, at the same time, keep a threshold value, generally 20%, when the number of dirty blocks reaches this value, write back to disk.

Install

For details about the installation process, refer to here. I made and make install directly on the linux-3.15.5 that compiled the centos-6.5 kernel, and the installation was successful.

make -j 4 KERNEL_TREE=/usr/src/kernels/2.6.32-131.0.15.el6.x86_64sudo make install

The first version of flashcache only supports writeback. Later, a branch supporting writethrough was opened separately in the flashcache-wt directory. However, the latest version has merged write through into the main version, the write around policy is added.

You can obtain the latest source code from GitHub.

env GIT_SSL_NO_VERIFY=true git clone https://github.com/facebook/flashcache.git
It is recommended to download the first thing after the source code is to go to the doc to read the flashcache-doc.txt and flashcache-sa-guide.txt Simulated Experiment

Not everyone has SSD/PCI-E flash hardware, so here we can give you a small skill to build a virtual hybrid storage device, so even in their own notebook, you can also easily simulate the flashcache test environment and make it easy.

First, we can use the memory to simulate a flash device with good performance. Of course, there is a drawback that everything will be lost after the host is restarted, however, this should not be a big problem for lab testing. There are two methods to use memory to simulate Block devices: ramdisk or tmpfs + loop device. To resize ramdisk, You need to modify grub and restart it. Here we use tmpfs.

# Limit tmpfs to a maximum of 10 Gb to avoid Memory depletion (the test machine has 24 GB physical memory) $ sudo Mount tmpfs/dev/SHM-T tmpfs-O size = 10240 M # create a 2G file, used to simulate 2 GB flash devices $ dd If =/dev/Zero of =/dev/SHM/SSD. img bs = 1024 k count = 2048 # simulate a file into a block device $ sudo losetup/dev/loop0/dev/SHM/SSD. IMG

To solve the cache device, you also need a disk persistent device. Similarly, you can use files on a general disk to virtualize a loop device.

# Create a 4G file in the file system of a general disk to simulate a 4G disk device $ dd If =/dev/Zero of =/u01/Jiangfeng/disk. img bs = 1024 k count = 4096 $ sudo losetup/dev/loop1/u01/Jiangfeng/disk. IMG

In this way, we have a fast device/dev/loop0, a slow disk device/dev/loop1, and we can start to create a flashcache hybrid storage device.

$sudo flashcache_create -p back cachedev /dev/loop0 /dev/loop1cachedev cachedev, ssd_devname /dev/loop0, disk_devname /dev/loop1 cache mode WRITE_BACKblock_size 8, md_block_size 8, cache_size 0Flashcache metadata will use 8MB of your 48384MB main memory$sudo mkfs.ext3 /dev/mapper/cachedevmke2fs 1.41.12 (17-May-2010)Filesystem label=OS type: LinuxBlock size=4096 (log=2)Fragment size=4096 (log=2)Stride=0 blocks, Stripe width=0 blocks262144 inodes, 1048576 blocks52428 blocks (5.00%) reserved for the super userFirst data block=0Maximum filesystem blocks=107374182432 block groups32768 blocks per group, 32768 fragments per group8192 inodes per groupSuperblock backups stored on blocks:        32768, 98304, 163840, 229376, 294912, 819200, 884736Writing inode tables: doneCreating journal (32768 blocks): doneWriting superblocks and filesystem accounting information: doneThis filesystem will be automatically checked every 28 mounts or180 days, whichever comes first.  Use tune2fs -c or -i to override.$sudo mount /dev/mapper/cachedev /u03

OK. Check it to start some simulated tests.

$sudo dmsetup tablecachedev: 0 8388608 flashcache conf:        ssd dev (/dev/loop0), disk dev (/dev/loop1) cache mode(WRITE_BACK)        capacity(2038M), associativity(512), data block size(4K) metadata block size(4096b)        skip sequential thresh(0K)        total blocks(521728), cached blocks(83), cache percent(0)        dirty blocks(0), dirty percent(0)        nr_queued(0)Size Hist: 4096:84 $sudo dmsetup statuscachedev: 0 8388608 flashcache stats:        reads(84), writes(0)        read hits(1), read hit percent(1)        write hits(0) write hit percent(0)        dirty write hits(0) dirty write hit percent(0)        replacement(0), write replacement(0)        write invalidates(0), read invalidates(0)        pending enqueues(0), pending inval(0)        metadata dirties(0), metadata cleans(0)        metadata batch(0) metadata ssd writes(0)        cleanings(0) fallow cleanings(0)        no room(0) front merge(0) back merge(0)        disk reads(83), disk writes(0) ssd reads(1) ssd writes(83)        uncached reads(0), uncached writes(0), uncached IO requeue(0)        uncached sequential reads(0), uncached sequential writes(0)        pid_adds(0), pid_dels(0), pid_drops(0) pid_expiry(0)$sudo sysctl -a | grep flashcachedev.flashcache.loop0+loop1.io_latency_hist = 0dev.flashcache.loop0+loop1.do_sync = 0dev.flashcache.loop0+loop1.stop_sync = 0dev.flashcache.loop0+loop1.dirty_thresh_pct = 20dev.flashcache.loop0+loop1.max_clean_ios_total = 4dev.flashcache.loop0+loop1.max_clean_ios_set = 2dev.flashcache.loop0+loop1.do_pid_expiry = 0dev.flashcache.loop0+loop1.max_pids = 100dev.flashcache.loop0+loop1.pid_expiry_secs = 60dev.flashcache.loop0+loop1.reclaim_policy = 0dev.flashcache.loop0+loop1.zero_stats = 0dev.flashcache.loop0+loop1.fast_remove = 0dev.flashcache.loop0+loop1.cache_all = 1dev.flashcache.loop0+loop1.fallow_clean_speed = 2dev.flashcache.loop0+loop1.fallow_delay = 900dev.flashcache.loop0+loop1.skip_seq_thresh_kb = 0
I am DD, I am an SSD. IMG and a disk. IMG, and then FIO is tested, the effect is 3-4 times higher than that of non-flashcache devices. Flashcache command line

Assume that the SSD is/dev/SDB and the SAS disk is/dev/SDC.

Create a flashcache device:

flashcache_create -p back cachedev /dev/sdb /dev/sdc

-P back: specifies that the cache mode is writeback.

Cachedev: flashcache device name

Place the SDD disk in front and the SAS disk in the back.

In this way, Linux virtualizes a block device with cache:

[[email protected] loop2+loop3]# ll /dev/mapper/cachedev lrwxrwxrwx 1 root root 7 Aug 18 11:08 /dev/mapper/cachedev -> ../dm-1
In this way, the device can be used just like a general block device. If a file system already exists in the original partition/dev/SDC, it can still be used normally after Mount. If there is no file system, you can do the same as a general device to do the file system first, then mount and use it.
mount /dev/mapper/cachedev /mnt
To Redo flashcache, you first need to umount the corresponding partition, and then if you need to re-do flashcache:
umount /mntdmsetup remove cachedevflashcache_destroy /dev/sdb

If you need to re-build, then install the above flashcache_create re-build.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.