Cross-platform Caffe and I/O model and parallel scenario (iii)

Source: Internet
Author: User
Tags hash inheritance mutex prefetch thread class cpu usage
3. Caffe I/O model

The Caffe supports GPU acceleration mode, which requires more efficiency in the I/O model. Caffe through the introduction of multiple pre-buffering to compensate for the large gap between memory and video bandwidth, using main memory management automata to control the data transmission and synchronization between the RAM and video, so as to achieve the hidden transmission time, improve the utilization of computing resources and maintain data consistency goal. Caffe also supports single-machine multi-GPU data parallelism, and the multithreaded I/O model provides support for its parallel scenarios. This chapter will expound the Caffe's I/O model in detail from two aspects of principle inquiry and frame analysis, and analyze its design ideas and advantages. 3.1 I/O model Overview

CPU/GPU Heterogeneous program design is more efficient for I/O model, and in order to realize the accelerating potential of GPU, I need to analyze the factors of speed performance improvement of I/O model. This section takes this as the starting point, elaborates the design idea of the Caffe I/O model, and summarizes the I/O model overall architecture.

Due to the introduction of the GPU, we need to manipulate two different storage bodies simultaneously: main memory is controlled by North Bridge, and the CPU is address bus, control bus and data bus, the memory is controlled by the South Bridge, and only the PCI bus is connected with the CPU. Because there is only one data bus between the CPU and the GPU, the CPU does not have direct access to the memory space, and a large portion of the GPU's clock cycles are used to exchange data with the CPU. At the same time, the GPU's bus bandwidth averages 10X of memory bandwidth, and the large bandwidth gap affects GPU acceleration performance. The main methods to compensate for this bandwidth gap are: time-sharing, asynchronous, multi-threading.

To address these issues, Caffe introduces multiple pre-buffering to improve the I/O model. By setting the sub-line in the data input layer (DataLayer), the GPU calculates, the CPU idle period for memory pre-buffering the data of four to five batch, to achieve the hidden data transmission time, improve the efficiency of computing resource utilization. multithreaded I/O forms the left part of the I/O model of Figure 3-1 Caffe, mainly using the two-tier producer-consumer model to maintain critical buffers. The transmission of the memory data and the video data is controlled by a main memory management automaton, which synchronizes the memory/video card data by the Caffe Synchronization memory (syncedmemory). To sum up, the caffe I/O model is shown in Figure 3-1, mainly composed of multi-threaded I/O and main memory management automata. Among them, multi-threaded I/O is mainly responsible for the source data from the hard disk loaded into main memory, memory management automaton is responsible for memory and memory data transmission and synchronization.

Figure 3-1 Caffe I/O model 3.2 I/O parallel analysis

The amount of training data in the deep learning system is huge and cannot be loaded into memory. The raw data is typically loaded from disk into the buffer of main memory for use by the training program, in batch data batches. This process involves the producer-consumer model: a single production process is a pre-buffering of a batch data, with less time, and a consuming process, including the entire forward propagation and reverse propagation, takes more time. There is a huge difference in the execution cycle between producers and consumers, which can waste computing and storage resources and even cause system crashes if not imposed. As a result, the producer will enter a blocking state after detecting that the buffer is full. I/O model if you do not use multithreaded design, the blocking code is placed in the main process, resulting in deadlock: when the buffer is full of the primary process is blocked, forward propagation, reverse propagation cannot be performed. Therefore, producers and consumers must be asynchronous and exist on different threads, which is the root cause of multithreaded design for I/O models. Producer/consumer access to the buffer is an asynchronous critical resource problem that requires a lock (MUTEX) to maintain resource consistency.

The core principle of multithreaded programming is to parallelize the non-causal continuous code. To improve the speed of the Caffe I/O model, it is necessary to analyze the parallel parts of the I/O process that are not context-sensitive.

(1) Datum and blob (Batch) are not context-sensitive

The raw data is loaded from the hard disk into memory and stored in the intermediate data type datum, which is only relevant to the input sample. Datum type data needs to be converted to Caffe's basic data type Blob,blob contains the shape information for forward propagation, which can only be determined when the initialization network is initialized.

Therefore, Datum's reading work can begin before the network is uninitialized, which is the connotation of DataReader threading design. At the same time, this kind of non-correlation, also for producers and consumers of critical resource access design buried foreshadowing.

(2) The GPU is not context-sensitive

Caffe's source parallel scheme uses single-machine multi-GPU data parallelism, each GPU has a copy of the model, different GPUs cover different segments of the data, and then synchronously merges the results of each GPU to iteratively update weights. Such a multi-GPU scenario, so that each GPU has at least one DataReader, overwriting different data segments. On the network structure, you can share the root network, as shown in the figure:

Figure 3-2 GPU Pipelining Programming scenario

The image above is a classic multi-GPU pipelined programming scenario. 3 GPUs have their own DataReader, but share all layers (including the data input layer DataLayer). GPU0 is controlled by the main process, GPU1 is controlled by thread 1, and GPU2 is controlled by thread 2. Caffe on the host side, that is, the CPU master process and the secondary thread, each layer's forward propagation is locked by a mutex, but the reverse propagation is not. Thus, although the main process, thread 1, thread 2 are calling Layer.foward () in parallel, but cannot access the same layer at the same time, the layer is a mutually exclusive critical resource. This behavior constructs a manual assembly line, such as:

GPU0 in Conv1, GPU1, GPU2 will be locked.

GPU0 in Conv3, CONV1 and Conv2 are idle and will be occupied by other GPUs.

The reason why the reverse propagation is not locked is because the forward propagation and the reverse propagation are in line with the causal law. 3.3 Multi-threaded I/O

The I/O module of the deep learning system belongs to the producer-consumer model and requires multi-threaded design to avoid deadlocks. According to the 3.2 bar multithreaded parallel analysis, the two-level pre-buffering scheme can be used to improve the parallelism of I/O module and reduce the waiting time. This section will analyze in detail the principle and implementation of the Caffe two-level multi-buffer I/O. 3.3.1 Producer-consumer model

Producer-consumer mode is a classic multithreaded synchronization problem with two important features: mutex and blocking. Mutual exclusion (mutex) and blocking (blocking) are two different concepts: a mutex pulls multiple threads into a serial execution queue for asynchronous parallel operations of the same resource, while the blocking waits for the thread to hibernate, and the CPU temporarily abandons its control.

Given that producers and consumers do not have random access and random write behavior characteristics, you can select a queue as a buffer. Because producers and consumers are accessing critical buffers asynchronously, it is necessary to give the queue a pop operation and a push operation Locking (Mutex) to maintain data consistency. Such a queue is called a blocking queue (blockingqueue). Thread blocking has two features: ①cpu discards threads; ② cannot be actively activated. In order to activate this thread, the model must be designed as a "dual model", and producers and consumers are exactly the same. The situation in which producers and consumers are blocked is as follows:

The buffer is empty, at which time the consumer blocks itself, rejecting the pop operation and handing over the CPU control.     When the buffer is full, the producer blocks itself, rejecting the push operation and handing over the CPU control. Corresponding, the thread is activated by the following conditions: After the buffer is empty, suddenly push an element, the consumer thread should be activated by the producer. After the buffer is full, suddenly pop an element, which should be activated by the consumer producer thread.

The above description indicates that both consumers and producers need to know the empty/full condition of the buffer, and either block themselves or activate the other side. In traditional producer and consumer programs, a single-buffered queue is typically used, which is appropriate for buffer queue size determination. However, in the actual I/O process, the batch data of the training sample is different, the buffer capacity is generally 3 to 4 times times the volume of batch data (pre-buffering multiple batch data), its size is indeterminate, so it is difficult to detect the meaning of "buffer queue full". To solve this problem, Caffe uses a "double-buffered queue group" scheme, which sets up two blocking queues free and full, together to form a queue group (Queuepair). To avoid detecting the upper bounds of the buffer queue, we can first place an empty element pointer to the free queue with an equal number of upper bounds. Each time a producer produces an empty datum element from the free queue, the data is populated and then fed into the full queue. The essence of this approach is that in addition to providing a production buffer queue (full) for the producer, a part buffer queue (free) is provided, and the buffer queue of the detection part is empty, simulating and replacing the detection of another finished buffer queue. 3.3.2 First level buffering

The primary function of the Caffe first level I/O is to load the raw data from the hard disk into the memory buffer, the basic data unit of the transmission is datum, representing a sample of the data. The Caffe data layer implementation class is shown in Figure 3-3, DataLayer inherits from Baseprefetchingdatalayer and thread class Internalthread because it requires new threads to read data on the database in parallel. An object of the DataLayer class has only one DataReader member Reader_,datareader class has a member body. Each body controls a data source, and different data sources can be differentiated according to the hash keyword. Body inherits from DataReader and Internalthread, which is actually a thread that will always be in the loop state to read datum data into the buffer.

Figure 3-3 Data layer class inheritance relationship

The Body-datareader constitutes the first level of the Caffe data buffer: from the database to the datum type buffer. When initializing the data entry layer of the network DataLayer, instantiate the member variable reader_ of the DataReader type, creating a Queuepair type of contiguous space queue_pair as a buffer, The size is the number of prefetch batch and the product of each batch capacity. A body is associated with the hash value of the layer name and the data source as the key value. A blob corresponds to a data source, and multi-GPU training can have multiple datalayer layers while training on the same database, but the same body open thread is common to read data from the database. The body class is a thread class whose constructor opens a data pre-read thread that is instantiated in DataReader. Thus the first level buffer producer is the body thread, and the buffer is held by the object Reader_ of the DataReader class.

The member function of the body class Internalthreadentry overrides the method of the parent class Internalthread, as shown in Figure 3-4 as the first level I/O data pre-buffering process. When the body of a database is instantiated, the data prefetch thread is started, and the thread uses a double-blocking queue to synchronize the producer-consumer. In a single data production process, an empty datum is first pulled out of the part queue free, the datum is assigned using a sample read from the database, and then the produced datum is pushed into the finished queue full in a blocked manner. The first level I/O buffer producer is the data prefetch thread, the corresponding consumer is the second level

Caffe's I/O modules support multiple GPU-based data parallel training, such as multi-producer multi-buffers, single-producer multi-buffers, single-producer single buffers, and many other forms of producer-consumer models. For example, multiple GPUs are trained on the same training data set and do not share the data entry layer, which belongs to the single-producer multi-buffer model, and can be load balanced by reading each buffer in turn during the consumption process. Caffe support single-machine multi-GPU data parallelism, the specific principle and implementation of the 4th chapter in detail.

Figure 3-4 First level I/O data pre-buffering process 3.3.3 Second level buffering

Caffe's second-level I/O mainly implements the conversion of datum data to BLOB data. The producer capabilities of the second-level I/O, as shown in Figure 3-5, deform the datum data through the prefetch thread of the data prefetching layer (Baseprefetchingdatalayer), map to the local memory of the blob, and eventually assemble into a blob-type batch. In the second level I/O consumer is datalayer forward process forward, in batch as the basic unit, input data into the next layer.

Figure 3-5 Level Two I/O pre-buffering

The second level I/O requires the shape information of the BLOB in the network to be determined only when the data entry layer (DataLayer) of the network is initialized. The initialization of the data entry layer in Caffe is mainly implemented by the member function Layersetup () of the class Baseprefetchingdatalayer, which contains the Datalayersetup function of the calling subclass to initialize the data layer. Pre-allocating data spaces for consumers and opening threads as second-level producers in advance will detail the initialization of DataLayer.

A layer is a basic computational unit of Caffe, with at least one input blob (Bottom blob) and one output blob (top blob), with two calculation directions: Forward propagation (Forward) and reverse propagation (backward). There are many layers in the field of deep learning, such as convolution layer, relu layer, and so on, using object-oriented inheritance and polymorphism, can realize multi-seed class layer. Caffe uses the Factory mode to achieve the initialization of various layer class objects, each implementation of a new layer needs to be registered in the Layer_factory, the registered layer can be called by passing a layerparameter to the Createrlayer function at runtime. As shown in Figure 3-6, the initialization process of the data layer DataLayer is demonstrated. During the initialization of the data pre-fetching layer, it is mainly based on the structure of the datum data to infer the shape of the output blob and then to reshape it. The procedure calls the Layersetup method of the Basedatalayer class, which invokes the Datalayersetup method, which is declared in the Baseprefetchingdatalayer parent class Basedatalayer. Implemented in the Baseprefetchingdatalayer subclass. This is because the data input layer of the Caffe has several specific implementations in terms of input data types, such as DataLayer, which is highlighted in this article, and imagedatalayer, and different input data needs to be initialized differently.

3-6 Data Layer (DataLayer) data read process

The second step of the initialization function Layersetup is to access the memory space where the consumer holds the data, and if the memory space does not exist, a new allocation of storage space is required. Since Caffe supports GPU acceleration, the memory allocation location includes main storage and video graphics, which need to be controlled using main memory management automata described in section 3.4. By accessing the data space to allocate storage space before starting the second level data prefetch thread, you can avoid the risk of Cudamalloc the Cuda API call at the same time in multithreaded situations.

The third step in initializing the function layersetup is to create a thread that prefetch the data, which is the producer of the second-level buffer. The member function of the Baseprefetchingdatalayer class Internalthreadentry a method that overrides the internalthread of the parent class, as shown in Figure 3-7 for the second level I/O data pre-buffering process. The second-level data prefetch thread first creates a Cuda asynchronous stream, which in the future is used to synchronize the pre-read batch blob to GPU memory, and then loops through the batch blob from the first level buffer to the second-level buffer until the thread terminates. In the Loop, first pops a smart pointer to the BLOB type from the second-level buffer's part queue prefetch_free_, which is a memory start address that holds the amount of batch data used to store BLOB data that the producer reads from the first-level buffer. Then, based on the results of datum in the first level buffer, the shape of the batch blob is inferred and the blob is reshape. It then iterates from the finished queue Full_ of the first buffer to read a datum data, converts it to a BLOB type, and copies it to the appropriate location in the batch Blob memory area, with the number of executions equal to batch size. This thread loops execution, each time iterating over a second-level buffer transformation, assembling a BLOB of batch data volume. The second-level I/O producer is the data prefetch thread, the consumer is the front-line propagation (Forward) process of the datalayer, and the buffer is a double-blocking queue consisting of prefetch_free_ and Prefetch_full_ of the Blockingqueue type. The forward process of datalayer is simple, just copy the input bottom blob to the output top blob. You can then push the used batch blob into the Prefetch_free_ queue, allowing the producer to continue reading the data.

Figure 3-7 Second level I/O data pre-buffering process 3.4 main memory model

In the traditional Cuda programming, it is very cumbersome to manage the memory and the space, if the CPU/GPU heterogeneous design is considered, the situation is more complicated. Caffe the management and access of main memory/memory using main memory management automaton, figure 3-8 shows the main memory management automaton of Caffe. There are four states of automata, defined in class Syncedmemory with enumeration types: Uninitialized, Head_at_cpu,head_at_gpu, synced. These four states are basically triggered by four application functions: Cpu_data (), Gpu_data (), Mutable_cpu_data (), Mutable_gpu_data (), and four state transfer functions on top of them: To_cpu (), To_gpu () , Mutable_cpu (), Mutable_gpu (). The first two state transfer functions are used for state machine maintenance before entering the synced state, and the latter two are used to break out of the synced state. Because the synced state ignores the behavior of TO_CPU and TO_GPU, breaking the synced state can only be done by manually assigning values, switching state head head.

Figure 3-8 Caffe main memory management automatic machine

The life cycle of the uninitialized state is the shortest in all States, and will be terminated with either CPU or GPU requesting memory. In the entire memory cycle, it is not necessarily to follow: data must first apply for memory, and then in the application of video memory, the last copy of the past. In fact, in the case of GPU operation, most of the primary storage bodies are directly requesting memory, such as the forward/reverse propagation phase of removing datalayer. Therefore, uninitialized allows the memory to be applied directly from TO_GPU (). When this state is transferred, it is usually necessary to set the memory to 0, in addition to requesting memory.

The HEAD_AT_CPU state indicates that the last data modification was triggered by the CPU. Note that it only indicates the last time it was modified by WHO, not who accessed it. When the GPU is working, the state becomes the second-shortest life cycle in all states. Usually the automata are in the synced and Head_at_gpu state, because most of the data modification work is GPU-triggered.

There are only three sources of this state:

① is transferred from uninitialized to: The data is specified as the first memory carrier.

② by Mutable_cpu_data (): The data will be ready to be modified and the state needs to be reset.

③cpu_data () and its sub-functions to_cpu (), as long as they do not meet the I condition, cannot be transferred to that State (because Access does not cause data modification).

The HEAD_AT_GPU state indicates that the last data modification was triggered by the GPU and is almost symmetrical with head_at_cpu.

Synced is the most important state and the only one that is not necessary. The reason for the synchronization state to be set up separately is to mark the data consistency of memory graphics. Because the class Syncedmemory will also manage the two main memory pointers, if encountered Head_at_cpu, but to access the video memory, or HEAD_AT_GPU, but to access the RAM, then theoretically have to first memory copy. This copy operation can be optimized, because if the memory and video data are consistent, there is no need to replicate back and forth. So, use synced to mark the data as consistent. Synced there are only two sources of transfer:

① transferred from HEAD_AT_CPU + TO_GPU () to:

The implication is that the CPU's data is newer than the GPU and requires the use of the GPU, which must synchronize main memory at this point.

② transferred from HEAD_AT_GPU + TO_CPU () to:

The implication is that the GPU's data is newer than the CPU and requires CPU usage, and the main memory must be synchronized at this point.

There are also two preparations to be made during the transfer to synced:

Checks whether the current CPU/GPU state's pointer allocates main memory and, if not, redistribute it. Copy main memory to corresponding state.

After being in the synced State, TO_CPU () and To_gpu () will be optimized to skip all internal code. The automaton will no longer function because only the required main memory pointers need to be returned at this time, and no special maintenance is required. This period of tranquility will be broken by the mutable prefix function, as they will be forced to change to head_at_xxx and start the automaton again.

The second level I/O in Caffe involves the synchronization of main memory data to memory. The normal copy function cudamemcpy uses the default stream Cudastreamdefault, allowing only the master process to copy data with the memory. Using only the master process to replicate data limits multithreaded Io, and in general it requires blocking before data replication is complete, and blocking the main process is very inefficient. To overcome these problems, Caffe uses asynchronous streams to synchronize main memory and memory data. The asynchronous flow concept was introduced in Cuda 5.0, and as with the pipelined architecture of the Intel CPU, the NVIDIA GPU also employs an I/O and compute-separated pipelining approach. After the asynchronous stream programming API is open, programmers are allowed to submit asynchronous synchronous replication streams to the GPU in CPU-side multithreaded programming, increasing the I/O utilization on the GPU side. Reference

[1] from the zero-starting cottage Caffe: IO system (i),

[2] Caffe BaseDataLayer.cppBasePrefetchingDataLayer.cpp DataLayer.cpp Study


[3] Caffe Source parsing 4:data_layer,

[4] Caffe reading record (2) Layer class

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.