Deep learning FPGA Implementation Basics 0 (FPGA defeats GPU and GPP, becoming the future of deep learning?) )

Source: Internet
Author: User

Requirement Description: Deep learning FPGA realizes knowledge reserve


Will the FPGA defeat the GPU and GPP and become the future of deep learning?

In recent years, deep learning has become the most commonly used technology in computer vision, speech recognition, natural language processing and other key areas, which are of great concern to the industry. However, deep learning models require a very large amount of data and computing power, and only better hardware acceleration conditions can meet the needs of the existing data and model scale continue to expand. Existing solutions use a graphics processing unit (GPU) cluster as a general-purpose computing Graphics processing Unit (GPGPU), but Field programmable gate Arrays (FPGAs) provide another solution that is worth exploring. The increasingly popular FPGA design tools make it more compatible with the high-level software often used in deep learning, making FPGAs easier for model builders and Deployer. The flexibility of the FPGA architecture allows researchers to explore model optimization beyond the fixed architecture of the GPU. At the same time, FPGAs are more powerful in terms of unit power consumption, which is critical to the research of large-scale server deployments or resource-constrained embedded applications. This paper examines deep learning and FPGAs from a hardware-accelerated perspective, pointing out what trends and innovations are making these technologies match each other and stimulating the exploration of how FPGAs can help in the field of deep learning.

Brief introduction

machine learning has a profound impact on daily life. Whether you click on personalized recommendations on the website, use voice communication on your smartphone, or take photos using facial recognition technology, some form of AI technology is used. This new trend of artificial intelligence is also accompanied by the concept of algorithmic design changes. In the past, most of the data-based machine learning is the use of specific areas of expertise to "shape" the site to learn the "characteristics", the computer from a large number of sample data acquisition of combined feature extraction system capabilities, the computer vision, speech recognition and natural language processing and other key areas to achieve a significant performance breakthrough. The study of these data-driven technologies, known as deep learning, is now being watched by two key groups in the technology community: The researchers who want to use and train these models for extreme high-performance cross-task computing, and the application scientists who want to deploy these models for new applications in the real world. However, they all face a constraint that hardware acceleration capabilities need to be strengthened before they can meet the need to scale up existing data and algorithms.

For deep learning, hardware acceleration is mainly based on the use of a graphics processing unit (GPU) cluster as a general-purpose computational graphics processing unit (GPGPU). Compared to traditional general-purpose processors (GPP), the GPU's core computing power is more than a few orders of magnitude, but also easier to parallel computing. In particular, Nvidia CUDA, as the most mainstream GPGPU authoring platform, is used by each of the major deep learning tools for GPU acceleration. Recently, the open-parallel program design standard OpenCL has attracted much attention as an alternative tool for heterogeneous hardware programming, and the enthusiasm for these tools has soared. While the support for OpenCL is somewhat less than cuda in the field of deep learning, OpenCL has two unique properties. First, OpenCL is open source for developers, free, and different from Cuda single vendor practices. Second, OpenCL supports a range of hardware, including GPUs, GPP, Field programmable gate Array (FPGA), and digital signal processor (DSP).

As a powerful competitor of the GPU in algorithmic acceleration, it is particularly important that FPGAs support different hardware immediately. The difference between FPGA and GPU is that the hardware configuration is flexible, and the FPGA can usually provide better performance than the GPU when it runs the key subroutines in deep learning (for example, the calculation of sliding windows). However, setting up FPGAs requires specific hardware knowledge that many researchers and application scientists do not have, and because of this, FPGAs are often seen as a proprietary architecture. Recently, FPGA tools have begun to use software-level programming models, including OpenCL, to make them increasingly popular with users trained in mainstream software development.

For researchers looking at a range of design tools, the criteria for selecting tools are often related to whether they have a user-friendly software development tool, a flexible and scalable model design methodology, and the ability to quickly calculate and reduce training time for large models. As FPGAs become easier to write because of the advent of high-abstraction design tools, their refactoring makes it possible to customize the architecture, while the high parallel computing power increases the speed of instruction execution, and the FPGA will benefit deep learning researchers.

For application scientists, although there are similar tool-level options, the focus of hardware selection is to maximize the performance of unit energy consumption, thereby reducing costs for large-scale operations. As a result, FPGAs benefit from the strong performance of unit power consumption and the ability to customize architectures for specific applications.

FPGA can meet the needs of two audiences, is a logical choice. This paper examines the current status of deep learning on FPGAs and the technological developments currently used to bridge the gap between the two. Therefore, this article has three important purposes. First, it is pointed out that there is an opportunity to explore the new hardware acceleration platform in deep learning, and FPGA is an ideal choice. Secondly, the paper outlines the current situation of FPGA to support deep learning, and points out the potential limitation. Finally, the key suggestions for the future direction of FPGA hardware acceleration are put forward to help solve the problems of deep learning in the future.


Traditionally, the tradeoff between flexibility and performance must be taken into account when evaluating the acceleration of a hardware platform. On the one hand, universal processors (GPP) provide a high degree of flexibility and ease of use, but performance is relatively inefficient. These platforms are often easier to access, can be produced at low prices, and are suitable for multiple uses and reuse. On the other hand, specialized integrated circuits (Asics) provide high performance, but at the cost of being flexible and more difficult to produce. These circuits are dedicated to a particular application and are expensive and time-consuming to produce.

The FPGA is the tradeoff between these two extremes. FPGA is a kind of more general programmable logic device (PLD), and simply, it is a kind of re-configurable integrated circuit. As a result, FPGAs provide both the performance advantages of integrated circuits and the flexibility of GPP reconfiguration. FPGAs can simply implement sequential logic by using triggers (FF) and implement combinatorial logic by using a lookup table (LUT). Modern FPGAs also contain hardened components for common functions such as full processor cores, communication cores, compute cores, and block memory (BRAM). In addition, current FPGA trends tend to be in the system-on-chip (SoC) design approach, where arm coprocessor and FPGA are usually located on the same chip. The current FPGA market is dominated by Xilinx and occupies more than 85% of the market share. In addition, FPGAs are rapidly replacing ASIC and application-specific standard products (ASSP) to implement fixed-function logic. The size of the FPGA market is expected to reach $10 billion in 2016.

for deep learning, FPGAs offer significant potential over traditional GPP acceleration capabilities. The implementation of GPP at the software level relies on the traditional von Neumann architecture, where instructions and data are stored in external memory and then removed when needed. This drives the advent of the cache, greatly reducing the cost of expensive external memory operations. The bottleneck of this architecture is the communication between processor and memory, which severely weakens the performance of GPP, especially the storage information technology that deep learning often needs to acquire. In comparison, the FPGA's programmable logic primitives can be used to implement data and control paths in common logic functions, rather than relying on the von Neumann structure. They can also take advantage of distributed on-chip memory, as well as the deep use of pipelining parallelism, which is naturally aligned with the Feedforward deep learning approach. Modern FPGAs also support partial dynamic reconfiguration, while another part is still available when part of the FPGA is reconfigured. This will have an impact on the large-scale deep learning model, where each layer of the FPGA can be reconfigured without disturbing the calculations being performed on other layers. This will be available for models that cannot be accommodated by a single FPGA, and can also reduce the high global storage read costs by keeping intermediate results on-premises storage.

Most importantly, it offers a different perspective than GPU,FPGA for hardware-accelerated design. The design of the GPU and other fixed architectures is to follow the software execution model and to execute the task-building structure in parallel to the Autonomic computing unit. Thus, the goal of developing a GPU for deep learning is to adapt the algorithm to this model, allowing the computations to be done in parallel, ensuring that the data is interdependent. In contrast, the FPGA architecture is specifically tailored to the application. In the development of FPGA deep learning technology, less emphasis on the algorithm to adapt to a fixed computing structure, so as to allow more freedom to explore the optimization of the algorithm level. Techniques that require a lot of complex lower-level hardware control operations are difficult to implement in the upper-level software language, but are particularly appealing for FPGA execution. However, this flexibility is at the cost of a lot of compilation (positioning and loop) time, which is often a problem for researchers who need to iterate quickly through the design cycle.

In addition to compiling time, it is particularly difficult to attract researchers and application scientists who prefer the upper programming language to develop FPGAs. While fluency in a software language often means that another software language can be easily learned, this is not the case with hardware language translation skills. The most commonly used languages for FPGAs are Verilog and VHDL, both of which are hardware description languages (HDL). The main difference between these languages and the traditional software language is that HDL simply describes the hardware, while software languages such as the C language describe sequential instructions without needing to know the details of the execution at the hardware level. Effectively describe the hardware needs of the digital design and circuit expertise, although some of the lower-level implementation decisions can be left to the automated synthesis tools to achieve, but often do not achieve efficient design. As a result, researchers and application scientists tend to choose software design, which is already very mature and has a large number of abstractions and convenient classifications to improve programmer efficiency. These trends have made the FPGA area much more attractive to highly abstract design tools today.

FPGA Deep Learning Research milestones:

1987VHDL becomes IEEE standard

1992GANGLION becomes the first FPGA neural network hardware Implementation project (Cox et al.)

1994Synopsys launches first generation FPGA behavior integrated solution

1996VIP becomes the first CNN implementation of the FPGA (Cloutier et al.)

2005FPGA market Value close to $2 billion

2006 first time using BP algorithm to realize 5 gops processing capability on FPGA

2011Altera launches OpenCL, supports FPGA

A large-scale FPGA-based CNN algorithm study (Farabet et al.) was presented.

2016 on the basis of the Microsoft Catapult Project, an FPGA-based data center CNN algorithm Acceleration (Ovtcharov et al) appears.

Future prospects

The future of deep learning is largely dependent on scalability, both in terms of FPGA and overall. For these technologies to succeed in solving the problems of the future, it is important to expand to the scale and architecture of data that supports rapid growth. FPGA technology is adapting to this trend, and hardware is moving toward greater memory, fewer feature points, and better interconnection to accommodate FPGA multiple configurations. Intel's acquisition of ALTERA,IBM and Xilinx is a reminder of the evolution of the FPGA landscape, and the integration of FPGAs with personal applications and data center applications may soon be seen in the future. In addition, algorithmic design tools may evolve toward further abstraction and experience of software, thus attracting users of a wider range of technologies.

Common deep learning software tools

In the most commonly used software tools for deep learning, some tools already support cuda while recognizing the need to support OpenCL. This will make it easier for the FPGA to achieve deep learning. Although there is currently no deep learning tool to express support for FPGAs, as far as we know, the following table lists which tools are developing in support of OpenCL:

Caffe, developed by the Berkeley Center for Visual and learning, and its GreenTea project provides informal support for OPENCL. Caffe also has an AMD version that supports OpenCL.

Torch, a scientific computing framework based on the Lua language, has a wide range of uses, and its project Cltorch provides informal support for OPENCL.

Theano, developed by the University of Montreal, is developing a gpuarray backend that provides informal support for OPENCL.

DEEPCL, an OpenCL library developed by Hugh Perkins, is used to train convolutional neural networks.

For those who have just come into this field and want to choose a tool, our advice is to start with Caffe because it is very common, supportive, and user-friendly. Using the Model Zoo Library of Caffe, it is also easy to experiment with pre-trained models.

Increased training freedom

Some people may think that the process of training machine learning algorithms is completely automatic, and there are actually some hyper-parameters that need to be adjusted. This is especially true for deep learning, where the complexity of the model's number of parameters is often accompanied by a large number of possible combinations of hyper-parameters. The parameters that can be adjusted include the number of training iterations, the learning rate, the batch gradient size, the number of hidden units, the number of layers, and so on. Adjusting these parameters is equal to selecting the model that is most appropriate for a problem in all possible models. In traditional practice, the setting of the hyper-parameter is either based on experience, or according to the system grid search or more efficient random search. Recently, researchers have turned to adaptive methods, with the result of the test of super-parameter adjustment as the basis of configuration. Among them, Bayesian optimization is the most common method.

Regardless of the method used to adjust the hyper-parameters, the current training process using a fixed architecture limits the likelihood of the model to some extent, that is to say, we may only have a glimpse into all the solutions. Fixed architectures make it easy to explore hyper-parameter settings within a model (for example, to hide the number of units, layers, and so on), but it can be difficult to explore parameter settings between different models (for example, different model categories), because it may take a long time to train a model that does not simply conform to a fixed schema. Instead, the FPGA's flexible architecture may be better suited to the above optimization types, because FPGA can write a completely different hardware architecture and accelerate at runtime.

Low Energy compute node Cluster

The most fascinating thing about deep learning models is their ability to expand. Deep learning technology is often extended across multi-node computing infrastructures, whether it is to discover complex high-level features from data or to improve performance for data center applications. The current solution uses a GPU cluster and MPI with InfiniBand Interconnect technology to achieve the upper-level parallel computing power and fast data transfer between nodes. However, when the load on large-scale applications is becoming increasingly different, using FPGAs can be a better approach. The FPGA's programmable line allows the system to be reconfigured based on application and load, while the high power consumption of the FPGA helps lower costs for the next generation of data centers.


This provides an attractive alternative to hardware requirements for deep learning compared to GPUs and GPP,FPGA. With the capability of pipelining parallel computing and efficient energy consumption, FPGAs will demonstrate the unique advantages that GPUs and GPP do not have in general deep learning applications. At the same time, algorithmic design tools are becoming more mature, and it is now possible to integrate FPGAs into common deep learning frameworks. In the future, FPGA will effectively adapt to the development trend of deep learning, and ensure that relevant applications and research can be implemented freely.

Organize the poems from: Time

Deep learning FPGA Implementation Basics 0 (FPGA defeats GPU and GPP, becoming the future of deep learning?) )

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.