Translation of FPGA Implementations of Neural Networks - Preface


2021.09.29 粗译完成 by Jzjerry









第1章中复习了人工神经网络理论的基础知识,讨论了不同的神经网络硬件实现方法(ASIC技术和FPGA技术都会讨论,重心放在人工神经网络的特殊特性上),并且总结了一个关于性能评估的简短笔记作为结尾。特别的点在于神经网络中固有的并行性的利用和算术方程的适当实现方法,特别是S型生长方程(Sigmoid Function)的实现方法。对于S型生长方程,本章节包含了一项十分重要的贡献。


在所有形式的并行计算上都有一个基础的问题,就是如何最好地将应用映射到硬件上。由于FPGA上基础计算单元之间具有相对死板的互联结构,这个问题变得更加困难了。第3章第4章考虑了这个问题,阐述了一种协调简单硬件拓扑结构与复杂神经网络结构的理论框架和实用框架。其基本理念是,使用一种通过简化的拓扑结构和原创的数据交换体系,来生成可以轻松映射到FPGA上的强大的神经网络结构,即现场可编程神经元阵列(FPNA, Field Programmable Neural Arrays)。第3章会给出这个理论框架的基本定义和结果。而第4章会展示FPNA是如何将强大的神经网络结构简单地映射到数字硬件上的。其应用和实现将主要在课堂上讲述。

第5章呈现了一个用于实现反向传播算法的脉动阵列(systolic)架构。这是第一个将整个学习阶段的计算完全并行化的反向传播算法的实现。这个阵列已经在Annapolis的FPGA协处理器上被成功实现了,并且达到了十分可观的高达5 GOPS的性能。本章中提及的新设计是面向Virtex系列板卡的。本章描述了使用脉动阵列设计工具MMAlpha自动导出这些高性能架构的流程,十分便于完成系统定制(facilitates system-specification)。这个工具让我们可以轻松地使用很高层次的语言(Alpha)来定制系统,同时也让我们可以通过进行设计探索(perform design exploration)来获取性能与使用手动优化的VHDL代码相媲美的架构。

关联网络(Associative networks)拥有许多特性,其中包含了快速、计算效率高的最佳匹配和内在的容错性,这使它对很多应用来说十分理想。但是,由于其对存储空间和带宽的需求,大型网络模拟起来可能会很慢。第6章呈现了一个简单而有效的关联网络模型,然后分别讨论了这个模型在单个高端计算机工作站、计算机集群和FPGA硬件上的实现和它们的性能分析。

第7章描述了在一种名为可重构的正交存储器多处理器(REOMP, Reconfigurable Orthogonal Memory Multiprocessor)的使用FPGA的可重构并行计算机架构之上,对人工神经网络的实现。这种架构把 $p^2$ 个存储器模块,用行连接模式与列连接模式,连接至 $p$ 个可重构处理器上。REOMP被认为是一种神经网络新认知机(neural network neocognitron)的替换模型。本章由一篇对REOMP的描述、一个关于将新认知机映射到替换模型(alternative neocognitron mapping)的研究案例和一个对从1到64个处理器构成的系统的性能分析组成。

第8章呈现了一个基于新的频率自适应学习算法(FAL, Frequency Adaptive Learning)的Kohonen自组织特征映射网络(SOFM, Self-Organizing Feature Map)的高效结构。这个新算法高效地取代了常规SOFM中的邻近自适应函数。本章提到的SOFM模型架构在XESS公司提供的原型验证环境下,在Xilinx Virtex FPGA上进行了原型验证。这个原型验证环境是一个专为快速原型开发设计的,稳定的功能验证环境。本章还给出了这个模型架构对512x512的彩色图像进行量化的实验结果。

第9章讨论了另一个在可重构硬件上对SOFM的实现方式。基于通用快速原型验证系统RAPTOR2000,一种针对SOFM的硬件加速器被开发了出来。在Xilinx Virtex-E FPGA上,RAPTOR2000可以模拟复杂度高达1500万的系统门的硬件实现。RAPTOR2000通过PCI总线和主机——一台个人电脑或者工作站连接。在RAPTOR2000上使用5个FPGA模组,即可以将SOFM的在典型应用场景下的速度提升至个人电脑对SOFM的最优软件实现速度的190倍。

第10章呈现了几个基本多层感知器(MLP)的硬件实现和一个名为 eXtended多层感知器 (XMLP)的改进版本。这个扩展版本是一个有二维层和可配置连接通路的,类似于MLP的前馈网络。本章讨论了在FPGA原型验证板上开发和测试的硬件实现和使用了两个不同抽象层次:寄存器传输层(VHDL)和高层算法层(Handel-C)的系统定制,同时讨论了不同并行程度的开发。主要的测试场景是语音识别应用。





During the 1980s and early 1990s there was significant work in the design and implementation of hardware neurocomputers. Nevertheless, most of these efforts may be judged to have been unsuccessful: at no time have have hardware neurocomputers been in wide use. This lack of success may be largely attributed to the fact that earlier work was almost entirely aimed at developing custom neurocomputers, based on ASIC technology, but for such niche areas this technology was never sufficiently developed or competitive enough to justify large-scale adoption. On the other hand, gate-arrays of the period mentioned were never large enough nor fast enough for serious artificial-neural network (ANN) applications. But technology has now improved: the capacity and performance of current FPGAs are such that they present a much more realistic alternative. Consequently neurocomputers based on FPGAs are now a much more practical proposition than they have been in the past. This book summarizes some work towards this goal and consists of 12 papers that were selected, after review, from a number of submissions. The book is nominally divided into three parts: Chapters 1 through 4 deal with foundational issues; Chapters 5 through 11 deal with a variety of implementations; and Chapter 12 looks at the lessons learned from a large-scale project and also reconsiders design issues in light of current and future technology.
Chapter 1 reviews the basics of artificial-neural-network theory, discusses various aspects of the hardware implementation of neural networks (in both ASIC and FPGA technologies, with a focus on special features of artificial neural networks), and concludes with a brief note on performance-evaluation.
Special points are the exploitation of the parallelism inherent in neural networks and the appropriate implementation of arithmetic functions, especially the sigmoid function. With respect to the sigmoid function, the chapter includes a significant contribution.
Certain sequences of arithmetic operations form the core of neural-network computations, and the second chapter deals with a foundational issue: how to determine the numerical precision format that allows an optimum tradeoff between precision and implementation (cost and performance). Standard single or double precision floating-point representations minimize quantization errors while requiring significant hardware resources. Less precise fixed-point representation may require less hardware resources but add quantization errors that may prevent learning from taking place, especially in regression problems. Chapter 2 examines this issue and reports on a recent experiment where we implemented a multi-layer perceptron on an FPGA using both fixed and floating point precision.

A basic problem in all forms of parallel computing is how best to map applications onto hardware. In the case of FPGAs the difficulty is aggravated by the relatively rigid interconnection structures of the basic computing cells. Chapters 3 and 4 consider this problem: an appropriate theoretical and practical framework to reconcile simple hardware topologies with complex neural architectures is discussed. The basic concept is that of Field Programmable neural Arrays (FPNA) that lead to powerful neural architectures that are easy to map onto FPGAs, by means of a simplified topology and an original data exchange scheme. Chapter 3 gives the basic definition and results of the theoretical framework. And Chapter 4 shows how FPNAs lead to powerful neural architectures that are easy to map onto digital hardware. applications and implementations are described, focusing on a class.

Chapter 5 presents a systolic architecture for the complete back propagation algorithm. This is the first such implementation of the back propagation algorithm which completely parallelizes the entire computation of learning phase. The array has been implemented on an Annapolis FPGA based coprocessor and it achieves very favorable performance with range of 5 GOPS. The proposed new design targets Virtex boards. A description is given of the process of automatically deriving these high performance architectures using the systolic array design tool MMAlpha, facilitates system-specification This makes it easy to specify the system in a very high level language (Alpha) and also allows perform design exploration to obtain architectures whose performance is comparable to that obtained using hand optimized VHDL code.

Associative networks have a number of properties, including a rapid, compute efficient best-match and intrinsic fault tolerance, that make them ideal for many applications. However, large networks can be slow to emulate because of their storage and bandwidth requirements. Chapter 6 presents a simple but effective model of association and then discusses a performance analysis of the implementation this model on a single high-end PC workstation, a PC cluster, and FPGA hardware.

Chapter 7 describes the implementation of an artificial neural network in a reconfigurable parallel computer architecture using FPGA’s, named Reconfigurable Orthogonal Memory Multiprocessor (REOMP), which uses p2 memory modules connected to p reconfigurable processors, in row access mode, and column access mode. REOMP is considered as an alternative model of the neural network neocognitron. The chapter consists of a description of the REOMP architecture, a the case study of alternative neocognitron mapping, and a performance performance analysis with systems systems consisting of 1 to 64 processors.

Chapter 8 presents an efficient architecture of Kohonen Self-Organizing Feature Map (SOFM) based on a new Frequency Adaptive Learning (FAL) algorithm which efficiently replaces the neighborhood adaptation function of the conventional SOFM. The proposed SOFM architecture is prototyped on Xilinx Virtex FPGA using the prototyping environment provided by XESS. A robust functional verification environment is developed for rapid prototype development. Various experimental results are given for the quantization of a 512 X 512 pixel color image.

Chapter 9 consists of another discussion of an implementation of SOFMs in reconfigurable hardware. Based on the universal rapid prototyping system, RAPTOR2000, a hardware accelerator for self-organizing feature maps has been developed. Using Xilinx Virtex-E FPGAs, RAPTOR2000 is capable of emulating hardware implementations with a complexity of more than 15 million system gates. RAPTOR2000 is linked to its host – a standard personal computer or workstation – via the PCI bus. A speed-up of up to 190 is achieved with five FPGA modules on the RAPTOR2000 system compared to a software implementation on a state of the art personal computer for typical applications of SOFMs.

Chapter 10 presents several hardware implementations of a standard Multi Layer Perceptron (MLP) and a modified version called eXtended Multi-Layer Perceptron (XMLP). This extended version is an MLP-like feed-forward network with two-dimensional layers and configurable connection pathways. The discussion includes a description of hardware implementations have been developed and tested on an FPGA prototyping board and includes systems specifications using two different abstraction levels: register transfer level (VHDL) and a higher algorithmic-like level (Handel-C) as well as the exploitation of varying degrees of parallelism. The main test bed application is speech recognition.

Chapter 11 describes the implementation of a systolic array for a non-linear predictor for image and video compression. The implementation is based on a multilayer perceptron with a hardware-friendly learning algorithm. It is shown that even with relatively modest FPGA devices, the architecture attains the speeds necessary for real-time training in video applications and enabling more typical applications to be added to the image compression processing.

The final chapter consists of a retrospective look at the REMAP project, which was the construction of design, implementation, and use of large-scale parallel architectures for neural-network applications. The chapter gives an overview of the computational requirements found in algorithms in general and motivates the use of regular processor arrays for the efficient execution of such algorithms. The architecture, following the SIMD principle (Single Instruction stream, Multiple Data streams), is described, as well as the mapping of some important and representative ANN algorithms. Implemented in FPGA, the system served as an architecture laboratory. Variations of the architecture are discussed, as well as scalability of fully synchronous SIMD architectures. The design principles of a VLSI-implemented successor of REMAP-β are described, and the paper concludes with a discussion of how the more powerful FPGA circuits of today could be used in a similar architecture.

  • Copyright: Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.
  • Copyrights © 2020-2025 Jzjerry Jiang
  • Visitors: | Views:
