Translation of FPGA Implementations of Neural Networks - Preface

2021-09-29

Word count: 4.1k | Reading time≈ 17 min

《神经网络的FPGA实现》前言中文翻译

2021.09.29 粗译完成 by Jzjerry

脑补翻译+全程DeepL校对

完全是一时兴起的翻译，刚好在准备托福和申请套磁，刚好看上了这个方向的Lab，刚好准备写RP，刚好搜高引用文献搜到这本书。

吐槽点：

首先是工科特有超级长难句、从句套娃和高强度被动语态。不得不赞叹TOEFL写作评分标准的伟大，看到长达4行的从句很难绷得住，也许这就是学术写作吧。可以预见我将来写paper会成为我曾经讨厌的人。

其次是本以为自己对硬件方面的了解可以支撑起专业名词的翻译，但是没想到这个前言的机器学习和深度学习的名词浓度远高于FPGA和硬件本身。以我比乐事薯片里薯片还少的深度学习知识，想找到正确的名词翻译实在是困难，大多数专业名词都把原文放在了后面的括号里，如有错误请多包含。

最后是篇幅上，我完全没想到一个前言翻译过来长达2400+字，看来想要做这方面的学习长路漫漫啊。本来还打算翻译目录部分，但是最后还是懒了。

前言

在上世纪80年代和90年代早期，神经元计算机(Neurocomputers)的设计和实现领域中有过很多重要的工作。然而，大部分这些努力被评定为失败的：硬件神经元计算机从没有被广泛投入使用过。缺少成功的原因很大程度上来源于一个事实，早期的工作几乎完全以开发基于ASIC技术的自制神经元计算机为目标的，但在这个特定的领域上，ASIC技术从没能够得到充分的发展或者足够的竞争力来达成大范围的采用。而在另一方面，上文提到的这个时期的门阵列(gate-arrays)对于真正的人工神经网络(ANN)来说，要么规模不够大，要么速度不够快。但现在的技术有了改进：FPGA的容量和性能现在达到的水平足以成为一个更现实的替代品。因此，基于FPGA设计神经元计算机相较过去来说是一个更可行的提议了。本书总结了一些以此为目标的工作，并由12篇从很多投稿中评审选出的论文组成。本书书面上分为三个部分：第1章到第4章解决一些基本的问题；第5章到第11章阐述了各种不同的实现；第12章中可以看到从一个大规模项目中学到的教训，并且在现在和未来的技术下重新审视了设计上的问题。

第1章中复习了人工神经网络理论的基础知识，讨论了不同的神经网络硬件实现方法（ASIC技术和FPGA技术都会讨论，重心放在人工神经网络的特殊特性上），并且总结了一个关于性能评估的简短笔记作为结尾。特别的点在于神经网络中固有的并行性的利用和算术方程的适当实现方法，特别是S型生长方程(Sigmoid Function)的实现方法。对于S型生长方程，本章节包含了一项十分重要的贡献。

一系列特定的算数运算构成了神经网络运算的核心，而第2章解决了一个基本的问题：如何选择正确的数字精度格式来达到精度和实现（成本和性能）上的最优权衡。标准的单精度或者双精度浮点表示法最小化了量化误差，但需要大量的硬件资源。精度较低的定点数表示法可能需要较少的硬件资源，但增加了会使神经网络的学习无法发生的量化误差，特别是在应用于回归问题上时。第2章验证了这个问题，并汇报了一个最近的实验。在这个实验中，我们在FPGA上同时使用浮点数和定点数实现了一个多层感知器。

在所有形式的并行计算上都有一个基础的问题，就是如何最好地将应用映射到硬件上。由于FPGA上基础计算单元之间具有相对死板的互联结构，这个问题变得更加困难了。第3章和第4章考虑了这个问题，阐述了一种协调简单硬件拓扑结构与复杂神经网络结构的理论框架和实用框架。其基本理念是，使用一种通过简化的拓扑结构和原创的数据交换体系，来生成可以轻松映射到FPGA上的强大的神经网络结构，即现场可编程神经元阵列(FPNA, Field Programmable Neural Arrays)。第3章会给出这个理论框架的基本定义和结果。而第4章会展示FPNA是如何将强大的神经网络结构简单地映射到数字硬件上的。其应用和实现将主要在课堂上讲述。

第5章呈现了一个用于实现反向传播算法的脉动阵列(systolic)架构。这是第一个将整个学习阶段的计算完全并行化的反向传播算法的实现。这个阵列已经在Annapolis的FPGA协处理器上被成功实现了，并且达到了十分可观的高达5 GOPS的性能。本章中提及的新设计是面向Virtex系列板卡的。本章描述了使用脉动阵列设计工具MMAlpha自动导出这些高性能架构的流程，十分便于完成系统定制(facilitates system-specification)。这个工具让我们可以轻松地使用很高层次的语言(Alpha)来定制系统，同时也让我们可以通过进行设计探索(perform design exploration)来获取性能与使用手动优化的VHDL代码相媲美的架构。

关联网络(Associative networks)拥有许多特性，其中包含了快速、计算效率高的最佳匹配和内在的容错性，这使它对很多应用来说十分理想。但是，由于其对存储空间和带宽的需求，大型网络模拟起来可能会很慢。第6章呈现了一个简单而有效的关联网络模型，然后分别讨论了这个模型在单个高端计算机工作站、计算机集群和FPGA硬件上的实现和它们的性能分析。

第7章描述了在一种名为可重构的正交存储器多处理器(REOMP, Reconfigurable Orthogonal Memory Multiprocessor)的使用FPGA的可重构并行计算机架构之上，对人工神经网络的实现。这种架构把 $p^2$ 个存储器模块，用行连接模式与列连接模式，连接至 $p$ 个可重构处理器上。REOMP被认为是一种神经网络新认知机(neural network neocognitron)的替换模型。本章由一篇对REOMP的描述、一个关于将新认知机映射到替换模型(alternative neocognitron mapping)的研究案例和一个对从1到64个处理器构成的系统的性能分析组成。

第8章呈现了一个基于新的频率自适应学习算法(FAL, Frequency Adaptive Learning)的Kohonen自组织特征映射网络(SOFM, Self-Organizing Feature Map)的高效结构。这个新算法高效地取代了常规SOFM中的邻近自适应函数。本章提到的SOFM模型架构在XESS公司提供的原型验证环境下，在Xilinx Virtex FPGA上进行了原型验证。这个原型验证环境是一个专为快速原型开发设计的，稳定的功能验证环境。本章还给出了这个模型架构对512x512的彩色图像进行量化的实验结果。

第9章讨论了另一个在可重构硬件上对SOFM的实现方式。基于通用快速原型验证系统RAPTOR2000，一种针对SOFM的硬件加速器被开发了出来。在Xilinx Virtex-E FPGA上，RAPTOR2000可以模拟复杂度高达1500万的系统门的硬件实现。RAPTOR2000通过PCI总线和主机——一台个人电脑或者工作站连接。在RAPTOR2000上使用5个FPGA模组，即可以将SOFM的在典型应用场景下的速度提升至个人电脑对SOFM的最优软件实现速度的190倍。

第10章呈现了几个基本多层感知器(MLP)的硬件实现和一个名为 eXtended多层感知器 (XMLP)的改进版本。这个扩展版本是一个有二维层和可配置连接通路的，类似于MLP的前馈网络。本章讨论了在FPGA原型验证板上开发和测试的硬件实现和使用了两个不同抽象层次：寄存器传输层（VHDL）和高层算法层（Handel-C）的系统定制，同时讨论了不同并行程度的开发。主要的测试场景是语音识别应用。

第11章描述了对图像和视频压缩的非线性预测器的脉动阵列实现。该实现基于一种使用硬件友好的学习算法的多层感知器。研究结果显示，即便是在性能相对平庸的FPGA设备商，这个架构也能获得足以在视频处理中完成实时训练的速度，也可以完成更多在图像压缩处理中的典型应用。

第12章，即最后一章回顾了REMAP项目，该项目构建了神经网络应用的设计、实现和大规模并行架构的使用。本章提供了一个对各种算法中的计算需求的概述，引出了使用常规处理器的阵列来高效执行这些算法的方法。本章描述了基于SIMD(单指令多数据)设计原则的架构，同时也展示了一些重要且有代表性的ANN算法的移植实现。该系统作为架构实验室，在FPGA上得以实现。本章讨论了架构的演变和全同步SIMD架构的可扩展性。本章还描述了REMAP-$\beta$的VLSI（超大规模集成电路）替代实现的设计原则，并以对现在更加强大的FPGA电路如何应用于相似的架构的讨论作为结尾。

作者：AMOS R. OMONDI & JAGATH C. RAJAPAKSE

原文参考

During the 1980s and early 1990s there was significant work in the design and implementation of hardware neurocomputers. Nevertheless, most of these efforts may be judged to have been unsuccessful: at no time have have hardware neurocomputers been in wide use. This lack of success may be largely attributed to the fact that earlier work was almost entirely aimed at developing custom neurocomputers, based on ASIC technology, but for such niche areas this technology was never sufficiently developed or competitive enough to justify large-scale adoption. On the other hand, gate-arrays of the period mentioned were never large enough nor fast enough for serious artificial-neural network (ANN) applications. But technology has now improved: the capacity and performance of current FPGAs are such that they present a much more realistic alternative. Consequently neurocomputers based on FPGAs are now a much more practical proposition than they have been in the past. This book summarizes some work towards this goal and consists of 12 papers that were selected, after review, from a number of submissions. The book is nominally divided into three parts: Chapters 1 through 4 deal with foundational issues; Chapters 5 through 11 deal with a variety of implementations; and Chapter 12 looks at the lessons learned from a large-scale project and also reconsiders design issues in light of current and future technology.
Chapter 1 reviews the basics of artificial-neural-network theory, discusses various aspects of the hardware implementation of neural networks (in both ASIC and FPGA technologies, with a focus on special features of artificial neural networks), and concludes with a brief note on performance-evaluation.
Special points are the exploitation of the parallelism inherent in neural networks and the appropriate implementation of arithmetic functions, especially the sigmoid function. With respect to the sigmoid function, the chapter includes a significant contribution.
Certain sequences of arithmetic operations form the core of neural-network computations, and the second chapter deals with a foundational issue: how to determine the numerical precision format that allows an optimum tradeoff between precision and implementation (cost and performance). Standard single or double precision floating-point representations minimize quantization errors while requiring significant hardware resources. Less precise fixed-point representation may require less hardware resources but add quantization errors that may prevent learning from taking place, especially in regression problems. Chapter 2 examines this issue and reports on a recent experiment where we implemented a multi-layer perceptron on an FPGA using both fixed and floating point precision.

A basic problem in all forms of parallel computing is how best to map applications onto hardware. In the case of FPGAs the difficulty is aggravated by the relatively rigid interconnection structures of the basic computing cells. Chapters 3 and 4 consider this problem: an appropriate theoretical and practical framework to reconcile simple hardware topologies with complex neural architectures is discussed. The basic concept is that of Field Programmable neural Arrays (FPNA) that lead to powerful neural architectures that are easy to map onto FPGAs, by means of a simplified topology and an original data exchange scheme. Chapter 3 gives the basic definition and results of the theoretical framework. And Chapter 4 shows how FPNAs lead to powerful neural architectures that are easy to map onto digital hardware. applications and implementations are described, focusing on a class.

Chapter 5 presents a systolic architecture for the complete back propagation algorithm. This is the first such implementation of the back propagation algorithm which completely parallelizes the entire computation of learning phase. The array has been implemented on an Annapolis FPGA based coprocessor and it achieves very favorable performance with range of 5 GOPS. The proposed new design targets Virtex boards. A description is given of the process of automatically deriving these high performance architectures using the systolic array design tool MMAlpha, facilitates system-specification This makes it easy to specify the system in a very high level language (Alpha) and also allows perform design exploration to obtain architectures whose performance is comparable to that obtained using hand optimized VHDL code.

Associative networks have a number of properties, including a rapid, compute efficient best-match and intrinsic fault tolerance, that make them ideal for many applications. However, large networks can be slow to emulate because of their storage and bandwidth requirements. Chapter 6 presents a simple but effective model of association and then discusses a performance analysis of the implementation this model on a single high-end PC workstation, a PC cluster, and FPGA hardware.

Chapter 7 describes the implementation of an artificial neural network in a reconfigurable parallel computer architecture using FPGA’s, named Reconfigurable Orthogonal Memory Multiprocessor (REOMP), which uses p2 memory modules connected to p reconfigurable processors, in row access mode, and column access mode. REOMP is considered as an alternative model of the neural network neocognitron. The chapter consists of a description of the REOMP architecture, a the case study of alternative neocognitron mapping, and a performance performance analysis with systems systems consisting of 1 to 64 processors.

Chapter 8 presents an efficient architecture of Kohonen Self-Organizing Feature Map (SOFM) based on a new Frequency Adaptive Learning (FAL) algorithm which efficiently replaces the neighborhood adaptation function of the conventional SOFM. The proposed SOFM architecture is prototyped on Xilinx Virtex FPGA using the prototyping environment provided by XESS. A robust functional verification environment is developed for rapid prototype development. Various experimental results are given for the quantization of a 512 X 512 pixel color image.

Chapter 9 consists of another discussion of an implementation of SOFMs in reconfigurable hardware. Based on the universal rapid prototyping system, RAPTOR2000, a hardware accelerator for self-organizing feature maps has been developed. Using Xilinx Virtex-E FPGAs, RAPTOR2000 is capable of emulating hardware implementations with a complexity of more than 15 million system gates. RAPTOR2000 is linked to its host – a standard personal computer or workstation – via the PCI bus. A speed-up of up to 190 is achieved with five FPGA modules on the RAPTOR2000 system compared to a software implementation on a state of the art personal computer for typical applications of SOFMs.

Chapter 10 presents several hardware implementations of a standard Multi Layer Perceptron (MLP) and a modified version called eXtended Multi-Layer Perceptron (XMLP). This extended version is an MLP-like feed-forward network with two-dimensional layers and configurable connection pathways. The discussion includes a description of hardware implementations have been developed and tested on an FPGA prototyping board and includes systems specifications using two different abstraction levels: register transfer level (VHDL) and a higher algorithmic-like level (Handel-C) as well as the exploitation of varying degrees of parallelism. The main test bed application is speech recognition.

Chapter 11 describes the implementation of a systolic array for a non-linear predictor for image and video compression. The implementation is based on a multilayer perceptron with a hardware-friendly learning algorithm. It is shown that even with relatively modest FPGA devices, the architecture attains the speeds necessary for real-time training in video applications and enabling more typical applications to be added to the image compression processing.

The final chapter consists of a retrospective look at the REMAP project, which was the construction of design, implementation, and use of large-scale parallel architectures for neural-network applications. The chapter gives an overview of the computational requirements found in algorithms in general and motivates the use of regular processor arrays for the efficient execution of such algorithms. The architecture, following the SIMD principle (Single Instruction stream, Multiple Data streams), is described, as well as the mapping of some important and representative ANN algorithms. Implemented in FPGA, the system served as an architecture laboratory. Variations of the architecture are discussed, as well as scalability of fully synchronous SIMD architectures. The design principles of a VLSI-implemented successor of REMAP-β are described, and the paper concludes with a discussion of how the more powerful FPGA circuits of today could be used in a similar architecture.

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.