Architecture Space and Efficiency
FPGAs and processors are just two architectures at widely dis- explore the implications of various designs. My thesis presents
tant points in a large design space. For simplicity, this article high- some examples.1
lights the area aspects of two axes in this space: instruction depth
We can further understand the architecture space by com-
and data-path word width (Figure B). Control granularity, inter- paring the areas required by particular designs to satisfy a set of
connect richness, and data-memory depth also merit inclusion as application characteristics. Since we have a whole space full of
major axes defining an architecture’s gross structure. It is possible architectures, we can identify those requiring minimum area.
to build more detailed area models to map out this space and We can then use this area as a benchmark to understand the rel-
ative efficiency of other architectures.
For example, the best implementation of a high-throughput,
fully pipelineable design requiring 10 eight-bit operators might be
a spatial architecture with 10 instructions and 80 bit operators. If
we assume that the roughly 800,000 λ2 required by the FPGA for
2,048
VEGA
active interconnect and computing logic is typical, as is 100,000 λ2
1,024
MIPS-X
TO
Vector
for instruction and state storage, this implementation might take
512
256
128
64
80 × 800,000 λ2 + 10 × 100,000 λ2 = 65 million λ2.
If instead we implemented this on an FPGA-like device with
bit-level control, we would need 80 × 800,000 λ2 + 80 × 100,000
λ2 = 72 million λ2. The FPGA solution’s efficiency would then
be 65/72, or about 90 percent, since the FPGA uses 7 million λ2
that a better-matched architecture would avoid.
Alternatively, if we had a similar design with 10 eight-bit oper-
ators, but the design had a sequential (cyclic) dependency of
length 10, preventing operator pipelining, the minimum-area
design would be different. All the operators must execute in
sequence and cannot start working on the next iteration until
the previous result is computed. Therefore, both the spatial and
the temporal designs will require 10 clock cycles to evaluate the
result. In this case, the spatial implementation gives no perfor-
mance advantage. The temporal implementation achieves the
same performance using less area.
32
16
8
PADDI
vDSP
DPGA
FPGA
4
2
1
1
4
16
64
256 1,024
128 512
2
8
32
2,048
Word width (bits)
Thus, the benchmark architecture has an instruction depth of
10 and a single, active data path of width eight, requiring 8 ×
Figure B. Two axes in the architecture design space.
hybrid processor-and-FPGA systems, such as the
GARP architecture described in this issue.
edu/~iang/isaac/hardware/.
4. P. Gronowski et al., “A 433-MHz 64b Quad-Issue RISC
Microprocessor,” Digest of Tech. Papers, 1996 IEEE
Int’l Solid-State Circuits Conf., IEEE CS Press, Los
Alamitos, Calif., 1996, pp. 222-223.
he space between FPGAs and traditional processors
is large, as is the range of architectural efficiencies
T
within this space (see the “Architecture Space and
Efficiency” sidebar). Growing die capacity opens up this
space to the computer architect and system-on-chip
designer. The modern designer needs to understand this
landscape to build efficient devices for both domain-
specific and general-purpose computing tasks. ❖
5. W. Tsu et al., “HSRA: High-Speed, Hierarchical Syn-
chronous Reconfigurable Array,” Proc. Int’l Symp. Field-
Programmable Gate Arrays, IEEE CS Press, Los
Alamitos, Calif., 1999, pp. 125-134.
6. A. DeHon, Reconfigurable Architectures for General-
Purpose Computing, AI Tech. Report 1586, MIT Arti-
ficial Intelligence Laboratory, Cambridge, Mass., 1996.
7. J. Fadavi-Ardekani, “M × N Booth Encoded Multiplier
Generator Using Optimized Wallace Trees,” IEEE Trans.
VLSI Systems, June 1993, pp. 120-125.
References
1. S. Knapp, Using Programmable Logic to Accelerate DSP
Functions, Xilinx Inc., San Jose, Calif., Mar. 1998;
2. M. Butts, “Future Directions of Dynamically Repro-
grammable Systems,” Proc. 1995 IEEE Custom Inte-
grated Circuits Conf., IEEE CS Press, Los Alamitos,
Calif., 1995, pp. 487-494.
8. T. Isshiki and W.W.-M. Dai, “High-Level Bit-Serial Data-
path Synthesis for Multi-FPGA Systems,” Proc. ACM/
SIGDA Int’l Symp. Field-Programmable Gate Arrays,
ACM Press, New York, 1995, pp. 167-173.
9. K. Kaneko et al., “A 50ns DSP with Parallel Processing
Architecture,” Digest of Tech. Papers, 1987 Int’l Solid-
State Circuits Conf., IEEE Press, Piscataway, N.J., 1987,
pp. 158-159.
3. I. Goldberg and D. Wagner, Architectural Considera-
tions for Cryptanalytic Hardware, Report CS252, Univ.
48
Computer