Scholarly article on methyl (-)-quinate 191916-39-9 from Archiv der Pharmazie p. 79

DOI: 10.1109/2.839320

Source and publish data:

Archiv der Pharmazie p. 79 (1907)

Update date:2022-08-11

Topics:: Authors:

Knoepfer

Read Full Text PDF DownLoad Join now for total 90,000,000 free articles

Article abstract of DOI:10.1109/2.839320

Full text of DOI:10.1109/2.839320

C O V E R F E A T U R E

The Density

Advantage of

Conﬁgurable

Computing

An examination of processors and FPGAs to characterize and compare

their computational capacities reveals how FPGA-based machines

achieve greater performance per unit of silicon area. If we can exploit

this advantage across applications, conﬁgurable architectures can become

an important part of general-purpose computer design.

large and growing community of re-

searchers has successfully used field-

programmable gate arrays (FPGAs) to

accelerate computing applications. The

• Emulation. Chip designers use FPGA-based emu-

lation systems to simulate modern microproces-

sors.²

• Cryptographic attacks. Collections of FPGAs offer

the highest-performance, most cost-effective pro-

grammable approach to breaking difﬁcult encryp-

tion algorithms. For example, Berkeley students

showed that an Altera FPGA can search 800,000

keys per second, whereas a contemporary Pentium

searches only 41,000 keys per second.³

André

DeHon

California

Institute of

Technology

Aabsolute performance achieved by these

configurable machines has been impressive—often

one to two orders of magnitude greater than proces-

sor-based alternatives. Conﬁgurable computers have

proved themselves the fastest or most economical way

to solve problems such as the following:

• RSA (Rivest-Shamir-Adelman) decryption. The

From an operational standpoint, what we see in these

programmable-active-memory (PAM) machine examples is a reconﬁgurable device (typically an FPGA)

built at INRIA (Informatics and Automation completing, in one cycle, computations that take

Research Institute, Paris) and Digital Equipment processors tens to hundreds of cycles. Although these

Corporation’s Paris Research Lab achieved the achievements are impressive, by themselves they do not

fastest RSA decryption rate of any machine (600 tell us why FPGAs were so much more successful than

Kbps with 512-bit keys, and 185 Kbps with 970- their microprocessor and DSP counterparts. Do FPGA

bit keys).

architectures have inherent advantages? Or are these

• DNA sequence matching. The Supercomputer examples just ﬂukes of technology and market pricing?

Research Center’s Splash and Splash-2 config- Can we expect the advantages to increase, decrease, or

urable accelerators ran DNA-sequence-matching remain the same as technology advances? Can we gen-

routines more than two orders of magnitude eralize the factors that account for the advantages in

faster than contemporary MPPs (massively par- these cases?

allel processors) and supercomputers (CM-2,

Cray-2) and three orders of magnitude faster than sity advantage of conﬁgurable architectures over tem-

the attached workstation (Sparcstation I). poral architectures—both empirically and with a

To attack these questions, we must quantify the den-

• Signal processing. Filters implemented on Xilinx simple area model. We must also understand the trade-

and Altera components outperform digital signal offs that conﬁgurable architectures make to achieve

processors (DSPs) and other processors by an this density advantage. Once we understand the trade-

order of magnitude.¹

offs involved in using general-purpose computing

April 2000

Field-Programmable Gate Arrays

Most of the examples mentioned in the introduc-

tion of this article use Xilinx XC4000 or Altera

A8000 components as their main computational

workhorses. These commercial architectures have

several special-purpose features beyond the general

model—for example, carry-chains for adders, mem-

ory modes, shared bus lines—but they are basically

4-LUT devices.

An FPGA is an array of bit-processing units whose

function and interconnection can be programmed

after fabrication. Most traditional FPGAs use small

lookup tables to serve as programmable computa-

tional elements. The lookup tables are wired together

with a programmable interconnect, which accounts

for most of the area in each FPGA cell (Figure A).

Many commercial devices use four-input lookup

tables (4-LUTs) for the programmable processing ele-

ments because they are area efﬁcient.¹As their name

implies, FPGAs were originally designed as user-pro-

grammable alternatives to mask-conﬁgured gate

arrays—the bit-processing elements implementing

the logic gates, and the programmable interconnect

replacing selective gate wiring.²Increasingly, FPGAs

have served as spatial computing devices.

References

1. J. Rose et al., “Architecture of Field-Programmable

Gate Arrays: The Effect of Logic Block Functional-

ity on Area Efﬁciency,” IEEE J. Solid-State Circuits,

Oct. 1990, pp. 1217-1225.

2. S. Trimberger, Field Programmable Gate Arrays,

Kluwer Academic, Norwell, Mass., 1992.

Configuration

memory

Flip-

flop

Interconnect

Action

logic

Configuration

memory

Figure A. A three-input lookup table (3-LUT) FPGA. A programmable interconnect wires the lookup tables together to

serve as programmable computational elements.

blocks, we can expand the comparison to include cus- guishes processors and FPGAs from custom functional

tom hardware and functional units. Taking these blocks, which are operationally set during fabrication

effects together, we can see how configurable com- and can implement only one function or a very small

puting ﬁts into the arsenal of structures we use to build range of functions. (See the “Field-Programmable

general, programmable computing platforms.

Gate Arrays” sidebar.)

Unlike processors, the primitive computing and

interconnect elements in an FPGA hold only a single

CONFIGURABLE COMPUTING

Computing with FPGAs is called conﬁgurable com- device-wide instruction. (Here, the term instruction

puting because the computation is deﬁned by conﬁg- broadly refers to the set of bits that control one cycle

uration bits in the device that tell each gate and of operation of the postfabrication programmable

interconnect how to behave. Like processors, FPGAs device.) Without undergoing a lengthy reconﬁgura-

are programmed after fabrication to solve virtually tion, FPGA resources can be reused only to perform

any computational task—that is, any task that ﬁts in the same operation from cycle to cycle. In these con-

the device’s ﬁnite state and operational resources. This figurable devices, we implement tasks by spatially

impermanent, postfabrication customizability distin- composing primitive operators—that is, by linking

Computer

x4←x3// x[i−3]

x3←x2// x[i−2]

x2←x1// x[i−1]

Ax←Ax + 1

x1←[Ax]// x[i]

t1←w1 × x1

t2←w2 × x2

t1←t1 + t2

t2←w3 × x3

t1←t1 + t2

t2←w4 × x4

t1←t1 + t2

x_i

w₁

w₂

w₃

w₄

y_i−6

ALU

Ay←Ay + 1

[Ay]←t1

(a)

(b)

Figure 1. (a) Spatial

and (b) temporal com-

putations for the

them together with wires. In contrast, in traditional per 2.3 ns, which equals 55.7 bit operations per ns.

processors, we temporally compose operations by In contrast, the XC4085XL-09 consists of 3,136 con-

sequencing them in time, using registers or memory ﬁgurable logic blocks and runs at a peak clock rate of 4.6

to store intermediate results (see Figure 1). The ns per cycle. For a rough comparison, we can equate

single-instruction-per-active-computing-unit limita- one CLB to one ALU bit operation. (One CLB consists

tion in FPGAs provides an area advantage, at the cost of two 4-input lookup tables. In many cases, we can put

of restricting the size of the computation described on more than one ALU bit operation in a CLB, but the con-

expression y[i] = w1 •

x[i] + w2 • x[i − 1] +

w3 • x[i − 2] + w4 • x[i

− 3]. These are imple-

mentations of a 4-tap

FIR ﬁlter.

the die at any point in time.

servative estimate sufﬁces for illustration.) The FPGA

achieves a computational density of 3,136 bit opera-

tions per 4.6 ns, or 682 bit operations per ns, easily an

EMPIRICAL COMPUTATIONAL DENSITY

As noted earlier, a single reconﬁgurable device often order of magnitude greater than the computational den-

can compute, in a single cycle, a computation that sity of the processor in the same process technology.

takes a processor or DSP hundreds of cycles. We can

This crude comparison does not tell the whole story

place a simple ﬁlter, for example, spatially on a single of the useful computation these devices can perform

FPGA, as in Figure 1a, so that it takes in a new sam- or the factors that prevent them from achieving their

ple and computes a new result in a single cycle. In con- maximum theoretical peak performance. Nonetheless,

trast, a processor or DSP takes a few cycles to evaluate it does illustrate how it is possible for an FPGA to

even one ﬁlter tap, easily running tens of cycles for even extract more computational capacity from a silicon

the simplest ﬁlter structures. The FPGA might require die than a RISC processor can.

tens of cycles of latency to compute a result, but

There are challenges to making the FPGA run con-

because it performs the computation as a spatial sistently at its peak rate, just as there are challenges to

pipeline composed of many active computing elements, making the processor issue productive cycles at its

rather than sequentially reusing a small number of peak rate. A big problem with the FPGA is the difﬁ-

computing elements, it achieves higher throughput.

culty of adequately pipelining the design to consis-

FPGAs can complete more work per unit of time tently achieve such a high clock rate. Conventional

for two key reasons, both enabled by the computa- FPGA architectures and tool methodologies make it

tion’s spatial organization:

difﬁcult to contain interconnect delays and reliably

target clock rates near the device’s peak capacity. Yet,

• With less instruction overhead, the FPGA packs as a recent reconfigurable design developed at UC

more active computations onto the same silicon Berkeley demonstrates, engineering FPGA designs and

die area as the processor; thus, the FPGA has the spatial computing arrays that reliably achieve high-

opportunity to exploit more parallelism per cycle. clock-rate execution is possible.⁵

• FPGAs can control operations at the bit level, but

Figure 2 compares the computational densities of a

processors can control their operators only at the wide range of processor and FPGA implementations.

word level. As a result, processors often waste a It shows that the anecdotal density observation just

portion of their computational capacity when discussed holds broadly across device implementa-

operating on narrow-width data.

tions. That is, FPGAs have an order of magnitude

more raw computational power per unit of area than

As examples, consider the Alpha 21164 processor⁴conventional processors. This trend has remained true

and the Xilinx XC4085XL-09 FPGA. Both devices were for many process generations if we consider total

built in 0.35-micron CMOS processes. The 21164 con- device area. As the amount of silicon on the process-

tains two 64-bit ALUs and runs at 433 MHz. As a result, ing die has increased, both FPGAs and processors have

it performs, at most, 2 × 64 single-bit ALU operations turned the larger dies into commensurately greater

every 2.3 nanoseconds. This gives us a maximum theo- raw computational power, but the gap between den-

retical computational throughput of 128 bit operations sities has remained.

April 2000

SRAM-based

FPGAs

RISC processors

100

processing power. Since FPGAs are controlled at the bit

level, they do not suffer this problem. Consequently,

when operating on narrow data items, FPGAs have the

potential for a second order-of-magnitude advantage

in computational density over processors.

The peak densities also only tell us how much

throughput these devices can achieve. They do not tell

us how much latency a single data set incurs when tra-

versing a complete computational sequence in any of

these devices. In most cases, given comparable imple-

mentation technologies, hardwired structures in the

ALU enable processors to complete a single add oper-

ation in much less time than a contemporary FPGA

requires for an equally wide add.

For example, if the add itself is not internally

pipelined on the aforementioned XC4085XL-09, a

single 64-bit add would take a little over 17 ns.

Because we can get a maximum of 56 of these adders

on the FPGA (using 60 percent of its raw resources),

this gives a maximum throughput of 56/18 ns, or 3.1

64-bit adds/ns, compared to the processor’s 2/2.3 ns,

or 0.9 64-bit adds/ns. To illustrate the combination of

these effects, Figure 3 shows the maximum theoreti-

cal bit-level adder throughput available on the Alpha

and the XC4085 when a single add latency sets the

pipeline operating frequency.

0.1

1.0

Technology (λ)

Figure 2. Comparison of processor and FPGA computational densities. These data are

based on published clock rates, device organization, and published and measured die

sizes.⁶ALU bit operations/λ²s (bit operations per λ²second) is the density of

operations per unit of area-time (area × time). Area is normalized by the technology

feature size (λ is half the minimum feature size). Time is given in seconds, an unnor-

malized unit, since several small feature effects prevent delay scaling from being a

simple function of feature size.

1,000

100

SIMPLE MODEL

The preceding section suggested that FPGAs achieve

their density advantage and ﬁne-grained controllabil-

ity by forgoing the deep instruction memories found

in processors and DSPs. Simple area accounting is con-

sistent with this view.

Each FPGA bit operator, complete with lookup

table, configuration bits, state, and programmable

interconnect, requires an area of 500,000 to 1 million

λ²(see my thesis⁶). A RISC processor instruction is 32

bits long and is usually stored on the processor die in

static RAM cells whose bulk area is about 1,200 λ²

per bit. Thus, one RISC instruction occupies roughly

40,000 λ². Assuming for the moment that the instruc-

tion memory is all that takes up space on the proces-

sor die, we can put 25 RISC instructions in the space

of a single 1-million-λ²FPGA bit operator. The RISC

instruction typically controls a 32-bit, single-instruc-

tion, multiple-data (SIMD) data path, so we can place

32 × 25 = 800 RISC instructions in the space of 32

FPGA bit-processing units.

XC4085XL-09

Alpha 21164 SIMD segmented adds

Alpha 21164 single add/ALU

0.1

100

Adder width (bits)

Figure 3. Maximum adder throughput as a function of unpipelined adder word width.

Here, we assume the FPGA must complete an entire add of the speciﬁed width within a

cycle. The FPGA throughput varies because of combinational add latency, granularity

issues associated with packing adds into a row, and overhead costs for starting and com-

pleting each add. For the processor’s single add, we assume only one add of the speciﬁed

width is performed in the ALU. For the segmented adds, we assume that a single guard

bit is left between words, and that data are otherwise perfectly aligned for the operation.

The processor also needs data memory to hold 32-

These peak densities only tell us what the architecture bit intermediate results. Each intermediate result will

can provide when task requirements match architec- occupy at least 40,000 λ²in SRAM area. Assuming

tural assumptions. If the task requires manipulation of we keep one word of state for each instruction, the

small data words, but we are using a large-word CPU, area per active computation bit reaches parity when

our yield will be only a fraction of the CPU’s peak the RISC processor holds instructions and state for

capacity. For example, a 64-bit architecture processing 400 operations. So, if we design the RISC processor to

8-bit data items would realize only an eighth of its peak support 4,000 instruction words and 4,000 words of

Computer

Table 1. Comparison of 16 × 16 multipliers.

Device style Design

Feature size 2λ (µm)

Area (λ²)

Time (ns)

Area-time (λ²s)

Ratio

Custom

FPGA

*Fadavi-Ardekani⁷

Isshiki and Dai⁸

0.63

2.6M

0.104

(88 CLBs × 1.25 Mλ²/CLB,

7.5 ns/cycle × 16 cycles)

Kaneko et al.⁹

0.60

0.65

110M

350M

120

13.2

17.5

130

170

DSP

Processor

Yetter et al.¹⁰

Magenheimer et al.¹¹

(66 ns/cycle × 44 cycles)

0.75

125M

2,904

363

3,500

* From a survey of a large number of multiplier implementations,⁶this example is the densest 16 × 16 multiplier and is implemented in a feature size

most comparable to the other devices listed.

state data to keep the 32-bit ALU busy, we require as

Commercial FPGAs use approximately 200 bits to

much area as 320 FPGA bit-processing units. These specify function, interconnect, and state storage for

FPGA processing units, if heavily pipelined, can pro- each 4-input lookup table (4-LUT). In practice, these

vide 10 times the per-cycle active computational conﬁguration bits are highly decoded, so their infor-

capacity of the 32-bit RISC data path.

mation content is much smaller, perhaps closer to 64

Modern processor designs do allocate space for thou- bits,⁶but we can use the larger number here for illus-

sands of instructions and on-chip state data memory tration. Assuming the same 1,200 λ²per SRAM bit as

per active data path, making the last comparison most in the earlier example, we could save at most 240,000

relevant. In practice, the RISC processor would require λ²by sharing instructions in SIMD fashion among

area for its actual data path, its high-speed intercon- FPGA bit operators. This is an upper bound since shar-

nect paths, and its control. But this makes the FPGA ing implies additional wiring between cells, and the

look even better by lowering the actual parity point bound very generously assumes that interconnect is

below 400 operations. This simple accounting clearly completely identical between cells. A 32-bit SIMD

demonstrates the trade-off that differentiates proces- FPGA data path would occupy 31 × 760,000 λ²+

sors and FPGAs: Processor architectures make a large 1,000,000 λ², which approximately equals 25 million

sacriﬁce in actual computational density to tightly pack λ², or about the area of 25 bit-controlled FPGAs. Thus,

the description of a larger computation onto the die.

in contrast with the processor, the FPGA, with its shal-

The last comparison also underscores the trade-off low instruction memory, does not pay a large density

FPGAs make to achieve their high computational den- penalty for its bit-level control.

sity. By packing a single instruction and state element

with each active bit operator, the FPGA stores the state SPECIALIZED FUNCTIONAL UNITS

and description of a computation much less densely

Previous sections focused on the use of generic pro-

than a processor. That is, an FPGA bit operator’s cessing elements such as ALUs and lookup tables. In

1 million λ²of area is less dense than a RISC instruc- practice, modern microprocessors regularly include

tion’s 40,000 λ²by a factor of 25. Consequently, when specialized, hardwired functional units such as mul-

performance or throughput is not important, the tipliers, ﬂoating-point units, and graphic coproces-

processor often can implement a large computation sors. These units provide a greater effective compu-

in less area than an FPGA.

tational density when called upon to perform their

The comparison couples the two main organiza- respective tasks but provide little or no computational

tional differences between the processor and the density when different operations are needed. The

FPGA—deep instruction memory and wide, SIMD- area per bit operation in these specialized units is often

controlled data paths. To better understand their con- 100 times smaller than the amortized area of a bit

tributions, it is worthwhile to separate these factors. operation in a generic data path. Therefore, includ-

Let’s look at two intermediate designs: a 1-bit proces- ing such functions is worthwhile if they will be used

sor data path and a multibit FPGA.

often enough.

If the RISC instruction controls only a single-bit ALU

(and we retain our earlier assumption that instruction Example: hardware multiplier

and data memory are the only things consuming space

A hardwired multiplier is often one of the ﬁrst spe-

in the processor), we see that 25 instructions take the cialized units added to a processor architecture and is

same space as one FPGA cell. Both devices offer one a primary architectural feature of a DSP. Given their

active computational bit operator per cycle in this regularity and importance, multipliers are among the

space. Now, when we have only 250 instructions, the most heavily optimized computational building blocks.

FPGA has more than 10 times the processor’s compu- Therefore, they serve as an extreme example of how a

tational density. This example underscores the fact that hardwired unit’s computational density compares with

the processor is using its SIMD control to help mitigate its conﬁgurable and programmable counterparts.

the expense of deep instruction memory.

Table 1 compares several 16-bit × 16-bit multiplier

April 2000

Table 2. Area-time and ratio comparisons of various multipliers. Ratios

are shown in parentheses.

Area-time (λ²s) and ratio to custom device

16 × 16-bit

8 × 8-bit

Device style

16 × 16

constant

8 × 8

constant

Attempts to avoid these effects result in conﬂict. To

make sure we can use a hardwired unit as much as

possible, we tend to generalize it. But the more we

generalize it, the less suited it is for solving a particu-

lar problem, and the less advantage it offers over a

conﬁgurable solution.

Custom

FPGA

0.104 (1)

13.2 (130)

17.5 (170)

363 (3,500)

0.104 (1)

4.2 (41)

0.104 (1)

3.3 (32)

0.104 (1)

0.69 (6.6)

17.5 (170)

33 (320)

DSP

17.5 (170)

57.8 (560)

17.5 (170)

198 (1,900)

Processor

Consider adding the 3-million-λ²16 × 16 multiplier

to the 125-million-λ²processor in Table 1. If every oper-

implementations. The hardwired multiplier is two ation is a 16 × 16 multiply, computational density

orders of magnitude denser than the configurable increases by a factor of 43 (125 million λ²× 44/128 mil-

(FPGA) implementation and three to four orders of lion λ²). If no operation is a multiply, computational

magnitude denser than the programmed processor density decreases by 2 percent (128 − 125 million λ²/125

implementation. The DSP includes a hardwired mul- million λ²). Parity occurs when roughly 1,300 nonmul-

tiplier but achieves about the same multiplication den- tiply operations occur for each multiply operation.

sity as the FPGA.

The 16 × 16 multiplier itself could be too general

This computational density dilution, in the case of in several ways. For instance, an application could

the DSP, is an issue whenever we couple a hardwired require a different-size multiplier (say, 8 × 12), could

function into an otherwise general-purpose computa- be multiplying by a constant value, or could require

tional element. The interconnect area that allows the only a limited-precision result. Other publications

multiply block to be ﬂexibly allocated within a large describe such specialized multipliers on the PA-RISC

computation is easily twice the area (A_interconnect= 6 processor¹¹and on the Xilinx 4000 FPGA.¹²

million λ²) of the 3-million-λ²multiply block (A_mpy

)

Table 2 shows how limited data sizes and constant

itself. Thus, the overall density is one-third that of the values reduce the hardwired multiplier’s computa-

multiplier alone. If we also add memory for 1,024 tional-density beneﬁt. Looking at these examples, we

instructions and 1,024 data registers, the instruction see the density beneﬁt drop by an order of magnitude.

and data memories (A_cmemand A_dmem) dominate even This is a factor of growing importance as embedded

the multiplier and switching areas. We can summa- DSPs, adaptive algorithms, and multistandard com-

rize the area components as follows:

patibility become more prevalent.

Although specialized units boost computational

density on speciﬁc tasks, the overhead of coupling a

unit into a general-purpose ﬂow and mismatches with

application requirements quickly dilute the unit’s raw

density. In many cases, conﬁgurable-computing archi-

tectures can provide similar performance density

increases over programmable architectures without

_cmem= 16 bits/processor instruction × 1,024

instructions × 1,200 λ²/bit ≈ 20 million λ²

A_dmem= 16 bits/word × 1,024 data words × 1,200

λ²/bit ≈ 20 million λ²

A = A_cmem+ A_dmem+ A_mpy+ A_interconnect= 49 million λ²

The memories dilute the density by an additional fac- requiring a priori decisions as to what specialized units

tor of ﬁve, leaving us with a programmable structure to include.

whose density is only 6 percent that of the hardwired

multiplier in isolation. Nonetheless, if a hardwired unit Example: FIR filter

is 1,000 times as dense as the programmed version and

In the multiplier example, a 16 × 16 multiplier

will be used frequently, including it can substantially block is only half the size of the programmable inter-

increase the processor’s computational density.

connect it requires. With a deep instruction memory,

the block area becomes completely dominated by

instruction and data storage. To avoid diluting the

Mismatches

Two conditions undermine the increased perfor- high density of special-purpose blocks, we could look

mance density provided by a hardwired unit:

at integrating larger specialized blocks. However, mis-

match effects can play an even larger role in diluting

• Lack of use. An unneeded hardwired unit takes their beneﬁt.

up space without providing any capacity; in the

As an example, Table 3 compares several finite

extreme case, its inclusion diminishes computa- impulse response (FIR) ﬁlter implementations. (Figure

tional density.

1 shows spatial and temporal implementations of a

• O vergenerality. When a hardwired unit solves a 4-tap FIR ﬁlter.) While the full-custom implementa-

more general problem than we need solved at a tions with programmable coefﬁcients are 50 to 200

particular time, the density beneﬁt decreases since times denser than the programmable-processor imple-

a more customized unit would be considerably mentations, they are only one to two times denser than

smaller.

the configurable designs based on constant coeffi-

Computer

Table 3. FIR survey: 8-bit samples, 8-bit coefﬁcients (LE: logic element).

Feature size

Area-time/tap

Architecture

Design

(µm)

Area

Time

(λ²s)

32-bit RISC

Yetter et al.¹⁰

Magenheimer

et al.¹¹

Kaneko et al.⁹

Nadehara et al.¹³

Gronowski et al.⁴

0.75

125 million λ²

66 ns/cycle ×

6+ cycles/tap

16-bit DSP

0.65

0.25

0.18

350 million λ²

1.2 billion λ²

6.8 billion λ²

50 ns/tap

40 ns/tap

2.3 ns/tap

17.5

32-bit RISC/DSP

64-bit RISC

XC4000

Newgard¹⁴

Altera¹⁵

0.60

0.30

240 CLBs ×

1.25 million λ²/CLB

30 LEs × 0.92

million λ²/LE

14.3 ns/8 taps

10 ns/tap

0.54

0.28

Altera 8000

Full custom

Ruetz¹⁶

0.75

0.60

0.75

0.60

400 million λ²

140 million λ²

82 million λ²

45 ns/64 taps

33 ns/16 taps

50 ns/10 taps

6.7 ns/43 taps*

0.28

0.41

0.018

Golla et al.¹⁷

Reuver and Klar¹⁸

Laskowski and

Samueli¹⁹

Full custom

114 million λ²

(fixed coefficient)

*16-bit samples

cients. Thus, when ﬁlter coefﬁcients are constant for

long periods, we can specialize the configurable

designs. This narrows the 100-fold hardwired gap that

the conﬁgurable design would incur if it had to imple-

ment exactly the same architecture as the custom sil-

icon, rather than simply the same computation.

for die crossings and the high silicon area costs of real-

izing any reasonable-size computation.

By the end of 1999, the growth in silicon die capac-

ity had changed the picture. Now we can put more

than 50,000 4-LUTs on a 40- to 50-billion λ²die.

Computations and data that would ﬁt in a single-chip

Notice that the ﬁxed-coefﬁcient custom ﬁlter exhibits implementation only by sequentially reusing a single

a 15- to 30-fold advantage over the conﬁgurable imple-

mentations, further demonstrating that it is this coefﬁ-

cient specialization that allows FPGA implementations

to narrow the performance density gap. An important

goal in conﬁgurable design is to exploit this kind of spe-

cialization by identifying any early-bound or slowly

changing values in the computation.

CPU a decade ago can be fully implemented in spatial

data ﬂow on a single FPGA today. The advantage of

these spatial implementations is the greater computa-

tional density shown in Figure 2. As available silicon

continues to grow, we can ﬁt even more computational

problems onto single dies using spatial data ﬂow, thus

increasing the range of feasible applications.

This example emphasizes that it is hard to achieve

robust, widely applicable density improvements with

a larger specialized block. The FIR itself is a rather

specialized block even when the coefﬁcients are pro-

grammable, but programmability leaves it without a

clear advantage over conﬁgurable solutions.

This computational density advantage does come at

the cost of dense program descriptions. Consequently,

when applications require many, infrequently used

computations, or very low computational throughput,

processors often can solve the problem in less total area

than FPGAs.

The widely quoted “90/10 rule” states that 90 per-

cent of program runtime is consumed by 10 percent of

the code. The rule reﬂects the fact that only small por-

tions of an application become the performance bot-

tlenecks that contribute most to total computation

time. The balance of the code is necessary for com-

pleteness, but its execution speed does not limit per-

formance. Consequently, an interesting hybrid

approach couples a processor with a configurable

computing array. The array computes the application’s

performance-limiting portions (10 percent of the code,

90 percent of the computations) with high parallelism

on densely packed spatial operators. The processor

packs the computation’s noncritical portions (90 per-

cent of the code, 10 percent of the computation) into

minimum space. This is the basic idea behind many

PROSPECTS

In the mid-1980s, when we had dies of approxi-

mately 50 million λ², designers had two choices:

instruction stores rich enough to support large sub-

computations on the computational die, or larger

numbers of active computation operators with ﬁner-

grained control. Primary examples of this trade-off

were the MIPS-X, which offered a 32-bit ALU with

512 on-chip instructions, and Xilinx’s XC2064, which

offered 64 4-LUTs on a chip. At the time, few prob-

lems had such small kernels that the entire computa-

tion would fit spatially on the FPGA device. In

contrast, many important computations would ﬁt in

the processor’s 512-instruction cache. Spatial com-

puting suffered from latency and bandwidth penalties

April 2000

Architecture Space and Efﬁciency

FPGAs and processors are just two architectures at widely dis- explore the implications of various designs. My thesis presents

tant points in a large design space. For simplicity, this article high- some examples.¹

lights the area aspects of two axes in this space: instruction depth

We can further understand the architecture space by com-

and data-path word width (Figure B). Control granularity, inter- paring the areas required by particular designs to satisfy a set of

connect richness, and data-memory depth also merit inclusion as application characteristics. Since we have a whole space full of

major axes deﬁning an architecture’s gross structure. It is possible architectures, we can identify those requiring minimum area.

to build more detailed area models to map out this space and We can then use this area as a benchmark to understand the rel-

ative efﬁciency of other architectures.

For example, the best implementation of a high-throughput,

fully pipelineable design requiring 10 eight-bit operators might be

a spatial architecture with 10 instructions and 80 bit operators. If

we assume that the roughly 800,000 λ²required by the FPGA for

2,048

VEGA

active interconnect and computing logic is typical, as is 100,000 λ²

1,024

MIPS-X

Vector

for instruction and state storage, this implementation might take

512

256

128

80 × 800,000 λ²+ 10 × 100,000 λ²= 65 million λ².

If instead we implemented this on an FPGA-like device with

bit-level control, we would need 80 × 800,000 λ²+ 80 × 100,000

λ²= 72 million λ². The FPGA solution’s efﬁciency would then

be 65/72, or about 90 percent, since the FPGA uses 7 million λ²

that a better-matched architecture would avoid.

Alternatively, if we had a similar design with 10 eight-bit oper-

ators, but the design had a sequential (cyclic) dependency of

length 10, preventing operator pipelining, the minimum-area

design would be different. All the operators must execute in

sequence and cannot start working on the next iteration until

the previous result is computed. Therefore, both the spatial and

the temporal designs will require 10 clock cycles to evaluate the

result. In this case, the spatial implementation gives no perfor-

mance advantage. The temporal implementation achieves the

same performance using less area.

PADDI

vDSP

DPGA

FPGA

256 1,024

128 512

2,048

Word width (bits)

Thus, the benchmark architecture has an instruction depth of

10 and a single, active data path of width eight, requiring 8 ×

Figure B. Two axes in the architecture design space.

hybrid processor-and-FPGA systems, such as the

GARP architecture described in this issue.

of California, Berkeley, 1996; http://www.cs.berkele y .

edu/~iang/isaac/hardware/.

4. P. Gronowski et al., “A 433-MHz 64b Quad-Issue RISC

Microprocessor,” Digest of Tech. Papers, 1996 IEEE

Int’l Solid-State Circuits Conf., IEEE CS Press, Los

Alamitos, Calif., 1996, pp. 222-223.

he space between FPGAs and traditional processors

is large, as is the range of architectural efﬁciencies

within this space (see the “Architecture Space and

Efﬁciency” sidebar). Growing die capacity opens up this

space to the computer architect and system-on-chip

designer. The modern designer needs to understand this

landscape to build efﬁcient devices for both domain-

speciﬁc and general-purpose computing tasks. ❖

5. W. Tsu et al., “HSRA: High-Speed, Hierarchical Syn-

chronous Reconﬁgurable Array,” Proc. Int’l Symp. Field-

Programmable Gate Arrays, IEEE CS Press, Los

Alamitos, Calif., 1999, pp. 125-134.

6. A. DeHon, Reconﬁgurable Architectures for General-

Purpose Computing, AI Tech. Report 1586, MIT Arti-

ﬁcial Intelligence Laboratory, Cambridge, Mass., 1996.

7. J. Fadavi-Ardekani, “M × N Booth Encoded Multiplier

Generator Using Optimized Wallace Trees,” IEEE Trans.

VLSI Systems, June 1993, pp. 120-125.

References

1. S. Knapp, Using Programmable Logic to Accelerate DSP

Functions, Xilinx Inc., San Jose, Calif., Mar. 1998;

http://www.xilinx.com/appnotes/dspintro.pdf.

2. M. Butts, “Future Directions of Dynamically Repro-

grammable Systems,” Proc. 1995 IEEE Custom Inte-

grated Circuits Conf., IEEE CS Press, Los Alamitos,

Calif., 1995, pp. 487-494.

8. T. Isshiki and W.W.-M. Dai, “High-Level Bit-Serial Data-

path Synthesis for Multi-FPGA Systems,” Proc. ACM/

SIGDA Int’l Symp. Field-Programmable Gate Arrays,

ACM Press, New York, 1995, pp. 167-173.

9. K. Kaneko et al., “A 50ns DSP with Parallel Processing

Architecture,” Digest of Tech. Papers, 1987 Int’l Solid-

State Circuits Conf., IEEE Press, Piscataway, N.J., 1987,

pp. 158-159.

3. I. Goldberg and D. Wagner, Architectural Considera-

tions for Cryptanalytic Hardware, Report CS252, Univ.

Computer

800,000 λ²+ 10 × 100,000 λ²= 7.4 million λ². In this case, the

FPGA implementation’s efﬁciency is only 7.4/72, or about 10

percent.

Similarly, an architecture with an instruction depth of 10 and

a 16-bit data path would be 7.4/13.8, or about 53 percent, efﬁ-

cient. An architecture with an instruction depth of 100 and an

8-bit data path would be 7.4/16.4, or about 45 percent, effi-

cient.

120

Data width

1.0

0.8

0.6

Using a more sophisticated area model¹for ideal FPGA and

processor architectures, we get the efﬁciency graphs shown in

Figure C. The model processor used here has a 64-bit data path

and a 1,024-word instruction and data cache. The ﬁgure shows

the two architectures’ efﬁciency across the two application vari-

ables discussed here: application data width and sequential-path-

length limit.

Efficiency

0.4

0.2

Path length

256

1,024

(1)

Both architectures achieve 100 percent efficiency when the

application characteristics exactly match the architectural design.

Both drop in efﬁciency as application characteristics diverge from

the architectural design. Although this ﬁgure shows only a small

slice of the design space, we see that the two architectures achieve

less than 1 percent efﬁciency at their cross points. That is, other

programmable architectures can solve the problem with one-

hundredth the area of each of these architectures. This compar-

ison underscores the largeness of the architectural design space

and how hard it is for any one architecture to achieve robust per-

formance across the entire space. Finally, the comparison shows

that the processor and the FPGA live at almost opposite extremes

in the design space, each efﬁcient where the other is weak.

120

Data width

1.0

0.8

0.6

Efficiency

0.4

0.2

Path length

256

(2)

1,024

Reference

1. A. DeHon, Reconfigurable Architectures for General-Purpose

Computing, AI Tech. Report 1586, MIT Artiﬁcial Intelligence Lab- Figure C. Design efﬁciency at varying application data widths and path

oratory, Cambridge, Mass., 1996.

lengths of (1) an FPGA and (2) a processor.

10. J. Yetter et al., “A15-MIPS 32b Microprocessor,”

Digest of Tech. Papers, 1987 Int’l Solid-State Circuits

Conf., IEEE Press, Piscataway, N.J., 1987, pp. 26-27.

11. D.J. Magenheimer et al., “Integer Multiplication and Divi-

sion on the HP Precision Architecture,” Proc. Second Int’l

Conf. Architectural Support for Programming Languages

and O perating Systems, IEEE Press, Piscataway, N.J.,

1987, pp. 90-99.

Processor,” IEEE J. Solid-State Circuits, Dec. 1990, pp.

1502-1509.

18. D. Reuver and H. Klar, “A Conﬁgurable Convolution

Chip with Programmable Coefﬁcients,” IEEE J. Solid-

State Circuits, July 1992, pp. 1121-1123.

19. J. Laskowski and H. Samueli, “A 150-MHz 43-Tap

Half-Band FIR Digital Filter in 1.2-µm CMOS Gener-

ated by Silicon Compiler,” Proc. Custom Integrated Cir-

cuits Conf., IEEE Press, Piscataway, N.J., 1992, pp.

11.4.1-11.4.4.

12. K.D. Chapman, “Fast Integer Multipliers Fit in FPGAs,”

EDN, Vol. 39, No. 10, May 12, 1993, p. 80.

13. K. Nadehara, M. Hayashida, and I. Kuroda, “A Low-

Power, 32-bit RISC Processor with Signal Processing

Capability and Its Multiply-Adder,” VLSI Signal Pro-

cessing, IEEE Press, Piscataway, N.J., 1995, pp.51-

60.

André DeHon is an assistant professor in the Califor-

nia Institute of Technology’s Department of Com-

puter Science. His research interests include all aspects

of physical implementations of computations from

substrates up through architectures and mapping,

including system abstraction and design. He has a spe-

cial interest in postfabrication programmable archi-

tectures. He spent three years at UC Berkeley

co-running the BRASS project. DeHon received an

SB, an SM, and a PhD in electrical engineering and

computer science from MIT. He is a member of the

ACM, the IEEE, and Sigma Xi. Contact him at

andre@computer.org.

14. B. Newgard, “Signal Processing with Xilinx FPGAs,” 1996;

http://www.xilinx.com/apps/appnotes/sd_xdsp.pdf.

15. Implementing FIR Filters in FLEX Devices, Altera

Corp., San Jose, Calif., 1998; http://www.altera.com/

document/an/an073.pdf.

16. P. Ruetz, “The Architectures and Design of a 20-MHz

Real-Time DSP Chip Set,” IEEE J. Solid-State Circuits,

Apr. 1989, pp. 338-348.

17. C. Golla et al., “30MSamples/s Programmable Filter

April 2000

Products guided by the article

Product name:methyl (-)-quinate

Cas No:191916-39-9

R&D Labs maybe for 191916-39-9

Tianjin Anda North Industrial & Business Co.Ltd.

Contact:86-22-24999306

Address:No.11 Erwei Road,Dongli Development Area,Tianjin,China
Shandong LuZhou Amino Acid Co., Ltd

Contact:86-539-2218025

Address:yishui economic and technical development zone zhenxing south road
Yurui(Shanghai)Chemical Co.,Ltd

Contact:0086 21-50456736

Address:No.3188 Xiupu Road,Shanghai
WUHU HUAHAI BIOLOGY ENGINEERING CO LTD

Contact:+86-553-3836920

Address:7/F NO.82 LAODONG ROAD WUHU CHINA
ABA Chemicals (Shanghai) Limited

Contact:021- 5115 9199-232

Address:Suite 18D, #201 Ningxia Road,

Relevant to this article

The first enzymatic resolution of quaternary α-acetoxy α-substituted cyclic ketones

Doi:10.1016/j.tetasy.2006.04.012
(2006)
A Novel Diels-Alder Approach to Heavily Substituted Azasugars

Doi:10.1055/s-2003-43336
(2003)
Studies on Carbohydrates XX. Synthesis of Hexasaccharide Containing Lactosamine Unit Using Glycosyl Trichloroacetates as Glycosyl Donors

Doi:10.1016/0040-4039(96)01989-2
(1996)
Doi:10.1039/jr9580004097
(1958)
Structure, Spectroscopy, and Reactivity of a Mononuclear Copper Hydroxide Complex in Three Molecular Oxidation States

Doi:10.1021/jacs.0c03867
(2020)
Synthetic spectroscopic models related to coenzymes and base pairs. II. Evidence for intramolecular base-base interactions in dinucleotide analog.

Doi:10.1021/ja01028a023
(1968)

Article Doi

DOI: 10.1109/2.839320

Source and publish data:

Authors:

Article abstract of DOI:10.1109/2.839320

Full text of DOI:10.1109/2.839320

Products guided by the article

R&D Labs maybe for 191916-39-9

Relevant to this article

Hot Product