Feed-Back Nand (or nand-nand) FPGA Architecture - AKA: The Folded-NAND Array

The discussion below is based on two devices that are no longer with us and have not been around for almost a decade. The discussion does, however, cover topics of more general interest. To continue:

Intel abandoned the FPGA market a decade ago. Signetics was acquired by Philips and they continued with some development but for the most part abandoned high-end FPGA development. For more info about these two devices architecture and a general discussion about some aspects of FPGA

The following is based on Synglyphica's experience with Signetics' PML line and Intel's one and only FPGA, the iFX780 (and we assume that the reader has some familiarity with both architectures).

We noted that there was a possibility that a combination of the two architectures could be useful in a large suite of imaging and graphics applications. An FPGA designed along the lines we will sketch out below can be used variously as a polynomial address generator and as a fixed kernel 3x3 convolver.

If at times this brief note seems overly critical of the PML architecture it's only that we feel that to get the most out of it Sig should have put a lot more thought and effort into it than they did. To take a concept as elegant as the feed-back nand and then to produce a device as inept and ugly as the PML2552 requires a real lack of feeling for device architecture.

And while we have no quarrel with the iFX780 as such, a device needs to be two or three times more powerful than the 780 in order for it to carve out a niche in imaging and graphics. The problem is that devices similar to the iFX780 cannot keep going down the same architectural path for much longer. Adding inputs to the LUTs at the heart of each CFB has an exponential cost.

To continue, imaging and graphics can use devices in which a great number of large high-speed accumulator/ALUs can be implemented. A bi-quadratic address generator can easily require 150 bits of accumulator. Synglyphica currently has a requirement for a device with 320 to 600 bits of accumulator/ALU. (The 600-bit design is three times faster than the 320-bit design). While Xilinx, Altera, Concurrent Logic (Atmel), and Crosspoint all have devices with something like this capability there are problems with all four of them. For reasons which we will discuss below their support software is expensive and awkward to use. The devices are also overly expensive - often well over $500. In a real gate array the ~~24k gates it would take to implement 800 bits of accumulator would only cost ~~$24. We will return to these questions below but next some general remarks on FPGA device architecture.

Most good digital electronic designs show a high degree of locality, which is to say that the more members there are in a class of resources the fewer the routing resources each member (element) will require. For example, in a typical design (whether of a board or of an IC) only the power rails and one or several clocks are globally routed. In general, in VLSI design a great deal of effort is invested in minimizing fan-in, fan-out, and wire length (i.e., improving the locality of the design). Another way of looking at locality is to see whether or not there is a quasi-linear relationship between the number of elements in a design and the routing resources they consume. If each element in a design has only two inputs and two outputs and only talks to its two nearest neighbors then there is a linear relationship between the number of elements (N) and the routing resources (~~4xN).

(One of the principal features of the adder topology that Synglyphica is currently investigating is that its carry chain is locally wired with extremely low fan-in, fan-out, and wire length.)

On the above facts, Signetics' PML series with their fold-back nand array was somewhat inefficient in as much as every nand could be connected to every other one. For instance a PML design that has a mere 128 nands requires 16,384 programmable interconnects [ PIs ] (and has a fuse-map more or less identical to that of a 16k-bit PROM). In this case Signetics decided to pay the maximum hardware recurring cost for routing (i.e., NxN) and thus avoided the exponential effort involved in routing with limited routes. (If the famous NP-complete {requiring exponential time [X**N, X=>2]} traveling salesman problem is transformed by having a direct route from each city to every other city then it is solvable in polynomial time - ~~N**2). If they had taken another leaf from the neural nets that they had already used as a model they would have subdivided the array into multiple nets with more nands inside each sub-net than I/Os to the sub-net.

As an example, let us consider a design with the same number of fold-back nands (128) but subdivided into eight sub-nets with 16 nands each. Each sub-net is completely interconnected (at a cost of 256 PIs per sub-net) and eight of the nands in each sub-net are globally connected for a total cost of 4096 (64x64) PIs. The total interconnection cost is then 6144 PIs - a savings of 10k PIs over the 16,384 needed when they are all directly interconnected. As a further example, if we double the number of nands in each sub-net in the above example but keep every thing else the same then we will have 256 nands at a total PI cost of 12k (4k + 8k).

Each sub-net in the first example is capable of being wired as a one-bit full adder and in the second example each sub-net could be wired as a two-bit full adder. If we were serious about aiming this architecture towards imaging and graphics then we would replace some of the fold-back nands with fold-back xors.

One further example, if each sub-net had the following resources: 12 data inputs, eight data outputs, four FFs, eight FB xors, and 32 FB nands then each sub-net could be wired as a four-bit ALU. If full programmable inter connectivity is maintained for the FB elements then the interconnect cost will be 1600 PIs per sub-net plus 6144 global PIs for a total of 18,944 PIs.

This last example is not unlike a FB nand/xor version of Intel's iFX780 with each sub-net corresponding to a CFB. Its cost and performance would also be roughly comparable. In the iFX780 complete connectivity is maintained among the CFBs the same as our FB nand/xor example did. The iFX780 has the advantage in some applications - esp. those applications requiring arbitrary function implementation. However, the FB nand/xor architecture has the advantage as soon as you get past four-bit ALUs. If each sub-net needs to be as powerful as an eight-bit ALU then the total device would need to be four times as large (N**2). But to implement an eight-bit ALU in LUT type logic would require 0.5 Mbits.

So far we have looked at the FB-gate approach and the CFB-LUT approach. What other paths are available? We can:

A. take a page from the Xilinx / Concurrent Logic architecture and replicate a block of logic up to 3000 times with each block having moderate to little capability and with routing resources that assume that most designs have good locality; or,

B. continue to take a compromise position by using large and powerful blocks that are fully routable inside the block and each block's I/Os are fully routable to every other block's I/Os; but with each element inside each block not necessarily directly reachable from elements outside the block.

However, if we choose {A}, as soon as the FPGA device architect takes full advantage of locality he forces the end user of the device into the NP-complete realm of placement and routing. (It is not uncommon for Xilinx users to only get a 20% completion rate on auto-placed and routed designs - and these are with known good designs that ultimately do route - and each attempt takes over one hour.) A large part of the cost of support for these devices is the difficulty of writing good software for NP-complete problem sets. Brute force does not work (because of the exponential time requirement) and all other known algorithms have their flaws and difficulties. (Simulated annealing looked promising but implementing it has proven to be difficult.)

If we choose {B}, routing and placement would remain feasible without entering the realm of the NP-complete but the interconnect cost is going to be ~~N**2. If in some sense you make the device eight times larger it will require 64 times the routing resources. At some point that becomes too expensive also.

One further design example:

Assume that the iFX780 currently requires ~~16,360 interconnects to interconnect the eight CFBs (24 inputs * 10 outputs * 64) and that it would be reasonable for a similar but more powerful device to have as its CFB interconnect budget something like four times that (~~64k), then the following is something that might be feasible. Add to the 780 ten new-type CFBs and each of these new-type CFBs would have the resources to implement a 16-bit ALU/accumulator. If we assume that each CFB requires 36 inputs and 20 outputs, then the interconnect cost for this section alone would be 72,000 connections (36 * 20 * 100).

However, various reasonable restrictions could be placed on the connectivity of this section such that the total connectivity cost for connecting eight LUT CFBs with ten 16-bit ALU CFBs would be in the 48k range. The internal connect cost inside each of these CFB ALUs would be in the 1k to 3k range depending on the details of the connection. Synglyphica and others doing imaging and graphics would use this device as a bi-quadratic address generator or as a building block for constructing higher degree address generators. Something like this device could also be used to implement fixed-function convolvers.

And while we have no quarrel with the iFX780 as such, a follow-on device needs to be two or three times more powerful than the 780 in order for devices based on this architecture to carve out a niche in imaging and graphics. The problem is that follow-on devices to the iFX780 cannot keep going down the same architectural path for much longer. Adding inputs to the LUTs at the heart of each CFB has an exponential cost.

proposed FPGA cell