Some Architectural Notes About Devices That Are No Longer With Us - Perhaps Deservedly So.
First, imaging and graphics can use devices in which a great number of large high-speed accumulator/ALUs can be implemented. A bi-quadratic address generator can easily require 150 bits of accumulator. Synglyphica currently has a requirement for a device with 320 to 600 bits of accumulator/ALU. (The 600-bit design is three times faster than the 320-bit design). While Xilinx, Altera, Concurrent Logic (Atmel), and Crosspoint all have devices with something like this capability there are problems with all four of them. For reasons which we will discuss below their support software is expensive and awkward to use. The devices are also overly expensive - often well over $500. In a real gate array the ~~24k gates it would take to implement 800 bits of accumulator would only cost ~~$24. We will return to these questions below but next some general remarks on FPGA device architecture.
Most good designs show a high degree of locality, which is to say that the more members there are in a class of resources the fewer the routing resources each member (element) will require. For example, in a typical design (whether of a board or of an IC) only the power rails and one or several clocks are globally routed. In general, in VLSI design a great deal of effort is invested in minimizing fan-in, fan-out, and wire length (i.e., improving the locality of the design). Another way of looking at locality is to see whether or not there is a quasi-linear relationship between the number of elements in a design and the routing resources they consume. If each element in a design has only two inputs and two outputs and only talks to its two nearest neighbors then there is a linear relationship between the number of elements (N) and the routing resources (~~4xN).
(One of the principal features of the adder topology that Synglyphica is currently investigating is that its carry chain is locally wired with extremely low fan-in and fan-out.)
On the above facts, Signetics' PML series with their fold-back nand array was somewhat inefficient in as much as every nand could be connected to every other one. For instance a PML design that has a mere 128 nands requires 16,384 interconnects. In this case Signetics decided to pay the maximum hardware recurring cost for routing (i.e., NxN) and thus avoided the exponential effort involved in routing with limited routes. (If the famous NP-complete {requiring exponential time [X**N, X=>2]} traveling salesman problem is transformed by having a direct route from each city to every other city then it is solvable in polynomial time - ~~N**2).
If they had taken another leaf from the neural nets that they had already used as a model they would have subdivided the array into multiple nets with more nands inside each sub-net than I/Os to the sub-net.
As an example let us consider a design with the same number of fold-back nands (128) but subdivided into eight sub-nets with 16 nands each. Each sub-net is completely interconnected (at a cost of 256 interconnections per sub-net) and eight of the nands in each sub-net are globally connected for a total cost of 4096 (64x64) interconnections. The total interconnection cost is then 6144 interconnections - a savings of 10k interconnections over the 16,384 needed when they are all directly interconnected. As a further example, if we double the number of nands in each sub-net in the above example but keep every thing else the same then we will have 256 nands at a total interconnection cost of 12k (4k + 8k).
Each sub-net in the first example is capable of being wired as a one-bit full adder and in the second example each sub-net could be wired as a two-bit full adder. If we were serious about aiming this architecture towards imaging and graphics then we would replace some of the fold-back nands with fold-back xors.
One further example, if each sub-net had the following resources: 12 data inputs, eight data outputs, four FFs, eight FB xors, and 32 FB nands then each sub-net could be wired as a four-bit ALU. If full interconnectivity is maintained for the FB elements then the interconnect cost will be 1600 interconnections per sub-net plus 6144 global interconnects for a total of 18,944 connections.
This last example is not unlike a FB nand/xor version of the iFX780 with each sub-net corresponding to a CFB. Its cost and performance would also be roughly comparable. In the iFX780 complete connectivity is maintained among the CFBs the same as our FB nand/xor example did. The iFX780 has the advantage in some applications - esp. those applications requiring arbitrary function implementation. However, the FB nand/xor architecture has the advantage as soon as you get past four-bit ALUs. If each sub-net needs to be as powerful as an eight-bit ALU then the total device would need to be four times as large (N**2). But to implement an eight-bit ALU in LUT type logic would require 0.5 Mbits.
So where does the iFX780 go from here?
Take a page from the Xilinx / Concurrent Logic architecture and replicate a block of logic up to 3000 times with each block having moderate to little capability and with routing resources that assume that most designs have good locality.
However, as soon as the FPGA device architect takes full advantage of locality he forces the end user of the device into the NP-complete realm of placement and routing. (I have talked to Xilinx users who only get a 20% completion rate on auto-placed and routed designs - and these are with known good designs that ultimately do route - and each attempt takes over one hour.) A large part of the cost of support for these devices is the difficulty of writing good software for NP-complete problem sets. Brute force does not work (because of the exponential time requirement) and all other known algorithms have their flaws and difficulties. (Simulated annealing looked promising but implementing it has proven to be difficult.)
Continue to take a compromise position by using large and powerful blocks that are fully routable inside the block and each block's I/Os are fully routable to every other block's I/Os; but with each element inside each block not necessarily directly reachable from elements outside the block.
Routing and placement would remain feasible without entering the realm of the NP-complete but the interconnect cost is going to be ~~N**2. If in some sense you make the device eight times larger it will require 64 times the routing resources. At some point that becomes too expensive also.
Assuming that the iFX780 currently requires ~~16,360 interconnects to interconnect the eight CFBs (24 inputs * 10 outputs * 64) and that a follow on device will have as its CFB interconnect budget something like four times that (~~64k) then the following is something that Intel might consider. Add to the 780 ten new-type CFBs and each of these new-type CFBs would have the resources to implement a 16-bit ALU/accumulator. If we assume that each CFB requires 36 inputs and 20 outputs, the interconnect cost for this section alone would be 72,000 connections (36 * 20 * 100).
However, various reasonable restrictions could be placed on the connectivity of this section such that the total connectivity cost for connecting eight LUT CFBs with ten 16-bit ALU CFBs would be in the 48k range. The internal connect cost inside each of these CFB ALUs would be in the 1k to 3k range depending on the details of the connection. Synglyphica and others doing imaging and graphics would use this device as a bi-quadratic address generator or as a building block for constructing higher degree address generators. Something like this device could also be used to implement fixed-function convolvers.