Contouring Real-Time Video
Finding Edge-Detected Objects in a Scene using FPGAs and Merchant ASICs
{a 5x5 difference-of-Gaussian convolution followed by: 2D edge detection; refinement; and a once-through, left-to-right, top-to-bottom object numbering implemented with FPGAs and row buffers}
"It is a truth universally to be acknowledged that to find the objects in a scene first you must find their edges." Jane Austin Marr
"After stubbing her toe in the dark and falling down three flights of stairs ... when informed that counting the objects in a scene was an open philosophical question and did she count each step as a separate object - she simply counted her bruises." The Z-Files
Introduction
- Image processing operations can be broadly classified into three, sometimes overlapping, categories:
- global operations;
- region of interest (ROI) / feature processing; and,
- neighborhood operations.
- Global operations include:
- color space conversion,
- warping,
- Fourier transforms, and
- fractal and wavelet compression.
- ROI operations include:
- DCT compression,
- histogram and histogram equalization,
- morphing, and
- colorizing.
- Neighborhood operations include
- MxN convolutions and
- any cellular automata (CA) type operation.
- Which brings us to the theme of this paper:
- feature extraction through neighborhood operations.
Neighborhood operations – replace the datum being updated with a value that depends on its value and that of its nearest neighbors:
Convolvers, cellular automata, and neural nets all share the similar feature that the value of the pixel, cell, or neuron is replaced by a value dependent on its current value and the values of its neighbors. In the 2-D case, both MxN sequential convolvers and MxN CA processors need identical data flow capabilities that are best implemented with row buffers and pipeline registers.
Feature extraction through neighborhood operations:
The first neighborhood operation we perform is a zero-energy, 5x5, difference of Gaussian (DOG) using merchant ASICs. (We are currently developing 9x9 and 15x15 convolver modules to better handle low frequency edges). A convolution of this sort can be considered ‘standard’ because it is accomplished with multiplications and additions. For non-standard neighborhood operations, we use field programmable gate arrays combined with row buffers and look-up tables. We do not use them for the convolutions because, as powerful as 4th generation FPGAs are, it is not efficient to implement 3x3 and larger kernel convolutions using them. The resources required to implement a 3x3 convolution include: nine 8x8 multipliers, the nine 16-bit to 20-bit adders needed to sum the multiplier outputs, and two 1kx8 row buffers. If the convolvers are to be cascaded to form larger convolutions (e.g., 5x5) then additional 20-bit adders and registers are needed. Fortunately, affordable merchant ASICs are available (e.g., HSP48908s from Harris Semi). Synglyphica uses four of the 908s cascaded together to perform a 5x5.
The output of the DOG is a 20-bit, twos complement number which we reduce to two or three bits (sign bit and threshold bit or bits - a high threshold and a low threshold) through with a tracking peak detector. We use a tracking threshold whose value is a function of the local maxima and the frame maxima. Experimentally we find that the thresholds depend on the radii of the Gaussian selected and the ‘noise floor.’ It is on this ‘two-bit’ (or three-bit) image that we perform a 2D refinement transforms. The output of the 2D edge refinement transform is a two or three bit image with outlined objects (see tables 1.a,b below). Rather spectacular effects can be generated with the above transforms. (Remember everything we are describing here is done in real time.)
FPGA architecture - the Atmel AT60xx looks like a CA:
Elsewhere I described a CA implemented with Intel IFX780 FPGAs. The principal attraction of the 780 for CA implementation was that its internal memory could be readily configured as row buffers and it had enough other internal resources that interesting CAs could be implemented without additional external logic. But with large, complex CAs the limiting resources are pipeline registers and sequential logic. The Atmel AT60xx family is extremely rich in such resources. To implement a CA with these parts it is necessary to use external row buffers, but the trade-off is acceptable since reasonably priced row buffers are available. In fact the AT60xx architecture seems to be CA inspired. Each cell contains one d-flop that is adequate for encoding the simple alive/dead state of basic CAs and each cell has direct connections to its four nearest, (North South East West) neighbors. Atmel even uses the NSEW terminology of CAs to describe this connection scheme.
Hardware environment - real-time video image processing:
We developed this feature extraction capability on our Digital Video Lab (DVL) which is our proprietary, real-time, D1 video development system. It is a straightforward system in which each hardware module takes in 32-bit pixel data on one set of pins, processes the video data and outputs it on a different set of pins, all at the D1 video rate of 13.5 MHz - see Figure 1.
The basic module set required for feature extraction includes, in the order of processing, a headboard to digitize the video, a color-space-converter (CSC) to convert the YUV of D1 video to RGB, another CSC to convert RGB to HSI (hue, saturation, intensity), a 5x5 convolver module, the FPGA module that does the actual feature extraction, another CSC module to convert HSI to RGB, and a tailboard to convert the RGB digital to RGB analog for display - see Figure 2.
5x5 convolver module - one billion operations per second:
We implemented a 5x5 convolver with four 3x3 convolvers cascaded to form a 5x5. These 3x3 convolvers have eight-bit inputs and 20-bit outputs. Each 3x3 convolver contained two row buffers and two cascade ports. No additional logic was required to bolt them together to perform a 5x5 convolution. Note that a 5x5 convolution requires 60 operations - 25 multiplies, 25 additions, four row-buffer reads and four row-buffer writes, and buffer pointer updates. At 13.5 MHz this results in 810 million operations per second. Because of the inefficiencies of using four 3x3 convolvers instead of one 5x5, over a billion operations per second are actually performed by our 5x5 module. The actual parts used were Harris Semi HSP48908s - see Figure 3.
FPGA module - four AT60XXs plus two row buffers:
In a development system, horsepower is often more important than efficiency or economy. Our compromise solution was to design our AT60xx FPGA module with sockets for four AT60XXs but only two of them have to be populated - the other sockets are populated only if the additional processing power is needed (the two row buffers are also optional) - see Figure 4. We will also describe elsewhere a bit-slice FPGA processor module that uses ten AT6010s - one FPGA for I/O, eight FPGAs for bit-slice processing and one FPGA for cascade summation.
Bit slice strategy
With fine-grain FPGA architectures (e.g.,AT60xx), bit-slice the design wherever possible. If you use series of n-bit latches etc., you will only end up exploding them and redistributing their parts in layout. Fine-grained architectures would benefit greatly both from routers that exploit bit-slice strategies and from compilers that aid the bit-slice approach.
Data format - D1 444 etc.
Most DVL processes only require 24 bits per pixel (e.g., eight bits each of red, green, and blue) but feature extraction requires 32 bits, with eight bits each of hue and saturation and 16 bits of intensity data. In the implementation described below only the intensity data is used for feature extraction with the chroma data carried along only for display purposes. Other implementations of feature extraction use the chroma data as well.
The Difference of Gaussian (DOG):
The Gaussian is a low pass filter that is optimally smooth in both the frequency and spatial domains. The difference of two Gaussian is a high pass filter equivalent to a Gaussian followed by a Laplace operator. The DOG is a near optimal filter for edge and line extraction. The only filters that outperform it are the Gabor filters and they tend to be computationally prohibitive in the 2D case (although they have their advocates). A gross approximation of a zero-energy 3x3 DOG is given by the following filter kernel:
-1 -1 -1 -1 8 -1 -1 -1 -1
As can be seen by inspection, in uniform and slowly changing regions the output will be zero or near zero but it will amplify sharp transitions.
(Note: In his seminal and posthumously published work "Vision" Marr postulated that any filter similar to a Gabor filter would be computationally too expensive and so not likely to be used in the human visual system. It appears that Marr might have been wrong about the human visual system but his point about expense is still valid.)
Zero crossing detection (ZCD):
The relevant data in the output of a DOG convolution are the zero crossings - see Figure 5. A negative spike followed by a positive spike shows a transition from a dark region to a lighter region (Figure 5a). And as the name "zero crossing" indicates, the center of the edge occurs at the point marked ‘x’. A white square on a darker background (Figure 5b) would be outlined / detected by the data shown in Figure 5c. The plus and minus signs indicate the sign of the output of the DOG. Unfortunately, with a zero energy DOG the sign bits by themselves are not enough, the magnitude of the crossing must also be considered. One computationally expensive method is to run three convolutions in parallel: a zero-energy DOG, a positive energy DOG, and a negative energy DOG. The true zero-crossings can be found where their outputs correlate. The negative energy DOG is used to more accurately locate the center of the edge. Computationally milder methods are left as an exercise for the reader but numerous hints are presented below.
positive negative energy energy -1 -1 -1 -1 -1 -1 -1 9 -1 -1 7 -1 -1 -1 -1 -1 -1 -1
ZCD cleanup:
The multiple convolution approach to ZCD cleanup mentioned above would only make sense where a narrow band of relatively high frequencies are of interest such as OCR or barcode decoding. In these two cases the signals of interest are close to the highest frequencies present. Only ‘paper noise’ and ink irregularities are there to confuse the process. For real-image image processing, filter banks with normalized Gaussian of different radii subtracted from each other would be the preferred method of ZCD cleanup and edge correlation. But it is just that case of narrow bands of high frequencies that a 5x5 DOG can handle.
ZCD cleanup via local maxima:
The method of ZCD cleanup that we actually use looks for local maxima in the DOG output and then recodes that as an n-bit number. In the At60xx implementation of this algorithm a local peak detector feeds a comparator bank. The output of the comparator bank is an n-bit number that is used to locate the edges of interest. Display look-up tables for the trivial one threshold case and the slightly less trivial two threshold case are given in tables 1a and 1b:
sb == sign bit, ~sb == inverted sign bit, tb{x} == threshold bit,
table 1.a
tb ~sb val
0x128 (grey)
11 (pos) 192 (white)
10 (neg) 64 (black)
table 1.b
tb1(hi) tb2(lo) ~sb val
x 0 x 128 (grey)
0 1 1 (pos) 160 (light grey)
0 1 0 (neg) 96 (dark grey)
1 x 1 (pos) 224 (white)
1 x 0 (neg) 32 (black)
We might note here that in order to fit the local maxima (peak) detector and the comparator bank into an AT6005 the function had to be converted into a single, monolithic, bit-slice function.
ZCD cleanup via other neighborhood operations:
The desired result of ZCD cleanup is an image similar to Figure 5c with no gaps in the outline and no extraneous data anywhere. Several paths via neighborhood operations are available to that end: (a) a 3x3 Gaussian on the n-bit output of the comparator banks, (b) a CA approach in which solitary points are removed and corners are smoothed and filled, and (c) general dilation / erosion methods (refinement). Methods {b} and {c} are suited to FPGA implementation. In method {b} either a 3x3 neighborhood or a 5x5 neighborhood is used with external row buffers. An AT60xx easily accommodates the post row buffer processing. The penultimate output of a ZCD cleanup is a two-bit image with a sign bit and a threshold bit. The ultimate output from ZCD cleanup is a binary (one bit video) image.
If the ZCD cleanup has been adequately performed a satisfactory binary image can be obtained. If the ZCD cleanup is not done properly then the binary image will be filled with streaks caused by unmatched zero crossings.
Object numbering:
The object numbering is done on a binary image with two column vectors. One vector holds the binary image while the other vector holds the current object numbering. At binary image transitions it uses any previous number that is present in the new data. Otherwise it takes the next available number.
REFERENCES
[Gardner 1970] Martin GARDNER, "The fantastic Combinations of John Conway’s New Solitaire Game ‘Life’," Sc. Am. 223:4 (April 1970), 120-123.
[Marr 1982] David MARR, Vision, W.H. Freeman and Co. (1982)