ABSTRACT

In this paper, we examine several algorithms suitable for the hardware implementation of the discrete Fourier transform (DFT) with non-power-of-two problem size. We incorporate these algorithms into Spiral, a tool capable of automatically generating corresponding hardware implementations. We discuss how each algorithm can be used to generate different types of hardware structures, and we demonstrate that our tool is able to produce hardware implementations of non-power-of-two sized DFTs over a wide range of cost/performance tradeoff points.

Index Terms—Discrete Fourier transform, high-level synthesis, field-programmable gate array, FPGAs, algorithms

1. INTRODUCTION

Transforms such as the discrete Fourier transform (DFT) are very commonly used in signal processing, communication, and scientific computing applications. However, they are difficult and time-consuming to implement in hardware because of the large number of algorithmic and datapath options that must be considered in order to tailor the implementation to a particular application’s cost and performance requirements. The right choices are highly dependent on the context. These problems considerably worsen when the transform size is not a power of two: algorithms become less regular and the relationship between algorithm and hardware implementation becomes more complicated. For these reasons, most existing work on hardware implementation of the DFT with non-power-of-two problem size focuses on building an implementation to meet a specific application’s cost and performance goals, and not on systematically creating a space of designs with different cost/performance characteristics.

Contribution. In this paper we utilize the hardware generation framework underlying Spiral [1, 2] to automatically produce hardware implementations of the DFT with non-power-of-two problem size. We consider several algorithmic options including the Bluestein [3] and mixed radix [4] algorithms, and we examine their applicability to the mathematical framework we use in hardware generation. Because designs are implemented automatically, it is possible to systematically produce and evaluate many more options than could reasonably be done by hand. In doing so, we are able to demonstrate for every input size a range of designs with different cost and performance tradeoffs.

Related work. Most previous work on hardware implementations of non-power-of-two sized DFTs has focused on producing a solution for a specific situation (a given problem size and performance requirement). For example, [5] and [6] consider the specific problem sizes and performance requirements of the Digital Radio Mondiale (DRM) radio broadcasting standard, while [7] presents an FPGA-based pipeline design of a 3,780 point inverse DFT used in a digital television standard.

2. BACKGROUND

In this section, we present relevant background on the discrete Fourier transform and fast Fourier transform algorithms.

Discrete Fourier transform. Computing the discrete Fourier transform on \( n \) points is defined as the matrix-vector product \( y = \text{DFT}_n x \), where \( x \) and \( y \) are \( n \) point input and output vectors (respectively), and

\[
\text{DFT}_n = [\omega_{n}^{k\ell}]_{0 \leq k, \ell < n}, \quad \omega_{n} = e^{-2\pi i/n}.
\]

Fast Fourier transform. Fast Fourier transform (FFT) algorithms allow the computation of \( \text{DFT}_n \) in \( O(n \log n) \) operations. Using the Kronecker product formalism in [4], an FFT algorithm can be written as a formula that represents a factorization of the dense \( \text{DFT}_n \) matrix into a product of structured sparse matrices. For example, the well-known Cooley-Tukey FFT can be written as

\[
\text{DFT}_{mn} = (\text{DFT}_m \otimes I_n)T_{n}^{mn}(I_m \otimes \text{DFT}_n)L_{m}^{mn}.
\]

Here, \( L_{m}^{mn} \) represents a stride permutation matrix that permutes its input according to:

\[
in + j \mapsto jm + i, \quad 0 \leq i < m, \quad 0 \leq j < n.
\]
Further, $I_m$ is the $m \times m$ identity matrix, and $\otimes$ is the tensor product, defined as

$$A \otimes B = [a_{k,l}B], \quad \text{where} \quad A = [a_{k,l}].$$

Lastly, $T_n^{mn}$ is a diagonal matrix of “twiddle factors,” (defined, e.g., in [8]). Figure 1 shows a dataflow interpretation of the formula (1) when $n = 2$ and $m = 2$.

**Automatic Hardware Generation.** Spiral [1, 2] is an automated tool for the generation and optimization of software implementations of linear transforms including the DFT. Spiral takes a given transform matrix (DFT$_n$) and expands it recursively using one or more FFTs, resulting in a formula, which is subsequently optimized and then translated into code. Search enables the selection of the best solution. In [1], we extended the Kronecker formalism using a tagging framework to enable the description of a rich space of hardware architectures by explicitly specifying sequential reuse of computational elements within the formula. This allows Spiral to automatically generate many possible hardware implementations of a transform of a given size, each with different trade-offs between various costs (e.g., circuit area) and performance metrics (e.g., throughput, latency).

Figure 2(a) shows a simplified datapath with no sequential reuse. It contains three repeated stages, each with four parallel computation blocks labeled $A_2$. It processes eight data elements in parallel. Next, Figure 2(b) employs streaming reuse to stream the data vector through the system over multiple cycles at a fixed rate, called the streaming width (two in this example).1 In Figure 2(c), the three computation stages have been collapsed to a single stage using a technique called iterative reuse. Now, the data vector must feed back and pass through the computation block multiple times.

By using different amounts of sequential reuse, different cost/performance trade-offs can be obtained. The more reuse employed, the smaller a datapath will be, but it will be commensurately slower. In [1], we provide more information on our formula-driven hardware generation techniques and evaluate the generated designs.

**3. NON-POWER-OF-TWO FFT ALGORITHMS**

In this section, we discuss four FFT algorithms that we use with our formula-driven generation framework to implement the DFT with non-power-of-two problem size.

**Pease FFT and Iterative FFT.** The Pease FFT [11] and Iterative FFT are commonly used FFT algorithms that decompose an $r^t$ point DFT into DFTs on $r$ points, where $r$ is called the radix. Both algorithms can be derived from (1). The radix-$r$ Pease and Iterative FFTs are (respectively):

$$DFT_{r^t} = \left( \prod_{t=0}^{t-1} \left( I_{r^t} \otimes DFT_{r} \right) D_{r^t} \right) R_{r^t}$$

(2)

$$DFT_{r^t} = \left( \prod_{t=0}^{t-1} \left( I_{r^t} \otimes DFT_{r} \otimes I_{r^t-1} \right) E_{r^t} \right) R_{r^t}$$

(3)

The matrix $R_{r^t}$ is the base-$r$ digit reversal permutation on $r^t$ points. When $r = 2$, this is called the “bit reversal permutation.” $D$ and $E$ are diagonal matrices of twiddle factors. The remaining matrices are as defined in Section 2.

We apply streaming reuse to both algorithms (as in Figure 2(b)), but we only use (2) for iterative reuse (as in Figure 2(c)). When using these algorithms to generate streaming hardware, the streaming width must be a power of $r$. So, as $r$ increases, the space of datapaths that can be generated by these formulas decreases.

**Mixed radix FFT.** The mixed radix FFT algorithm is a variant of (1); it decomposes a DFT of size $r^k s^t$ into multiple DFTs of sizes $r^k$ and $s^t$:

$$DFT_{r^k s^t} = L_{r^k s^t} \left( I_{s^t} \otimes DFT_{r^k} \right) L_{s^t}^{s^t k s^t} T_{s^t}^{r^k s^t} .$$

(4)

The smaller DFT$_{r^k}$ and DFT$_{s^t}$ matrices can then be recursively decomposed using the radix-$r$ and -$s$ algorithms (2) or (3). The two computation stages corresponding to the two tensor products in (4) must be built independently, i.e., they can not be collapsed into one iterative reuse stage as in Figure 2(c).

The structure of (3) imposes an additional restriction on the applicable streaming width: it must evenly divide the

---

1Permutations on streaming data require the use of memory; we implement them as explained in [9, 10].
problem size $r^k s^ℓ$, and it must be a multiple of $rs$. Thus, as the radices $r$ and $s$ get larger, the datapath options become more restricted.

**Bluestein FFT.** The Bluestein FFT [3] is a convolution-based algorithm for any problem size $n$. For a given $n$, the algorithm reduces the DFT to a circular convolution of two vectors of length $m > 2n − 1$. This convolution in turn can then be performed using two DFT$_m$s. Typically $m$ is chosen as the smallest admissible 2-power. We can represent the Bluestein FFT as

$$DFT_n = D^{(2)}_{n \times m} DFT^{-1}_m D^{(1)}_m DFT_m D^{(0)}_{m \times n},$$

where $m = 2^{\lceil\log_2(n)\rceil}+1$ is the smallest power of two greater than $2n − 1$, and the $D$ matrices are diagonal matrices that scale the vector by constant values. $D^{(0)}_{m \times n}$ and $D^{(2)}_{n \times m}$ are non-square matrices, so in addition to scaling the data, $D^{(0)}$ extends the input data vector from $n$ points to $m$ points (by zero padding), and $D^{(2)}$ shortens the output data vector from $m$ points to $n$ points (by discarding unneeded data).

The Bluestein FFT algorithm has a higher operation count than the other algorithms considered, but its structure is more regular than (4); the forward and inverse DFT$_n$ can be collapsed into one logic block based on iterative reuse as in Figure 2(c). In Section 4 we will see that this way the Bluestein algorithm can yield smaller (but slower) hardware implementations than the mixed radix FFT.

4. EXPERIMENTAL RESULTS

In this section, we evaluate the designs generated from (2), (3), (4), and (5). We compare the effects of each algorithm and illustrate how the algorithmic options affect the quality and types of datapaths that can be implemented within our framework. Our experiments evaluate the designs implemented on a Xilinx Virtex-5 FPGA (field-programmable gate array). However, the designs produced by our tool are not FPGA-specific but are also suitable for implementation as an ASIC (application-specific integrated circuit).

**Experimental Setup.** Given a transform of a specific size, a set of algorithmic options, and hardware directives that specify the desired datapath characteristics, Spiral automatically generates a corresponding pipelined hardware implementation in synthesizable register-transfer level Verilog. In these experiments, we generate designs that operate on 16 bit fixed point data words (for each of the real and imaginary parts of complex data), but Spiral is able to generate designs for any precision fixed-point or single precision floating point.

In these experiments, we generate and evaluate all of the options presented above with a streaming width up to 32. We use Xilinx ISE to synthesize and place/route each design for a Xilinx Virtex-5 LX 330 FPGA. When a memory structure is needed, we utilize an on-chip block RAM when we will utilize 2KB or more, otherwise we build the memory out of FPGA logic elements. (This threshold is also a controllable parameter in our generation tool.)

The set of algorithms and datapaths we consider depends on the problem size. Here, we consider four values of $n$ that illustrate four different classes of problem size. In Figure 3, we examine the throughput performance (in million samples per second) versus FPGA area consumed (in Virtex-5 slices) for each design. Data labels are used to indicate the number of Virtex-5 DSP slices consumed by each point in the Pareto-optimal set. A similar tradeoff evaluation can be performed for other cost or performance metrics, such as latency.

**Large prime size.** First, Figure 3(a) shows throughput versus area for implementations of DFT$_{409}$. Because 499 is a large prime number, the only applicable option among those we consider is the Bluestein FFT algorithm. Each black circle represents one Bluestein-based datapath, and the black line illustrates the designs in the Pareto-optimal set: the set of best tradeoff points among those evaluated. The slowest design requires approximately 1,500 slices and has a throughput of 11 million samples per second (MSPS). The fastest is about 18× larger but 60× faster.

**Composite number with larger prime factors.** Next, Figure 3(b) shows results for DFT$_{405}$. In addition to the Bluestein algorithm, we apply the mixed radix FFT algorithm to this problem; it decomposes DFT$_{405}$ into DFT$_{3^1}$, and DFT$_{5}$. This enables two additional designs (with streaming width $3 · 5 = 15$), shown as white triangles. Although the larger of these designs provides a much higher throughput than a similarly sized Bluestein-derived design, it is important to note that the cross-over point (where the mixed-radix designs become the better choice) is quite large (at approximately 15,000 slices). So, if design requirements require a smaller, lower throughput core, the Bluestein algorithm is a better choice.

**Composite number with smaller prime factors.** Third, Figure 3(c) illustrates throughput and area for DFT$_{432}$. Similar to the previous example, DFT$_{432}$ can be implemented using the mixed radix algorithm. However, unlike DFT$_{405}$, we can decompose DFT$_{432}$ into smaller radices: 2 and 3 (because $432 = 2^4 · 3^3$). Since the radices are smaller, the mixed radix algorithm can be used with more options for streaming width, yielding a larger set of Pareto-optimal designs than for DFT$_{404}$. The Bluestein designs are still the best choice for the smallest/slowest designs, but the cross-over point (about 9,000 slices) is lower than in the previous example.

**Power of small prime.** Lastly, Figure 3(d) shows results for DFT$_{243}$ = DFT$_{3^5}$. Because the problem size is a power of three, the radix-3 Pease and Iterative FFTs are applied directly (shown as white triangles). Here, the cores built with the radix-3 algorithms surpass the performance of the Bluestein cores at a much lower area than in the other problems, since the single-radix design can be built with less logic than in the previous multi-radix designs.

These results show that our framework produces designs
with a range of cost/performance trade-offs. This range depends on the number-theoretic properties of the problem size, which affects the set of algorithms and streaming options that are applicable.

5. CONCLUSION

In this paper, we explore hardware implementation of the discrete Fourier transform with problem sizes that are not powers of two. We discuss several algorithms for this class of problems that fit within our automatic hardware generation framework Spiral. We show how different algorithms can be applied to DFTs of different problem sizes, and we discuss the types of hardware structures that may be generated for each. Lastly, we provide an evaluation of designs generated using the algorithmic and datapath options described, and show that the options considered allow the tool to produce designs over a range of cost/performance tradeoff points.

6. REFERENCES


