ON THE IMPLEMENTATION OF ADAPTIVE EQUALIZERS FOR WIRELESS ATM WITH AN EXTENDED QR-DECOMPOSITION-BASED RLS-ALGORITHM

Christian Drewes, Ralph Hasholzner, Joachim S. Hammerschmidt
Institute for Integrated Circuits • Technical University of Munich
Arcisstr. 21 • D-80290 München • Germany
Email: c_drewes@lis.e-technik.tu-muenchen.de

ABSTRACT

An extended QR-decomposition (QRD)-based RLS-algorithm is introduced, applicable for equalization, allowing an estimate of the transmitted symbol, when it is a-priori unknown. To achieve low-cost implementations strength reduction and square-root free Givens-rotations are applied. Several QRD-based RLS-algorithms are compared in terms of the number of mathematical operations. The QR-RLS-algorithm indicates lowest complexity and is always a good candidate when a huge training overhead has to be avoided, resulting in an equalizer of less than 15 taps. Finally, the hardware complexity of a 13 MBaud-DFE for wireless ATM is estimated.

1. INTRODUCTION

It is well-known that fiber-to-the-home is the user access method offering a maximum of data rates. However, investment costs to install demand-covering optical access networks are far too high for any network provider. Therefore, fiber will only be used in the feeder part of access networks, whereas in the last mile the existing infrastructure (twisted pair and coax cable) or wireless transmission will be adopted (Fig. 1). The latter is particularly attractive for new carriers entering the market without their own copper cables in the last mile [2][5].

In order to be competitive with wired solutions, the requirements of broadband fixed wireless access can be summarized as follows [2]:
• Low costs,
• high flexibility to support all kinds of future application,
• support of future Asynchronous Transfer Mode (ATM) end-to-end connections.

In [2][5] it is shown that a Time Division Multiple Access (TDMA)-based system is very promising with regard to the above-mentioned requirements.

To overcome the impairments of the radio channel, adaptive equalizers have to be implemented [9]. In [3], some channel models for the region above 20 GHz are introduced. It was shown that a Decision Feedback Equalizer (DFE, Fig. 2) is an adequate structure for equalizing these channels. The Feed-Forward Filter (FFF) can be run Baud-Spaced (BS) or Fractionally-Spaced (FS). A FS-equalizer has the advantage to comply with the sampling theorem. The transmit and receive filters are assumed to be raised cosine filters with a rolloff factor of 0.5. The influence of the rolloff factor on system performance is analyzed in [3].

For the uplink of a wireless TDMA-system a fast converging Recursive-Least-Squares (RLS)-algorithm has to be used due to the varying channel characteristics of the single time-slots [3][5]. The RLS-algorithm needs about 2N iterations to converge, where N is the number of equalizer taps [6]. To ensure a large system throughput, the training sequence has to be as short as possible. Therefore, the equalizer-length should not exceed about 15 taps, resulting in a training-sequence of minimal length 30 and a throughput of about 85% assuming QPSK-modulation and the transmission of one ATM-cell (53 bytes) per time-slot.

To improve numerical properties, the method of QR-Decomposition (QRD) can be applied to the matrix of incoming data samples [6][7][11]. In an application like equalization the desired symbol is not known a-priori.
Therefore, the QRD-based RLS-algorithms have to be extended to allow an estimation of the desired symbol from the old data (Section 2). The QRD can be implemented recursively using unitary rotations like Givens-Rotations (GR) 46711]. In Section 3, an algebraic transformation is applied to complex multiplications requiring less chip area. Several square-root free GR-QRD-based RLS-algorithms are compared in terms of mathematical operations per iteration exhibiting the eligibility of the QR-RLS-algorithm for the given application. An estimate of the number of gates required for an implementation of a 13 Mbaud-DFE is given. Section 4 concludes this paper.

2. AN EXTENDED QR-RLS-ALGORITHM

2.1 The QR-RLS-algorithm

For a DFE, the conventional QR-RLS algorithm has to be modified: Since the estimated transmitted symbol d(n) is generally not given, we have to add a second column to McWirther’s QR-RLS algorithm 811:

\[
\begin{pmatrix}
\beta R(n-1) & \sqrt{\gamma} p(n-1) \\
\lambda R^*(n) & \sqrt{\gamma} p(n-1) \\
\end{pmatrix}
\begin{pmatrix}
\mathbf{Q}(n) \\
0 \\
\end{pmatrix}
\begin{pmatrix}
\mathbf{R}^*(n) \\
0 \\
\end{pmatrix}
\begin{pmatrix}
\mathbf{u}^*(n) \\
\mathbf{d}^*(n) \\
\end{pmatrix}
\]

where \( \mathbf{Q}(n) \) is an unitary rotation matrix, \( \mathbf{R}(n) \) is an upper triangular matrix with \( \mathbf{R}^*(n)\mathbf{R}(n) = \mathbf{U}^*(n)\mathbf{U}(n) \), \( \mathbf{U}(n) = [\mathbf{u}(n), \sqrt{\lambda} \mathbf{u}(n-1), \ldots, \sqrt{\lambda} \mathbf{u}(0)]^* \). The vector \( \mathbf{u}(n) \) contains the data samples in the equalizer at time \( n \), \( \mathbf{d}^*(n) \) is the estimated transmitted symbol, \( \gamma(n) \) the equalized response, \( e(n) \) the a-priori error, \( \gamma(n) \) the conversion factor, \( \lambda \) the forgetting factor, and \( p(n) \) is the cross-correlation vector of \( d(n) \) and \( u(n) \) premultiplied by \( \mathbf{R}^*(n) \). The \( \cdot^* \) denotes Hermitian transpose, the \( \cdot^\dagger \) denotes the inverse of the Hermitian transpose. With the help of the second column in (1), we can calculate \( \gamma(n) \) and then update \( p(n) \) with the help of \( d'(n) \).

Equation (1) can be proven with the help of the following identity 11:

\[
\mathbf{A}^\dagger \mathbf{B} = \mathbf{C}^\dagger \mathbf{D} \quad \text{for} \quad [\mathbf{AB}] = [\mathbf{CD}] \quad (2)
\]

with properly dimensioned matrices \( \mathbf{A,B,C,D} \) and a unitary matrix \( \mathbf{Q} \). The prove of the first and last column of (1) is straightforward 11], whereas for the second column, it is slightly more complicated:

Using the following relation,

\[
\mathbf{Q}(n) \begin{pmatrix}
\mathbf{0} \\
1 \\
\end{pmatrix} = \begin{pmatrix}
\mathbf{R}^*(n)\mathbf{u}(n) \\
\gamma(n) \\
\end{pmatrix} \quad (3)
\]

we derive from (2) the well-known equation for the conversion factor 11,

\[
\gamma^2(n) = 1 - \mathbf{u}^*(n)[\mathbf{U}^*(n)\mathbf{U}(n)]^{-1}\mathbf{u}(n) \quad (4)
\]

Using (2) and (3) and the second column of (1) as

\[
\begin{pmatrix}
\mathbf{R}^*(n)\mathbf{u}(n) \\
\gamma(n) \\
\end{pmatrix}^* \begin{pmatrix}
\lambda \mathbf{R}^*(n)\mathbf{R}^*(n-1)p(n-1) \\
-\gamma(n)\gamma^*(n) \\
\end{pmatrix}
\]

results in

\[
0 = \lambda \mathbf{u}^*(n)\mathbf{R}^*(n)\mathbf{R}^*(n-1)p(n-1) + \gamma^2(n)\gamma^*(n) \quad (5)
\]

With the help of (4) and the relations

\[
\mathbf{R}^*(n)\mathbf{R}(n) = \mathbf{U}^*(n)\mathbf{U}(n) = \lambda\mathbf{U}^*(n-1)\mathbf{U}(n-1) + \mathbf{u}^*(n)\mathbf{u}(n) \quad (6)
\]

the final result is

\[
0 = \mathbf{u}^*(n)\mathbf{w}(n-1) - \gamma(n)\gamma^*(n) \quad (9)
\]

with the (complex conjugate) coefficient vector \( \mathbf{w}(n) \), showing that matrix relation (1) is true. 

2.2 Other QRD-based RLS-Algorithms

The complexity of the QR-RLS-algorithm is \( O(N^2) \). Taking the shift invariance of the incoming data samples into account, a variety of fast least squares algorithms can be created. These algorithms use forward and backward predictions to reduce complexity to \( O(N) \).

They can be divided into two classes: order recursive or lat-
tice algorithms and fixed order algorithms. For DFEs, multichannel algorithms have to be used. The number of channels equals the number of forward or backward prediction values to be calculated within one symbol period. Thus two channels are required for a BS-DFE and three for a FS-DFE with a doubly sampled FFF. Again, the method of QRD can be applied to improve stability. In [11], a QRD-based Multichannel-Least-Squares-Lattice (QR-MLSL)-algorithm is derived, and in [1] a multichannel fixed order QRD-based Fast-Least-Squares (QR-FLS)-algorithm is presented. Again, because the transmitted symbol is not given a-priori, we have to extend these QRD-based algorithms as done in (1).

3. IMPLEMENTATION ISSUES

The standard implementation of a complex multiplication requires four multipliers and two adders:

\[(a + jb)(c + jd) = (ac - bd) + j(ad + bc)\]

However, it is possible to reduce the number of multipliers to three via strength reduction [10]:

\[
\begin{align*}
(a - b)d + a(c - d) &= ac - bd \\
(a - b)d + b(c + d) &= ad + bc
\end{align*}
\]

Since multipliers are typically more expensive than adders, chip area is reduced at the cost of an increased critical path length by the delay of one adder.

To avoid square-roots, a square-root free version of the GR can be implemented [4] that also reduces the number of multiplications significantly.

In Table 1, the number of (real-valued) multiplications (MUL) and divisions (DIV) of three square-root free QRD-based RLS-algorithms with strength reduction are compared. The multiplication with the forgetting factor can be implemented using a shift-operation and an addition.

In Fig. 3 the number of multiplications plus ten times the number of divisions (assuming a higher implementation cost of divisions) per iteration of the different RLS algorithms are compared for a DFE with equal length feedforward and feedback filters. It turns out that for an equalizer length of less than about 12 taps (BS-DFE) or 18 taps (FS-DFE), the QR-RLS-algorithm indicates lowest complexity. The use of an equalizer with more taps is not recommended, since this decreases throughput due to a longer training sequence.

The QR-RLS-algorithm can be implemented using a systolic structure, Fig. 4, with the processing elements as defined in Figs. 5 and 6. Each cell is storing and updating one value of \(R(n)\) or \(p(n)\).

In Fig. 3 the number of multiplications plus ten times the number of divisions (assuming a higher implementation cost of divisions) per iteration of the different RLS algorithms are compared for a DFE with equal length feedforward and feedback filters. It turns out that for an equalizer length of less than about 12 taps (BS-DFE) or 18 taps (FS-DFE), the QR-RLS-algorithm indicates lowest complexity. The use of an equalizer with more taps is not recommended, since this decreases throughput due to a longer training sequence.

The QR-RLS-algorithm can be implemented using a systolic structure, Fig. 4, with the processing elements as defined in Figs. 5 and 6. Each cell is storing and updating one value of \(R(n)\) or \(p(n)\).

The processing cells on the r.h.s. of Fig. 4 have a double function: First, they are fed with a Zero to calculate the equalized output \(y(n)\) and second, the cells are updated with the estimated transmitted symbol \(d'(n)\). The black cells need not to be implemented. Their function can be taken over by the white cells. Because of the feedback loop full pipelining can not be applied. However, the columns of the processing array in Fig. 4 can be pipelined, whereas the rows cannot.
The implementation complexity of a (4,4)-BS-DFE and a (8,4)-FS-DFE, i.e. 4 or 8 taps in the feedforward section, respectively, and 4 taps in the feedback section for the transmission of QPSK-modulated 26 Mb/s was estimated. The requirements of such an algorithm result in a resolution of the Analog-to-Digital-Converter (ADC) of 9 bit @ 13 or 26 MHz sampling rate, respectively, and a gate count of the QR-RLS-equalizer of 400,000 or 800,000, respectively, with a 0.35 μm-technology. The optimization criterion was a 3 dB-loss of system performance with finite wordlengths and fixed-point arithmetic compared with a 64 bit floating-point arithmetic. The maximum internal wordlength was 13 bit.

4. SUMMARY

This paper showed that the QR-RLS-algorithm as introduced here is a good option for a DFE of a wireless ATM system. It can be implemented on a processor array. However, the feedback loop impedes full pipelining and systolic operation. The hardware complexity of a 26 Mb/s-DFE is huge, but feasible with today’s technology.

Future work in this scope will compare the complexity and performance of QR-RLS- and QR-MLSL-algorithms under finite wordlength constraints. Another topic will be the analysis of normalized QRD-based RLS-algorithms. Some of the internal variables grow more and more while others get smaller and smaller. By normalizing them, one could expect less stringent minimum wordlength requirements, resulting in less chip area.

ACKNOWLEDGEMENT

Supported by: The German Federal Ministry of Education, Science, Research, and Technology within the project "ATMmobil" and the Siemens AG, Munich, Germany.

REFERENCES