Abstract
The two-dimensional discrete wavelet transform (DWT) can be found in the heart of many image-processing algorithms. Until recently, several studies have compared the performance of such transform on parallel architectures. All these studies however considered only separable calculation schemes (the convolution and lifting schemes). These two schemes are however only some out of many possible schemes suitable for parallel architectures. This page introduces and discusses several new non-separable schemes for computation of the 2-D DWT. The non-separable schemes outperform their separable counterparts in cases where an exchange of intermediate results is expensive.
Filters and Polynomials

The widely-used $z$-transform notation is used for the description of underlying wavelet filters. The $z$-transform of a FIR filter $\left( g_k \right)$ is a Laurent polynomial $G(z)$ given by $$G(z) = \sum_{k} \, g_k \, z^{-k} \text{.}$$ Similarly, the transfer function of the two-dimensional filter $\left( g_{k_m,k_n} \right)$ is a polynomial $$G(z_m,z_n) = \sum_{k_m} \sum_{k_n} \, g_{k_m,k_n} \, z_m^{-k_m} z_n^{-k_n} \text{,}$$ where the subscript $m$ refers to the horizontal axis and $n$ to the vertical one. The $G^*(z_m,z_n) = G(z_n,z_m)$ indicates a polynomial transposed to $G(z_m,z_n)$. In order to keep the notation readable, a shortened notation G (in the upright font) is only written instead of the full notation $G(z_m,z_n)$. The shortened notation is used for the matrices of polynomials as well.

The discrete wavelet transform splits the input signal into two components according to a parity of its samples. The forward transform can be computed by a pair of quadrature mirror filters ($\mathrm{G}_0, \mathrm{G}_1$) followed by subsampling by a factor of two. Such transform can also be represented by the polyphase matrix $$\begin{bmatrix} \zeta^{\phantom{+1}} & 0 \\ 0 & \zeta^{-1} \end{bmatrix} \, \begin{bmatrix} \odd{\mathrm{G}_1} & \even{\mathrm{G}_1} \\ \odd{\mathrm{G}_0} & \even{\mathrm{G}_0} \end{bmatrix} \text{,}$$ where operators $\even{\mathrm{G}}$ and $\odd{\mathrm{G}}$ denote the even and odd terms of $\mathrm{G}$, respectively. The lifting scheme decomposes such a convolution into a sequence of short filters. The filters are referred to as the lifting step. The initial polyphase matrix above is factored into $K$ pairs of lifting steps. In each such pair, the first step is referred to as the predict and the second one as the update. An implementation can be represented by the product of polyphase matrices $$\begin{bmatrix} \zeta^{\phantom{+1}} & 0 \\ 0 & \zeta^{-1} \end{bmatrix} \left\{ \prod_k \begin{bmatrix} 1 & \U{}^{(k)} \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ \P{}^{(k)} & 1 \end{bmatrix} \right\} \text{,}$$ for $K > k > 0$ (descending), where $\zeta$ is some scaling constant, $\P{}^{(k)}, \U{}^{(k)}$ represent the $k$th predict and update filters. The superscript $(k)$ and also the very last scaling step are omitted in the following text.

On parallel architectures, the processing of single or several adjacent signal samples is mapped to independent processing units. To avoid race conditions, these units must use some synchronization method (memory barriers). Considering the lifting scheme, these synchronizations can be required before the lifting steps.

The 2-D transform is defined as the tensor product of 1-D transforms. Consequently, the single-scale transform splits the input signal into a quadruple of wavelet coefficients, illustrated by the colored balls in the figures below.

Non-Separable Convolution Scheme
The non-separable convolution scheme acts a similar role to the separable convolution. However, all horizontal and vertical calculations are performed in a single step. For all pairs of lifting steps, the non-separable convolution can formally be described as $$\NN \text{,}$$ where $\NN$ is a product of 1-D transforms in horizontal and vertical directions. The drawback of this method is the highest number of operations. For example, for the CDF 9/7 wavelet, these filters are of sizes $9\times9$, $7\times9$, $9\times7$, and $7\times7$.
Non-Separable Polyconvolution Scheme The non-separable convolution can be broken into smaller units, referred to as the non-separable polyconvolutions. For a single pair of lifting steps, the scheme follows the mapping $$\begin{bmatrix} \V^*\V & \V^*\U & \U^*\V & \U^*\U \\ \V^*\P & \V^* & \U^*\P & \U^* \\ \P^*\V & \P^*\U & \V & \U \\ \P^*\P & \P^* & \P & 1 \\ \end{bmatrix} \text{,}$$ where $\V = \P\U + 1$ is an auxiliary polynomial. For the CDF wavelets, the scheme is graphically illustrated in the figure. Note that the employed filters are of sizes $5\times5$, $3\times5$, $5\times3$, and $3\times3$. Thus, compared to the non-separable convolution, only half of the operations are required specifically for the CDF 9/7 wavelet.
Non-Separable Lifting Scheme The non-separable lifting scheme is formed by combining the corresponding horizontal and vertical steps of the two-dimensional separable lifting scheme. The number of steps is halved, whereas the number of operations has slightly been increased. The scheme consists of a spatial predict and spatial update step, thus two steps in total for each pair of the original lifting steps. For each pair of $\P$ and $\U$, the scheme is defined by $$\begin{bmatrix} 1 & \U & \U^* & \U\U^* \\ 0 & 1 & 0 & \U^* \\ 0 & 0 & 1 & \U \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} \, \begin{bmatrix} 1 & 0 & 0 & 0 \\ \P & 1 & 0 & 0 \\ \P^* & 0 & 1 & 0 \\ \P\P^* & \P^* & \P & 1 \\ \end{bmatrix} \text{.}$$ For the CDF wavelets, the scheme is graphically illustrated in the figure.
Non-Separable Explosion Scheme The explosion scheme can be computed in three steps per each predict and update pair of the original lifting steps. All these steps have a form of an "explosion" with respect to output components. The scheme is described for each pair of $\P$ and $\U$ by $$\begin{bmatrix} 1 & 0 & 0 & \U\U^* \\ 0 & 1 & 0 & \U^* \\ 0 & 0 & 1 & \U \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} \, \begin{bmatrix} 1 & \U & \U^* & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & \P^* & \P & 1 \\ \end{bmatrix} \, \begin{bmatrix} 1 & 0 & 0 & 0 \\ \P & 1 & 0 & 0 \\ \P^* & 0 & 1 & 0 \\ -\P\P^* & 0 & 0 & 1 \\ \end{bmatrix} \, \text{.}$$ For the CDF wavelets, the scheme is graphically illustrated in the figure.
Non-Separable Implosion Scheme The implosion scheme consists of three spatial lifting steps. Similarly to the above-described non-separable schemes, it is not possible to distinguish the vertical and horizontal steps. In terms of multiply–accumulate operations, all of these steps have the form of an "implosion" with respect to target subbands. This scheme is defined for a single pair of lifting steps by $$\begin{bmatrix} 1 & \U & \U^* & -\U\U^* \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} \, \begin{bmatrix} 1 & 0 & 0 & 0 \\ \P & 1 & 0 & \U^* \\ \P^* & 0 & 1 & \U \\ 0 & 0 & 0 & 1 \\ \end{bmatrix} \, \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ \P\P^* & \P^* & \P & 1 \\ \end{bmatrix} \text{.}$$ For the CDF wavelets, the scheme is graphically illustrated in the figure.
References