Project 5: Audio Compression
In this project, we will implement an audio compression algorithm using the Modified Discrete Cosine Transform (MDCT) and its inverse transform.
The MDCT and inverse
Let $n$ be an even positive integer. The Modified Discrete Cosine Transform of $x=(x_0,...,x_{2n-1})^T$ is the $n$-dimensional vector \[ y = Mx, \] where $M$ is the $n \times 2n$ matrix \[M_{ij}= \sqrt{\frac{2}{n}}\cos{\frac{(i+\frac{1}{2})(j+\frac{n}{2} + \frac{1}{2}) \pi}{n}} \] for $0\le j \le n -1$ and $0 \le j \le 2n-1$.
Since M is $n\times 2n$, it is not square and therefore non-invertible. However, if the length 2n vectors are overlapped and averaged in the intersecting region, we can effectively create $n \times n$ chunks and get around this limitation. Theorem 11.12 summarizes, reproduced here for convenience.
Theorem 11.12: Let M be the $n \times 2n$ matrix described above, and $N = M^T$. Let $u_1,u_2,u_3$ be $n$-vectors and set $v_1 = M\begin{bmatrix} u_1\\u_2\\ \end{bmatrix}$ and $v_2 = M\begin{bmatrix} u_2\\u_3\\ \end{bmatrix}$ Then the n-vectors $w_1 , w_2 , w_3 , w_4$ defined by \[ \begin{bmatrix} w_1\\w_2\\ \end{bmatrix} = N v_1 and \begin{bmatrix} w_3\\w_4\\ \end{bmatrix} = N v_2 \] satisfy $u_2 = \frac{1}{2} (w_2 + w_3)$.
This exactly reconstructs the matrix.
Of course, in practice we are not interested in exactly reconstructing the matrix, but instead want to simplify our data so that it can be more easily stored. Bit quantization is one technique used to compress files by rounding datapoints. Methods like this are referred to as "lossy compression"; generally there is a trade-off between file size and audio quality. The effects of bit quantization on the audio quality will be described in the next few parts.
In this project, we will implement an audio compression algorithm using the Modified Discrete Cosine Transform (MDCT) and its inverse transform.
The MDCT and inverse
Let $n$ be an even positive integer. The Modified Discrete Cosine Transform of $x=(x_0,...,x_{2n-1})^T$ is the $n$-dimensional vector \[ y = Mx, \] where $M$ is the $n \times 2n$ matrix \[M_{ij}= \sqrt{\frac{2}{n}}\cos{\frac{(i+\frac{1}{2})(j+\frac{n}{2} + \frac{1}{2}) \pi}{n}} \] for $0\le j \le n -1$ and $0 \le j \le 2n-1$.
Since M is $n\times 2n$, it is not square and therefore non-invertible. However, if the length 2n vectors are overlapped and averaged in the intersecting region, we can effectively create $n \times n$ chunks and get around this limitation. Theorem 11.12 summarizes, reproduced here for convenience.
Theorem 11.12: Let M be the $n \times 2n$ matrix described above, and $N = M^T$. Let $u_1,u_2,u_3$ be $n$-vectors and set $v_1 = M\begin{bmatrix} u_1\\u_2\\ \end{bmatrix}$ and $v_2 = M\begin{bmatrix} u_2\\u_3\\ \end{bmatrix}$ Then the n-vectors $w_1 , w_2 , w_3 , w_4$ defined by \[ \begin{bmatrix} w_1\\w_2\\ \end{bmatrix} = N v_1 and \begin{bmatrix} w_3\\w_4\\ \end{bmatrix} = N v_2 \] satisfy $u_2 = \frac{1}{2} (w_2 + w_3)$.
This exactly reconstructs the matrix.
Of course, in practice we are not interested in exactly reconstructing the matrix, but instead want to simplify our data so that it can be more easily stored. Bit quantization is one technique used to compress files by rounding datapoints. Methods like this are referred to as "lossy compression"; generally there is a trade-off between file size and audio quality. The effects of bit quantization on the audio quality will be described in the next few parts.
Part 1: MDCT with simple signal
For the first part of this project, I modified the code provided in our text on pg. 528-529 to read in and compress a sinusoidal signal. Project5_01.m , simplecodec.m
In general, it appears that odd multiples of frequency perform better than even multiples, as demonstrated by the error plot below. I suspect this is related to the fact that cosine is an even function.
Part 2: Applying a windowing function
The MDCT takes chunks of the audio matrix, or windows, (in our case each is 64 samples long) and compresses each window separately. There is a problem with this, however; there is no guarantee that the chunks of audio will match in amplitude after the compression has been carried out. To remedy this, we apply a windowing function, which makes the amplitude go to zero at the endpoints of each window. This ensures that the windows will always be continuous as their endpoints, reducing clicks and hums. I used the following windowing function:
\[ h_i = \sqrt{2} \sin{\frac{(i - \frac{1}{2})\pi}{2n}} \]
\[ h_i = \sqrt{2} \sin{\frac{(i - \frac{1}{2})\pi}{2n}} \]
Each $h_i$ is multiplied by $x_i$ , and the $w_2$ and $w_3$ vectors of the inverse MDCT are multiplied component-wise by the second half and first half of $h_i$ , respectively.
Overall, this method produces better sound at the same bit rate. Interestingly, after the windowing function is applied the even multiples of the frequency are less error-prone than the odd multiples, the opposite of what existed before the windowing function was applied. Code: windowcodec.m
Overall, this method produces better sound at the same bit rate. Interestingly, after the windowing function is applied the even multiples of the frequency are less error-prone than the odd multiples, the opposite of what existed before the windowing function was applied. Code: windowcodec.m
Part 3: A minor point (arts elective portion)
In this section, I construct a basic chord to perform my analysis. To do this, I use the fact that certain integer multiples of a base frequency, added together, make pleasing intervals and chords. For instance, to construct an A-minor chord, we start with a base frequency of 44 Hz. \[ 10\times 44 Hz + 12\times 44Hz + 15\times 44 Hz = A + C + E \]. Code: Project5_03.m
Uncompressed audio:
Uncompressed audio:
Audio without windowing (bit rate = 4): Audio with windowing (bit rate = 4)
Part 4: Music compression
The next step is to get some "real" sound in, and play with that. For that purpose, here is an audio excerpt (from a final project for MUSI 254). It was originally mixed down to an .mp3 file, but I represent it here as a .wav for consistency.
I now proceed to destroy it, by reducing the bit rate.
Windowing, bit rate = 2: No windowing, bit rate = 2:
Windowing, bit rate = 2: No windowing, bit rate = 2:
At bit rate = 6, it doesn't sound so bad. I find that RMS errors of less than $10^{-3}$ are difficult to hear.
Windowing, bit rate = 6: