Table of Contents
Fast/SIMD Linear/Trigonometric Math
Notes
Digital Signal Processing (DSP) is usually applied on streams, thus processed on signal blocks, consisting of a bunch of (complex) floating-point samples, e.g. with block size 256 .. 2048 samples. In the lab, a big signal file might be processed at once - similar to Octave/Matlab/NumPy/SciPy, but not when it comes to responsive realtime processing.
Operations
Main operations are mixing and filtering.
Mixing
Simply said: multiplication of the input stream with a complex (quadrature) sine carrier. Generation of sine carrier $c(n)$ involves the trigonometric functions $\sin()$ and $\cos()$. And, of course, the complex multiplication is required:
$y(n) = x(n) \cdot c(n)$
with the input signal $x(n)$,
the complex carrier $c(n) = \exp(n \cdot i \cdot \Delta\varphi) = \cos(n \cdot \Delta\varphi) + i \cdot \sin(n \cdot \Delta\varphi)$
and $\Delta\varphi = \frac{shift\_frequency}{samplerate} \cdot 2 \pi$.
Boundless phase $\phi(n) = n \cdot \Delta\varphi$ will reduce accuracy of $\cos()$ and $\sin()$. $\phi(n)$ should always get wrapped to the range $[ -\pi ; +\pi ]$, when used as argument to trigonometric functions:
$\phi(n) := \operatorname{wrapToPi}(\phi(n-1) + \Delta\varphi)$
Calculation of $c(n)$ won't need $\sin()$ and $\cos()$ for every sample $n$, when changing the calculation to
$c(n) := c(n-1) \cdot r$
with complex rotation constant $r := \exp(i \cdot \Delta\varphi) = \cos(\Delta\varphi) + i \cdot \sin(\Delta\varphi) $.
Thus, calculation of $c(n)$ does only need one complex multiplication per sample. An additional side effect is, that $\operatorname{wrapToPi}()$ isn't necessary in the complex plane of $c(n) \in \mathbb{C}$.
Cause of error propagation and rounding towards zero (default in C/C++ programs), $c(n)$ needs to be normalized every bunch of $M$ samples, that $|c(n)| = 1$. Thus $c'(n) := \frac{c(n)}{|c(n)|}$ every $n = k \cdot M$ for all $k \in \mathbb{N}$.
For streaming, the generated carrier's phase needs to be continuous - to the previous block.
Filtering
Filtering involves design, needed once, for infinite impulse response (IIR) and finite impulse response (FIR) filter types.
FIR type filters to need convolution for application of the filter. The convolution is applied to the input stream (in the time domain).
Fast convolution is performed utilizing the fast fourier transform (FFT), with algorithms like overlap/scratch.
Down-conversion and decimation
Mixing the desired signal's (high) frequency near or exactly to zero is called down-conversion.
In addition, an unnecessarily high samplerate is reduced by lowpass filtering and decimation.
Decimation means taking out 1 sample every N samples. To avoid aliases in this decimation, a low-pass filter has to be applied.
There are several algorithms for an optimized combination of low-pass filter and decimation, e.g. half-band-filter or cascaded-integrator-comb (CIC) structures to name a few.
Spectrum (plot)
Usually, a power (density) spectrum is displayed, showing the logarithmic spectrum of the input, temporary or result signals. This involves windowing, the FFT, power $norm(z) = \operatorname{Re}(z)^2 + \operatorname{Im}(z)^2$, alternatively calculated with $norm(z) = z \cdot \operatorname{conj}(z) = z \cdot {z}^\ast$ with the complex FFT bins $z$. Then, power is converted to a level with logarithm to base 10 ($\log_{10}()$) and scaling (scalar multiplication).
For short: $10 \cdot \log_{10}(\operatorname{norm}(FFT(x \cdot window)))$
For levels on a plot, a precision and accuracy of 0.1 dB is usually sufficient.
FM: Frequency modulation
FM demodulation requires complex multiplication with previous samples conjugate and complex $ \arctan(z) = atan2( \operatorname{Im}(z), \operatorname{Re}(z) )$:
$y(z) = \arctan\{ z(n) \cdot conj( z(n-1) ) \}$.
Alternatively $y(z) = \operatorname{diff}( \arctan(z) ) = \arctan(z(n)) - \arctan(z(n-1)) $
+ wrapping back to the range $[ -\pi ; +\pi ]$ of the resulting phase values:
$y(z) = \operatorname{wrapToPi}( \operatorname{diff}( \arctan(z) ) )$
AM: Amplitude modulation
AM demodulation, with simple envelope detector, requires complex multiplication with same samples' conjugate and square root ($\operatorname{sqrt}()$):
$y(z) = |z| = \operatorname{abs}(z) = \sqrt{ \operatorname{norm}(z) } = \sqrt{ z \cdot {z}^\ast } = \sqrt{ z \cdot \operatorname{conj}(z) } = \sqrt{ \operatorname{Re}(z)^2 + \operatorname{Im}(z)^2 }$
Then, filter $y(z)$ by a high pass filter, to remove the DC component ..
Speed
Single memory pass
Some operations need to be applied consecutively onto the stream. Applying the operations sequentially onto the blocks, requires one pass over the memory for each operation. Depending on the block size (number of samples) and the CPU's cache sizes, this might get quite slow - due to limited memory bandwidth - compared to calculating all operations in one single pass. But this requires complex/combined functions for the necessary operations ..
Reduce precision and accuracy
Another optimization is, not to pay for unnecessary precision/accuracy. This allows to use different approximation levels, e.g. at/with math functions like $\log_{10}()$ or $\arctan()$, e.g. cause the spectrum display won't need an accuracy better than 0.5 dB. Quite simpler is using/prefering single precision 'float' over 'double'. Half-precision might be a choice for a small dynamic range.
SIMD
Of course, not to miss, are the capabilities of modern processors for SIMD (single instruction multiple data) like SSE or AVX instruction sets on x64 or NEON on ARM architecture. Now, in 2022, C/C++ compilers still have problems at automatic vectorization to generate these vector instructions. Assembler or compiler intrinsic functions can be used to enforce the usage ..
Fixed Point Arithmetic
For ancient processors without any floating point unit (FPU) or slightly more recent, but also outdated SIMD units without floating point support, fixed point arithmetic might be interesting. With many microcontrollers, there's also no FPU available, that the only alternative to a slow floating point emulation library is fixed point arithmetic. You need to be very careful about the varying precision at operations. Of course, all value ranges must be well determined.
Having to program the math operations manually, restricts this approach to simple operations - due to it's complexity. Any operations beyond comparison, addition, subtraction and multiplication will be quite complex.
Furthermore, i'm missing a good freely usable C++ template library - both with and without SIMD.
Reproducibility
All operations should be exactly reproducible. That is quite important for debugging and for regression/unit-tests.
When generating noisy (AWGN) test signals, it's advised to use a reproducible pseudo random number generator (PRNG) with a known and adjustable seed value.
Target
This page should collect links and references to relevant libraries and notes. Ideally, we would develop a benchmark for some operations.
SIMD Libraries
see SIMD
Fast atan2f
- mazzo.li
- Speeding up atan2f by 50x (2021-08-16)
- vectorized-atan2 (2021-08-16)
- Morlocks' Fun with AVX Blog
- Branch-Free Blend() '(a < b) ? x : y'
- Fast Arc Tangent polynomial approximation
- Hacker News
- dsprelated
Fast Inverse Square Root
- Matthew Robertson: A Brief History of InvSqrt, 2012
- YouTube (DE): in depth mathematical derivation of the algorithm
- YouTube (EN):
General
- DirectXMath
- Paul Mineiro's fastapprox
- with exp, log, pow, sind, cos, tan, erf, .. - but no atan[2]
- Cephes
- Misc
-
- Sleef (Public Domain)
-
