development:fast_math
Differences
This shows you the differences between two versions of the page.
| development:fast_math [2025/10/30] – created ayguen | development:fast_math [Unknown date] (current) – external edit (Unknown date) 127.0.0.1 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ~~NOTRANS~~ | ||
| + | {{indexmenu_n> | ||
| + | |||
| + | ====== Fast/SIMD Linear/ | ||
| + | |||
| + | ===== Notes ===== | ||
| + | |||
| + | //Digital Signal Processing// | ||
| + | |||
| + | ==== Operations ==== | ||
| + | |||
| + | Main operations are mixing and filtering. | ||
| + | |||
| + | === Mixing === | ||
| + | |||
| + | Simply said: multiplication of the input stream with a complex (quadrature) sine carrier. Generation of sine carrier $c(n)$ involves the trigonometric functions $\sin()$ and $\cos()$. And, of course, the complex multiplication is required:\\ | ||
| + | $y(n) = x(n) \cdot c(n)$\\ | ||
| + | with the input signal $x(n)$,\\ | ||
| + | the complex carrier $c(n) = \exp(n \cdot i \cdot \Delta\varphi) = \cos(n \cdot \Delta\varphi) + i \cdot \sin(n \cdot \Delta\varphi)$\\ | ||
| + | and $\Delta\varphi = \frac{shift\_frequency}{samplerate} \cdot 2 \pi$.\\ | ||
| + | Boundless phase $\phi(n) = n \cdot \Delta\varphi$ will reduce accuracy of $\cos()$ and $\sin()$. $\phi(n)$ should always get wrapped to the range $[ -\pi ; +\pi ]$, when used as argument to trigonometric functions: | ||
| + | $\phi(n) := \operatorname{wrapToPi}(\phi(n-1) + \Delta\varphi)$ | ||
| + | |||
| + | Calculation of $c(n)$ won't need $\sin()$ and $\cos()$ for every sample $n$, when changing the calculation to\\ | ||
| + | $c(n) := c(n-1) \cdot r$\\ | ||
| + | with complex rotation constant $r := \exp(i \cdot \Delta\varphi) = \cos(\Delta\varphi) + i \cdot \sin(\Delta\varphi) $.\\ | ||
| + | Thus, calculation of $c(n)$ does only need one complex multiplication per sample. An additional side effect is, that $\operatorname{wrapToPi}()$ isn't necessary in the complex plane of $c(n) \in \mathbb{C}$. | ||
| + | |||
| + | Cause of error propagation and rounding towards zero (default in C/C++ programs), $c(n)$ needs to be normalized every bunch of $M$ samples, that $|c(n)| = 1$. Thus $c'(n) := \frac{c(n)}{|c(n)|}$ every $n = k \cdot M$ for all $k \in \mathbb{N}$. | ||
| + | |||
| + | For streaming, the generated carrier' | ||
| + | |||
| + | === Filtering === | ||
| + | |||
| + | Filtering involves design, needed once, for //infinite impulse response// (IIR) and //finite impulse response// (FIR) filter types.\\ | ||
| + | FIR type filters to need convolution for application of the filter. The convolution is applied to the input stream (in the time domain).\\ | ||
| + | Fast convolution is performed utilizing the //fast fourier transform// (FFT), with algorithms like overlap/ | ||
| + | |||
| + | === Down-conversion and decimation === | ||
| + | |||
| + | Mixing the desired signal' | ||
| + | In addition, an unnecessarily high samplerate is reduced by lowpass filtering and decimation. \\ | ||
| + | Decimation means taking out 1 sample every N samples. To avoid aliases in this // | ||
| + | There are several algorithms for an optimized combination of low-pass filter and decimation, e.g. // | ||
| + | |||
| + | |||
| + | === Spectrum (plot) === | ||
| + | |||
| + | Usually, a power (density) spectrum is displayed, showing the logarithmic spectrum of the input, temporary or result signals. This involves windowing, the FFT, power $norm(z) = \operatorname{Re}(z)^2 + \operatorname{Im}(z)^2$, | ||
| + | For short: $10 \cdot \log_{10}(\operatorname{norm}(FFT(x \cdot window)))$\\ | ||
| + | For levels on a plot, a precision and accuracy of 0.1 dB is usually sufficient. | ||
| + | |||
| + | === FM: Frequency modulation === | ||
| + | |||
| + | FM demodulation requires complex multiplication with previous samples conjugate and complex $ \arctan(z) = atan2( \operatorname{Im}(z), | ||
| + | $y(z) = \arctan\{ z(n) \cdot conj( z(n-1) ) \}$. | ||
| + | |||
| + | Alternatively $y(z) = \operatorname{diff}( \arctan(z) ) = \arctan(z(n)) - \arctan(z(n-1)) $\\ | ||
| + | + wrapping back to the range $[ -\pi ; +\pi ]$ of the resulting phase values:\\ | ||
| + | $y(z) = \operatorname{wrapToPi}( \operatorname{diff}( \arctan(z) ) )$ | ||
| + | |||
| + | === AM: Amplitude modulation === | ||
| + | |||
| + | AM demodulation, | ||
| + | $y(z) = |z| = \operatorname{abs}(z) = \sqrt{ \operatorname{norm}(z) } = \sqrt{ z \cdot {z}^\ast } = \sqrt{ z \cdot \operatorname{conj}(z) } = \sqrt{ \operatorname{Re}(z)^2 + \operatorname{Im}(z)^2 }$\\ | ||
| + | Then, filter $y(z)$ by a high pass filter, to remove the DC component .. | ||
| + | |||
| + | |||
| + | ==== Speed ==== | ||
| + | |||
| + | === Single memory pass === | ||
| + | |||
| + | Some operations need to be applied consecutively onto the stream. Applying the operations sequentially onto the blocks, requires one pass over the memory for each operation. Depending on the block size (number of samples) and the CPU's cache sizes, this might get quite slow - due to limited memory bandwidth - compared to calculating all operations in one single pass. But this requires complex/ | ||
| + | |||
| + | === Reduce precision and accuracy === | ||
| + | |||
| + | Another optimization is, not to pay for unnecessary precision/ | ||
| + | |||
| + | === SIMD === | ||
| + | |||
| + | Of course, not to miss, are the capabilities of modern processors for SIMD (single instruction multiple data) like SSE or AVX instruction sets on x64 or NEON on ARM architecture. Now, in 2022, C/C++ compilers still have problems at automatic vectorization to generate these vector instructions. Assembler or compiler intrinsic functions can be used to enforce the usage .. | ||
| + | |||
| + | === Fixed Point Arithmetic === | ||
| + | |||
| + | For ancient processors without any //floating point unit// (FPU) or slightly more recent, but also outdated SIMD units without floating point support, fixed point arithmetic might be interesting. With many microcontrollers, | ||
| + | Having to program the math operations manually, restricts this approach to simple operations - due to it's complexity. Any operations beyond comparison, addition, subtraction and multiplication will be quite complex.\\ | ||
| + | Furthermore, | ||
| + | |||
| + | |||
| + | ==== Reproducibility ==== | ||
| + | |||
| + | All operations should be exactly reproducible. That is quite important for debugging and for regression/ | ||
| + | When generating noisy (//AWGN//) test signals, it's advised to use a reproducible pseudo random number generator (//PRNG//) with a known and adjustable //seed// value. | ||
| + | |||
| + | ==== Target ==== | ||
| + | |||
| + | This page should collect links and references to relevant libraries and notes. | ||
| + | Ideally, we would develop a benchmark for some operations. | ||
| + | |||
| + | |||
| + | ===== SIMD Libraries ===== | ||
| + | |||
| + | see [[development: | ||
| + | |||
| + | ===== Fast atan2f ===== | ||
| + | |||
| + | * mazzo.li | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * Morlocks' | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * Hacker News | ||
| + | * https:// | ||
| + | * https:// | ||
| + | * dsprelated | ||
| + | * [[https:// | ||
| + | * https:// | ||
| + | |||
| + | |||
| + | ===== Fast Inverse Square Root ===== | ||
| + | |||
| + | * https:// | ||
| + | * https:// | ||
| + | * Matthew Robertson: A Brief History of InvSqrt, 2012 | ||
| + | * https:// | ||
| + | * YouTube (DE): in depth mathematical derivation of the algorithm | ||
| + | * https:// | ||
| + | * YouTube (EN): | ||
| + | * https:// | ||
| + | |||
| + | |||
| + | ===== General ===== | ||
| + | |||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | * DirectXMath | ||
| + | * https:// | ||
| + | * https:// | ||
| + | * Paul Mineiro' | ||
| + | * with exp, log, pow, sind, cos, tan, erf, .. - but no atan[2] | ||
| + | * https:// | ||
| + | * https:// | ||
| + | * Cephes | ||
| + | * http:// | ||
| + | * [[https:// | ||
| + | * https:// | ||
| + | * Misc | ||
| + | * [[https:// | ||
| + | * https:// | ||
| + | * Sleef (Public Domain) | ||
| + | * [[https:// | ||
| + | * https:// | ||
| + | * https:// | ||
| + | |||
