Differences

This shows you the differences between two versions of the page.

--- development:numeric_math [2025/10/30] – created ayguen
+++ development:numeric_math [Unknown date] (current) – external edit (Unknown date) 127.0.0.1
@@ Line 1: / Line 1: @@
+~~NOTRANS~~
+{{indexmenu_n>300}}
+====== Numeric / Math / Linear Algebra ======
+===== Numerically Robust Variance and weighted sum =====
+  * John D. Cook's [[https://www.johndcook.com/blog/standard_deviation/|Accurately computing running variance]]
+  * https://kunalbandekar.blogspot.com/2011/03/algorithms-for-calculating-variance.html
+  * https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
+    * [[https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm|Welford algorithm]]
+    * [[https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Weighted_incremental_algorithm|West algorithm]]
+  * https://en.wikipedia.org/wiki/Squared_deviations_from_the_mean
+  * https://en.wikipedia.org/wiki/Kahan_summation_algorithm
+    * https://www.ddj.com/floating-point-summation/184403224
+    * https://rosettacode.org/wiki/Kahan_summation
+    * https://stackoverflow.com/questions/57301596/optimal-implementation-of-iterative-kahan-summation
+===== C/C++ Libraries =====
+  * C: Intel(R) Math Kernel Library (MKL)
+    * https://en.wikipedia.org/wiki/Math_Kernel_Library
+    * https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
+    * https://www.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top.html
+  * C: Intel(R) Integrated Performance Primitives (IPP)
+    * https://en.wikipedia.org/wiki/Integrated_Performance_Primitives
+    * https://www.intel.com/content/www/us/en/developer/tools/oneapi/ipp.html
+    * https://www.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top.html
+  * C: Vector Optimized Library of Kernels (libVOLK)
+    * https://www.libvolk.org/
+    * currently GPLv3 license
+    * future version 3 will be LGPL, see https://www.libvolk.org/help-us-relicense-volk-under-lgpl.html
+  * C++ Eigen
+    * https://eigen.tuxfamily.org/
+    * https://eigen.tuxfamily.org/dox/index.html
+    * pre-allocated arrays can be mapped to Eigen's Matrix/Vector types
+      * https://eigen.tuxfamily.org/dox/classEigen_1_1Map.html
+      * see https://eigen.tuxfamily.org/dox/group__matrixtypedefs.html for the native types
+    * matrix self transpose/adjoint multiply optimization, e.g. for least squares
+      * https://stackoverflow.com/questions/39606224/does-eigen-have-self-transpose-multiply-optimization-like-h-transposeh
+      * https://stackoverflow.com/questions/42712307/efficient-matrix-transpose-matrix-multiplication-in-eigen
+  * Boost Interval Arithmetic
+    * https://www.boost.org/doc/libs/1_75_0/libs/numeric/interval/doc/interval.htm
+  * C++ Armadillo
+    * https://arma.sourceforge.net/
+  * Fastor
+    * https://github.com/romeric/Fastor
+  * libxsmm
+    * https://github.com/libxsmm/libxsmm
+  * NumPy
+    * [[https://numpy.org/|NumPy]] is python, but there are promising c++ libraries:
+    * https://github.com/dpilger26/NumCpp see [[https://dpilger26.github.io/NumCpp/doxygen/html/index.html|documentation]]
+    * https://github.com/xtensor-stack/xtensor see [[https://xtensor.readthedocs.io/en/latest/|documentation]]
+      * optionally utilizes https://github.com/xtensor-stack/xsimd / https://xsimd.readthedocs.io/en/latest/
+  * other libraries:
+    * https://en.wikipedia.org/wiki/Comparison_of_linear_algebra_libraries
+    * https://stackoverflow.com/questions/1380371/what-are-the-most-widely-used-c-vector-matrix-math-linear-algebra-libraries-a
+===== IEEE-754 Float Numbers =====
+  * https://en.wikipedia.org/wiki/IEEE_754
+    * existence of denormalized numbers (aka subnormal numbers) should be known
+    * existence of signed zeros might be interesting
+    * existence of following non-numbers (Not-a-Number: NaN) should be known
+      * quiet NaN (qNaN)
+      * signaling NaN (sNaN)
+      * positive and negative infinity (+/- inf)
+    * rounding modes can be controlled
+      * nearest, towards 0, towards + or -inf
+      * be aware, that rotating a 2D-point or complex coordinate in a loop will go towards zero, due to error propagation with default rounding mode //'round towards zero'// !
+    * exceptions can be handled, by setting up float-traps/signal handler
+==== Performance Issues with Denormals and NANs ====
+Calculation with denormals or non-numbers slows down performance - even when not signalled.\\
+it might be interesting to abort a calculation, e.g. a matrix/vector multiplication, with first occurence of NaN - or with one of the other conditions .. Unfortunately, SIMD instruction sets have issues on exception trapping and NAN propagation. See Agner Fog's article at https://www.agner.org/optimize/#nan_propagation
+Special options like DAZ (Denormals-Are-Zero) and FTZ (Flush-To-Zero) can be used, if the application won't care about very small denomalized numbers, see https://en.wikipedia.org/wiki/Subnormal_number
+IPP library does also provide helper functions:
+  * https://www.intel.com/content/www/us/en/docs/ipp/developer-reference/2021-9/setdenormarezeros.html
+  * https://www.intel.com/content/www/us/en/docs/ipp/developer-reference/2021-9/setflushtozero.html
+===== Fast cache-efficient matrix transposition =====
+The topic is explained at [[https://en.wikipedia.org/wiki/Transpose|wikipedia]] - with a specialized [[https://en.wikipedia.org/wiki/In-place_matrix_transposition|article]] for the //in-place// operation. Matrix transposition can also be utilized in the field of image processing. (De-)Interleaving is a different wording for the same operation, e.g. multi-channel audio data. See https://stackoverflow.com/questions/7780279/de-interleave-an-array-in-place
+Transposing a matrix the //simple// way will produce many cache misses. That is, why special algorithms, like [[https://en.wikipedia.org/wiki/Cache-oblivious_algorithm|Cache-oblivious]] ones, are beneficial. There are numerous scientific papers on this topic, e.g. [[https://www.researchgate.net/publication/3838543_Cache-efficient_matrix_transposition|Cache-efficient matrix transposition]]. But the problem is also discussed on [[https://stackoverflow.com/questions/5200338/a-cache-efficient-matrix-transpose-program|stackoverflow]] - happily with some code snippets.
+In general, one should consider following aspects:
+  * rectangular or square matrix
+  * transpose only from/to rectangular regions of bigger matrices or images
+  * in-place or out-of-place operation
+  * combination with conjugation
+Here some libraries, which should be quite performance efficient, providing transpose functions:
+  * [[https://www.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-2-image-processing/image-data-exchange-and-initialization-functions/transpose.html|Intel IPP: ippiTranspose_*()]]
+  * [[https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/blas-and-sparse-blas-routines/blas-like-extensions/mkl-omatcopy.html|Intel MKL: out-of-place mkl_*omatcopy()]]
+    * unfortunately no smaller data types than float
+    * [[https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/blas-and-sparse-blas-routines/blas-like-extensions/mkl-imatcopy-batch.html|Intel MKL: in-place mkl_*imatcopy()]]
+  * [[https://eigen.tuxfamily.org/dox/group__TutorialMatrixArithmetic.html|Eigen: .transpose()]]
+    * returns new created/transposed matrix
+    * pre-allocated arrays can be mapped to Eigen's Matrix/Vector types
+      * https://eigen.tuxfamily.org/dox/classEigen_1_1Map.html
+      * see https://eigen.tuxfamily.org/dox/group__matrixtypedefs.html for the native types
+  * [[https://arma.sourceforge.net/docs.html#trans|Armadillo: trans()]] - also with slower [[https://arma.sourceforge.net/docs.html#inplace_trans|in-place variant]]
+    * returns new created/transposed matrix
+    * [[https://arma.sourceforge.net/docs.html#Mat|advanced constructor]] with ''copy_aux_mem = false'' for using external data directly
+[[https://github.com/|github.com]] also produces many results, when searching for "//transpose//".
+Pavel Zemtsov wrote a bunch of related articles at [[https://pzemtsov.github.io/|Experiments in program optimisation]], backed with sources at https://github.com/pzemtsov/article-e1-cache and https://github.com/pzemtsov/article-E1-demux-C:
+  * [[https://pzemtsov.github.io/2014/05/02/demultiplexing-of-e1-converting-to-C.html|De-multiplexing of E1 stream: converting to C]]
+  * [[https://pzemtsov.github.io/2014/06/04/fast-e1-de-multiplexing-in-C-using-sse-avx.html|Fast E1 de-multiplexing in C using SSE/AVX]]
+  * [[https://pzemtsov.github.io/2014/10/01/how-to-transpose-a-16x16-matrix.html|How to transpose a 16×16 byte matrix using SSE]]
+  * [[https://pzemtsov.github.io/2014/10/21/bug-story-2-unrolling-16x16-transposition.html|Bug story 2: Unrolling the 16×16 matrix transposition, or be careful with macros]]
+  * [[https://pzemtsov.github.io/2014/11/24/how-cache-affects-demultiplexing-of-e1.html|How cache affects demultiplexing of E1]]
+other links:
+  * https://github.com/FFmpeg/FFmpeg/blob/master/libavfilter/opencl/transpose.cl
+  * https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/arm/neon.S
+  * https://www.cs.technion.ac.il/~itai/Courses/Cache/matrix-transposition.pdf
+  * https://dl.acm.org/doi/10.1145/3555353
+  * https://github.com/romeric/Fastor/blob/master/Fastor/backend/transpose/transpose_kernels.h
+  * https://groups.google.com/g/llvm-dev/c/P_CXf5hPro8/m/RI6Fm3bzAQAJ
+there's also a new library: https://github.com/hayguen/libtranspose
+===== Links =====
+  * Numerical Computation Guide
+    * https://docs.oracle.com/cd/E19957-01/806-3568/ncgTOC.html
+  * Several Blog Entries of Bruce Dawson
+    * https://randomascii.wordpress.com/category/floating-point/
+  * C99 and C++11 Floating-point environment reference
+    * https://en.cppreference.com/w/c/numeric/fenv
+    * https://en.cppreference.com/w/cpp/numeric/fenv