~~NOTRANS~~ {{indexmenu_n>300}} ====== Numeric / Math / Linear Algebra ====== ===== Numerically Robust Variance and weighted sum ===== * John D. Cook's [[https://www.johndcook.com/blog/standard_deviation/|Accurately computing running variance]] * https://kunalbandekar.blogspot.com/2011/03/algorithms-for-calculating-variance.html * https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance * [[https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm|Welford algorithm]] * [[https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Weighted_incremental_algorithm|West algorithm]] * https://en.wikipedia.org/wiki/Squared_deviations_from_the_mean * https://en.wikipedia.org/wiki/Kahan_summation_algorithm * https://www.ddj.com/floating-point-summation/184403224 * https://rosettacode.org/wiki/Kahan_summation * https://stackoverflow.com/questions/57301596/optimal-implementation-of-iterative-kahan-summation ===== C/C++ Libraries ===== * C: Intel(R) Math Kernel Library (MKL) * https://en.wikipedia.org/wiki/Math_Kernel_Library * https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html * https://www.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top.html * C: Intel(R) Integrated Performance Primitives (IPP) * https://en.wikipedia.org/wiki/Integrated_Performance_Primitives * https://www.intel.com/content/www/us/en/developer/tools/oneapi/ipp.html * https://www.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top.html * C: Vector Optimized Library of Kernels (libVOLK) * https://www.libvolk.org/ * currently GPLv3 license * future version 3 will be LGPL, see https://www.libvolk.org/help-us-relicense-volk-under-lgpl.html * C++ Eigen * https://eigen.tuxfamily.org/ * https://eigen.tuxfamily.org/dox/index.html * pre-allocated arrays can be mapped to Eigen's Matrix/Vector types * https://eigen.tuxfamily.org/dox/classEigen_1_1Map.html * see https://eigen.tuxfamily.org/dox/group__matrixtypedefs.html for the native types * matrix self transpose/adjoint multiply optimization, e.g. for least squares * https://stackoverflow.com/questions/39606224/does-eigen-have-self-transpose-multiply-optimization-like-h-transposeh * https://stackoverflow.com/questions/42712307/efficient-matrix-transpose-matrix-multiplication-in-eigen * Boost Interval Arithmetic * https://www.boost.org/doc/libs/1_75_0/libs/numeric/interval/doc/interval.htm * C++ Armadillo * https://arma.sourceforge.net/ * Fastor * https://github.com/romeric/Fastor * libxsmm * https://github.com/libxsmm/libxsmm * NumPy * [[https://numpy.org/|NumPy]] is python, but there are promising c++ libraries: * https://github.com/dpilger26/NumCpp see [[https://dpilger26.github.io/NumCpp/doxygen/html/index.html|documentation]] * https://github.com/xtensor-stack/xtensor see [[https://xtensor.readthedocs.io/en/latest/|documentation]] * optionally utilizes https://github.com/xtensor-stack/xsimd / https://xsimd.readthedocs.io/en/latest/ * other libraries: * https://en.wikipedia.org/wiki/Comparison_of_linear_algebra_libraries * https://stackoverflow.com/questions/1380371/what-are-the-most-widely-used-c-vector-matrix-math-linear-algebra-libraries-a ===== IEEE-754 Float Numbers ===== * https://en.wikipedia.org/wiki/IEEE_754 * existence of denormalized numbers (aka subnormal numbers) should be known * existence of signed zeros might be interesting * existence of following non-numbers (Not-a-Number: NaN) should be known * quiet NaN (qNaN) * signaling NaN (sNaN) * positive and negative infinity (+/- inf) * rounding modes can be controlled * nearest, towards 0, towards + or -inf * be aware, that rotating a 2D-point or complex coordinate in a loop will go towards zero, due to error propagation with default rounding mode //'round towards zero'// ! * exceptions can be handled, by setting up float-traps/signal handler ==== Performance Issues with Denormals and NANs ==== Calculation with denormals or non-numbers slows down performance - even when not signalled.\\ it might be interesting to abort a calculation, e.g. a matrix/vector multiplication, with first occurence of NaN - or with one of the other conditions .. Unfortunately, SIMD instruction sets have issues on exception trapping and NAN propagation. See Agner Fog's article at https://www.agner.org/optimize/#nan_propagation Special options like DAZ (Denormals-Are-Zero) and FTZ (Flush-To-Zero) can be used, if the application won't care about very small denomalized numbers, see https://en.wikipedia.org/wiki/Subnormal_number IPP library does also provide helper functions: * https://www.intel.com/content/www/us/en/docs/ipp/developer-reference/2021-9/setdenormarezeros.html * https://www.intel.com/content/www/us/en/docs/ipp/developer-reference/2021-9/setflushtozero.html ===== Fast cache-efficient matrix transposition ===== The topic is explained at [[https://en.wikipedia.org/wiki/Transpose|wikipedia]] - with a specialized [[https://en.wikipedia.org/wiki/In-place_matrix_transposition|article]] for the //in-place// operation. Matrix transposition can also be utilized in the field of image processing. (De-)Interleaving is a different wording for the same operation, e.g. multi-channel audio data. See https://stackoverflow.com/questions/7780279/de-interleave-an-array-in-place Transposing a matrix the //simple// way will produce many cache misses. That is, why special algorithms, like [[https://en.wikipedia.org/wiki/Cache-oblivious_algorithm|Cache-oblivious]] ones, are beneficial. There are numerous scientific papers on this topic, e.g. [[https://www.researchgate.net/publication/3838543_Cache-efficient_matrix_transposition|Cache-efficient matrix transposition]]. But the problem is also discussed on [[https://stackoverflow.com/questions/5200338/a-cache-efficient-matrix-transpose-program|stackoverflow]] - happily with some code snippets. In general, one should consider following aspects: * rectangular or square matrix * transpose only from/to rectangular regions of bigger matrices or images * in-place or out-of-place operation * combination with conjugation Here some libraries, which should be quite performance efficient, providing transpose functions: * [[https://www.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-2-image-processing/image-data-exchange-and-initialization-functions/transpose.html|Intel IPP: ippiTranspose_*()]] * [[https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/blas-and-sparse-blas-routines/blas-like-extensions/mkl-omatcopy.html|Intel MKL: out-of-place mkl_*omatcopy()]] * unfortunately no smaller data types than float * [[https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/blas-and-sparse-blas-routines/blas-like-extensions/mkl-imatcopy-batch.html|Intel MKL: in-place mkl_*imatcopy()]] * [[https://eigen.tuxfamily.org/dox/group__TutorialMatrixArithmetic.html|Eigen: .transpose()]] * returns new created/transposed matrix * pre-allocated arrays can be mapped to Eigen's Matrix/Vector types * https://eigen.tuxfamily.org/dox/classEigen_1_1Map.html * see https://eigen.tuxfamily.org/dox/group__matrixtypedefs.html for the native types * [[https://arma.sourceforge.net/docs.html#trans|Armadillo: trans()]] - also with slower [[https://arma.sourceforge.net/docs.html#inplace_trans|in-place variant]] * returns new created/transposed matrix * [[https://arma.sourceforge.net/docs.html#Mat|advanced constructor]] with ''copy_aux_mem = false'' for using external data directly [[https://github.com/|github.com]] also produces many results, when searching for "//transpose//". Pavel Zemtsov wrote a bunch of related articles at [[https://pzemtsov.github.io/|Experiments in program optimisation]], backed with sources at https://github.com/pzemtsov/article-e1-cache and https://github.com/pzemtsov/article-E1-demux-C: * [[https://pzemtsov.github.io/2014/05/02/demultiplexing-of-e1-converting-to-C.html|De-multiplexing of E1 stream: converting to C]] * [[https://pzemtsov.github.io/2014/06/04/fast-e1-de-multiplexing-in-C-using-sse-avx.html|Fast E1 de-multiplexing in C using SSE/AVX]] * [[https://pzemtsov.github.io/2014/10/01/how-to-transpose-a-16x16-matrix.html|How to transpose a 16×16 byte matrix using SSE]] * [[https://pzemtsov.github.io/2014/10/21/bug-story-2-unrolling-16x16-transposition.html|Bug story 2: Unrolling the 16×16 matrix transposition, or be careful with macros]] * [[https://pzemtsov.github.io/2014/11/24/how-cache-affects-demultiplexing-of-e1.html|How cache affects demultiplexing of E1]] other links: * https://github.com/FFmpeg/FFmpeg/blob/master/libavfilter/opencl/transpose.cl * https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/arm/neon.S * https://www.cs.technion.ac.il/~itai/Courses/Cache/matrix-transposition.pdf * https://dl.acm.org/doi/10.1145/3555353 * https://github.com/romeric/Fastor/blob/master/Fastor/backend/transpose/transpose_kernels.h * https://groups.google.com/g/llvm-dev/c/P_CXf5hPro8/m/RI6Fm3bzAQAJ there's also a new library: https://github.com/hayguen/libtranspose ===== Links ===== * Numerical Computation Guide * https://docs.oracle.com/cd/E19957-01/806-3568/ncgTOC.html * Several Blog Entries of Bruce Dawson * https://randomascii.wordpress.com/category/floating-point/ * C99 and C++11 Floating-point environment reference * https://en.cppreference.com/w/c/numeric/fenv * https://en.cppreference.com/w/cpp/numeric/fenv