Calculation with denormals or non-numbers slows down performance - even when not signalled.
it might be interesting to abort a calculation, e.g. a matrix/vector multiplication, with first occurence of NaN - or with one of the other conditions .. Unfortunately, SIMD instruction sets have issues on exception trapping and NAN propagation. See Agner Fog's article at https://www.agner.org/optimize/#nan_propagation
Special options like DAZ (Denormals-Are-Zero) and FTZ (Flush-To-Zero) can be used, if the application won't care about very small denomalized numbers, see https://en.wikipedia.org/wiki/Subnormal_number
IPP library does also provide helper functions:
The topic is explained at wikipedia - with a specialized article for the in-place operation. Matrix transposition can also be utilized in the field of image processing. (De-)Interleaving is a different wording for the same operation, e.g. multi-channel audio data. See https://stackoverflow.com/questions/7780279/de-interleave-an-array-in-place
Transposing a matrix the simple way will produce many cache misses. That is, why special algorithms, like Cache-oblivious ones, are beneficial. There are numerous scientific papers on this topic, e.g. Cache-efficient matrix transposition. But the problem is also discussed on stackoverflow - happily with some code snippets.
In general, one should consider following aspects:
Here some libraries, which should be quite performance efficient, providing transpose functions:
copy_aux_mem = false
for using external data directlygithub.com also produces many results, when searching for “transpose”.
Pavel Zemtsov wrote a bunch of related articles at Experiments in program optimisation, backed with sources at https://github.com/pzemtsov/article-e1-cache and https://github.com/pzemtsov/article-E1-demux-C:
other links:
there's also a new library: https://github.com/hayguen/libtranspose