Unified memory management for CPUs and GPUs

All data structures in TNL contain information about its allocation (on CPU or GPU). It is represented by a C++ template parameter and so correctness of all memory accesses are checked already at compile time. For example array declaration looks as follows:

TNL::Containers::Array< Value, Device, Index, Allocator > array;
  • Value is data type of the array elements.
  • Device determines where the array is allocated, i.e. CPU or GPU.
  • Index is type for indexing the array elements.
  • Allocator performs memory allocation on CPU, GPU, page-locked memory or CUDA unified memory.

TNL also offers extended smart pointers which simplify passing objects to GPUs.

Flexible parallel reduction

Parallel reduction is one of the most important parallel operations. It is, however, rather tedious to implement it. TNL profits from C++ lambda functions and offers interface for the so-called flexible parallel reduction. The following example demonstrates its use for the implementation of the scalar product:

auto fetch = [=] __cuda_callable__ (int i) { return ( a[i] * b[i] ); };
auto reduce = [] __cuda_callable__ (float x, float y) { return x + y; };
TNL::Algorithms::reduce< Devices::Cuda >( 0, size, fetch, reduce, 0.0 );

The lambda function fetch serves for reading the data to be reduced, in this case it also multiplies related elements of the input vectors. The function reduce represents the operation used for the reduction, in this case addition. After replacing Devices::Cuda with Devices::Host, the same operation will be computed on CPU.

Expression templates

Most of the BLAS Level 1 functions (and even more) are available in TNL in the form of expression templates. They are intuitive, easy to use and efficient at the same time. The following code based on Cublas

cublasHandle_t handle;
cublasSaxpy( handle, size, 1.0, a, 1, x, 1 );
cublasSaxpy( handle, size, 2.0, b, 1, x, 1 );
cublasSaxpy( handle, size, 3.0, c, 1, x, 1 );

is equivalent to the following code in TNL

x = a + 2 * b + 3 * c;

It is simpler and in addition it is up to 1.7 times faster compared to Cublas.

Dense and sparse matrices

Dense and sparse matrices are one of the most important data structures for majority of HPC algorithms. TNL offers unified interface for both dense and sparse matrices. The sparse matrices can have fixed (tridiagonal and multidiagonal) or general layout. General sparse matrices can be stored in one of many formats optimized for various matrix elements patterns, namely CSR, Ellpack, SlicedEllpack, ChunkedEllpack or BisectionEllpack. General sparse matrices can be stored as symmetric (only lower part and diagonal os stored) or binary (only positions of nonzero-elements are stored) to minimize memory requirements.

The matrix elements can be simply set up with a help of lambda functions as follows (it works even on GPUs, of course):

auto f = [] __cuda_callable__ ( int rowIdx, int localIdx, int& columnIdx,
double& value ) {
value = rowIdx + columnIdx; };
matrix.forAllElements( f );

You can perform flexible parallel reduction within particular matrix rows. For example, matrix-vector multiplication can be implemented as follows (it works even on GPUs, of course):

auto fetch = [=] __cuda_callable__ ( int rowIdx, int columnIdx, const double& value ) -> double {
    return x[ columnIdx ] * value;
auto keep = [=] __cuda_callable__ ( int rowIdx, const double& value ) mutable {
    y[ rowIdx ] = value;
matrix.reduceAllRows( fetch, TNL::plus{}, keep, 0.0 );

Structured and unstructured numerical meshes

Numerical meshes are necessary building blocks for PDE solvers. TNL offers both structured and unstructured meshes either on CPU or GPU.

  • Structured numerical grid can have 1, 2 or 3 dimensions. It is able to index not only grid cells but also faces, edges and vertices including.
  • Unstructred numerical meshes can have arbitrary number of dimensions. For these type of meshes, all mesh entities (cells, faces, edges, vertices, etc.) must be stored explicitly in the memory. Especially on GPUs, memory is precious resource. Therefore, in TNL, one can configure what type of mesh enities and relations between them, are supposed to be stored. This way, the unstructured mesh can be fine-tuned for specific numerical scheme, typically based on the finite volume or finite element method. Unstructured meshes can be imported from VTI, VTK, VTU, XMLVTK and Netgen formats.

Numerical solvers

TNL offers iterative solvers for linear systems including stationary solvers (Jacobi, SOR - CPU only currently), Krylov subspace methods (CG, BiCGStab, GMRES, CWYGMRES, TFQMR) together with few preconditioners (Jacobi, ILU0 - CPU only, ILUT - CPU only).

For solution of ODEs, there is first order Euler and fourth order adaptive Runge-Kutta-Merson solver.

Distibuted computing

TNL supports distributed arrays and vectors, matrices and unstructured meshes.


TNL involves set of supporting tools. They are simple command-line application for the computation preprocessing or postprocessing including tools for conversion of (medical) images to TNL data structures, exporter of TNL data to VTK or gnuplot and tools for convergence study.