Unified memory management for CPUs and GPUs

All data structures in TNL contains information about its allocation (on CPU or GPU). It is represented by a C++ template parameter and so correctness of all memory accesses are checked already at compile time. For example array declaration looks as follows:

TNL::Containers::Array< Value, Device, Index, Allocator > array;

  • Value is data type of the array elements.
  • Device says where the array is allocated i.e. CPU  or GPU.
  • Index is type for indexing the array elements.
  • Allocator performs memory allocation on CPU, GPU, page-locked memory or CUDA unified memory.

TNL also offers extended smart pointers which makes allocation of objects on GPU significantly simpler.

Flexible parallel reduction

Parallel reduction is one of the most important parallel operations. It is, however, rather tedious to implement it. TNL profits from C++ lambda functions and offers something what we call flexible parallel reduction. The following examples demonstrates the scalar product:

auto fetch = [=] __cuda_callable__ (int i) { return ( a[i] * b[i] ); };
auto reduce = [] __cuda_callable__ (float x, float y) { return x + y; };
TNL::Algorithms::reduce< Devices::Cuda >( 0, size, fetch, reduce, 0.0 );

The lambda function fetch serves for reading the data to be reduced, in this case it also multiplies related elements of the input vectors. The function reduce represents the operation used for the reduction. It is addition in this case. After replacing Devices::Cuda with Devices::Host, the same will be done on CPU.

Expression templates

Most of the Blas Level 1 functions (and even more) are available in form of expression templates in TNL. They are very intuitive, easy to use and efficient at the same time. The following code based on Cublas

cublasHandle_t handle;
cublasSaxpy( handle, size, 1.0, a, 1, x, 1 );
cublasSaxpy( handle, size, 2.0, b, 1, x, 1 );
cublasSaxpy( handle, size, 3.0, c, 1, x, 1 );

is equivalent with the following code in TNL

x = a + 2 * b + 3 * c;

It is simpler and in addition it is up to 1.7 times faster compared to Cublas.

Dense and sparse matrices

Dense and sparse matrices are one of the most important data structures for majority of HPC algorithms. TNL offers unified interface for both dense and sparse matrices. The sparse matrices can have fixed (tridiagonal and multidiagonal) or general layout. General sparse matrices can be stored in one of many formats optimized for various matrix elements patterns, namely CSR, Ellpack, SlicedEllpack, ChunkedEllpack or BisectionEllpack. Genral sparse matrices can be stored as symmetric (only lower part and diagonal os stored) or binary (only positions of nonzero-elements are stored) to minimze memory requirements.

The matrix elements can be simply set-up with a help of lambda functions as follows (it works even on GPUs, of course):

auto f = [] __cuda_callable__ ( int rowIdx, int localIdx, int& columnIdx, double& value ) {
value = rowIdx + columnIdx; };
matrix.forAllElements( f );

You can perform flexible parallel reduction within particular matrix rows. For example, matrix-vector multiplication can be implemented as follows (it works even on GPUs, of course):

auto fetch = [=] __cuda_callable__ ( int rowIdx, int columnIdx, const double& value ) -> double {
return x[ columnIdx ] * value; };
auto keep = [=] __cuda_callable__ ( int rowIdx, const double& value ) mutable {
y[ rowIdx ] = value; };
matrix.reduceAllRows( fetch, TNL::plus{}, keep, 0.0 );

Structured and unstructured numerical meshes

Numerical meshes are necessary building blocks for PDE solvers. TNL offers both structured and unstructured meshes either on CPU or GPU.

  • Structured numerical grid - can have 1, 2 or 3 dimensions. It is able to index not only grid cells but also faces, edges and vertices including.
  • Unstructred numerical meshes - can have arbitrary number of dimensions. For these type of meshes, all mesh entities (cells, faces, edges, vertices ...) must be stored explicitly in the memory. Especially on GPUs, memory is precious resource. Therefore, in TNL, one can configure what type of mesh enities and relations between them, are supposed to be stored. This way, the unstructured mesh can be fine-tuned for specific numerical scheme, typically based on the finite volume or finite element method. Unstructured meshes can be imported from VTI, VTK, VTU, XMLVTK and Netgen formats.

Numerical solvers

TNL offers iterative solvers for linear systems including stationary solvers (Jacobi, SOR - CPU only currently), Krylov subspace methods (CG, BiCGStab,GMRES, CWYGMRES, TFQMR) together with few preconditioners (Jacobi, ILU0 - CPU only,  ILUT - CPU only).

For solution of ODEs, there is first order Euler and fourth order adaptive Runge-Kutta-Merson solver.

PDE framework

TNL provides a framework for rapid development of PDE solvers. It is based on an architecture similar to client-server which we call problem-solver. On one hand there is PDE problem to be solved represented by a templated C++ class. It is written by the user and it describes mainly organisation of the degrees of freedom and numerical scheme.  In a lot of cases, it is independent on the hardware architecture. On the other hand, there is a solver part implemented in TNL which manages numerical meshes, all necessary solvers (for linear systems or ODE systems) and hardware beneath.

Distibuted computing

TNL supports distributed arrays and vectors, matrices and unstructured meshes.


TNL involves set of supporting tools. They are simple command-line  application for the computation preprocessing or postprocessing including tools for conversion of (medical) images to TNL data structures, exporter of TNL data to VTK or gnuplot and tools for convergence study.