Unified memory management for CPUs and GPUs

All data structures in TNL contains information about its allocation (on CPU or GPU). It is represented by a C++ template parameter and so correctness of all memory accesses are checked already at compile time. For example array definiton looks as follows:

TNL::Array< Value, Device, Index, Allocator >

  • Value is data type of the array elements.
  • Device says where the array is allocated i.e. CPU  or GPU.
  • Index is type for indexing the array elements.
  • Allocator performs memory allocation on CPU, GPU, page-locked memory or CUDA unified memory.

TNL also offers extended smart pointers which makes allocation of objects on GPU significantly simpler.

Flexible parallel reduction

Parallel reduction is one of the most important parallel operations. It is, however, rather tedious to implement it. TNL profits from C++ lambda functions and offers something what we call flexible parallel reduction. The following examples demonstrates the scalar product:

auto fetch = [=] __cuda_callable__ (int i)->float {return ( a[i] * b[i] ); };
auto reduce = [] __cuda_callable__ (float& x, const float& y) {x += y; };
Reduction< Devices::Cuda >::reduce( size,reduce,fetch,0.0 );

The lambda function fetch serves for reading the data to be reduced, in this case it also multiplies related elements of the input vectors. The function reduce represents the operation used for the reduction. It is addition in this case. After replacing Devices::Cuda with Devices::Host, the same will be done on CPU.

Expression templates

Most of the Blas Level 1 functions (and even more) are available in form of expression templates in TNL. They are very intuitive, easy to use and efficient at the same time. The following code based on Cublas

cublasHandle_t handle;
cublasSaxpy( handle, size, 1.0, a, 1, x, 1 );
cublasSaxpy( handle, size, 2.0, b, 1, x, 1 );
cublasSaxpy( handle, size, 3.0, c, 1, x, 1 );

is equivalent with the following code in TNL

x = a + 2 * b + 3 * c;

It is simpler and in addition it is up to 1.7 times faster compared to Cublas.

Efficient data structures

In TNL, you can find set of data structures which are designed to be efficient for algorithms running on both CPUs and GPUs. Such data structures include arrays, dense matrices, sparse matrices (diagonal, tridiagonal, multidiagonal, CSR, Ellpack, Sliced Ellpack) but also orthogonal numerical grids together with conforming unstructured numerical meshes.

With TNL, you can store the numerical meshes on GPU, which makes assembly of linear systems on GPU much more efficient. TNL offers unified, matrix format independent, interface for this purpose.

Numerical solvers

TNL offers iterative solvers for linear systems including stationary solvers (Jacobi, SOR - CPU only currently), Krylov subspace methods (CG, BiCGStab,GMRES, CWYGMRES, TFQMR) together with few preconditioners (Jacobi, ILU0 - CPU only,  ILUT - CPU only).

For solution of ODEs, there is first order Euler and fourth order adaptive Runge-Kutta-Merson solver.

PDE framework

TNL provides a framework for rapid development of PDE solvers. It is based on an architecture similar to client-server which we call problem-solver. On one hand there is PDE problem to be solved represented by a templated C++ class. It is written by the user and it describes mainly organisation of the degrees of freedom and numerical scheme.  In a lot of cases, it is independent on the hardware architecture. On the other hand, there is a solver part implemented in TNL which manages numerical meshes, all necessary solvers (for linear systems or ODE systems) and hardware beneath.

Distibuted computing

TNL supports distributed arrays and vectors, matrices and unstructured meshes.


TNL involves set of supporting tools. They are simple command-line  application for the computation preprocessing or postprocessing including tools for conversion of (medical) images to TNL data structures, exporter of TNL data to VTK or gnuplot and tools for convergence study.


We are currently working on:

  • Documentation
  • Support of GPUs by AMD via HIP
  • Polyhedral meshes
  • Parallel sorting
  • Hashing
  • n-D arrays
  • Adaptive orthogonal numerical grids
  • Computations on clusters (including GPU clusters)
  • Additional matrix formats
  • Higher-precision arithmetic
  • API for Python