Features

Unified memory management for CPUs and GPUs

All data structures in TNL contain information about its allocation (on CPU or GPU). It is represented by a C++ template parameter and so correctness of all memory accesses are checked already at compile time. For example array declaration looks as follows:

TNL::Containers::Array< Value, Device, Index, Allocator > array;

Value is data type of the array elements.
Device determines where the array is allocated, i.e. CPU or GPU.
Index is type for indexing the array elements.
Allocator performs memory allocation on CPU, GPU, page-locked memory or CUDA unified memory.

TNL also offers extended smart pointers which simplify passing objects to GPUs.

YouTube: Arrays and memory management.

Flexible parallel reduction

Parallel reduction is one of the most important parallel operations. It is, however, rather tedious to implement it. TNL profits from C++ lambda functions and offers interface for the so-called flexible parallel reduction. The following example demonstrates its use for the implementation of the scalar product:

auto fetch = [=] __cuda_callable__ (int i) { return ( a[i] * b[i] ); };
auto reduce = [] __cuda_callable__ (float x, float y) { return x + y; };
TNL::Algorithms::reduce< Devices::Cuda >( 0, size, fetch, reduce, 0.0 );

The lambda function fetch serves for reading the data to be reduced, in this case it also multiplies related elements of the input vectors. The function reduce represents the operation used for the reduction, in this case addition. After replacing Devices::Cuda with Devices::Host, the same operation will be computed on CPU.

YouTube: Parallel for and lambda functions, Parallel reduction.

Expression templates

Most of the BLAS Level 1 functions (and even more) are available in TNL in the form of expression templates. They are intuitive, easy to use and efficient at the same time. The following code based on Cublas

cublasHandle_t handle;
cublasSaxpy( handle, size, 1.0, a, 1, x, 1 );
cublasSaxpy( handle, size, 2.0, b, 1, x, 1 );
cublasSaxpy( handle, size, 3.0, c, 1, x, 1 );

is equivalent to the following code in TNL

x = a + 2 * b + 3 * c;

It is simpler and in addition it is up to 1.7 times faster compared to Cublas.

YouTube: Vectors and ET.

Multidimensional arrays

Multidimensional arrays are used for managing multidimensional data. The sizes of these arrays can be either static (set at compile time) or dynamic (set at runtime). Additionally, multidimensional arrays allow users to easily change the layout of the data in memory.

using namespace TNL::Containers;

using RowMajorArray = NDArray<
    int,                          // Value type
    SizesHolder< int, 0, 0 >,     // SizesHolder for 2D, both sizes are set at runtime
    std::index_sequence< 0, 1 >,  // Permutation for the indexing - the first index
                                  //   changes slowest, the second fastest
    Devices::Host >;              // Allocate on the CPU

using ColumnMajorArray = NDArray<
    int,                          // Value type
    SizesHolder< int, 0, 0 >,     // SizesHolder for 2D, both sizes are set at runtime
    std::index_sequence< 1, 0 >,  // Permutation for the indexing - the first index
                                  //   changes fastest, the second slowest
    Devices::Host >;              // Allocate on the CPU

using Array3D  = NDArray<
    double,                          // Value type
    SizesHolder< int, 0, 0, 4 >,     // SizesHolder for 3D, two sizes are set at
                                     //   runtime, one is set to 4 at compile time
    std::index_sequence< 0, 1, 2 >,  // Permutation for the indexing - the first
                                     //   index changes slowest, the last fastest
    Devices::Cuda >;                 // Allocate on the CUDA GPU

Array3D a;
a.setSizes( 10, 10, 0 );  // The statically-sized axis must have 0 at runtime.

Dense and sparse matrices

Dense and sparse matrices are one of the most important data structures for majority of HPC algorithms. TNL offers unified interface for both dense and sparse matrices. The sparse matrices can have fixed (tridiagonal and multidiagonal) or general layout. General sparse matrices can be stored in one of many formats optimized for various matrix elements patterns, namely CSR, Ellpack, SlicedEllpack, ChunkedEllpack or BisectionEllpack. General sparse matrices can be stored as symmetric (only lower part and diagonal os stored) or binary (only positions of nonzero-elements are stored) to minimize memory requirements.

The matrix elements can be simply set up with a help of lambda functions as follows (it works even on GPUs, of course):

auto f = [] __cuda_callable__ ( int rowIdx, int localIdx, int& columnIdx, double& value ) {
    value = rowIdx + columnIdx;
};
matrix.forAllElements( f );

You can perform flexible parallel reduction within particular matrix rows. For example, matrix-vector multiplication can be implemented as follows (it works even on GPUs, of course):

auto fetch = [=] __cuda_callable__ ( int rowIdx, int columnIdx, const double& value ) -> double {
    return x[ columnIdx ] * value;
};
auto keep = [=] __cuda_callable__ ( int rowIdx, const double& value ) mutable {
    y[ rowIdx ] = value;
};
matrix.reduceAllRows( fetch, TNL::plus{}, keep, 0.0 );

Structured and unstructured numerical meshes

Numerical meshes are necessary building blocks for PDE solvers. TNL offers both structured and unstructured meshes either on CPU or GPU.

Structured grids

Structured orthogonal (Cartesian) grids can have 1, 2 or 3 dimensions. It is able to index not only grid cells but also faces, edges and vertices including adjacent entities of any dimension. The adjacency of the mesh entities is not stored explicitly in memory, but it is computed on-demand.

Unstructured numerical meshes

Unstructred numerical meshes can have 1, 2 or 3 dimensions and each mesh entity can have any number of neighbors. For these type of meshes, all mesh entities (cells, faces, edges, vertices, etc.) must be stored explicitly in memory. Since memory may be the limiting factor in many applications and especially on GPUs, the TNL data structure is designed to be configurable and allows to change the type of mesh entities and their relations that are stored in memory. This way, the unstructured mesh can be fine-tuned for a specific numerical scheme, typically based on the finite volume or finite element method. Unstructured meshes can be imported from VTI, VTK, VTU, XMLVTK and Netgen formats.

The data structure implemented in TNL is described in detail in the paper Klinkovský J., Oberhuber T., Fučík R., Žabka V., Configurable open-source data structure for distributed conforming unstructured homogeneous meshes with GPU support, ACM Transactions on Mathematical Software, vol. 48, no.3, article No. 30, pp. 1–30, 2022.

Numerical solvers

Linear systems of algebraic equations

TNL offers iterative solvers for linear systems including stationary solvers (Jacobi, SOR - CPU only currently), Krylov subspace methods (CG, BiCGStab, GMRES, CWYGMRES, TFQMR) together with few preconditioners (Jacobi, ILU0 - CPU only, ILUT - CPU only).

Systems of ODEs

For solution of ODEs, there are the following methods:

1st order methods: Euler, Midpoint and Matlab ode1.
2nd order methods: Heun (adaptive), Ralston, Fehlberg (adaptive), Matlab ode2.
3rd order methods: Kutta, Heun, Van der Houwen/Wray, Ralston, SSPRK3, Bohacki-Shampin (adaptive), Matlab ode23 (adaptive).
4th order methods: Runge-Kutta, 3/8 rule method, Ralston, Runge-Kutta-Merson (adaptive).
5th order methods: Cash-Karp (adaptive), Dormand-Prince (adaptive), Fehlberg (adaptive), Matlab ode45 (adaptive).

Distibuted computing

TNL supports distributed arrays and vectors, matrices and unstructured meshes.

Tools

TNL involves set of supporting tools. They are simple command-line application for the computation preprocessing or postprocessing including tools for conversion of (medical) images to TNL data structures, exporter of TNL data to VTK or gnuplot and tools for convergence study.