Features
Unified memory management for CPUs and GPUs
All data structures in TNL contain information about its allocation (on CPU or GPU). It is represented by a C++ template parameter and so correctness of all memory accesses are checked already at compile time. For example array declaration looks as follows:
TNL::Containers::Array< Value, Device, Index, Allocator > array;

Value
is data type of the array elements. 
Device
determines where the array is allocated, i.e. CPU or GPU. 
Index
is type for indexing the array elements. 
Allocator
performs memory allocation on CPU, GPU, pagelocked memory or CUDA unified memory.
TNL also offers extended smart pointers which simplify passing objects to GPUs.
Flexible parallel reduction
Parallel reduction is one of the most important parallel operations. It is, however, rather tedious to implement it. TNL profits from C++ lambda functions and offers interface for the socalled flexible parallel reduction. The following example demonstrates its use for the implementation of the scalar product:
auto fetch = [=] __cuda_callable__ (int i) { return ( a[i] * b[i] ); }; auto reduce = [] __cuda_callable__ (float x, float y) { return x + y; }; TNL::Algorithms::reduce< Devices::Cuda >( 0, size, fetch, reduce, 0.0 );
The lambda function fetch
serves for reading the data to be reduced, in
this case it also multiplies related elements of the input vectors. The
function reduce
represents the operation used for the reduction, in this
case addition. After replacing Devices::Cuda
with Devices::Host
, the same
operation will be computed on CPU.
Expression templates
Most of the BLAS Level 1 functions (and even more) are available in TNL in the form of expression templates. They are intuitive, easy to use and efficient at the same time. The following code based on Cublas
cublasHandle_t handle; cublasSaxpy( handle, size, 1.0, a, 1, x, 1 ); cublasSaxpy( handle, size, 2.0, b, 1, x, 1 ); cublasSaxpy( handle, size, 3.0, c, 1, x, 1 );
is equivalent to the following code in TNL
x = a + 2 * b + 3 * c;
It is simpler and in addition it is up to 1.7 times faster compared to Cublas.
Dense and sparse matrices
Dense and sparse matrices are one of the most important data structures for majority of HPC algorithms. TNL offers unified interface for both dense and sparse matrices. The sparse matrices can have fixed (tridiagonal and multidiagonal) or general layout. General sparse matrices can be stored in one of many formats optimized for various matrix elements patterns, namely CSR, Ellpack, SlicedEllpack, ChunkedEllpack or BisectionEllpack. General sparse matrices can be stored as symmetric (only lower part and diagonal os stored) or binary (only positions of nonzeroelements are stored) to minimize memory requirements.
The matrix elements can be simply set up with a help of lambda functions as follows (it works even on GPUs, of course):
auto f = [] __cuda_callable__ ( int rowIdx, int localIdx, int& columnIdx, double& value ) { value = rowIdx + columnIdx; }; matrix.forAllElements( f );
You can perform flexible parallel reduction within particular matrix rows. For example, matrixvector multiplication can be implemented as follows (it works even on GPUs, of course):
auto fetch = [=] __cuda_callable__ ( int rowIdx, int columnIdx, const double& value ) > double { return x[ columnIdx ] * value; }; auto keep = [=] __cuda_callable__ ( int rowIdx, const double& value ) mutable { y[ rowIdx ] = value; }; matrix.reduceAllRows( fetch, TNL::plus{}, keep, 0.0 );
Structured and unstructured numerical meshes
Numerical meshes are necessary building blocks for PDE solvers. TNL offers both structured and unstructured meshes either on CPU or GPU.
 Structured numerical grid can have 1, 2 or 3 dimensions. It is able to index not only grid cells but also faces, edges and vertices including.
 Unstructred numerical meshes can have arbitrary number of dimensions. For these type of meshes, all mesh entities (cells, faces, edges, vertices, etc.) must be stored explicitly in the memory. Especially on GPUs, memory is precious resource. Therefore, in TNL, one can configure what type of mesh enities and relations between them, are supposed to be stored. This way, the unstructured mesh can be finetuned for specific numerical scheme, typically based on the finite volume or finite element method. Unstructured meshes can be imported from VTI, VTK, VTU, XMLVTK and Netgen formats.
Numerical solvers
TNL offers iterative solvers for linear systems including stationary solvers (Jacobi, SOR  CPU only currently), Krylov subspace methods (CG, BiCGStab, GMRES, CWYGMRES, TFQMR) together with few preconditioners (Jacobi, ILU0  CPU only, ILUT  CPU only).
For solution of ODEs, there is first order Euler and fourth order adaptive RungeKuttaMerson solver.
Distibuted computing
TNL supports distributed arrays and vectors, matrices and unstructured meshes.
Tools
TNL involves set of supporting tools. They are simple commandline application for the computation preprocessing or postprocessing including tools for conversion of (medical) images to TNL data structures, exporter of TNL data to VTK or gnuplot and tools for convergence study.