.
.
.
.
.
All data structures in TNL contains information about its allocation (on CPU or GPU). It is represented by a C++ template parameter and so correctness of all memory accesses are checked already at compile time. For example array declaration looks as follows:
TNL::Array< Value, Device, Index, Allocator >
TNL also offers extended smart pointers which makes allocation of objects on GPU significantly simpler.
Parallel reduction is one of the most important parallel operations. It is, however, rather tedious to implement it. TNL profits from C++ lambda functions and offers something what we call flexible parallel reduction. The following examples demonstrates the scalar product:
auto fetch = [=] __cuda_callable__ (int i)->float {return ( a[i] * b[i] ); };
auto reduce = [] __cuda_callable__ (float& x, const float& y) -> float { return x + y; };
TNL::reduce< Devices::Cuda >( 0, size, fetch, reduce, 0.0 );
The lambda function fetch serves for reading the data to be reduced, in this case it also multiplies related elements of the input vectors. The function reduce represents the operation used for the reduction. It is addition in this case. After replacing Devices::Cuda with Devices::Host, the same will be done on CPU.
Most of the Blas Level 1 functions (and even more) are available in form of expression templates in TNL. They are very intuitive, easy to use and efficient at the same time. The following code based on Cublas
cublasHandle_t handle;
cublasSaxpy( handle, size, 1.0, a, 1, x, 1 );
cublasSaxpy( handle, size, 2.0, b, 1, x, 1 );
cublasSaxpy( handle, size, 3.0, c, 1, x, 1 );
is equivalent with the following code in TNL
x = a + 2 * b + 3 * c;
It is simpler and in addition it is up to 1.7 times faster compared to Cublas.
Dense and sparse matrices are one of the most important data structures for majority of HPC algorithms. TNL offers unified interface for both dense and sparse matrices. The sparse matrices can have fixed (tridiagonal and multidiagonal) or general layout. General sparse matrices can be stored in one of many formats optimized for various matrix elements patterns, namely CSR, Ellpack, SlicedEllpack, ChunkedEllpack or BisectionEllpack. Genral sparse matrices can be stored as symmetric (only lower part and diagonal os stored) or binary (only positions of nonzero-elements are stored) to minimze memory requirements.
The matrix elements can be simply set-up with a help of lambda functions as follows (it works even on GPUs, of course):
auto f = [] __cuda_callable__ ( int rowIdx, int localIdx, int& columnIdx, double& value ) {
value = rowIdx + columnIdx; };
matrix.forAllElements( f );
You can perform flexible parallel reduction within particular matrix rows. For example, matrix-vector multiplication can be implemented as follows (it works even on GPUs, of course):
auto fetch = [=] __cuda_callable__ ( int rowIdx, int columnIdx, const double& value ) -> double {
return x[ columnIdx ] * value; };
auto keep = [=] __cuda_callable__ ( int rowIdx, const double& value ) mutable {
y[ rowIdx ] = value; };
matrix.reduceAllRows( fetch, TNL::plus{}, keep, 0.0 );
Numerical meshes are necessary building blocks for PDE solvers. TNL offers both structured and unstructured meshes either on CPU or GPU.
TNL offers iterative solvers for linear systems including stationary solvers (Jacobi, SOR - CPU only currently), Krylov subspace methods (CG, BiCGStab,GMRES, CWYGMRES, TFQMR) together with few preconditioners (Jacobi, ILU0 - CPU only, ILUT - CPU only).
For solution of ODEs, there is first order Euler and fourth order adaptive Runge-Kutta-Merson solver.
TNL provides a framework for rapid development of PDE solvers. It is based on an architecture similar to client-server which we call problem-solver. On one hand there is PDE problem to be solved represented by a templated C++ class. It is written by the user and it describes mainly organisation of the degrees of freedom and numerical scheme. In a lot of cases, it is independent on the hardware architecture. On the other hand, there is a solver part implemented in TNL which manages numerical meshes, all necessary solvers (for linear systems or ODE systems) and hardware beneath.
TNL supports distributed arrays and vectors, matrices and unstructured meshes.
TNL involves set of supporting tools. They are simple command-line application for the computation preprocessing or postprocessing including tools for conversion of (medical) images to TNL data structures, exporter of TNL data to VTK or gnuplot and tools for convergence study.