WaterLily.jl is a simple and fast fluid simulator written in pure Julia. It solves the unsteady incompressible 2D or 3D Navier-Stokes equations on a Cartesian grid. The pressure Poisson equation is solved with a geometric multigrid method. Solid boundaries are modelled using the Boundary Data Immersion Method.
v1.0 has ported the solver from a serial CPU execution to a backend-agnostic execution including multi-threaded CPU and GPU from different vendors (NVIDIA and AMD) thanks to KernelAbstractions.jl (KA). We have also recently published an extended abstract preprint with benchmarking details regarding this port (see arXiv). In this post, we will review our approach to port the code together with the challenges we have faced.
Introducing KernelAbstractions.jl
The main ingredient of this port is the @kernel
macro from KA.
This macro takes a function definition and converts it into a kernel specialised for a given backend. KA can work with CUDA, ROCm, oneAPI, and Metal backends.
As example, consider the divergence operator for a 2D vector field $\vec{u}=(u, v)$. In the finite-volume method (FVM) and using a Cartesian (uniform) grid with unit cells, this is defined as
\[\begin{align} \sigma=\unicode{x2230}(\nabla\cdot\vec{u})\,\mathrm{d}V = \unicode{x222F}\vec{u}\cdot\hat{n}\,\mathrm{d}S\rightarrow \sigma_{i,j} = (u_{i+1,j} - u_{i,j}) + (v_{i,j+1} - v_{i,j}), \end{align}\]where $i$ and $j$ are the indices of the discretised staggered grid:
In WaterLily, we define loops based on the CartesianIndex
such that I=(i,j,...)
, thus writing an n-dimensional solver in a very straight-forward way.
With this, to compute the divergence of a 2D vector field we can use
δ(d,::CartesianIndex{D}) where {D} = CartesianIndex(ntuple(j -> j==d ? 1 : 0, D))
@inline ∂(a,I::CartesianIndex{D},u::AbstractArray{T,n}) where {D,T,n} = u[I+δ(a,I),a] - u[I,a]
inside(a) = CartesianIndices(ntuple(i-> 2:size(a)[i]-1, ndims(a)))
N = (10, 10) # domain size
σ = zeros(N) # scalar field
u = rand(N..., length(N)) # 2D vector field with ghost cells
for d ∈ 1:ndims(σ), I ∈ inside(σ)
σ[I] += ∂(d, I, u)
end
where a loop for each dimension d
and each Cartesian index I
is used.
The function δ
provides a Cartesian step in the direction d
, for example δ(1, 2)
returns CartesianIndex(1, 0)
and δ(2, 3)
returns CartesianIndex(0, 1, 0)
.
This is used in the derivative function ∂
to implement the divergence equation as described above.
inside(σ)
provides the CartesianIndices
of σ
excluding the ghost elements, ie. a range of Cartesian indices starting at (2, 2)
and ending at (9, 9)
when size(σ) == (10, 10)
.
Note that the divergence operation is independent for each I
, and this is great because it means we can parallelize it!
This is where KA comes into place.
To be able to generate the divergence operator using KA, we need to write the divergence kernel which is just the divergence operator written in KA style.
We define the divergence kernel _divergence!
and a wrapper divergence!
as follows
using KernelAbstractions: get_backend, @index, @kernel
@kernel function _divergence!(σ, u, @Const(I0))
I = @index(Global, Cartesian)
I += I0
σ_sum = zero(eltype(σ))
for d ∈ 1:ndims(σ)
σ_sum += ∂(d, I, u)
end
σ[I] = σ_sum
end
function divergence!(σ, u)
R = inside(σ)
_divergence!(get_backend(σ), 64)(σ, u, R[1]-oneunit(R[1]), ndrange=size(R))
end
Note that in the _divergence!
kernel we operate again using Cartesian indices by calling @index(Global, Cartesian)
from the KA @index
macro.
The range of Cartesian indices is given by the ndrange
argument in the wrapper function where we pass the inside(σ)
Cartesian indices range, and the backend is inferred with the get_backend
method.
Also note that we pass an additional (constant) argument I0
which provides the offset index to apply to the indices given by @index
naturally starting on (1,1,...)
.
Using a workgroup size of 64
(size of the group of threads acting in parallel, see terminology), KA will parallelize over I
by multi-threading in CPU or GPU.
In this regard, we just need to change the array type of σ
and u
from Array
(CPU backend) to CuArray
(NVIDIA GPU backend) or ROCArray
(AMD GPU backend), and KA will specialise the kernel for the desired backend
using CUDA: CuArray
N = (10, 10)
σ = zeros(N) |> CuArray
u = rand(N..., length(N)) |> CuArray
divergence!(σ, u)
Automatic loop and kernel generation
As a stencil-based CFD solver, WaterLily heavily uses for
loops to iterate over the n-dimensional arrays.
To automate the generation of such loops, the macro @loop
is defined
macro loop(args...)
ex,_,itr = args
op,I,R = itr.args
@assert op ∈ (:(∈),:(in))
return quote
for $I ∈ $R
$ex
end
end |> esc
end
This macro takes an expression such as @loop <expr> over I ∈ R
where R
is a CartesianIndices
range, and produces the loop for I ∈ R <expr> end
.
For example, the serial divergence operator could now be simply defined using
for d ∈ 1:ndims(σ)
@loop σ[I] += ∂(d, I, u) over I ∈ inside(σ)
end
which generates the I ∈ inside(σ)
loop automatically.
Even though this could be seen as small improvement (if any), the nice thing about writing loops using this approach is that the computationally-demanding part of the code can be abstracted out of the main workflow.
For example, it is easy to add performance macros such as @inbounds
and/or @fastmath
to each loop by changing the quote block in the @loop
macro
macro loop(args...)
ex,_,itr = args
op,I,R = itr.args
@assert op ∈ (:(∈),:(in))
return quote
@inbounds for $I ∈ $R
@fastmath $ex
end
end |> esc
end
And, even nicer, we can also use this approach to automatically generate KA kernels for every loop in the code!
To do so, we modify @loop
to generate the KA kernel using @kernel
and the wrapper function that sets the backend and the workgroup size
macro loop(args...)
ex,_,itr = args
_,I,R = itr.args; sym = []
grab!(sym,ex) # get arguments and replace composites in `ex`
setdiff!(sym,[I]) # don't want to pass I as an argument
@gensym kern # generate unique kernel function name
return quote
@kernel function $kern($(rep.(sym)...),@Const(I0)) # replace composite arguments
$I = @index(Global,Cartesian)
$I += I0
$ex
end
$kern(get_backend($(sym[1])),64)($(sym...),$R[1]-oneunit($R[1]),ndrange=size($R))
end |> esc
end
function grab!(sym,ex::Expr)
ex.head == :. && return union!(sym,[ex]) # grab composite name and return
start = ex.head==:(call) ? 2 : 1 # don't grab function names
foreach(a->grab!(sym,a),ex.args[start:end]) # recurse into args
ex.args[start:end] = rep.(ex.args[start:end]) # replace composites in args
end
grab!(sym,ex::Symbol) = union!(sym,[ex]) # grab symbol name
grab!(sym,ex) = nothing
rep(ex) = ex
rep(ex::Expr) = ex.head == :. ? Symbol(ex.args[2].value) : ex
The helper functions grab!
and rep
allow to extract the arguments required by the expression ex
and the Cartesian index range that will be passed to the kernel.
The code generated by @loop
and @kernel
can be explored using @macroexpand
. For example, for d=1
@macroexpand @loop σ[I] += ∂(1, I, u) over I ∈ inside(σ)
we can observe that the code for both CPU and GPU kernels is produced:
Generated code
@macroexpand @loopKA σ[I] += ∂(1, I, u) over I ∈ inside(σ) quote begin function var"cpu_##kern#339"(__ctx__, σ, u, I0; ) let I0 = (KernelAbstractions.constify)(I0) $(Expr(:aliasscope)) begin var"##N#341" = length((KernelAbstractions.__workitems_iterspace)(__ctx__)) begin for var"##I#340" = (KernelAbstractions.__workitems_iterspace)(__ctx__) (KernelAbstractions.__validindex)(__ctx__, var"##I#340") || continue I = KernelAbstractions.__index_Global_Cartesian(__ctx__, var"##I#340") begin I += I0 σ[I] += ∂(1, I, u) end end end end $(Expr(:popaliasscope)) return nothing end end function var"gpu_##kern#339"(__ctx__, σ, u, I0; ) let I0 = (KernelAbstractions.constify)(I0) if (KernelAbstractions.__validindex)(__ctx__) begin I = KernelAbstractions.__index_Global_Cartesian(__ctx__) I += I0 σ[I] += ∂(1, I, u) end end return nothing end end begin if !($(Expr(:isdefined, Symbol("##kern#339")))) begin $(Expr(:meta, :doc)) var"##kern#339"(dev) = begin var"##kern#339"(dev, (KernelAbstractions.NDIteration.DynamicSize)(), (KernelAbstractions.NDIteration.DynamicSize)()) end end var"##kern#339"(dev, size) = begin var"##kern#339"(dev, (KernelAbstractions.NDIteration.StaticSize)(size), (KernelAbstractions.NDIteration.DynamicSize)()) end var"##kern#339"(dev, size, range) = begin var"##kern#339"(dev, (KernelAbstractions.NDIteration.StaticSize)(size), (KernelAbstractions.NDIteration.StaticSize)(range)) end function var"##kern#339"(dev::Dev, sz::S, range::NDRange) where {Dev, S <: KernelAbstractions.NDIteration._Size, NDRange <: KernelAbstractions.NDIteration._Size} if (KernelAbstractions.isgpu)(dev) return (KernelAbstractions.construct)(dev, sz, range, var"gpu_##kern#339") else return (KernelAbstractions.construct)(dev, sz, range, var"cpu_##kern#339") end end end end end (var"##kern#339"(get_backend(σ), 64))(σ, u, (inside(σ))[1] - oneunit((inside(σ))[1]), ndrange = size(inside(σ))) end
The best feature we achieve when modifying @loop
to produce KA kernels is that the divergence operator remains the same as before using KA
for d ∈ 1:ndims(σ)
@loop σ[I] += ∂(d, I, u) over I ∈ inside(σ)
end
This exact approach is what has allowed WaterLily to have the same LOC as before using KA, just around 800!
Benchmarking
Now that we have all the items in place, we can benchmark the speedup achieved by KA compared to the serial execution using BenchmarkTools.jl. Let’s now gather all the code we have used and create a small benchmarking MWE (see below or download it here). In this code showcase, we will refer to the serial CPU execution as “serial”, the multi-threaded CPU execution as “CPU”, and the GPU execution as “GPU”:
using KernelAbstractions: get_backend, synchronize, @index, @kernel, @groupsize
using CUDA: CuArray
using BenchmarkTools
δ(d,::CartesianIndex{D}) where {D} = CartesianIndex(ntuple(j -> j==d ? 1 : 0, D))
@inline ∂(a,I::CartesianIndex{D},u::AbstractArray{T,n}) where {D,T,n} = u[I+δ(a,I),a]-u[I,a]
inside(a) = CartesianIndices(ntuple(i-> 2:size(a)[i]-1,ndims(a)))
# serial loop macro
macro loop(args...)
ex,_,itr = args
op,I,R = itr.args
@assert op ∈ (:(∈),:(in))
return quote
for $I ∈ $R
$ex
end
end |> esc
end
# KA-adapted loop macro
macro loopKA(args...)
ex,_,itr = args
_,I,R = itr.args; sym = []
grab!(sym,ex) # get arguments and replace composites in `ex`
setdiff!(sym,[I]) # don't want to pass I as an argument
@gensym kern # generate unique kernel function name
return quote
@kernel function $kern($(rep.(sym)...),@Const(I0)) # replace composite arguments
$I = @index(Global,Cartesian)
$I += I0
$ex
end
$kern(get_backend($(sym[1])),64)($(sym...),$R[1]-oneunit($R[1]),ndrange=size($R))
end |> esc
end
function grab!(sym,ex::Expr)
ex.head == :. && return union!(sym,[ex]) # grab composite name and return
start = ex.head==:(call) ? 2 : 1 # don't grab function names
foreach(a->grab!(sym,a),ex.args[start:end]) # recurse into args
ex.args[start:end] = rep.(ex.args[start:end]) # replace composites in args
end
grab!(sym,ex::Symbol) = union!(sym,[ex]) # grab symbol name
grab!(sym,ex) = nothing
rep(ex) = ex
rep(ex::Expr) = ex.head == :. ? Symbol(ex.args[2].value) : ex
function divergence!(σ, u)
for d ∈ 1:ndims(σ)
@loop σ[I] += ∂(d, I, u) over I ∈ inside(σ)
end
end
function divergenceKA!(σ, u)
for d ∈ 1:ndims(σ)
@loopKA σ[I] += ∂(d, I, u) over I ∈ inside(σ)
end
end
N = (2^8, 2^8, 2^8)
# CPU serial arrays
σ_serial = zeros(N)
u_serial = rand(N..., length(N))
# CPU multi-threading arrays
σ_CPU = zeros(N)
u_CPU = copy(u_serial)
# GPU arrays
σ_GPU = zeros(N) |> CuArray
u_GPU = copy(u_serial) |> CuArray
# Benchmark warmup (force compilation) and validation
divergence!(σ_serial, u_serial)
divergenceKA!(σ_CPU, u_CPU)
divergenceKA!(σ_GPU, u_GPU)
@assert σ_serial ≈ σ_CPU ≈ σ_GPU |> Array
# Create and run benchmarks
suite = BenchmarkGroup()
suite["serial"] = @benchmarkable divergence!($σ_serial, $u_serial)
suite["CPU"] = @benchmarkable begin
divergenceKA!($σ_CPU, $u_CPU)
synchronize(get_backend($σ_CPU))
end
suite["GPU"] = @benchmarkable begin
divergenceKA!($σ_GPU, $u_GPU)
synchronize(get_backend($σ_GPU))
end
results = run(suite, verbose=true)
In this benchmark we have used a 3D array σ
(scalar field) instead of the 2D array used before, hence demonstrating the n-dimensional capabilities of the current methodology.
For N=(2^8,2^8,2^8)
, the following benchmark results are achieved on a 6-core laptop equipped with an NVIDIA GeForce GTX 1650 Ti GPU card
"CPU" => Trial(52.651 ms)
"GPU" => Trial(7.589 ms)
"serial" => Trial(234.347 ms)
The GPU executions yields a 30x speed-up compared to the serial execution and 7x compared to the multi-threaded CPU execution. The multi-threaded CPU execution yields 4.5x speed-up compared to the serial execution (ideally should be 6x in the 6-core machine).
As a final note on this section, see that synchronize
is used when running the KA benchmarks. If not used, we would only be measuring the time that it takes to launch a kernel but not to actually run it.
Challenges
Porting the whole solver to GPU has been mostly a learning exercise. With no previous experience on software development for GPUs, KA smoothens the learning curve, so it is a great way to get started. Of course, a lot of stuff does not just work out of the box, and we have faced some challenges while doing the port. Here are some of them.
Offset indices in KA kernels
Offset indices are important for boundary-value problems where arrays may contain both the solution and the boundary conditions of a problem.
In the stencil-based finite-volume and finite-difference methods, the boundary elements are only accessed to compute the stencil, but not directly modified when looping through the solution elements of an array.
It is in this scenario where offset indices are important, for example.
KA @index
macro only provides natural indices in Julia (starting at 1), and this minor missing feature initially derailed us into using OffsetArrays.jl.
Of course this added complexity to the code, and we even observed degraded performance in some kernels.
Some time after this (more than we would like to admit), the idea of manually passing the offset index into the KA kernel took shape and quickly yield a much cleaner solution.
Thankfully, this feature will be natively supported in KA in the future (see KA issue #384).
To inline functions can be important in GPU kernels
In KA, GPU kernels are of course more sensitive than CPU kernels when it comes to functions that may be called within.
We have observed this sensitivity both at compilation time and at runtime.
For example, the δ
function was originally implemented with multiple dispatch as
@inline δ(i,N::Int) = CartesianIndex(ntuple(j -> j==i ? 1 : 0, N))
δ(d,I::CartesianIndex{N}) where {N} = δ(d, N)
The main problem here is that this implementation is type-unstable, and without @inline
the GPU kernel was complaining about a dynamic function (see KA issue #392).
Another inline-related problem can be observed with the derivative function ∂
.
When removing the @inline
macro from its definition, the GPU performance decays significantly, and the GPU benchmark gets even with the CPU one.
This demonstrates that the compiler can do performant tricks when the information on the required instructions is not nested on external functions to the kernel.
Popular functions may not work within kernels
Often we use functions such as the norm2
from LinearAlgebra.jl to compute the norm of an array.
A surprise is that some of these do not work inside a kernel since the GPU compiler may not be equipped to do so. Hence, these need to be manually written in a suitable form.
In this case, we use norm2(x) = √sum(abs2,x)
.
Another example is the sum
function using generator syntax such as
@kernel function _divergence(σ, u)
I = @index(Global, Cartesian)
σ[I] = sum(u[I+δ(d),d]-u[I,d] for d ∈ 1:ndims(σ))
end
which errors during compilation for a GPU kernel.
Here a solution can be to use a different form of sum
@kernel function _divergence(σ, u)
I = @index(Global, Cartesian)
σ[I] = sum(j -> u[I+δ(j),j]-u[I,j], 1:ndims(σ), init=zero(eltype(σ)))
end
even though we have observed reduced performance in the latter version (more information in Discourse post #96658). There are efforts in KA directed towards providing a reduction interface for kernels (see KA issue #234).
Limitations of the automatic kernel generation on loops
While the @loop
macro that generates KA kernels is fairly general, it also has some limitations.
For example, it may have been noticed that we have not nested the loop over the dimensions d ∈ 1:ndims(σ)
in the kernel.
The reason behind this is that even if turning
for d ∈ 1:ndims(σ)
@loop σ[I] += ∂(d, I, u) over I ∈ inside(σ)
end
into
@loop σ[I] = sum(d->∂(d, I, u), 1:ndims(σ)) over I ∈ inside(σ)
would reduce the number of kernel evaluations, the limitation of the sum
function mentioned before makes this approach not as performant as writing a kernel for each dimension.
Also related to this issue is the fact that passing more than one expression per kernel would reduce the overall number of kernel evaluations, but gluing expressions together can be not straight-forward with the current implementation of @loop
.
Care for race conditions!
When moving from serial to parallel computations, race conditions are a recurring issue. For WaterLily, this issue popped up for the linear solver used in the pressure Poisson equation. Prior to the port, WaterLily relied on Successive Over Relaxation (SOR) method (a Gauss-Seidel-type solver) which uses (ordered) backsubstitution, hence not suitable for parallel executions. The solution here was just to switch to a better suited solver such as the Conjugate-Gradient method.
Acknowledgements
Special thanks to Valentin Churavy for creating KernelAbstractions.jl and revising this article. And, of course, Gabriel D. Weymouth for creating WaterLily.jl and for helping in the revising of this article too! :)