In the modern computing landscape, it has become common knowledge that many compute-heavy operations should be run on GPUs. This of course extends to attempts to do numerical modelling of physical systems, one of the great industry applications of a mathematics education. And in the course of that industry application, it will often become necessary to deploy these models, and the computational tools to simulate them, to consumer devices or, heaven forbid, non-x86 devices.
To that end, a friend provided me with an example problem to build an mac app in swift that could solve the heat equation on apple silicon. The heat equation is an especially simple differential equation, but if the math is the easy part here then we are free to focus on any software and hardware difficulties. Alas I do not have a macbook at the moment. In the spirit of the problem however, I figured the closest thing I could do to implementing it in swift is implementing it in rust, a language I am told is quite similar. And if I could not make a solver that ran on apple silicon, I would instead build one using WebGPU that would run on everything*.
$*$ : see mozilla MDN WebGPU API browser compatibility table. Most desktop browsers support WebGPU on windows, and on other operating systems can be coerced to with the proper browser launch arguments; on mobile, chrome for android should work just fine as will safari for ios. You can check your browser's compatibility here.
Here we will discuss what I learned about WebGPU and Wasm (Web-Assembly) in this project, and some details of rust. I have used rust before, but I am relatively new to web development; in a very real sense, this project marks my first time using javascript (although I have committed to taking all available measures to never install npm or node.js). Moreover my relationship to rust has historically been as a sort of "discount haskell" that suffers the compromises of a language designed closer to the hardware, although of course, since we aim to do some GPU programming, that will be exactly what we need. And of course since I can't help myself, this article will secondarily perform as a tutorial on the heat equation as well as how to implement something like what I ended up making.
I must first confess that, as I said earlier, the problem was specified for me. This includes the particular numerical scheme that we'll discuss how to implement soon. However, I am of course no mere codemonkey and happen to have a penchant for pedagogy, so we will begin with a discussion of the heat equation which will clarify many details of our particular implementation later on.
The derivation of the heat equation is one of the more obvious examples of mathematics acting as analytic natural philosophy in a trench coat. One asks 'given a distribution of thermal energy in a system, how is it to evolve over time?' and immediately has two threads to pull on: that since we are concerned with the evolution, we are expressing our insight in terms of the time derivative (i.e. the instaneous change-per-unit-time); and that what we fundamentally want to encode is that heat spreads out. So what we are trying to find is a function $T(x,t)$ of position $x$ and time $t$ that is specified by an initial distribution $T(x,0)$, and we can try assuming that whatever binding property will define this relationship for us will relate the change in time to something. $$\begin{gather*} \frac{\partial T}{\partial t} = ? \end{gather*}$$
Consider a one dimensional thermal distribution, and imagine that this line-of-material is divided up into very small segments, with one such a segment between positions $x$ and $x + \Delta x$, with segment length $\Delta x$ (although the idea is to make $\Delta x$ very small). With its two points of contact, one to the material segment in the negative direction (less than $x$) and one in the positive direction (greater than $x+\Delta x$), we have two other things which can change the temperature of our segment by conducting heat with it.
We'll now identify each segment by the point it starts at, so let us imagine that the segment we are describing is the one whose temperature at time $t$ is $T(x,t)$ while the next segment is $T(x + \Delta x, t)$ and the previous one is $T(x - \Delta x, t)$. As above, we expect the change of the temperature per unit time in our segment to be proportional to the difference between its temperature and the temperature of the two segments besides it. That means we want our relationship to look something like this: $$\begin{gather*} \frac{\partial T}{\partial t}(x,t) \propto \big[ T(x + \Delta x, t) - T(x, t) \big] + \big[ T(x - \Delta x, t) - T(x, t) \big]. \end{gather*}$$
I've used the symbol $\propto$ here, meaning that instead of speaking of an equivalence, we speak of a proportional relation. Whatever relationship we're discovering here, we know only that these quantities are related to one another (and we'll assume that relationship is linear) but we certainly won't say that they are literally equal yet.
Marked in square brackets are the contributions to the change in temperature in our segment by the two segments adjacent to it. We see in the first term $T(x + \Delta x, t) - T(x, t)$: this is because if the segment to the right is hotter than our segment then this quantity will be positive corresponding to the idea that it should conduct heat into our segment, and thus increase our segment's temperature. Similar reasoning applies to the other segment: when the segment to the left of ours is hotter, it conducts heat into ours, increasing the temperature over time, and so we have $T(x - \Delta x, t) - T(x, t)$ positive. The logic reverses; if $T(x,t)$ is hotter than either of its adjacent segments, we expect it to conduct heat out of itself and thus cool. These two bracketed quantities are thus in opposition. If the difference between the temperature of our segment and the one on its right is the negative of the one on its left, our change over time will be zero and heat will simply flow through the segment without its temperature changing.
Since we spoke of proportionality rather than equivalence, we are at liberty to multiply or divide factors in the above relation. The true coefficient will be discussed shortly, but in the mean time it will do us well to divide our right-hand side by $\Delta x$. Doing so reveals that we have something of the form of a partial derivative when we take our $\Delta x$ toward zero. $$\begin{align*} & \lim_{\Delta x \to 0} \frac{\big[ T(x + \Delta x, t) - T(x, t) \big]}{\Delta x} + \frac{\big[ T(x - \Delta x, t) - T(x, t) \big]}{\Delta x} \\[0em] =& \lim_{\Delta x \to 0} \frac{\big[ T(x + \Delta x, t) - T(x, t) \big]}{\Delta x} - \frac{\big[ T(x, t) - T(x - \Delta x, t) \big]}{\Delta x} \\[0em] \approx& \frac{ \partial T }{\partial x} (x,t) - \frac{\partial T}{\partial x} (x - \Delta x, t) \end{align*}$$
We'll want to avoid absolutely bringing $\Delta x$ to zero as doing so would tell us the above quantity is also zero; we are certainly doing what I'd call "physicist-brained" math but in the spirit of physics, as long as our metaphor remains coherent under our manipulations, the math will be relatively forgiving (and we are in luck since the formula we will derive shortly is two hundred years old).
What this tells us is that the change of temperature of our segment can now be considered proportional to the heat flux, or rather the negative-direction moving heat, to it from the next segment minus the heat flux to the previous segment. The reason for the derivative in space to be the negative-direction moving heat is obvious when you think about what the graph of temperature looks like.
We can divide this by $\Delta x$ again to take our two derivatives, separated in their $x$ argument by a $\Delta x$, and turn them into a single second derivative. $$\begin{gather*} \lim_{\Delta x \to 0} \frac{1}{\Delta x} \left( \frac{ \partial T }{\partial x} (x,t) - \frac{\partial T}{\partial x} (x - \Delta x, t) \right) = \frac{\partial^2 T}{\partial x^2} (x , t) \\[0em] \frac{\partial T}{ \partial t} \propto \frac{\partial^2 T}{\partial x^2} \end{gather*}$$
The last thing we need to decide is what the proportionality constant is so we can turn $\propto$ into $=$. However this is a choice depending on the given material. For instance consider that the dynamics we describe above equally apply to a block of steel just as well as it does to a block of compacted feathers. But we know that steel conducts heat very well, and so we expect temperature differentials within a block of steel to diffuse quickly, whereas a block of compacted feathers (assuming they are sufficiently compacted so that heat conducts through it uniformly) might insulate somewhat and thus diffuse more slowly. So this is a choice we make about our material of study: we encode this choice in the parameter $\kappa$ (greek 'kappa'), representing the thermal conductivity of our material of study, and form the heat equation.
$$\begin{gather*} \frac{\partial T}{ \partial t} = \kappa \frac{\partial^2 T}{\partial x^2} \end{gather*}$$
It must be emphasized then that we will be able to proceed through our reasoning under the assumption that if we know the state of a system's temperature (its temperature at each $x$) then we know its time derivative. Its time derivative is defined autonomously by the state at a present time.
Something amusing which we may want to note before we implement a method to solve this differential equation from initial conditions is what that equation intuitively means. We happen to have a Green's function solution to the one dimensional heat equation, meaning here a solution that applies when the initial condition is infinite heat concentration at $x=0$ corresponding to some finite total heat. Such a solution can be used as a convolution on an initial condition $g(x)$ to give a solution $$\begin{gather*} T(x,t) = \frac{1}{\sqrt{4 \pi \kappa t}} \int_{-\infty}^{\infty} \exp\left( - \frac{(x - \tau)^2}{4 \kappa t}\right) g(\tau) d\tau \end{gather*}$$ where it becomes obvious that this convolution is with a gaussian. The accusation is almost appropriate then that what we proceed to implement here is but a fancy gaussian blur. Almost. Indeed what we implement here is also a fancy gaussian blur, however it is a gaussian blur that works precisely so that if we set the parameters accordingly to a real life thermal-conducting material and a real life initial condition, it will roughly produce the correct thermal distribution at the appropriate time.
Of course at the point that we encode the heat equation in a computer for simulation, we must begin undoing the limits in our derivatives, inserting $\Delta x$ back into things and figuring out how we want to make this work in two dimensions.
In our model, we'll discretize space as a grid of rectangles; note here that this is already an assumption, since we are saying that heat never moves directly diagonally. This assumption will however enable us to reason certain things out, for instance, in moving to two dimensions we don't have that many more considerations as in one dimension since we know heat can only enter or exit a rectangle through one of four sides, two more opposing directions to previous two with similar heat-flow effects at play. So prior to anything particular to our discritization, we may simply add a second heat-flow term for our second dimension and say that $$\begin{gather*} \frac{\partial T}{ \partial t} = \kappa \left( \frac{\partial^2 T}{\partial x^2} + \frac{\partial^2 T}{\partial y^2} \right), \end{gather*}$$ a statement which happens to remain true regardless of the particular discretization we're using here (see laplace operator).
Next we need to ask how we want to calculate our derivatives. Obviously this will involve some form of treating, for instance $\partial T/\partial t$ as $\big(T(x,t + \Delta t) - T(x, t) \big)/\Delta t$, but the question is if we use that expression or instead go for $\big(T(x,t) - T(x - \Delta t, t) \big)/\Delta t$. Both will have the two temperatures separated by a $\Delta t$ but whether to define this relationship forward or backwards has different effects. In this instance, we will use forward time, meaning the former.
We have to make this choice twice for our spatial derivatives. For instance, we can consider $\partial^2 T/\partial x^2$ three different ways. $$\begin{gather*} \frac{\big(T(x, t) - T(x - \Delta x, t) \big) - \big( T(x - \Delta x, t) - T(x - 2 \Delta x, t) \big)}{\Delta x^2} \\[0em] \\[0em] \frac{\big(T(x + \Delta x, t) - T(x, t) \big) - \big( T(x, t) - T(x - \Delta x, t) \big)}{\Delta x^2} \\[0em] \\[0em] \frac{\big(T(x + 2 \Delta x, t) - T(x + \Delta x, t) \big) - \big( T(x + \Delta x, t) - T(x, t) \big)}{\Delta x^2} \end{gather*}$$
Each of these descriptions of the second derivative have consequences, that is, since we can only calculate the second derivative using two other points, do we use the two points behind, the two points beside the central one, or the two points ahead? It is typical to use the two points beside a central one, and so we use the one in the middle and call our scheme forward time centered space (or FTCS), simplifying it to the following.
$$\begin{gather*} \frac{T(x, t + \Delta t) - T(x,t)}{\Delta t} \approx \frac{T(x + \Delta x, t) - 2 T(x, t) + T(x - \Delta x, t) }{\Delta x^2} \end{gather*}$$
The vague idea then would be to rearrange this approximate equation so that we have $T(x,t + \Delta t)$ on one side and all temperatures at time $t$ on the other, meaning that we would program some state (i.e. the temperature at all positions we care about) which we know at time $t$ and use this to predict time $t+\Delta t$, moving forward in time. This is what we would call the forward Euler method, and roughly corresponds to the idea that we can take a formula for the derivative $f'$ of a function $f$ and relate it approximately to the formula that is used in its limit. $$\begin{align*} &\lim_{\Delta t \to 0 }\frac{f(t + \Delta t) - f(t)}{\Delta t} = f'(t) \\[0em] \implies & f(t+\Delta t) \approx f(t) + f'(t)\Delta t \end{align*}$$
We can also state the backwards Euler method which is similar, but notice here that we check the derivative not at time $t$ but instead at time $t + \Delta t$; naively, this would allow us to run the simulation backwards. $$\begin{gather*} f(t+\Delta t) - f'(t + \Delta t)\Delta t \approx f(t) \end{gather*}$$
Of course it is unlikely that we'd want to run the simulation backwards. The true value of the backwards Euler method to us is that if we are calculating across position and time variables and our derivative function is a linear function of $f$ (or to us, $T$) at different positions, then the idea that we are computing $T(x,t)$ as a function of $T(x,t + \Delta t)$ is not merely a computation backwards in time, it is computed linearly. This means we can express it as a problem of the form $\mathbf{Ax} = \mathbf{b}$ in linear algebra, and use the widely explored class of linear solver algorithms to find $\mathbf{x}$ (which in this case is our $T(x,t + \Delta t)$ at some fixed $t$, thought of as a vector of different positions) that solves $\mathbf{Ax} = \mathbf{b}$ (where $\mathbf{b}$ is $T(x,t)$ which we assume to be known for time $t$). Thus our method which appears to move backwards in time has implicitly specified a method to move forward in time, and so we call the backwards Euler method an Implicit method.
By contrast our forward Euler method is a Explicit method. Explicit methods are of course easier to implement (since one need not implement a linear solver, nor convert their numerical method into a linear problem and back again after solving) but our explicit forward Euler method has a problem: our differential equations were defined under the assumption that $\Delta t$ and $\Delta x$ were basically infinitely small, and when they are not small enough, the forward euler method tends towards an error where it undershoots or overshoots the true function it should approximate.
And this should be apparent to us merely from the equation for the forward Euler method since the point we compute the derivative for (remembering that this is how much the function changes) is taken at time $t$ and then only checked again after $\Delta t$ time has passed. If our derivative is increasing or decreasing in this period then our model can't see that change until after the period has passed. Fortunately, for exactly the reason we describe, the backwards euler method has almost exactly the opposite problem, and so we can average the two methods to get something much better, called the Crank-Nicolson method. For our one dimensional heat equation, that looks something like this:
$$\begin{gather*} \frac{T(x, t + \Delta t) - T(x,t)}{\Delta t} \approx \frac{1}{2} \left( \frac{\partial^2 T}{\partial x^2}(x, t + \Delta t) + \frac{\partial^2 T}{\partial x^2}(x, t)\right). \end{gather*}$$
For brevity we have used the partial derivatives in space as the full expressions of our centered-space second derivatives is quite long, however we can expand the right hand side of the above as broken across two lines:
$$\begin{align*} \approx \frac{1}{2} \Bigg( \frac{T(x + \Delta x, t+ \Delta t) - 2 T(x, t + \Delta t) + T(x - \Delta x, t + \Delta t) }{\Delta x^2} \\[0em] + \frac{T(x + \Delta x, t) - 2 T(x, t) + T(x - \Delta x, t) }{\Delta x^2} \Bigg) \end{align*}$$
As you can see, this is not an explicit method, but it is linear in space so it constitutes an implicit method.
There are other ways to remedy this problem of the forward euler method that will still give us an explicit method however. What the Crank-Nicolson method did for us was allowed us to iterate forward in time using a time derivative which was defined as the average of our $f'(t)$ and $f'(t + \Delta t)$ (which are $\partial_t T(x,t)$ and $\partial_t T(x,t + \Delta t)$ in our case). We might, instead of taking an average, attempt to compute $f'(t + \Delta t/2)$, taking the derivative in the middle of our two timesteps instead. This is fine if we know $f'$ explicitly, but adds a minor complication in our case of the heat equation since we do not know $T(x, t + \Delta t/2)$; our time derivative is dependent on our spatial configuration.
To solve this, for each $\Delta t$ step of our iteration, we need to take a forward Euler half-step to compute a rough approximation of the state at time $t + \Delta t/2$ and then go back and use that as the derivative in a not-quite forward Euler method. This is what we call Runge-Kutta-2 (RK2), Runge-Kutta refering to the class of methods like this one and $2$ refering to how we compute our iteration in two steps (although it is common to do it in four steps as RK4). Our method then looks something like this in one dimension: $$\begin{align*} \frac{\partial^2 T}{\partial x^2}(x,t) &= \frac{T(x + \Delta x, t) - 2 T(x, t) + T(x - \Delta x, t) }{\Delta x^2} \\[0em] T(x, t + \Delta t/2) &= T(x,t) + \frac{\Delta t}{2} \cdot \kappa \frac{\partial^2 T}{\partial x^2}(x,t) \\[0em] T(x, t + \Delta t) &= T(x,t) + \Delta t \cdot \kappa \frac{\partial^2 T}{ \partial x^2} (x, t + \Delta t/2) \end{align*}$$
Of course we may merely add the corresponding term for our second dimension, and we get the method we will actually implement.
$$\begin{align*} \frac{\partial^2 T}{\partial x^2}(x,t) &= \frac{T(x + \Delta x,y, t) - 2 T(x,y,t) + T(x - \Delta x,y, t) }{\Delta x^2} \\[0em] \nabla^2 T(x,y,t) &= \frac{\partial^2 T}{\partial x^2}(x,y,t) + \frac{\partial^2 T}{\partial y^2}(x,y,t) \\[0em] T(x,y, t + \Delta t/2) &= T(x,y,t) + \frac{\Delta t}{2} \cdot \kappa \nabla^2 T(x,y,t) \\[0em] T(x,y, t + \Delta t) &= T(x,y,t) + \Delta t \cdot \kappa \nabla^2 T(x,y,t + \Delta t /2) \end{align*}$$
This is the titular explicit FTCS RK2 method. We should actually check if any of the methods described here work as we would like them to, and indeed we have methods for discussing this which we describe soon. But now that we have our method, we'll implement it before our understanding goes stale.
To do that, we'll need one more thing: a boundary condition. Our centered space scheme allows us to calculate how our state should evolve over time, but to do so it requires knowledge of adjacent points; if our simulation has an edge (which it must) then that edge doesn't have adjacent points to draw on which tell us how the edge should evolve. We have a few options here. We can set the edge as a constant, meaning that our simulation assumes our material is in some thermal bath that keeps its exterior at a constant temperature. We could make the edges wrap around on each other, drawing on the opposite side of our rectangle to inform how the edge evolves.
What we'll do instead though is to assume that the boundary is insulating. This means that at the edge, we expect the temperature gradient to always be zero so our $T(x,t)$ $x$-$T$ graph looks like a flat line at its edge. Recalling our earlier discussion of the heat equation, this constant gradient at the edge means that there is no heat-flow out of the edge, and thus it is insulating.
Setting our simulation zone to be the square patch of space $[0,1] \times [0,1]$, we'll implement this constant gradient by setting the boundary to be equal to its second-inner-most line. That means $T(0,0,t)$ is set to $T(\Delta x, \Delta y, t)$ and $T(1,0.5,t)$ is set to $T(1 - \Delta x, 0.5, t)$ for example. Setting the temperature in this way should violate conservation of energy more than the periodic condition mentioned above (and certainly less than the constant edge temperature condition), but as the edge heats up to its interior, the total thermal energy in the system should stabilize, giving us a good sense of whether the simulation is working correctly.
This was also my first time with many of these rust crates, so we'll focus on some important ones we'll need to understand soon.
(wgpu) The WGPU crate exposes a GPU API corresponding to the WebGPU spec. My understanding is that this API is, while simpler, not particularly in conflict with many other APIs, and so WGPU acts primarily as a way to run Vulkan, Metal, DirectX or WebGPU GPU code from rust.
(tokio) Since the GPU is separate from the CPU, our rust code will have to send jobs to the GPU and often wait for their results to return. This means that many GPU operations run asynchronously, and so we'll need something to deal with that. Rust provides language utilities to deal with asynchronous programming, but not the runtime itself, and so tokio will provide our asynchronous runtime as well as many helpful tools.
(wasm-bindgen) The plan is to get this running in a browser, so we'll need a way to compile our rust code to something that runs on the web as well as a way to run our code once it's there. The crate wasm-bindgen provides macros that instructs our web-assembly compiler how to make our rust functions available in javascript to be run in browser, acting as our FFI.
(web-sys) Of course the above crates are built with the intent that they can run on the web (with the partial exception of tokio, although certainly the features we require run fine in wasm) which means when we use them on the web, we'll need to have rust types corresponding to javascript objects. The crate web-sys provides these types along with methods corresponding to many typical javascript functions.
We'll also make frequent use of bytemuck for casting our data as unsigned 8-bit integer arrays when moving them back and forth from the GPU, and a combination of the crates log and console_log will enable us to display data on our browser's console for debugging.
The calculation itself can be done without a browser so we'll start by just setting up our simulation absent any graphical considerations. Our way of thinking about stored data corresponding to our temperature state will be as follows.
pub struct RectGrid {
array: Vec<f32>,
imax: usize,
jmax: usize
}
impl RectGrid {
pub fn new(width: usize, height: usize) -> Self
{
Self {
array: [0.].repeat(width * height),
imax: width,
jmax: height
}
}
pub fn arrayasslice<S>(&self) -> &[S]
where
S: bytemuck::AnyBitPattern
{
bytemuck::cast_slice(&(self.array))
}
fn setelement(&mut self, i: usize, j: usize, newval: f32) {
self.array[i + j * self.imax] = newval
}
pub fn setbyfunc(&mut self, f: impl Fn(f32,f32) -> f32) {
for n in 0..(self.imax*self.jmax) {
let i: usize = n % self.imax;
let j: usize = n / self.imax;
// SINCE (1,1) IS INCLUDED WE MUST SUBTRACT 1 FROM IJMAX
// OR ELSE X=1 AND Y=1 WILL NOT OCCUR
let x: f32 = (i as f32) / ((self.imax -1) as f32);
let y: f32 = (j as f32) / ((self.jmax -1) as f32);
self.setelement(i,j, f(x,y))
}
}
}
Here I'm omitting the various getter functions as well as other auxillary functions, but this is roughly our model. We should know what the width $W$ and height $H$ of our temperature grid is in rectangular 'pixels' and we will have an array of length width * height. A given $n \in \{0,...,(W \cdot H - 1)\}$ (as typical of array indexing to include zero and exclude the end) will correspond to $(i,j)$ integer coordinates on the grid with $i = n\mod W$ and $j = \lfloor n/W \rfloor$, that is, when we divide $n$ by the width of our grid, the remainder will be the $i$-coordinate and the result of this division rounded down will be the $j$-coordinate. In other words, our array will be one dimensional holding each 'row' of data one after another, so to divide by the row length (with rounding) is to ask which row we're on, and the remainder of that division is where in the row we are. Of course this means that $i \in \{0,...,W - 1\}$ and $j \in \{0,...,H - 1\}$ where $n=W$ would mean we are at $i=0$ and $j=1$ for example. This also means that if we want to include the $(x,y)$ coordinate $(1,1)$ (and indeed the other $(1,y)$ and $(x,1)$ positions) we'll need to subtract one from the width and height before we divide $i$ and $j$ by them to obtain $(x,y)$ positions, as seen in the comment.
In the above code we have the functions arrayasslice and $\texttt{setbyfunc}$ mentioned. Of course rust's vector types have their own function for returning their contents as an array slice but this is generally as an array slice corresponding to the type they contain; we'll need to deal with arrays that are cast to &[u8] for GPU purposes. The function setbyfunc will enable us to turn a function into initial conditions for the simulation. In the absence of a user specified initial condition, we'll use a hard-cutoff function which sets the grid equal to some constant temperature inside of a circle and zero outside of that.
pub fn makemiddleRatTinitconds(r: f32, t:f32 )
-> (impl Fn(f32,f32) -> f32) {
let rsquared = sqnum(r);
move |x:f32,y:f32| -> f32 {
match (sqnum(x-0.5) + sqnum(y-0.5)) < rsquared {
true => t,
false => 0.
}
}
}
Of course, since most of what we do is on the GPU, we won't use the above code very much. It will remain as a working model for us when we move to the GPU though. For instance, we will continue to one dimensional arrays to store our data on the GPU, and while we won't often explicitly calculate $(i,j)$ coordinates, we think in terms of that grid. More often than not, we'll simply use an index $i + j\cdot W$ corresponding to the point we care about, or how we want to move through the grid.
Our manipulation of the GPU starts with the wgpu type Instance. This exists as our entrypoint to asking the GPU what it can do for us, and we'll use it to get an Adapter, which is our connection to a physical device. In order to get that Adapter, we'll first need to specify how we want to interact with the GPU; I mentioned earlier that rust's wgpu uses a WebGPU style API but can also run using Vulkan, Metal, DirectX and WebGPU. We specify which of those modes we're interested in when we create our instance, and this instance will then select an adapter corresponding to needs we've specified.
let instance = wgpu::Instance::new(&wgpu::InstanceDescriptor {
#[cfg(not(target_arch = "wasm32"))]
backends: wgpu::Backends::PRIMARY,
#[cfg(target_arch = "wasm32")]
backends: wgpu::Backends::BROWSER_WEBGPU,
..Default::default()
});
let adapter = instance
.request_adapter(&wgpu::RequestAdapterOptions
{
power_preference: wgpu::PowerPreference::default(),
compatible_surface: None,
force_fallback_adapter: false,
})
.await?;
The macros #[cfg(target_arch = "wasm32")] let us enable or disable code based on our compile target. We'll want to use the BROWSER_WEBGPU backend if we're compiling for web and PRIMARY otherwise, which corresponds to telling the instance we're okay with any of Vulkan, Metal, or DirectX12 or WebGPU. There are also exists a backend corresponding to openGL, however that backend won't let us use compute shaders.
The request for an adapter then specifies more things we care about. How high power do we want to run the GPU? What should we do if we can't find an adapter we want? Most importantly to us (although later) is what kind of graphical surface does out adapter need to be compatible with? Since we'll be outputting to a Html canvas later, we'll need some way to specify that. But for now we'll set that to None.
Once we have our adapter, we can get Device and Queue objects, which we'll use to allocate buffers and send instructions to the GPU.
let (device, queue) = adapter
.request_device(&wgpu::DeviceDescriptor {
label: None,
required_features: wgpu::Features::empty(),
experimental_features: wgpu::ExperimentalFeatures::disabled(),
required_limits: if cfg!(target_arch = "wasm32") {
adapter.limits()
} else {
wgpu::Limits::defaults()
},
memory_hints: Default::default(),
trace: wgpu::Trace::Off,
})
.await?;
fn error_capture(error: wgpu::Error) {
println!(error.to_string());
}
device.on_uncaptured_error(Arc::new(error_capture));
As seen here, you may want to use the on_uncaptured_error method so you aren't left in the dark if you get a strange error. We'll discuss how to set up more general logging later.
We can now use this device to create buffers, shader modules, and pipelines. These buffers are our way of storing data on the GPU; there are also textures, which behave mostly like buffers but with some differences in the API. Shaders and pipelines together act as the GPU's functions which we'll call from the CPU later when we add them to the queue. It'll be important to remember as we continue that many of these terms got their names as tools in graphics programming, and so even though we are doing computation, we still speak in terms of compute shaders.
fn helper_basic_compute_shader(
device: &wgpu::Device,
label: Option<&str>,
shader_module: &wgpu::ShaderModule
) -> wgpu::ComputePipeline {
device.create_compute_pipeline(&wgpu::ComputePipelineDescriptor {
label: label,
layout: None,
module: shader_module,
entry_point: None,
compilation_options: Default::default(),
cache: Default::default(),
})
}
let laplacian_shader = device.create_shader_module(
wgpu::include_wgsl!("laplacian.wgsl"));
let iterate_shader = device.create_shader_module(
wgpu::include_wgsl!("iterate_heat.wgsl"));
let buffer_move_shader = device.create_shader_module(
wgpu::include_wgsl!("buffer_move.wgsl"));
let fix_boundary_conditions_shdr = device.create_shader_module(
wgpu::include_wgsl!("boundary_cond.wgsl"));
let laplacian_pipeline = helper_basic_compute_shader(
device,
Some("Laplacian Pipeline"),
&laplacian_shader
);
let iterate_pipeline = helper_basic_compute_shader(
device,
Some("Iteration Pipeline"),
&iterate_shader
);
let buffer_move_pipeline = helper_basic_compute_shader(
device,
Some("Relocation Pipeline"),
&buffer_move_shader
);
let fix_boundary_conditions_ppln = helper_basic_compute_shader(
device,
Some("Boundary Conds Pipeline"),
&fix_boundary_conditions_shdr
);
Each shader module corresponds to a WGSL file we write, and the pipelines (for our purposes) correspond to individual functions within those wgsl files that we call. We'll get to how we write those files later (I promise it's relatively simple) but for now let's break down the relevant arguments in our helper function's ComputePipelineDescriptor. Many of these arguments are options since it's not strictly necessary to set them.
label: This doesn't matter too much, but for debugging purposes you'll want to set a label so when you make a mistake, you'll know where it happened.
layout: The pipeline layout argument is set to match the bind group the pipeline will use, but we'll have to discuss bind groups after we discuss buffers. Setting this argument simply allows us to set what the pipeline layout should look like so we'll throw an error if we violate our own expectations later. With it set to None, wgpu looks at the file for us to figure out a suitable default.
module: This one just corresponds to the wgsl shader module we plan to extract the pipeline from.
entry_point: Our wgsl file can have multiple functions in it, some of which might be for use in other functions, but multiple of which might be the function we want to call from rust. For this project I've kept every pipeline in its own file, hence the need to set up multiple shader modules, and so we don't need to specify this. We could have instead put all of our pipelines in the one shader module, in which case we would need to specify this entry point.
These wgsl files will need to be compiled into gpu instructions on the individual client device due to the vast differences between device GPUs. Although WASM magic seems to package them away somewhere into the resulting web-assembly during compilation.
Recalling the the method we described earlier, we'll need to encode the following equations in our code:$$\begin{align*} \frac{\partial^2 T}{\partial x^2}(x,t) &= \frac{T(x + \Delta x,y, t) - 2 T(x,y,t) + T(x - \Delta x,y, t) }{\Delta x^2} \\[0em] \nabla^2 T(x,y,t) &= \frac{\partial^2 T}{\partial x^2}(x,y,t) + \frac{\partial^2 T}{\partial y^2}(x,y,t) \\[0em] T(x,y, t + \Delta t/2) &= T(x,y,t) + \frac{\Delta t}{2} \cdot \kappa \nabla^2 T(x,y,t) \\[0em] T(x,y, t + \Delta t) &= T(x,y,t) + \Delta t \cdot \kappa \nabla^2 T(x,y,t + \Delta t /2) \end{align*}$$ and the insulating boundary condition we discussed. We'll get to the actual wgsl code later, but for now what you should know is that of the four pipelines above, the 'Laplacian Pipeline' corresponds to the first two equations above, i.e. how do we calculate $\nabla^2 T(x,y,t)$ for all $(x,y)$ if we know $T(x,y,t)$. The iteration pipeline will do the latter two equations, since you might notice they have a very similar form; to reuse the iteration pipeline in these two instances, we need to replace $\Delta t$ with $\Delta t/2$ and $\nabla^2T(x,y,t)$ with our midpoint $\nabla^2 T(x,y,t + \Delta t/2)$ in one of our iteration pipeline uses. We of course have a pipeline for our boundary conditions, and due to the way that shader calls work, I've elected to have a shader that copies our output buffer corresponding to $T(x,y,t + \Delta t)$ into our input buffer $T(x,y,t)$. There are of course a myriad of ways to avoid having to do this with a shader pipeline, but it will serve later as a very simple example of how our shaders work.
fn helper_compute_interim_data_buffer(
device: &wgpu::Device,
label: Option<&str>,
size: u64,
) -> wgpu::Buffer {
device.create_buffer(&wgpu::BufferDescriptor {
label: label,
size: size,
usage: wgpu::BufferUsages::COPY_DST | wgpu::BufferUsages::STORAGE,
mapped_at_creation: false,
})
}
let data_buffer = device.create_buffer_init(&BufferInitDescriptor {
label: Some("data"),
contents: bytemuck::cast_slice(&initial_data),
usage: wgpu::BufferUsages::COPY_DST
| wgpu::BufferUsages::COPY_SRC
| wgpu::BufferUsages::STORAGE,
});
let export_buffer = device.create_buffer(&wgpu::BufferDescriptor {
label: Some("export"),
size: data_buffer.size(),
usage: wgpu::BufferUsages::COPY_DST | wgpu::BufferUsages::MAP_READ,
mapped_at_creation: false,
});
let output_buffer = device.create_buffer(&wgpu::BufferDescriptor {
label: Some("output"),
size: data_buffer.size(),
usage: wgpu::BufferUsages::COPY_SRC | wgpu::BufferUsages::STORAGE,
mapped_at_creation: false
});
let laplacian_buffer = helper_compute_interim_data_buffer(
device, Some("laplacian buffer"), data_buffer.size()
);
let midpoint_buffer = helper_compute_interim_data_buffer(
device, Some("midpoint buffer"), data_buffer.size()
);
let midpoint_laplacian_buffer = helper_compute_interim_data_buffer(
device, Some("midpoint laplacian buffer"), data_buffer.size()
);
To define a buffer, we have two tools depending on whether we want to initialize the buffer with some specific data in it. You can see these in the difference between data_buffer and export_buffer, which are created with create_buffer_init vs create_buffer, taking different kinds of struct arguments. You'll see in data_buffer, we initialize with an argument called contents, to which we have provided an initial_data which is simply our RectGrid.array from earlier, a Vec<f32> which we convert to a &[u8] using bytemuck. When we don't specify what data we want to store on the GPU, we'll need to instead say how many bytes we want to allocate, but we can also do this by just copying the size of a previous buffer.
In our use of create_buffer with the BufferDescriptor struct, we have an argument mapped_at_creation, which corresponds to whether the buffer is 'mapped', meaning available both to the GPU and to the CPU such that we may write directly to it outside of queuing a write operation. Obviously sharing memory in this way is much slower, but this argument allows us create buffers that may be written to one time when they are made no matter what other permissions or status they have as a way to get data in.
Those permissions are set by the usage argument, and indicates what properties the buffer needs to have in the GPU. For instance, COPY_SRC and COPY_DST indicate that we might later want to queue a copy between buffers treating the buffer in question as a data source or a data destination, a read or write permission. The STORAGE flags correspond to making the buffer readable and writable to the shaders, i.e. allowing our wgsl code to modify the buffer; in the absence of a need to write to these buffers in our WGSL code, we can use UNIFORM instead, and we'll do that shortly. See that we've made the data_buffer both a copy source and a copy destination, as well as marking it as storage, whereas our output buffer is only a copy source and a storage, since we never need to copy into it, as it is an output. MAP_READ will enable us to make our buffer mapped again at a later time so we can read data out of it to non-GPU memory, and there is of course a matching MAP_WRITE flag that we don't particularly need.
These permissions are simply sets of flags, and so to combine them we use the bitwise OR operation. It was suggested in the tutorial I originally got these flags from that some sets of flags are contradictory, but I haven't seen anything since then that suggests that too many flags does anything other than slowing down the buffer.
fn helper_param_buffer(
device: &wgpu::Device,
label: Option<&str>,
size: u64,
) -> wgpu::Buffer {
device.create_buffer(&wgpu::BufferDescriptor {
label: label,
size: size,
usage: wgpu::BufferUsages::UNIFORM | wgpu::BufferUsages::COPY_DST,
mapped_at_creation: false,
})
}
let width_buffer = device.create_buffer_init(&BufferInitDescriptor {
label: Some("width"),
contents: bytemuck::cast_slice(&[width]),
usage: wgpu::BufferUsages::UNIFORM | wgpu::BufferUsages::COPY_DST,
});
let height_buffer = device.create_buffer_init(&BufferInitDescriptor {
label: Some("height"),
contents: bytemuck::cast_slice(&[height]),
usage: wgpu::BufferUsages::UNIFORM | wgpu::BufferUsages::COPY_DST,
});
let kappa_buffer = helper_param_buffer(device,Some("kappa"),4);
let delta_t_buffer = helper_param_buffer(device,Some("delta_t"),4);
// this is just the same number divided by two so we can reuse the pipeline
let delta_t_2_buffer = helper_param_buffer(device,Some("delta_t_2"),4);
It is of course possible to also define buffers which hold only a single value, and so we have multiple buffers storing parameters for our simulation. Our thermal conductivity parameter $\kappa$, our simulation width and height, and our timestep $\Delta t$. As mentioned earlier, we also store a buffer which is just $\Delta t$ divide by two, but we could have just as easily made this value merely a flag that indicates whether the GPU should divide $\Delta t$ by two itself; since it would need to do so individually on every single thread, I thought I'd save the trouble and cache that calculation.
Since these are parameters, we won't need to change them and can thus use the UNIFORM flags, and we'll set their sizes to four bytes for f32 (our width and height buffers are also four bytes for u32).
These values, both the parameters and the larger arrays, will appear not directly as function arguments to our pipelines but rather more like global variables. What sets these global variables for a pipeline, and thus what will allow us to reuse the iteration pipeline, is called a bind group. For instance, here is the code right at the top of our "Laplacian Pipeline".
@group(0) @binding(0) var<storage, read> data: array<f32>;
@group(0) @binding(1) var<storage, read_write> laplacian: array<f32>;
@group(0) @binding(2) var<uniform> width: u32;
@group(0) @binding(3) var<uniform> height: u32;
This means that as far as our laplacian pipeline is concerned, we have four global variables: data, which is an array of f32 values which we may read from, laplacian which is a similar array which we may read from and write to, and then two uniform values which are simply u32 values we only read from. The names of these variables don't matter on the CPU-side, they only matter to our WGSL code, whereas the way we indicate what data receives these names in our WGSL code from the CPU-side is by their group number and binding number. We specify these as follows.
fn helper_compute_bind_group(
device: &wgpu::Device,
label: Option<&str>,
pipeline: &wgpu::ComputePipeline,
buffer_sequence: &[&wgpu::Buffer]
) -> wgpu::BindGroup {
let mut temp: Vec<wgpu::BindGroupEntry> = Vec::new();
for (n,val) in buffer_sequence.iter().enumerate() {
temp.push(
wgpu::BindGroupEntry {
binding: n as u32,
resource: val.as_entire_binding()
},
)
}
device.create_bind_group(&wgpu::BindGroupDescriptor {
label: label,
layout: &pipeline.get_bind_group_layout(0),
entries: &temp
})
}
// shader that fixes the insulating boundary conditions.
// we want to apply this before we compute the laplacian
// since it effectively fixes the laplacian equal to zero on the boundary.
let fix_boundary_conditions_bg = helper_compute_bind_group(
device, None, &fix_boundary_conditions_ppln,
&[&data_buffer, &width_buffer, &height_buffer]
);
// compute laplacian of data
let stage_one_bind_group = helper_compute_bind_group(
device, None, &laplacian_pipeline,
&[&data_buffer, &laplacian_buffer, &width_buffer, &height_buffer]
);
// compute RK2 midpoint using laplacian
let stage_two_bind_group = helper_compute_bind_group(
device, None, &iterate_pipeline,
&[
&data_buffer, &laplacian_buffer, &midpoint_buffer,
&width_buffer, &height_buffer, &kappa_buffer, &delta_t_2_buffer
]
);
// reuse laplacian pipeline to compute laplacian using midpoint buffer
let stage_three_bind_group = helper_compute_bind_group(
device, None, &laplacian_pipeline,
&[&midpoint_buffer, &midpoint_laplacian_buffer, &width_buffer, &height_buffer]
);
// reuse RK2 midpoint pipeline but with delta_t
// instead of delta_t/2 to compute RK2 result
let stage_four_bind_group = helper_compute_bind_group(
device, None, &iterate_pipeline,
&[
&data_buffer, &midpoint_laplacian_buffer, &output_buffer,
&width_buffer, &height_buffer, &kappa_buffer, &delta_t_buffer
]
);
// send output buffer to data buffer so we can repeat this
let stage_five_bind_group = helper_compute_bind_group(
device, None, &buffer_move_pipeline,
&[&output_buffer, &data_buffer, &width_buffer, &height_buffer]
);
We'll assign a bind group to each use of pipelines we'll need, and for that reason we name them based on the order they'll apply in. A bind group is consisted of a series of BindGroupEntrys which are individally numbered, and so the numbers within a bind group do not necessarily need to occur in order or continuously. Here, that does not matter to us, so we have a little helper function to just assign our bind group entries numbers sequentially.
Notice also that our bind groups require a bind group layout, and this we inherit from the pipeline we plan to use the bind group for. Since we're specialising our bind groups to a pipeline, you can see that we use the laplacian pipeline for two different bind groups and the iterate pipeline for two different bind groups, but with different buffers each time. For instance in the stage one bind group we use the laplacian pipeline so we can calculate the laplacian of our data buffer in our laplacian buffer, but on the second go around, we provide the midpoint buffer in place of the data buffer, and the midpoint laplacian buffer in place of the laplacian buffer. We do a similar thing for our iterate pipeline, but on the second go around we replace the midpoint buffer with the output buffer, the laplacian buffer with the midpoint laplacian buffer, and the delta_t_2_buffer with the delta_t_buffer. Once again, that is because our operation to calculate $T(x,y,t + \Delta t /2)$ is very similar to our operation to calculate $T(x,y,t + \Delta t)$, as is our operation to calculate $\nabla^2 T(x,y,t)$ to calculating $\nabla^2 T(x,y,t + \Delta t /2)$.
Let's now take a look inside our wgsl code. The full code for our laplcian pipeline looks like this:
@group(0) @binding(0) var<storage, read> data: array<f32>;
@group(0) @binding(1) var<storage, read_write> laplacian: array<f32>;
@group(0) @binding(2) var<uniform> width: u32;
@group(0) @binding(3) var<uniform> height: u32;
@compute// Entrypoint
@workgroup_size(64,1,1)
fn main(
//we use these to get which workgroup and where inside workgroup we are
//@builtin(workgroup_id) wid: vec3<u32>,
//@builtin(local_invocation_id) lid: vec3<u32>
//
// but this will just give us a global id equivalent to
// wid * workgroup_size + lid
@builtin(global_invocation_id) gid: vec3<u32>
) {
// valuable reference: https://www.w3.org/TR/WGSL/#arithmetic-expr
// for ij indices. getting i requires % but we dont need it
let roughj = gid.x / width;
// exit if on boundary. might be inefficient but easier to call too many workers
if (gid.x <= width) {return;} // i.e. y=0
if (gid.x >= width * (height - 1)) {return;} // i.e. y=1
if (gid.x == roughj * width) {return;} // i.e. x=0
if (gid.x == (roughj + 1) * width - 1) {return;} // i.e. x=1
let widthfloatmin1 = f32(width);
let heightfloatmin1 = f32(height);
let delta_x_sq = 1.0f / widthfloatmin1 / widthfloatmin1;
let delta_y_sq = 1.0f / heightfloatmin1 / heightfloatmin1;
laplacian[gid.x] =
(
data[gid.x + 1]
-2.0f * data[gid.x]
+ data[gid.x - 1]
) / delta_x_sq
+ (
data[gid.x + width]
-2.0f * data[gid.x]
+ data[gid.x - width]
) / delta_y_sq;
}
At the top we see our buffers provided as global variables provided. Next we see the function that is actually run on our GPU threads, marked with @compute so wgpu knows this is an entrypoint. Next we specify the workgroup_size; this is how many threads are run together on the GPU. Workgroups have sizes in three different dimensions which correspond to how they're identified once in flight; we organized our data in a line so we'll just stick to using the one dimension. Workgroups are then deployed later in dispatches, where many workgroups are set about running together, and with a large number of dispatches each containing workgroups with many threads, we can get whatever we need done. In my research, the understanding I've found is that workgroup threads are guarenteed to run in parallel while dispatched workgroups may or may not be broken up to facilitate a device's maximum thread count. To that end, more threads per work group is better because it means less dispatches (which have some amount of overhead), and the WebGPU spec guarentees that a fully compliant device will provide a maximum of 256 threads per work group which may be distributed amongst an $x$, $y$ and $z$ axis which may individually hold 256, 256, and 64 threads respectively. Some devices can run dramatically more than that, but many devices are not fully WebGPU compliant, and so it is recommended to set the workgroup size to 64. (see the spec)
The next relevant item is how we identify which thread is running when it's in flight. We have a few ways to do this; the first way would be to figure out which thread is running using a local_invocation_id which will tell us which thread we're running on within a workgroup, together with a workgroup_id which tells us where we are in all of the dispatches. We would then calculate which thread we are using $64 \cdot \texttt{workgroup\_id} + \texttt{local\_invocation\_id}$, but fortunately WebGPU will also do this for us with the global_invocation_id.
Once we know which thread we're on, we're ready to actually do something useful. We'll use our thread number, indicated as the $x$ component of our global invocation ID (remembering that this can be three dimensional, but we're only using one of those dimensions) as the ID for which pixel on our grid we want to compute the laplacian for. As long as we dispatch a total of width times height threads, we'll have one for every pixel. But that is too much, as we must remember calculating the laplacian for pixel on our grid involves using adjacent pixels, so we can't do this calculation on the edge without using a value on the next row by accident or drawing from memory outside of the bounds of data_buffer. More generally, it will often be easier to call too many threads than too few (since they come in blocks of 64), and so most of our shaders will start with some condition to dismiss excess threads. In this case, we dismiss threads smaller than width (i.e. threads on the first row) and threads greater than the width times the height minus one (in the final row or beyond it). We also calculate a $j$ (the vertical position of our thread within the grid, or row number) as an integer, meaning it is rounded down per the wgsl spec; if we multiply this number by the width again, we'll get back not the number we started with but the ID corresponding to the left-most pixel on the row, and so this can tell us if we're on the left-hand border of our grid. Subtract one from that, and we'll know if we're on the right-hand border of our grid, since at the end of each row we loop over to the next row.
We can finally calculate the actual laplacian, and this will almost straightforwardly correspond to the formulas we wrote earlier, with adjacent rows or columns acting as $\Delta x$ or $\Delta y$ distant points. The only note to make here is that since we express our data in a line, instead of adding one or subtracting one from our $j$ coordinate, we simply add or subtract a whole width to move between rows, while we add or subtract one to move between columns. $$\begin{align*} \frac{\partial^2 T}{\partial x^2}(x,t) &= \frac{T(x + \Delta x,y, t) - 2 T(x,y,t) + T(x - \Delta x,y, t) }{\Delta x^2} \\[0em] \nabla^2 T(x,y,t) &= \frac{\partial^2 T}{\partial x^2}(x,y,t) + \frac{\partial^2 T}{\partial y^2}(x,y,t) \end{align*}$$
The pattern discussed above then follows us to our other shaders.
@group(0) @binding(0) var<storage, read> data: array<f32>;
@group(0) @binding(1) var<storage, read> laplacian: array<f32>;
@group(0) @binding(2) var<storage, read_write> output: array<f32>;
@group(0) @binding(3) var<uniform> width: u32;
@group(0) @binding(4) var<uniform> height: u32;
@group(0) @binding(5) var<uniform> kappa: f32;
@group(0) @binding(6) var<uniform> delta_t: f32;
@compute @workgroup_size(64,1,1)
fn main(
@builtin(global_invocation_id) gid: vec3<u32>
) {
if (gid.x > (width * height)) {return;}
output[gid.x] = data[gid.x] + delta_t * kappa * laplacian[gid.x];
}
This is our iterate shader, and it corresponds to the latter two steps: $$\begin{align*} T(x,y, t + \Delta t/2) &= T(x,y,t) + \frac{\Delta t}{2} \cdot \kappa \nabla^2 T(x,y,t) \\[0em] T(x,y, t + \Delta t) &= T(x,y,t) + \Delta t \cdot \kappa\nabla^2 T(x,y,t + \Delta t /2) \end{align*}$$
Remember when we set up the bind groups that delta_t may not literally be delta_t. Indeed, even with the laplacian pipeline, many of these variables will correspond to different data buffers to facilitate doing the proper step of the RK2 calculation as necessary.
If any of this is confusing to you, the following is the quintessential example of a simple pipeline, which copies the contents of one buffer to another. The process for setting up and running this pipeline is explained in great deal here.
@group(0) @binding(0) var<storage, read> data: array<f32>;
@group(0) @binding(1) var<storage, read_write> out: array<f32>;
@group(0) @binding(2) var<uniform> width: u32;
@group(0) @binding(3) var<uniform> height: u32;
@compute
@workgroup_size(64,1,1)
fn main(
@builtin(global_invocation_id) gid: vec3<u32>
) {
if (gid.x > (width * height)) {return;}
out[gid.x] = data[gid.x];
}
The second most complex buffer we'll discuss here (however certainly the most irritating) is the one that enforces the insulating boundary conditions. This is a rather irritating collection of cases, since we have four sides which must set themselves equal to their closest neighbor on the interior of the grid, not to mention corners which must be set to the corner of the interior of the grid, and then we have to avoid double-counting corners. I'll spare you the details of this one, but if you're interested in how it works, you can inspect the copious collection of comments.
@group(0) @binding(0) var<storage, read_write> data: array<f32>;
@group(0) @binding(1) var<uniform> width: u32;
@group(0) @binding(2) var<uniform> height: u32;
@compute// Entrypoint
@workgroup_size(64,1,1)
fn main(
@builtin(global_invocation_id) gid: vec3<u32>
) {
// we need 2 * width + 2 * height with different behaviours for each side.
// regard adding width as ostensibly y+=delta_y since each y coordinate
// position is separated by a width.
if (gid.x < width) { // side 1, y=0, horizontal
if (gid.x == 0) {
//corner, set (0,0) value to (delta_x,delta_y) value
data[0] = data[width + 1];
} else if (gid.x == width){
//corner, set (1,0) value to (1 - delta_x,delta_y) value
data[width] = data[2 * width - 2];
} else {
// set (x,0) values to (x,delta_y) values
data[gid.x] = data[gid.x + width];
}
return;
} else if (gid.x < width + height ){ // side 2, x=0, vertical
// in these cases we must regard gid.x as j+width since we havent subtracted that
if ((gid.x == width) | (gid.x == ((width + height) - 1))) {
return; // corners, handled by sides 1 and 3
}
// set (0,y) values to (delta_x, y) values
let indexwecareabout = (gid.x - width) * width; // y axis
data[indexwecareabout] = data[indexwecareabout + 1];
return;
} else if (gid.x < (2*width) + height ){ // side 3, y=1, horizontal
// in these cases we must regard gid.x as
// x+width+height since we havent subtracted that
if (gid.x == width + height) {
// corner, set (0,1) value to (delta_x, 1 - delta_y) value
data[width * (height - 1)] = data[width * (height - 2) + 1];
} else if (gid.x == (2*width) + height - 1) {
// corner, set (1,1) value to (1 - delta_x, 1 - delta_y) value
data[width * height] = data[width * (height - 1) - 2];
} else {
// set (x,1) values to (x, 1 - delta_y) values
let indexwecareabout = (gid.x - width - height ) + (width * (height - 1));
data[indexwecareabout] = data[indexwecareabout - width];
}
return;
} else if (gid.x < 2 * width + 2 * height ){// side 4, x=1, vertical
if ((gid.x == (2*width) + height) | (gid.x == 2 * width + 2 * height - 1)) {
return; // corners, handled by sides 1 and 3
} else {
// the +1 before we multiply by width is so we are one more row than we want,
// then the -1 takes us to the y=1 side of the previous row
let indexwecareabout = (gid.x - ((2*width) + height) + 1) * width - 1;
data[indexwecareabout] = data[indexwecareabout - 1];
return;
}
} else { return; }
}
There are a few other ways we could implement this, for instance we could improve readability dramatically if we simply calculated $i$ and $j$ using $i = n \mod W$ and $j = n \mod H$ like I described earlier. The reason I've not done this is because I've heard that calculating a remainder is a particularly expensive GPU operation (although its possible this is no longer the case), and we don't actually need it if we're willing to simply imagine our grid. This is actually also true of integer division, but that's a little tougher to get around in the way we've set this up other than by moving from linear workgroups and dispatches to square workgroups and dispatches, which itself is not too hard once you understand this. Moreover that doesn't simplify our boundary condition code since the perimeter of our square is still a line and we aren't going to call threads for every pixel on our grid just to dismiss the ones that aren't on the boundary.
Strictly speaking the caveats for the corners of the boundary also aren't necessary, but we might want to fix them anyway for graphical considerations later.
With the bind groups and the shaders now both explained, we're ready to actually run our computation.
let mut encoder = device.create_command_encoder(&Default::default());
// these braces are to make sure we return any refs borrowed in them.
// principly, encoder must be returned since it is borrowed by
// begin_compute_pass, and is needed so we can call encoder.finish().
{
let mut gputodo = encoder.begin_compute_pass(&Default::default());
let workgroup_quantity = (width * height).div_ceil(64) as u32;
let boundary_conds_wg_quant =
(width*2 + height*2).div_ceil(workgroup_size);
for _ in 0..iteration_quantity {
gputodo.set_pipeline(&fix_boundary_conditions_ppln);
gputodo.set_bind_group(0, &fix_boundary_conditions_bg, &[]);
gputodo.dispatch_workgroups(boundary_conds_wg_quant, 1 , 1);
gputodo.set_pipeline(&laplacian_pipeline);
gputodo.set_bind_group(0, &stage_one_bind_group, &[]);
gputodo.dispatch_workgroups(workgroup_quantity, 1, 1);
gputodo.set_pipeline(&iterate_pipeline);
gputodo.set_bind_group(0, &stage_two_bind_group, &[]);
gputodo.dispatch_workgroups(workgroup_quantity, 1, 1);
gputodo.set_pipeline(&laplacian_pipeline);
gputodo.set_bind_group(0, &stage_three_bind_group, &[]);
gputodo.dispatch_workgroups(workgroup_quantity, 1, 1);
gputodo.set_pipeline(&iterate_pipeline);
gputodo.set_bind_group(0, &stage_four_bind_group, &[]);
gputodo.dispatch_workgroups(workgroup_quantity, 1, 1);
gputodo.set_pipeline(&buffer_move_pipeline);
gputodo.set_bind_group(0, &stage_five_bind_group, &[]);
gputodo.dispatch_workgroups(workgroup_quantity, 1, 1);
}
}
queue.submit([encoder.finish()]);
As you can see, we create an object called a command encoder. This encoder is highly stateful, as is the queue object and the compute pass object that I've called gputodo. During the compute pass, we're able to run compute pipelines, which we'll recall act as our functions. We tell our GPU to-do list which pipeline we want to run, instruct it which bind group to use (and what to label that bind group, recalling all of our bind group variables were @group(0) so we use the argument 0), and then give the instruction to dispatch the workgroups. Since this is just a to-do list, we can then add these instructions to the to-do list many many times, before we let our gputodo go out of scope, releasing encoder, then submit it to the queue.
We should touch on the idea of race conditions, i.e. when deploying so many threads to the GPU, how can we know they wait for one another to finish working on data before proceeding to the next step? This can be more complicated if you want to deal with race conditions within a compute shader, but we can also solve this problem by separating those steps into different compute shaders, which is what I've done here. WGPU will finish each dispatch before it calls the next one, so we can guarentee that we don't have any race condition issues.
The steps in this order ensure that we apply the boundary condition first so that the edge of the interior of our grid isn't cooled by the boundary but rather the boundary is set to precisely the temperature that won't influence the simulation at all; we do this step first ostensibly so that the laplacian (i.e. the second derivative in two dimensions) is zero at the edge of the interior. Next we calculate the laplacian of $T(x,y,t)$ in stage one, which we use in stage two to calculate $T(x,y,t + \Delta t/2)$. In stage three we calculate the laplacian of $T(x,y,t+\Delta t/2)$, and then we use that to calculate $T(x,y,t + \Delta t)$. Finally, we move that output back into our data buffer so that its correctly positioned to repeat the process. As mentioned earlier, there are a myrid of ways to avoid this move, indeed one of them would be simply to use a different set of bind groups and a different first iteration shader for the first pass that uses the data buffer, then every other pass could just use the output buffer as its data buffer. We have options.
The process you see above of creating a to-do list for the GPU then submitting an encoder that carries it to the queue is just one of many things we can tell the queue to do. For instance, we can set our parameters, or instructing the GPU to make a buffer ready to read from CPU, although these steps won't actually happen until we submit the queue to the GPU. This submission can be resource intensive, and so it is generally considered bad practice for a graphical application to submit more than once per frame; this can be remedied by keeping a queue of our own, a Vec<CommandBuffer> which we submit as a slice to the queue when we are ready.
queue.write_buffer(&kappa_buffer, 0, cast_slice(&[kappa]));
queue.write_buffer(&delta_t_buffer, 0, cast_slice(&[delta_t.clone()]));
queue.write_buffer(&delta_t_2_buffer, 0, cast_slice(&[delta_t / 2.0]));
queue.submit([]);
Once we've completed a computation to our satisfaction, we'll have to decide what we want to do with the final result. We have two options that we might be interested in: we can either export the data, or try to display it to the screen in some way. We'll do both, but I'll talk through the exporting side of things first since there are reasons we might want to do this other than just displaying to the screen, i.e. perhaps we want to check the total temperature to make sure it's staying roughly constant.
This is where we'll have to begin discussing the tokio crate.
Earlier, when we requested an Adapter, Device, and Queue, we used a .await method. This is because the process of asking our GPU for these items doesn't happen on the CPU, and so our CPU might need to wait to get a response back. This meant I had to do some reading about how rust asynchronous processes worked, and what I learned was first and foremost that async is not necessarily about threads. The age old joke about 'what does your CPU do when it's told to sleep for a millisecond other than check if a millisecond has passed on every tick?' comes to mind. The answer to that, as far as the thinking about async functions, turns out to be that the CPU does something else.
impl SetupCompute {
pub async fn new(
initial_data: &Vec<f32>,
width: u32,
height: u32,
) -> Self {
...
}
}
You can imagine much of what we've discussed so far going in a function like this one, where we've specified an async function. When we call an async function, you can imagine the processor being given permission to do this task at its leasure. That might involve threading (this depends on how your asyncs are being handled, and indeed rust itself does not weigh in on this which is why we need tokio) or equally it might involve simply performing the function until such a point that it starts a process that takes too long, at which point we start working on something else in the same thread. The way we think about this is that async functions don't return values, they return Future<Output = ...> generic types which will probably eventually return a value. Since javascript also supports asynchronicity with Promises instead of Futures, we can call these kinds of functions from javascript using wasm.
When we need the value from a future, we can use the await method, and this will halt the function it's running in from continuing until the future has returned its value. This is called blocking, in the sense that the function is blocked from doing anything else other than waiting for the returned value; consequently, rust won't let us await in any function that isn't also an async function, since any function that can block should be ignorable while it's waiting for something else so that the function it runs within can continue. Not even in our main function can block without awaiting, and so if you want to run our code downstream of a main function (we don't, since what we'll do is run via wasm FFI), we'd need the following macro to specify how async will be handled in the program.
#[tokio::main]
async fn main() {
...
}
The tokio crate also provides other facilities, such as the sender/receiver abstraction. Due to rust's memory protections, it can be annoying to send information between threads in a way that perhaps C would allow, since one thread needs to hold on to a mutable reference to write a message and another needs to read it constantly to see if its changed; not allowed in rust since this property more broadly could lead to race conditions. The sender/receiver abstraction remedies this, and we'll need it in order to export data from the GPU, since the GPU needs to tell us that its ready for the export to happen.
Note that in general, tokio will not work in web-assembly, but the particular feature of senders and receivers in tokio::sync will.
let mut encoder = device.create_command_encoder(&Default::default());
let (sender,receiver) = tokio::sync::oneshot::channel();
// if the buffer was previously mapped, we'll have to unmap before we can copy into it
export_buffer.unmap();
encoder.copy_buffer_to_buffer(
&output_buffer, 0,
&export_buffer, 0,
output_buffer.size()
);
// We'll either need to instruct the GPU to have our export buffer mapped here,
// in which case this closure takes sender
encoder.map_buffer_on_submit(
&export_buffer,
wgpu::MapMode::Read,
..,
move |result| {sender.send(result).unwrap()}
);
queue.submit([encoder.finish()]);
let our_final_data: Vec<f32> = {
// OR alternatively we'll do it here after we submit. do not do both
self.export_buffer.map_async(wgpu::MapMode::Read, ..,
move |result| sender.send(result).unwrap());
// this device polling step is unnecessary in WASM since device is polled
// automatically, but we'll need it if running on desktop and using the
// latter mapping method; the former does not require a poll since we submit.
device.poll(wgpu::PollType::wait_indefinitely())?;
_ = receiver.await?;
let output_data = self.export_buffer.get_mapped_range(..);
bytemuck::cast_slice(&output_data).to_vec()
};
Here, we copy the contents of our output buffer into our export buffer (since output doesn't have the flags to read to CPU and export doesn't have the flags to be used in a shader pipeline) and tokio gives us a channel with a sender and a receiver. Ownership of this sender is moved into a closure which WGPU activates for us when our export buffer is ready. All we have to do is tell the receiver to wait for its signal, and we'll know that we're ready to get_mapped_range on the contents of the export buffer.
The only other relevant thing here to pay attention to is the zeros in our copy_buffer_to_buffer call: it is common to want to copy only parts of a buffer, so those numbers are our byte offsets. Since we want to copy the whole buffer to another buffer of trhe same size, we set those to zero, but you can easily imagine moving only (a contiguous) part of a buffer.
If all we wanted to do was run our computation without a graphical interface, we'd be done now. We could write the whole program in one async main and run it on the command line. But not only would that not be as fun, many devices won't give us a command line quite as easily. Our user interface of choice today will be the web, and to do that we'll need a few things.
WGPU is made to conform mostly with the WebGPU spec and provide an interface with the GPU to rust that looks a lot like this spec, however since it is made with WebGPU in mind, it is also made with certain facilities for the web in mind. This was my first time using many features of the web (whether that was input elements in HTML or javascript basically at all), but WGPU knows what to do with an HTML canvas element if we can provide one. Getting that canvas element, bare in mind a HTML element, would be something easier done in javascript, and so for this we need the facilities of the web-sys crate.
let window = wgpu::web_sys::window().unwrap_throw();
let document = window.document().unwrap_throw();
// many tutorials get the canvas like this. DO NOT DO IT
let canvas = body.get_element_by_id(CANVAS_ID_IN_HTML).unwrap_throw();
let canvas: Element = document
.query_selector("canvas")
.expect("could not find canvas")
.expect("canvas query returned empty");
let html_canvas_element: HtmlCanvasElement = canvas
.dyn_into()
.expect("man your canvas is bonked or somethin");
Now, I could not tell you why, but for whatever reason getting the canvas by element ID will catastrophically fail later on, with little sign that this is what went wrong. In fact, in my investigation into why, I found that the actual error that was being thrown was on a step that had a comment saying "Not returning this error because it is a type error that shouldn't happen unless the browser, JS builtin objects, or wasm bindings are misbehaving somehow." In fact many tutorials use exactly that method, and this exact error is then not thrown but only if you use the openGL backend, and throwing an error in near identical code only in the WebGPU backend. Inconceivably, changing this particular method of getting the canvas solved everything for me although needless to say this cost me time and motivation.
The crate web-sys ostensibly exposes javascript methods such as get_element_by_id and query_selector, returning results as rust analogs. Since javascript has a different type system to rust, we need to convert them to a rust type such as Element or HtmlCanvasElement using the dyn_into method which will of course fail if we pick a type that doesn't match closely enough. It should be noted that web-sys doesn't expose basically anything by default, hiding most functionality between features which must be manually specified. Consequently my rust toml file looks like this
[target.'cfg(target_arch = "wasm32")'.dependencies]
tokio = { version = "1", features = ["sync"] }
console_error_panic_hook = "0.1.6"
console_log = "1.0"
wasm-bindgen = "0.2.106"
wasm-bindgen-futures = "0.4.56"
web-sys = { version = "0.3.83", features = [
"Document",
"Window",
"Element",
"Element",
"HtmlElement",
"HtmlCanvasElement",
]}
Now remember way back at the beginning when we requested our adapter, and there was that one argument compatible_surface which we ignored? We'll need that now, as we actually have a surface we'll want to draw to. If you're doing this on the desktop, you'll want to create your surface in a different way outlined here, probably using the winit crate.
#[cfg(target_arch = "wasm32")]
let surface: wgpu::Surface = {
let surface_target = wgpu::SurfaceTarget::Canvas(html_canvas_element);
instance.create_surface(surface_target).expect("failed to create surface")
};
let adapter = instance
.request_adapter(&wgpu::RequestAdapterOptions
{
power_preference: wgpu::PowerPreference::default(),
compatible_surface: Some(surface),
force_fallback_adapter: false,
})
.await?;
Most of the steps I described above then apply just the same as before, with the only real caveat being that we we'll want to wait to submit our compute job so that it can be submitted together with instructions to render a frame (which we'll discuss now) and limit our number of iterations to something we can get done in a frame.
What follows will look a lot like how we prepared buffers, shader modules, and pipelines for our computation, except this time we'll have a bunch of extra details related to graphics. A texture is a lot like a buffer, but it has to be able to do certain things that our normal buffers don't do. Imagine using this code for a videogame; if your player gets real close to a texture in 3D space, what does the GPU do if the texture now occupies more of the screen than the resolution of the texture itself? Textures are responsible for dealing with many of these problems, and so I couldn't tell you what every argument of what follows does exactly.
let size = wgpu::Extent3d {
width: width,
height: height,
depth_or_array_layers: 1,
};
let surface_caps = surface.get_capabilities(&adapter);
let surface_format = surface_caps
.formats
.iter()
.copied()
.find(|f| f.is_srgb())
.unwrap_or(surface_caps.formats[0]);
let config = wgpu::SurfaceConfiguration {
usage: wgpu::TextureUsages::RENDER_ATTACHMENT,
format: surface_format,
width: size.width,
height: size.height,
present_mode: surface_caps.present_modes[0],
alpha_mode: surface_caps.alpha_modes[0],
desired_maximum_frame_latency: 2,
view_formats: vec![],
};
let texture = device.create_texture(&wgpu::TextureDescriptor {
label: Some("default texture"),
size,
mip_level_count: 1,
sample_count: 1,
dimension: wgpu::TextureDimension::D2,
format: wgpu::TextureFormat::Rgba8UnormSrgb,
usage: wgpu::TextureUsages::TEXTURE_BINDING | wgpu::TextureUsages::COPY_DST,
view_formats: &[],
});
let default_texture: Vec<u8> = [0,255,0,255]
.repeat((size.width.div_ceil(64) * size.height * 64) as usize);
queue.write_texture(
wgpu::TexelCopyTextureInfo {
aspect: wgpu::TextureAspect::All,
texture: &texture,
mip_level: 0,
origin: wgpu::Origin3d::ZERO,
},
&default_texture,
wgpu::TexelCopyBufferLayout {
offset: 0,
bytes_per_row: Some(256 * size.width.div_ceil(64)),
rows_per_image: Some(size.height),
},
size,
);
Our config is where our window meaningfully learns how big it's supposed to be. As suggested earlier, the texture (although it is sized such that one of its pixels is probably a pixel on the canvas) must be able to be dynamically scaled, and so setting the texture size tells the GPU nothing about the actual output size.
Defining the texture has many similarities to defining a buffer, and as you can see we specify usage flags just as we did with buffers, as well as indicating we'll want to copy into this texture. Finally, we write an initial value for the texture using a repeating RGBA vector of green pixels.
Our first indication of irritating complexity related to textures comes in a requirement for the analogous method to write_buffer, the write_texture method. Rather than freely copying one set of data to another position in another buffer, the write_texture method does this in blocks of 256 bytes for each row. Now since our texture consists of RGBA data, each pixel will be four bytes. This means we have to size anything we want to copy to our texture buffer in increments of 64 pixel widths. Not a problem now, we'll just oversize our default texture and WGPU won't copy anything that won't fit into the texture.
let view = texture.create_view(&wgpu::TextureViewDescriptor::default());
let sampler = device.create_sampler(&wgpu::SamplerDescriptor {
address_mode_u: wgpu::AddressMode::ClampToEdge,
address_mode_v: wgpu::AddressMode::ClampToEdge,
address_mode_w: wgpu::AddressMode::ClampToEdge,
mag_filter: wgpu::FilterMode::Linear,
min_filter: wgpu::FilterMode::Nearest,
mipmap_filter: wgpu::FilterMode::Nearest,
..Default::default()
});
The view and sampler are involved in that process of interpolating a texture if its resolution doesn't match how it's displaying on the screen. Since we're explicitly trying to avoid that, most of our settings here are linear or nearest-neighbor sampling, anything cheap that we'll hope doesn't need to actually run.
let texture_bind_group_layout =
device.create_bind_group_layout(&wgpu::BindGroupLayoutDescriptor {
entries: &[
wgpu::BindGroupLayoutEntry {
binding: 0,
visibility: wgpu::ShaderStages::FRAGMENT,
ty: wgpu::BindingType::Texture {
multisampled: false,
view_dimension: wgpu::TextureViewDimension::D2,
sample_type: wgpu::TextureSampleType::Float { filterable: true },
},
count: None,
},
wgpu::BindGroupLayoutEntry {
binding: 1,
visibility: wgpu::ShaderStages::FRAGMENT,
ty: wgpu::BindingType::Sampler(wgpu::SamplerBindingType::Filtering),
count: None,
},
],
label: Some("texture_bind_group_layout"),
});
let texture_draw_bind_group =
device.create_bind_group(&wgpu::BindGroupDescriptor {
layout: &texture_bind_group_layout,
entries: &[
wgpu::BindGroupEntry {
binding: 0,
resource: wgpu::BindingResource::TextureView(&view),
},
wgpu::BindGroupEntry {
binding: 1,
resource: wgpu::BindingResource::Sampler(&sampler),
},
],
label: Some("diffuse_bind_group"),
});
let shader = device.create_shader_module(
wgpu::ShaderModuleDescriptor {
label: Some("Shader"),
source: wgpu::ShaderSource::Wgsl(
include_str!("rectangle.wgsl").into()),
}
);
let render_pipeline_layout =
device.create_pipeline_layout(&wgpu::PipelineLayoutDescriptor {
label: Some("Render Pipeline Layout"),
bind_group_layouts: &[&texture_bind_group_layout],
push_constant_ranges: &[]
});
The above should look familiar due to its similarity to the compute pipeline setup. The only striking difference is that we explicitly define a texture bind group layout. What will follow is a division between a vertex shader, which places polygons, and a fragment shader which colors our screen using information about where those polygons are and what they should look like. These shaders are written in a wgsl file called rectangle, and our bind group layout gives wgpu instructions on how we'll provide our view and sampler from earlier to our fragment shader.
The job of our fragment shader is rather opaque; we have GPU steps for using the sampler and view to do whatever texture sampling WGPU finds appropriate. Our vertex shader we can be significantly more hands-on with. Since we want to simply draw our texture to the screen, we can achieve this with two triangles forming a square over the view. To specify those triangles, we'll need to give coordinates, but we might have to give those coordinates multiple times since these two triangles forming a square will share vertices. It's better practice when we're certain we want the edges of the triangles to line up to simply reuse the points together with an index buffer which we'll use to say which triangles have which vertices as their points.
It'll also be important to think about which vertices we specify for the triangle in which order, since our triangles have a front face and a back face. If we view a triangle through its back face, it will be culled to the viewport, so picking the order of vertices correctly is paramount to actually getting to see the thing. My implementation is inspired heavily by sotrh's tutorial which gives some details that I haven't, including how coordinates work for textures.
#[repr(C)]
#[derive(Copy, Clone, Debug, bytemuck::Pod, bytemuck::Zeroable)]
struct Vertex {
position: [f32; 3],
tex_coords: [f32; 2],
}
const VERTICES: &[Vertex] = &[
Vertex {
position: [-1.0, -1.0, 0.0],
tex_coords: [0.0, 1.0],
}, // 0 bottom left
Vertex {
position: [1.0, -1.0, 0.0],
tex_coords: [1.0, 1.0],
}, // 1 bottom right
Vertex {
position: [1.0, 1.0, 0.0],
tex_coords: [1.0, 0.0],
}, // 2 top right
Vertex {
position: [-1.0, 1.0, 0.0],
tex_coords: [0.0, 0.0],
}, // 3 top left
];
// our rectangle is two triangles bisecting the screen on its anti-diagonal
const INDICES: &[u16] = &[0, 1, 2, 0, 2, 3];
...
let vertex_buffer = device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
label: Some("Vertex Buffer"),
contents: bytemuck::cast_slice(VERTICES),
usage: wgpu::BufferUsages::VERTEX,
});
let index_buffer = device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
label: Some("Index Buffer"),
contents: bytemuck::cast_slice(INDICES),
usage: wgpu::BufferUsages::INDEX,
});
The following is basically just sotrh's code, but if you have any interest in reproducing what I made, you'll need this.
let render_pipeline = device.create_render_pipeline(
&wgpu::RenderPipelineDescriptor {
label: Some("Render Pipeline"),
multiview: None,
layout: Some(&render_pipeline_layout),
vertex: wgpu::VertexState {
module: &shader,
entry_point: Some("vs_main"),
buffers: &[Vertex::desc()],
compilation_options: Default::default(),
},
fragment: Some(wgpu::FragmentState {
module: &shader,
entry_point: Some("fs_main"),
targets: &[Some(wgpu::ColorTargetState {
format: config.format,
blend: Some(wgpu::BlendState {
color: wgpu::BlendComponent::REPLACE,
alpha: wgpu::BlendComponent::REPLACE,
}),
write_mask: wgpu::ColorWrites::ALL,
})],
compilation_options: Default::default(),
}),
primitive: wgpu::PrimitiveState {
topology: wgpu::PrimitiveTopology::TriangleList,
strip_index_format: None,
front_face: wgpu::FrontFace::Ccw,
cull_mode: Some(wgpu::Face::Back),
polygon_mode: wgpu::PolygonMode::Fill,
unclipped_depth: false,
conservative: false,
},
depth_stencil: None,
multisample: wgpu::MultisampleState {
count: 1,
mask: !0,
alpha_to_coverage_enabled: false,
},
cache: None,
}
);
surface.configure(&device, &config);
We'll also need the shader code itself
// Vertex shader
struct VertexInput {
@location(0) position: vec3<f32>,
@location(1) tex_coords: vec2<f32>,
}
struct VertexOutput {
@builtin(position) clip_position: vec4<f32>,
@location(0) tex_coords: vec2<f32>,
}
@vertex
fn vs_main(
model: VertexInput,
) -> VertexOutput {
var out: VertexOutput;
out.tex_coords = model.tex_coords;
out.clip_position = vec4<f32>(model.position, 1.0);
return out;
}
// Fragment shader
@group(0) @binding(0)
var t_diffuse: texture_2d<f32>;
@group(0) @binding(1)
var s_diffuse: sampler;
@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
return textureSample(t_diffuse, s_diffuse, in.tex_coords);
}
Notice above the texture sample function that the fragment shader calls. In my reading about this, one of the key things that separates a fragment shader from other kinds of shaders, resulting amongst other things in differences in computational performance, is that fragment shaders need access to these kinds of facilities.
We'll also need a way to render the texture to the screen. You'll want a function somewhere that does something like the following that you are able to call on each frame. Don't let this function be async or it will cause problems; even though we do a GPU call in here, that doesn't mean the function itself is async, just that the GPU knows to get pixels to the screen as soon as it can.
let output = self.surface.get_current_texture()?;
let view = output
.texture
.create_view(&wgpu::TextureViewDescriptor::default());
let mut encoder = self
.device
.create_command_encoder(&wgpu::CommandEncoderDescriptor {
label: Some("Render Encoder"),
});
{
let mut render_pass = encoder.begin_render_pass(&wgpu::RenderPassDescriptor {
label: Some("Render Pass"),
//multiview_mask: None,
color_attachments: &[Some(wgpu::RenderPassColorAttachment {
view: &view,
resolve_target: None,
ops: wgpu::Operations {
load: wgpu::LoadOp::Clear(wgpu::Color::GREEN),
store: wgpu::StoreOp::Store,
},
depth_slice: None,
})],
depth_stencil_attachment: None,
occlusion_query_set: None,
timestamp_writes: None,
});
render_pass.set_pipeline(&self.render_pipeline);
render_pass.set_bind_group(0, &self.texture_draw_bind_group, &[]);
render_pass.set_vertex_buffer(0, self.vertex_buffer.slice(..));
render_pass.set_index_buffer(self.index_buffer.slice(..), wgpu::IndexFormat::Uint16);
render_pass.draw_indexed(0..6, 0, 0..1);
};
queue.submit([encoder.finish()]);
output.present();
Finally, we can get to my favorite part, which is how we turn our simulation into something that displays on the screen. If we set up the code I've given you so far, we'd get a simulation and a separate blank green screen. Since we're doing the heat equation, we'll probably want a heatmap so that color can tell us how hot a part of our simulation is.
To do that, we'll need to pick a notion of what temperature range we're displaying, i.e. how cold is blue and how hot is red. We can set up a shader pipeline and a buffer for computing a texture into, which we'll then copy into our texture with queue.write_texture:
let pad_per_line: u32 = width.div_ceil(64) * 64 - width;
let pad_buffer = device.create_buffer_init(&BufferInitDescriptor {
label: Some("pad_per_line"),
contents: bytemuck::cast_slice(&[pad_per_line]),
usage: wgpu::BufferUsages::UNIFORM | wgpu::BufferUsages::COPY_DST,
});
#[allow(non_snake_case)]
let vis_minT_buffer = helper_param_buffer(device,Some("minT"),4);
#[allow(non_snake_case)]
let vis_maxT_buffer = helper_param_buffer(device,Some("maxT"),4);
let heat_hue_shader = device.create_shader_module(
wgpu::include_wgsl!("heatcolor.wgsl"));
let heat_hue_pipeline = helper_basic_compute_shader(
device,
Some("Heatmap Pipeline"),
&heat_hue_shader
);
let heat_hue_buffer = device.create_buffer(&wgpu::BufferDescriptor {
label: Some("heatmap"),
size: (256 * width.div_ceil(64) * height) as u64,
usage: wgpu::BufferUsages::COPY_SRC | wgpu::BufferUsages::STORAGE,
mapped_at_creation: false
});
let heat_hue_bind_group = helper_compute_bind_group(
device, None, &heat_hue_pipeline,
&[
&data_buffer,
&heat_hue_buffer,
&vis_minT_buffer,
&vis_maxT_buffer,
&width_buffer,
&height_buffer,
&pad_buffer
]
);
Something's come back to haunt us here. Remember when I said that the write_texture operation only works in increments of 64 pixels per row? This means we'll need to pad each row of our heatmap buffer, and then we'll need to tell our shader how much padding we're using so it can ignore those pixels. On writing the texture, WGPU knows not to copy the padded region because our heatmap buffer will have more pixels per row than texture can fit. Everything else here should seem familiar, but we now need to get into writing the shader itself.
There's just one other problem: we'll need to write our pixels as RGBA, four unsigned single byte integers, but WGSL doesn't support the u8 type. We'll need to do some bit level hacking to get around that.
@group(0) @binding(0) var<storage, read> data: array<f32>;
@group(0) @binding(1) var<storage, read_write> rgba_out: array<u32>;
@group(0) @binding(2) var<uniform> minT: f32;
@group(0) @binding(3) var<uniform> maxT: f32;
@group(0) @binding(4) var<uniform> width: u32;
@group(0) @binding(5) var<uniform> height: u32;
@group(0) @binding(6) var<uniform> pad_per_line: u32;
// GOING TO HAVE TO DO SOME EVIL BITWISE MANIPULATION HERE
@compute
@workgroup_size(8,8,1)
fn main(
@builtin(global_invocation_id) gid: vec3<u32>
) {
if ((gid.x >= width) | (gid.y >= height)) {return;}
let range = clamp((data[gid.x + gid.y * width] - minT)/(maxT - minT), 0.0f, 1.0f);
var red: f32 = 0;
var green: f32 = 0;
var blue: f32 = 0;
// we ignore the part of the colorwheel outside of 0-240 degrees
switch u32(floor(range * 4)) {
case 0: {
blue = 1.0f;
green = 4.0f * range;
}
case 1: {
green = 1.0f;
blue = 1.0f - 4.0f * (range - 0.25);
}
case 2: {
green = 1.0f;
red = 4.0f * (range - 0.5);
}
case 3: {
green = 1.0f - 4.0f * (range - 0.75);
red = 1.0f;
}
default: {
red = 1.0f;
}
}
// we use floating point multiplication on f32 representatives of the u8 255
// numbers we want then bitwise clamp them so color bits dont leak into each
// other. due to evil bit reversed storage shenanigans the RGBA is actually ABGR
var color: u32 = 0xFF000000 // 255 in opacity column
+ (u32(0000000255.0f * red) & 0x000000FF) // 0000000255 is 0x000000FF
+ (u32(0000065280.0f * green) & 0x0000FF00) // 0000065280 is 0x0000FF00
+ (u32(0016711680.0f * blue) & 0x00FF0000) // 0016711680 is 0x00FF0000
;
rgba_out[gid.x + gid.y * (pad_per_line + width)] = color;
}
This time I'm giving you an example of a 2D workgroup call so we can avoid an integer division to get our rough approximation of the $j$ coordinate from earlier; our workgroup_size is $(8,8,1)$ and we have to use gid.y as well. See at the top each of the variables we passed in through the bind group.
We can write our four u8 values in using one u32 value which wgsl will support, just so long as we dont let the data leak into other bit columns it doesn't belong in. In this way, we construct our $4\times\texttt{u8}$ by limiting four different u32 values to integers that can be expressed only using their eight bits of the larger u32. Since computes store bits within a value backwards, our RGBA needs to be written in as ABGR. We need not pay mind to reversing the bits within our u8s though since whatever arithmetic we do to get them will also be reversed. (I learned that tidbit playing microcorruption a few years ago, a puzzle game about bit-hacking embedded systems.)
So, referring to our above graph, our procedure will involve fixing our temperature from the data_buffer in the range from minT to maxT (the temperatures corresponding to coldest blue and hottest red) which we do with wgsl's clamp operation so range is never more than one or less than zero. We check which quadrant of the part of the color wheel we're interested in that our temperature should go in, and based on that we apply the appropriate linear functions to get a $[0,1]$ range floating point number for how red, green, or blue our color should be. Then we multiply this floating point ratio with a floating point number corresponding to all-bits-in-byte-on in the u8 we care about. The smaller our ratio is, the more these bits will be shifted down once we convert to u8 (e.g. a ratio 0.5 is roughly a single bit shift), and that's fine but we'll need to switch off any on-bits that are shifted in the wrong columns with a bitwise-And. Finally, we add these up into our RGBA u32 (although the way we've set this up we could equally use bitwise-Or), and send it into the proper pixel slot in our heatmap buffer, making sure to ignore padding pixels.
We'll want to call this shader before before we render, then use a command encoder to instruct the buffer-to-texture copy rather than asking the queue to do it so that the instruction is put on the encoder's to-do-list in order.
let mut pending_queue: Vec<CommandBuffer> = Vec::new()
let mut compute_encoder = device.create_command_encoder(&Default::default());
// ... Our compute shader instructions go here.
pending_queue.push(compute_encoder.finish());
let mut color_encoder = device.create_command_encoder(&Default::default());
let x_workgroup_quantity = self.width.div_ceil(8) as u32;
let y_workgroup_quantity = self.height.div_ceil(8) as u32;
{
let mut gputodo = color_encoder.begin_compute_pass(&Default::default());
gputodo.set_pipeline(&self.heat_hue_pipeline);
gputodo.set_bind_group(0, &self.heat_hue_bind_group, &[]);
gputodo.dispatch_workgroups(x_workgroup_quantity, y_workgroup_quantity, 1);
}
color_encoder.copy_buffer_to_texture(
wgpu::TexelCopyBufferInfo {
buffer: &self.heat_map_buffer,
layout: wgpu::TexelCopyBufferLayout {
offset: 0,
bytes_per_row: Some(256 * self.width.div_ceil(64)),
rows_per_image: Some(self.height)
}
},
wgpu::TexelCopyTextureInfo {
aspect: wgpu::TextureAspect::All,
texture: texture,
mip_level: 0,
origin: wgpu::Origin3d::ZERO
},
textureBuffer.size()
);
pending_queue.push(color_encoder.finish());
let mut render_encoder = device.create_command_encoder(&Default::default());
// ... Our render shader instructions go here
pending_queue.push(render_encoder.finish());
queue.submit(pending_queue.into_boxed_slice());
Since we'll need to call some version of this every frame (and consider we might not want to be running the calculation every frame as well) we'll need a function we can wrap it in, and we'll want this function distinct from the function that gets our device and queue since we only need to do that once. This necessitates some kind of program state struct, and the fine details of how you lay that out don't matter too much so long as you keep your buffers and pipelines in scope. When their rust lifetime ends, wgpu will free the memory, and this also means that if we want to reinitialize everything we can just drop the struct in some scope and that'll clear our buffers rather than any manual steps we might need in another language. In my code, I separated each of these steps into their own method of the state struct.
No matter what you do, you'll need a way to make this state accessible to javascript. One way to do this is to make the state itself into a javascript object, which can then have its methods called over there. The way to do something like that is outlined here. What I elected to do to keep as much inside rust as possible is to maintain a persisting state variable. This can be achieved with the thread_local macro and the RefCell type.
use std::cell::{RefCell};
pub enum WebApp {
Uninitialized,
Idle(WgpuState),
}
thread_local! {
pub static THE_STATE : RefCell<WebApp> = RefCell::new(WebApp::Uninitialized);
}
Rust obviously doesn't like the idea of a global mutable variable such as this since if any thread can access this at any time, it would be the peak of race/write conflicts. The compromise lies in the Cell model, in which we have a rust object which holds on to our state and will only make it available by swapping out the contents. When we swap the value, we don't have merely a mutable reference to the value but for the time being we actually have ownership of the value itself, while an WebApp::Uninitialized sits in its place in THE_STATE. Incidentally, we also need a structure like this for our pending_queue object, since when we're done with the queue we'll need to let our queue fully eat the value rather than copying it by reference. The functions we expose to javascript then look something like this with #[wasm_bindgen] to indicate we want them exposed.
#[wasm_bindgen]
pub fn run_a_compute_iter() -> Result<(), JsValue> {
let globalstate = THE_STATE.replace(WebApp::Uninitialized);
let mut state: WgpuState = match globalstate {
WebApp::Uninitialized => {
// we'll discuss this shortly
log::info!("Can not do! Uninitialized");
return Err(JsValue::from_str("Can not do! Uninitialized"));
}
WebApp::Idle(state) => state
};
let mut pending_queue = state.pending_queue.replace(vec![]);
state.heateq.send_compute_job(&mut pending_queue,&state.device);
state.heateq.send_color_job(&mut pending_queue,&state.device);
state.heateq.color_to_texture(
&mut pending_queue,
&state.device,&state.texture_buffer
);
_ = state.pending_queue.replace(pending_queue);
THE_STATE.set(WebApp::Idle(state));
Ok(())
}
As long as we write our javascript well, we should never encounter the kind of issue that RefCell protects against in which we're trying to access the state while another function is using it.
We're almost ready to actually call some functions from javascript, but first we need to discuss the initialization. This is not as simple as running our GPU initialization function since WASM itself needs a distinct entrypoint where it can also set up everything WASM itself needs to run in javascript. That entry point must be marked #[wasm_bindgen(start)]. This is also a good time to initialize the logging tools that will let us write to the browser console if we want to do some debugging.
#[cfg(target_arch = "wasm32")]
#[wasm_bindgen(start)]
pub async fn run_web() -> Result<(), wasm_bindgen::JsValue> {
console_log::init_with_level(log::Level::Info).unwrap_throw();
log::info!{"initialized log"};
let (sender, receiver) =
tokio::sync::oneshot::channel::<Result<(), wgpu::BufferAsyncError>>();
_ = web_app::INTERNAL_MESSAGE.replace(Some(receiver));
let window = wgpu::web_sys::window().unwrap_throw();
let document = window.document().unwrap_throw();
//let canvas = body.get_element_by_id(CANVAS_ID).unwrap_throw();
let canvas: web_sys::Element = document
.query_selector("canvas")
.expect("could not find canvas")
.expect("canvas query returned empty");
let html_canvas_element: HtmlCanvasElement = canvas
.dyn_into()
.expect("man your canvas is bonked or somethin");
let mut webstate = wgpuworkhorse::WgpuState::new(html_canvas_element)
.await
.map_err(|e| JsValue::from_str(&format!("error {}",e)))?;
webstate.render().map_err(|e| JsValue::from_str(&format!("error {}",e)));
console_error_panic_hook::set_once();
web_app::THE_STATE.set(web_app::WebApp::Idle(webstate));
sender.send(Ok(()));
Ok(())
}
Here we see a few things we've mentioned before, the obtaining of the canvas element, as well as the sender-receiver abstraction (into another RefCell called INTERNAL_MESSAGE which holds a receiver) which can be useful for insisting that javascript wait until this initialization is done before trying to access the GPU. I ran into some problems with that since something about packing the WASM initialization into this init function can make javascript ignore await instructions on the promise it returns. You'll also see here that our wasm_bindgen(start) function must have the output type Result<(), wasm_bindgen::JsValue>, since if it fails it will need to output a javascript error. There are a few examples here of creating a JsValue error from strings.
The top of my javascript code then looks like this:
import init from "./pkg/pet_webgpusolver.js";
async function run() {
await init();
}
await run();
console.log("WASM Loaded");
import {
run_a_compute_iter,
render_a_frame,
update_values,
send_output_to_export,
is_receiver_ready,
get_export_to_num,
junk_current_state,
rinit_with_xy,
parse_csv,
give_current_width,
give_current_height,
init_from_csv_buffer,
get_total_energy_in_one
} from "./pkg/pet_webgpusolver.js";
async function init_energy() {
var init_energy_val = await get_total_energy_in_one();
document.getElementById("total_energy_goes_here").textContent = init_energy_val;
}
init_energy();
We need to call init, which is the function we marked with #[wasm_bindgen(start)], before we use anything from WASM or else it will fail. As I mentioned earlier, we can use both graphics and export the buffer which I do here in get_total_energy_in_one when we sum the temperature across all grid cells so we can check if our simulation is conserving energy, hence we make use of receiver stored in INTERNAL_MESSAGE.
Now as for calling for a simulation and render each frame, we'll want to use javascript's utilities for this. There are other ways to schedule renders on the rust side, but I've been warned javascript has some particularities about when it wants its new frames and may fight us if we try to conflict with it, but all of that is handled for us if we just use the requestAnimationFrame function.
var max_N = 52488;
var N_add = 100;
var current_N = 0;
var current_time = 0;
var current_delta_t = parseFloat(document.getElementById("delta_t").value);
var stop_compute = false;
var get_total_temp = true;
function run_compute() {
run_a_compute_iter();
if (get_total_temp) {
send_output_to_export();
render_a_frame();
get_total_temp = false;
} else {
render_a_frame();
}
current_N = current_N + N_add
current_time = current_time + current_delta_t * N_add;
let receiver_response = is_receiver_ready();
if (receiver_response) {
let total_energy = get_export_to_num();
document.getElementById("total_energy_goes_here").textContent = `${total_energy}`
document.getElementById("total_time_goes_here").textContent = `${current_time}`
get_total_temp = true;
}
if ((current_N < max_N) & (~stop_compute)) {requestAnimationFrame(run_compute);}
}
The rest of the features I implemented, including the ability to implement an initial configuration from a CSV, amount to details you can find on my github repo for this project. As you can imagine it's not that hard to parse a CSV, and all other functions are merely derivative of what I've described above. There is the matter of how you run this on desktop, in which case sotrh's tutorial using winit applies more; my code also does both, but doesn't implement the full suite of features made available on the web for the desktop app at this stage.
To compile the project, you'll need to install wasm-pack and run the command wasm-pack build --target web.
This was ultimately a fun little project, but as I indicated at the beginning, it isn't super mathematically interesting; we are ultimately implementing a very precise gaussian blur. I was also surprised ultimately by how little deeper hardware considerations were made a problem for me; there were race conditions, which is a problem in all threaded work, and buffer sychronisation issues to worry about (heat_hue_buffer had to have rows in multiples of 256 bytes, etc.) and that was kind of it. The WebGPU style of GPU API is, all things considered, much more high level than I expected. There was a simple notion of 'a function' I could call, even if it was shuffled through a few layers of shader-module/pipeline and bind-group arguments, but no special attention had to be paid to lining the system of operations up in any special way. It just kind of worked.
The one thing that didn't kind of just work was the explicitly web related components. Part of the reason I got to know WGPU as well as I did was because I had to spend a very long, demotivating, time figuring out why a canvas element rust-type JsValue was failing a type conversion while literally having a debug implementation that said it's type was correct, altogether with the comments in WGPU saying that error basically shouldn't occur. That was probably the greatest of my grievances, but it was of course an unwelcome surprise when javascript's await command simply didn't do the thing it was supposed to do if the thing it was doing was starting up WASM. I don't have resolutions for either of these errors, I can't say it was impossible for them not to have happened in retrospect and it's my fault in the way that you might like to in programming, the wasm-bindgen simply had a bizarre instability there.
But having overcome those problems, and now understanding how to write make WebGPU work, I can't help but be drawn to doing something a bit more mathematically interesting. During my undergrad I'd taken a class on applied mathematics, where we spent half the course on continuum mechanics and methods for modelling it. I'd struggled back then, in part because of the lecturer, in part because of covid, and in part because I couldn't wrap my head around how the modelling methods we were using worked. I've since remedied that, as I read a book a couple of years ago detailing the motivations for how the FEniCS finite element solver worked; I was astounded at how simple it was. And now, aside from writing this, I can't help but wonder about how I could throw a meshing rust-crate together with some simple case setup, first order lagrange polynomials as basis vectors, assemble it as a $A \mathbf{x} = \mathbf{b}$ problem and throw a GPU linear solver algorithm at it. That last part is supposed to be the hard part, but at the bare minimum a conjugate gradient solver would work just fine, and I'd need that if I wanted to do an implicit finite-difference method anyway.
Perhaps something to do next time.
Now that we know how to build the simulation, we need to ask some basic questions about when it will give us a correct answer and how correct that answer will be. Obviously this is a concern that goes beyond the mere building of the simulation, and so I've put this in an appendix so that you can stop reading if the code is all you're interested in. Personally I think there is no point
So we will investigate the truncation error and the stability analysis.
The rational for how this works can be a bit conceptually deceptive but it comes down to two fundemental observations.
First, our method revolves around allowing pixels on our grid to conduct heat into adjacent pixels. Recall the Green's function solution, $$\begin{gather*} T(x,t) = \frac{1}{\sqrt{4 \pi \kappa t}} \int_{-\infty}^{\infty} \exp\left( - \frac{(x - \tau)^2}{4 \kappa t}\right) g(\tau) d\tau \end{gather*}$$ together with my lamentation that this solution implies the time evolution of $T$ is actually just a gaussian blur. But in a sense means that for our simulation to be perfectly accurate we need a way to make every pixel affect every other pixel simultaneously, even if only a little, since a gaussian is never zero, just very very close to it. In practice, since the fall-off is exponential-squared, each part of our grid doesn't meaningfully need to affect every other one, just those nearby, or in our simulation actually adjacent. But in principle we must now pose the question if it is possible for the true solution to spread out in a way where non-adjacent grid points have larger-than-rounding-error increases, spreading faster than our adjacent-pixel-effector simulation allows in a single time step. If that were to happen, we'd have no reason to think that our model would predict something sensible. In fact the simulation might try to guess that heat should flow more than the difference between temperatures, meaning that we'd see net heat-difference creation; this would be like watching a system's entropy go down in a sense. No good.
So how are we to know when our timestep is small enough? Is there a way to show that the timestep is small enough that only enough time passes for pixels to meaningfully spread heat to only their neighbors? The solution comes to us through history; we must remember that fourier analysis is famed early on for its contributions to the study of the heat equation, and so our method of attack will be to study our solutions $T(x,y,t)$ in terms of a wave basis of functions $T(x,y,t) = E(t) e^{i k_x x + i k_y y}$, composed of a magnitude term $E(t)$ and a complex wave. We're using a complex wave but that doesn't mean we're necessarily thinking about some notion of complex temperature, just that we're decomposing our functions that way, with $k_x$ and $k_y$ standing for the frequency of the wave in different axes. Since the heat equation is linear, any solution of this form will have a corresponding $E(t) e^{-ik_x x - i k_y y}$ solution which will add to make $E(t) \cos(k_x x + k_y y)$ for example, and we can just require that we only consider real solutions of $T$ after our analysis is done.
Let's try this for the heat equation in two dimensions without RK2 applied. The initial assumption is $$\begin{align*} \frac{T(x,y,t + \Delta t) - T(x,y,t) }{\Delta t} =& \kappa \textstyle{\frac{T(x+\Delta x,y,t) - 2 T(x,y,t) + T(x - \Delta x,y,t)}{\Delta x^2}} \\[0em] &+ \textstyle{\kappa \frac{T(x,y + \Delta y,t) - 2 T(x,y,t) + T(x,y - \Delta y,t)}{\Delta y^2} } \end{align*}$$ and we'll now divide both sides by $T(x,y,t)$, assuming the complex plane wave solution. This is why we wanted a basis function of that form, because dividing complex waves by each other allow you to subtract the phase arguments. This will amount to a clever trick that creates a bound for us. $$\begin{align*} \frac{\frac{T(x,y,t + \Delta t)}{T(x,y,t)} - \frac{T(x,y,t)}{T(x,y,t)} }{\Delta t} =& \kappa \frac{\frac{T(x+\Delta x,y,t)}{T(x,y,t)} - 2 \frac{T(x,y,t)}{T(x,y,t)} + \frac{T(x - \Delta x,y,t)}{T(x,y,t)}}{\Delta x^2} \\[0em] &+ \kappa \frac{\frac{T(x,y +\Delta y,t)}{T(x,y,t)} - 2 \frac{T(x,y,t)}{T(x,y,t)} + \frac{T(x,y - \Delta x,t)}{T(x,y,t)}}{\Delta y^2} \\[0em] \frac{\frac{E(t+\Delta t)}{E(t)} - 1 }{\Delta t} =& \kappa \frac{\frac{e^{ik_x (x+\Delta x) + i k_y y}}{e^{ik_x x + i k_y y}} - 2 + \frac{e^{ik_x (x-\Delta x) + i k_y y}}{e^{ik_x x + i k_y y}}}{\Delta x^2} \\[0em] &+ \kappa \frac{\frac{e^{ik_x x + i k_y (y+ \Delta y)}}{e^{ik_x x + i k_y y}} - 2 + \frac{e^{ik_x x + i k_y (y - \Delta y )}}{e^{ik_x x + i k_y y}}}{\Delta y^2} \end{align*}$$ Now using $e^x/e^y = e^{x-y}$ and moving $\Delta t$ and $1$ to the other side, $$\begin{align*} \frac{E(t+\Delta t)}{E(t)} =& 1 + \kappa \Delta t \frac{e^{ik_x \Delta x} - 2 + e^{-ik_x \Delta x}}{\Delta x^2} + \kappa \Delta t \frac{e^{ik_x \Delta y} - 2 + e^{-ik_x \Delta y}}{\Delta y^2} \end{align*}$$ we obtain an expression for the ratio of $E(t)$ growth. Since this was the magnitude over time, this ratio tells us whether or not the magnitude of the wave component is growing, and so any unstable solution will need that to be greater than one in order to produce the weird heat propogation effects of an unstable simulation. One might infer that we can simply demand that this growth rate is less than one, i.e. that the magnitude of plane waves in the simulation only declines, and it turns out this can tell us how small $\Delta t$ needs to be. So let's assert $$\begin{gather*} \frac{E(t+\Delta t)}{E(t)} \leq 1. \end{gather*}$$ To get something meaningful out of this, we'll have to use a property of complex plane waves that $e^{ix} - e^{-ix} = 2i\sin(x)$, and moreover that squaring this, $e^{i2x} + e^{-i2x} - 2= -4 \sin^2(x)$. We have some terms that have a very similar form in our expression, so we can write $$\begin{gather*} \left| \frac{E(t+\Delta t)}{E(t)} \right| = \left|1 - 4\kappa \Delta t \frac{\sin^2(k_x \Delta x / 2)}{\Delta x^2} - 4\kappa \Delta t \frac{\sin^2(k_y \Delta y / 2)}{\Delta y^2} \right| \leq 1 \end{gather*}$$
Elsewhere on this site I've explained why $|x|\leq y$ implies $-y \leq x \leq y$, so we'll apply that here to get two inequalities. $$\begin{gather*} -1 \leq 1 - 4\kappa \Delta t \frac{\sin^2(k_x \Delta x / 2)}{\Delta x^2} - 4\kappa \Delta t \frac{\sin^2(k_y \Delta y / 2)}{\Delta y^2} \leq 1 \\[0em] -2 \leq -4 \kappa \Delta t \frac{\sin^2(k_x \Delta x / 2)}{\Delta x^2} - 4\kappa \Delta t \frac{\sin^2(k_y \Delta y / 2)}{\Delta y^2} \leq 0 \end{gather*}$$
This is really two inequalities, principly revolving around our two $\sin^2$ terms. We actually don't have to pay any attention to what's in them, only noting that because they're squared sines, they can be anywhere in the range $[0,1]$. In this way, the $\leq 0$ branch actually doesn't tell us anything (since we have two non-positive terms), but the $2 \le $ branch does. Let's rewrite it. $$\begin{gather*} 4\kappa \Delta t \frac{\sin^2(k_x \Delta x / 2)}{\Delta x^2} + 4 \kappa \Delta t \frac{\sin^2(k_y \Delta y / 2)}{\Delta y^2} \leq 2 \\[0em] \end{gather*}$$ One way to analyse this is to completely ignore our $\sin^2$ terms, considering only their 'worst case scenario' when they go closest to the bound we're trying not to violate. Since $\sin^2$ ranges in $[0,1]$, our worst case is for them both to be $1$, so we'll assume that. $$\begin{gather*} 4 \kappa \Delta t \frac{1}{\Delta x^2} + 4 \kappa \Delta t \frac{1}{\Delta y^2} \leq 2 \\[0em] 4\kappa \Delta t \left( \frac{1}{\Delta x^2} + \frac{1}{\Delta y^2} \right) \leq 2 \\[0em] \Delta t \le \frac{1}{2 \kappa \left( \frac{1}{\Delta x^2} + \frac{1}{\Delta y^2}\right)} \end{gather*}$$
And there is a stability criteria! Of course, using the stability criteria as an equality would be a bad idea since our calculation is done using floating point arithmetic rather than exact arithmetic, but we can healthily satisfy the bound by setting $\Delta t$ to say, half that.
The real question is whether or not our RK2 method can do better!
No it can't.
This proof will require a lot of terms, and there won't be as much time for hand holding as I'd normally do since there's so much to do and this proof is unlikely to be of interest to anyone except those who can understand it in this form. To make expressions fit, we'll start writing $T(x,y,t)$ as $T^{i,j}_{k}$ to denote grid position $(i,j)$ at timestep $k$. So we'd write $T(x + \Delta x, y, t + \Delta t /2)$ as $T^{i+1,j}_{k+1/2}$. I'll also introduce the notation here that $\nabla_\square^2$ denotes our centered space numerical laplacian as distinct from the true laplacian, with $\delta^2/\delta x^2$ and $\delta^2/\delta y^2$ standing for the partial derivatives as expressed in our numerical scheme rather than the true derivatives. Now let's get going.
Our iteration method in RK2 can be expressed as $$\begin{align*} T^{i,j}_{k+1} &= T^{i,j}_k + \kappa \Delta t \nabla_\square^2 T^{i,j}_{k+1/2} \\[0em] &= T^{i,j}_k + \kappa \Delta t \nabla_\square^2 \left( T^{i,j}_k + \frac{\Delta t}{2} \nabla_\square^2 T^{i,j}_{k} \right) \\[0em] &= T^{i,j}_k + \kappa \Delta t \nabla_\square^2 T^{i,j}_k + \frac{\kappa \Delta t^2}{2} \nabla_\square^2 \nabla_\square^2 T^{i,j}_k \end{align*}$$ It's best we analyse these terms individually or else we'll end up with some very long expressions. We'll study the ratio in the latter two terms assuming plane wave solutions as above. $$\begin{align*} \frac{\nabla_\square^2 T^{i,j}_k}{T^{i,j}_k} &= \frac{1}{T^{i,j}_k}\left(\frac{\delta^2 T^{i,j}_k}{\delta x^2} + \frac{\delta^2 T^{i,j}_k}{\delta y^2}\right) \\[0em] &= \frac{e^{i k_x \Delta x} + e^{-i k_x \Delta x} - 2}{\Delta x^2} + \frac{e^{i k_y \Delta y} + e^{-i k_y \Delta y} - 2}{\Delta y^2} \\[0em] &= \frac{- 4 \sin^2(k_x \Delta x / 2) }{\Delta x^2} + \frac{- 4 \sin^2(k_y \Delta y / 2)}{\Delta y^2} \end{align*}$$ And for the next term we'll have to reuse some parts and proceed to break the problem down further. $$\begin{align*} \frac{\nabla_\square ^2\nabla_\square^2 T^{i,j}_k}{T^{i,j}_k} = \frac{1}{T^{i,j}_k}\left(\frac{\delta^2}{\delta x^2} \frac{\delta^2 T^{i,j}_k}{\delta x^2} + \frac{\delta^2}{\delta x^2} \frac{\delta^2 T^{i,j}_k}{\delta y^2} + \frac{\delta^2}{\delta y^2} \frac{\delta^2 T^{i,j}_k}{\delta x^2} + \frac{\delta^2}{\delta y^2} \frac{\delta^2 T^{i,j}_k}{\delta y^2} \right) \end{align*}$$ This will be easier solved if we focus on the first two terms, since the latter two will be mirror-images in the way we write them. $$\begin{align*} \frac{1}{T^{i,j}_k} \frac{\delta^2}{\delta x^2} \frac{\delta^2 T^{i,j}_k}{\delta x^2} &= \frac{1}{T^{i,j}_k} \frac{\delta^2}{\delta x^2} \frac{T^{i+1,j}_k - 2 T^{i,j}_k + T^{i-1,j}_k}{\Delta x^2} \\[0em] &= \frac{1}{T^{i,j}_k \Delta x^4} \big( T^{i+2,j}_k - 4 T^{i+1,j}_k + 6 T^{i,j}_k - 4 T^{i-1,j}_k + T^{i-2,j}_k \big) \\[0em] &= \frac{1}{\Delta x^4} \big( e^{i 2 k_x \Delta x} - 4 e^{i k_x \Delta x} + 6 - 4 e^{- i k_x \Delta x}+ e^{- i 2 k_x \Delta x}\big) \\[0em] &= \frac{1}{\Delta x^4} \big( e^{i 2 k_x \Delta x} + e^{- i 2 k_x \Delta x} -2 - 4 e^{i k_x \Delta x} - 4 e^{- i k_x \Delta x} + 8 \big) \\[0em] &= \frac{1}{\Delta x^4} \big( -4 \sin^2(k_x \Delta x) + 16 \sin^2{k_x \Delta x / 2} \big) \end{align*}$$ We'll adopt a version of this for $\Delta y$. $$\begin{gather*} \frac{1}{T^{i,j}_k} \frac{\delta^2}{\delta y^2} \frac{\delta^2 T^{i,j}_k}{\delta y^2} = \frac{1}{\Delta y^4} \big( -4 \sin^2(k_y \Delta y) + 16 \sin^2{k_y \Delta y / 2} \big) \end{gather*}$$ Now we repeat this for the cross term. $$\begin{align*} \frac{1}{T^{i,j}_k} \frac{\delta^2}{\delta x^2} \frac{\delta^2 T^{i,j}_k}{\delta y^2} &= \frac{1}{T^{i,j}_k} \frac{\delta^2}{\delta x^2} \frac{T^{i,j+1}_k - 2 T^{i,j}_k + T^{i,j-1}_k}{\Delta y^2} \\[0em] &= \frac{1}{T^{i,j}_k \Delta x^2 \Delta y^2} \begin{pmatrix} T^{i+1,j+1}_k & - 2 T^{i,j+1}_k & + T^{i-1,j+1}_k \\[0em] -2 T^{i+1,j}_k & +4T^{i,j}_k & -2 T^{i-1,j}_k \\[0em] + T^{i+1,j-1}_k & -2 T^{i,j-1}_k & +T^{i-1,j-1}_k\\[0em] \end{pmatrix} \\[0em] &= \frac{1}{\Delta x^2 \Delta y^2} \begin{pmatrix} e^{i k_x \Delta x + i k_y \Delta y} & - 2e^{i k_y \Delta y} & + e^{-i k_x \Delta x + i k_y \Delta y} \\[0em] -2 e^{i k_x \Delta x} & +4 & -2 e^{- i k_x \Delta x} \\[0em] + e^{i k_x \Delta x - i k_y \Delta y} & -2 e^{- i k_y \Delta y} & +e^{-i k_x \Delta x - i k_y \Delta y}\\[0em] \end{pmatrix} \end{align*}$$ Here we've written the terms in a matrix but this is purely so we can stack them better. The pairings we'll exploit to get our $-4 \sin^2$ relation are the pairings across the center of this 3x3 set of terms, and they'll need the corresponding constants $-2$ for each $-4 \sin^2$. For the two diagonal pairings we'll have to subtract two each, so adding a corresponding four to our constant term in the center to cancel that $-4$ we need will make it eight, just enough for our horizontal and vertical pairings. $$\begin{align*} \frac{1}{T^{i,j}_k} \frac{\delta^2}{\delta x^2} \frac{\delta^2 T^{i,j}_k}{\delta y^2} &= \frac{1}{\Delta x^2 \Delta y^2} \begin{pmatrix} -4 \sin^2(k_x \Delta x/2 + k_y \Delta y/2) \\[0em] -4 \sin^2(k_x \Delta x/2 - k_y \Delta y/2) \\[0em] +8 \sin^2(k_x \Delta x/2) \\[0em] +8 \sin^2( k_y \Delta y/2) \\[0em] \end{pmatrix} \end{align*}$$ It suffices to notice by inspection that swapping $x$ and $y$ in this expression would keep it the same, so we also know the other cross term. Now let's return to our original ratio equality. $$\begin{align*} \frac{E(t + \Delta t)}{E(t)} = \frac{T^{i,j}_{k+1}}{T^{i,j}_k} =& 1 + \kappa \Delta t \begin{pmatrix} \frac{-4}{\Delta x^2} \sin^2(k_x \Delta x /2) \\[0em] \frac{-4}{\Delta y^2} \sin^2(k_y \Delta y /2) \\[0em] \end{pmatrix} \\[0em] &+ \frac{\kappa^2 \Delta t^2 }{2\Delta x^4} \begin{pmatrix} -4 \sin^2(k_x \Delta x) \\[0em] +16 \sin^2(k_x \Delta x /2) \\[0em] \end{pmatrix} \\[0em] &+ \frac{\kappa^2 \Delta t^2 }{\Delta x^2 \Delta y^2} \begin{pmatrix} -4 \sin^2(k_x \Delta x/2 + k_y \Delta y/2) \\[0em] -4 \sin^2(k_x \Delta x/2 - k_y \Delta y/2) \\[0em] +8 \sin^2(k_x \Delta x/2) \\[0em] +8 \sin^2( k_y \Delta y/2) \\[0em] \end{pmatrix} \\[0em] & + \frac{\kappa^2 \Delta t^2 }{2\Delta y^4} \begin{pmatrix} -4 \sin^2(k_y \Delta y) \\[0em] +16 \sin^2(k_y \Delta y /2) \\[0em] \end{pmatrix} \end{align*}$$ Oof. We can now turn this into a bound but to make that useful we'll have to figure out how we turn our $\sin^2$ terms into their 'worst case scenario's. But there's something important to note here. Look at the inequality which we derive from this in the way we did for the non-RK2 case. $$\begin{align*} 2 \ge & \kappa \Delta t \begin{pmatrix} \frac{4}{\Delta x^2} \sin^2(k_x \Delta x /2) \\[0em] \frac{4}{\Delta y^2} \sin^2(k_y \Delta y /2) \\[0em] \end{pmatrix} + \frac{\kappa^2 \Delta t^2 }{2\Delta x^4} \begin{pmatrix} 4 \sin^2(k_x \Delta x) \\[0em] -16 \sin^2(k_x \Delta x /2) \\[0em] \end{pmatrix} \\[0em] &+ \frac{\kappa^2 \Delta t^2 }{\Delta x^2 \Delta y^2} \begin{pmatrix} 4 \sin^2(k_x \Delta x/2 + k_y \Delta y/2) \\[0em] +4 \sin^2(k_x \Delta x/2 - k_y \Delta y/2) \\[0em] -8 \sin^2(k_x \Delta x/2) \\[0em] -8 \sin^2( k_y \Delta y/2) \\[0em] \end{pmatrix} \\[0em] & + \frac{\kappa^2 \Delta t^2 }{2\Delta y^4} \begin{pmatrix} 4 \sin^2(k_y \Delta y) \\[0em] -16 \sin^2(k_y \Delta y /2) \\[0em] \end{pmatrix} \ge 0 \end{align*}$$ We have a term of order $O(\kappa \Delta t/ \Delta x^2) + O(\kappa \Delta t/ \Delta y^2)$, which are the same terms as in the non-RK2 case, and all other terms have that corresponding coefficient squared (e.g. $O(\kappa^2 \Delta t^2 / \Delta x^4)$) or some variation of that involving swapping $\Delta x^2$ or $\Delta y^2$. So already, we expect our bound to be close to the normal stability bound because the foward euler stability terms dominate. Moreover, many of the sine arguments are related, which is important since $\sin^2$ has all of its peaks at odd number multiples of $\pi/2$ and zeros for even multiples. So if we were to maximise the standard non-RK2 sine squared terms, we'd zero out many of the other ones in the following way: $$\begin{align*} \kappa \Delta t \begin{pmatrix} \frac{4}{\Delta x^2} \\[0em] \frac{4}{\Delta y^2} \\[0em] \end{pmatrix} + \frac{\kappa^2 \Delta t^2 }{2\Delta x^4} \begin{pmatrix} 0 \\[0em] -16 \\[0em] \end{pmatrix} & + \frac{\kappa^2 \Delta t^2 }{2\Delta y^4} \begin{pmatrix} 0 \\[0em] -16 \\[0em] \end{pmatrix} + \frac{\kappa^2 \Delta t^2 }{\Delta x^2 \Delta y^2} \begin{pmatrix} 0 \\[0em] + 0 \\[0em] -8 \\[0em] -8 \\[0em] \end{pmatrix} \end{align*}$$ Notice here we also zero out sine squared terms with arguments that are the sums or differences of the $k_x \Delta x / 2$ or $k_y \Delta y / 2$ terms, since irrespective of their sums and differences, if they are maximising the non-RK2 sine squared terms, then they are odd number multiples of $\pi/2$ and thus sum and difference to even multiples.
We'll set $A = \kappa \Delta t$, $B = 1/\Delta x^2$ and $C = 1/\Delta y^2$ now and simplify this relation to the following: $$\begin{gather*} 2 \ge 4AB + 4AC - 8 A^2B^2 - 8 A^2 C^2 -16 A^2 B C \ge 0. \end{gather*}$$ If we want to get a bound with $\Delta t$ on one side of this, we'll need to solve this as a quadratic in $A$. We could apply the quadratic equation on both inequalities that form this transitive chain, but if we do this for the $2 \ge $ side we get complex solutions; in fact this is because the $2 \ge$ inequality is unconditionally satisfied. Unlike in the forward Euler stability analysis, our terms that are quadratic in $A$, since they are negative, can now violate the $\ge 0$ condition when our $\sin^2$ terms are non-zero (i.e. in the case where we've maximised the $\sin^2$ terms). So studying that inequality, we must solve the quadratic equation to find the valid bounds on $\kappa \Delta t$. $$\begin{gather*} A^2 \left( -8 B^2 -8C^2 -16 BC \right) + A(4B + 4C) \ge 0 \\[0em] A \Big( A \left( -8 B^2 -8 C^2 -16 BC \right) + 4B + 4C \Big) \ge 0 \end{gather*}$$ The solutions, and thus the terminals of the range $A$ must remain inside, are $\kappa \Delta t = 0$ (meaning things are unstable if $\kappa \Delta t$ is negative) and $$\begin{gather*} \kappa \Delta t = \frac{4B + 4C}{8B^2 + 8 C^2 + 16 BC} = \frac{B + C}{2(B + C)^2} = \frac{1}{2(1/\Delta x^2 + 1/\Delta y^2)} \\[0em] \Delta t = \frac{1}{2 \kappa (1/\Delta x^2 + 1/\Delta y^2)} \end{gather*}$$ so altogether we have $$\begin{gather*} 0 \le \Delta t \le \frac{1}{2 \kappa (1/\Delta x^2 + 1/\Delta y^2)} \end{gather*}$$ just as in the forward Euler stability criterion.
Garbage!!
But there is another consideration we have to make when choosing our method which may show RK2 to be better than forward Euler.
The truncation error stems from a basic fact of calculus known in relation to the Taylor series. That is, in general, $$\begin{gather*} f(x+h) = f(x) + h \frac{d f}{d x}(x) + \frac{h^2}{2!} \frac{d^2 f}{d x^2}(x) + \frac{h^3}{3!} \frac{d^3 f}{d x^3}(x) + ... \end{gather*}$$
And in fact we invoke this whenever we take a derivative. When we say $(f(x+h)-f(x))/h$ what we're doing is subtracting that first term, $f(x)$, dividing the $h$ off of $df/dx$ and then in the limit of $h \to 0$, all the other terms are set to zero. That's all fine and dandy but it's in direct contradiction to the central presumption of, say, the forward Euler method. In order to derive the forward euler method, $f(x+h)=f(x) + hf'(x)$, we necessarily presume that $$\begin{gather*} \frac{f(x+ \Delta x) - f(x)}{\Delta x} = \frac{d f}{dx} \end{gather*}$$ which completely ignores those other terms in the Taylor series. If our $\Delta x$ (or our $\Delta t$ for that matter) is small enough then ignoring these terms may merely be a rounding error, as is the case in the derivative where we take the limit of that rounding error being zero, but failing that, this error is sure to add up over time.
We can use this as a method for comparing numerical techniques; essentially we ask the question, for a given technique, how much does our error reduction decrease when we make our calculation more expensive. If we double the number of iteration steps and half the time step between those iterations, or perhaps divide each pixel of our grid into four, do we half the error? Perhaps quarter it? The answer differs for each technique.
Let's do this for the forward Euler method, using our heat-equation temperature as an example; the way we'll do the calculation won't account for position variable truncation error but that's not what we're hoping to improve with RK2 anyway. The central presumption of the forward Euler method for our heat equation is ostensibly what is written above, so we'll move all terms to one side and define this as the Truncation Error (TE). Since if the presumption we use to derive the Euler method is true then the other side of the equation above is zero, we know that the truncation error should be zero, and its failure to be zero is error.
$$\begin{gather*} \text{TE} = \frac{T(x,y,t + \Delta t) - T(x,y,t)}{\Delta t} - \frac{\partial T}{\partial t} \approx 0 \end{gather*}$$
The true value of $T(x,y,t + \Delta t)$ is, as above, known in relation to $T(x,y,t)$ and derivatives of $T$ at $(x,y,t)$ as the taylor series $$\begin{gather*} T(x,y,t + \Delta t) = T(x,y,t) + \Delta t \frac{\partial T}{\partial t} + \frac{\Delta t^2}{2} \frac{\partial^2 T}{\partial t^2 } + \frac{\Delta t^3}{3!} \frac{\partial^3 T}{\partial t^3} + ... \end{gather*}$$
which we may substitute into the truncation error relation to get the error $$\begin{align*} \frac{T(x,y,t + \Delta t) - T(x,y,t)}{\Delta t} - \frac{\partial T}{\partial t} &= \frac{\Delta t \frac{\partial T}{\partial t} + \frac{\Delta t^2}{2} \frac{\partial^2 T}{\partial t^2 } + \frac{\Delta t^3}{3!} \frac{\partial^3 T}{\partial t^3} + ...}{\Delta t} - \frac{\partial T}{\partial t} \\[0em] &= \frac{\Delta t}{2} \frac{\partial^2 T}{\partial t^2 } + \frac{\Delta t^3}{3!} \frac{\partial^3 T}{\partial t^3} + \frac{\Delta t^4}{4!} \frac{\partial^4 T}{\partial t^4} + ... \end{align*}$$
We can't do anything further with this but we don't really need to. This is where we break out the big-O notation and say that the forward euler method is $O(\Delta t)$, a first order method, meaning that when we half the time step we half the error. And that error is precisely the second, third, and higher order derivative terms that we haven't accounted for when we express the derivative as a difference. This is of course not accounting for error from our spatial resolution $\Delta x$ and $\Delta y$ but we can perform those analyses separately.
The better question is what this method says about our RK2 method. Are we actually getting any improvements out of it? Let's see.
Recall our iteration method again, but this time we'll use $\square$ as a subscript and $\delta/\delta x^2$ to denote our numerical versions of things, since we're principly interested in the difference between the numerical scheme and the true solution with real derivatives. $$\begin{align*} \frac{\delta^2 T}{\delta x^2}(x,t) &= \frac{T(x + \Delta x,y, t) - 2 T(x,y,t) + T(x - \Delta x,y, t) }{\Delta x^2} \\[0em] \nabla_\square^2 T(x,y,t) &= \frac{\delta^2 T}{\delta x^2}(x,y,t) + \frac{\delta^2 T}{\delta y^2}(x,y,t) \\[0em] T_\square(x,y, t + \Delta t/2) &= T(x,y,t) + \frac{\Delta t}{2} \cdot \kappa \nabla_\square^2 T(x,y,t) \\[0em] T_\square(x,y, t + \Delta t) &= T(x,y,t) + \Delta t \cdot \kappa \nabla_\square^2 T_\square(x,y,t + \Delta t /2) \end{align*}$$ In fact our core interest will ultimately be in the differences between these quantities and their non-numerical versions. $$\begin{align*} \text{TE}_{x} &= \frac{\delta^2 T}{\delta x^2}(x,t) - \frac{\partial^2 T}{\partial x^2} \\[0em] \text{TE}_{\nabla} &= \nabla_\square^2 T(x,y,t) - \nabla^2 T(x,y,t)\\[0em] \text{TE}_{t/2} &= T_\square(x,y, t + \Delta t/2) - T(x,y,t + \Delta t/2) \\[0em] \text{TE}_t &= T_\square(x,y, t + \Delta t) - T(x,y,t + \Delta t) \end{align*}$$
What follows is a lot of termwise manipulation, so I'll minimize it by default.
The truncation errors are as follows:$$\begin{align*} \text{TE}_{x} &= \frac{\delta^2 T}{\delta x^2}(x,t) - \frac{\partial^2 T}{\partial x^2} \\[0em] \text{TE}_{\nabla} &= 2 \sum_{n=2}^\infty \frac{(\Delta x)^{2n-2}}{(2n)!} \frac{\partial^{2n} T}{\partial x^{2n}} + 2 \sum_{n=2}^\infty \frac{(\Delta y)^{2n-2}}{(2n)!} \frac{\partial^{2n} T}{\partial y^{2n}} \end{align*}$$ Moreover, these errors appear whenever we use $\nabla_\square^2$ or $\delta^2/\delta x^2$ as linear operators. $$\begin{gather*} \nabla_\square^2 = \nabla^2 + 2 \sum_{n=2}^\infty \frac{(\Delta x)^{2n-2}}{(2n)!} \frac{\partial^{2n}}{\partial x^{2n}} + 2 \sum_{n=2}^\infty \frac{(\Delta y)^{2n-2}}{(2n)!} \frac{\partial^{2n}}{\partial y^{2n}} \end{gather*}$$ So we'll define the linear operator $$\begin{gather*} L = 2 \sum_{n=2}^\infty \frac{(\Delta x)^{2n-2}}{(2n)!} \frac{\partial^{2n}}{\partial x^{2n}} + 2 \sum_{n=2}^\infty \frac{(\Delta y)^{2n-2}}{(2n)!} \frac{\partial^{2n}}{\partial y^{2n}} \end{gather*}$$ and say $\text{TE}_{\nabla} = L(T)$ and $\nabla_\square^2 = \nabla^2 + L$.
We'll do these one by one, starting with $\text{TE}_x$. Immediately we'll apply the Taylor series $T(x + \Delta x,y,t) = \sum_{n=0}^\infty \frac{(\Delta x)^n}{n!} \frac{\partial^n T}{\partial x^n}$. $$\begin{align*} %\text{TE}_x = \frac{T_k^{i+1,j} - 2T_k^{i,j} + T^{i-1,j}_k }{\Delta x^2} - \frac{\partial^2 T^{i,j}_k}{\partial x^2} \text{TE}_x &= \frac{T(x + \Delta x, y, t) - 2T(x,y,t) + T(x - \Delta x, y, t)}{\Delta x^2} - \frac{\partial^2 T}{\partial x^2} (x,y,t) \\[0em] &= \frac{ \sum_{n=0}^\infty \frac{(\Delta x)^n}{n!} \frac{\partial^n T}{\partial x^n} - 2T(x,y,t) + \sum_{n=0}^\infty \frac{(-\Delta x)^n}{n!} \frac{\partial^n T}{\partial x^n} }{\Delta x^2} - \frac{\partial^2 T}{\partial x^2} \end{align*}$$ Our $-2T(x,y,t)$ allows us to eliminate the zeroth order terms on both series. $$\begin{align*} \text{TE}_x &= \frac{ \sum_{n=1}^\infty \frac{(\Delta x)^n}{n!} \frac{\partial^n T}{\partial x^n} + \sum_{n=1}^\infty \frac{(-\Delta x)^n}{n!} \frac{\partial^n T}{\partial x^n} }{\Delta x^2} - \frac{\partial^2 T}{\partial x^2} \end{align*}$$ The next thing we need to do is notice that the first series has a $(\Delta x)^n$ factor and the other has a $(-\Delta x)^n$ factor, meaning that every odd order term will be the negative of the other series. We'll express these series in terms of their even and odd components, $$\begin{align*} \sum_{n=1}^\infty \frac{(\Delta x)^n}{n!} \frac{\partial^n T}{\partial x^n} &= \sum_{n=1}^\infty \frac{(\Delta x)^{2n - 1}}{(2n - 1)!} \frac{\partial^{2 n - 1} T}{\partial x^{2 n - 1}} + \sum_{n=1}^\infty \frac{(\Delta x)^{2n}}{(2n)!} \frac{\partial^{2n} T}{\partial x^{2n}} \\[0em] \sum_{n=1}^\infty \frac{(-\Delta x)^n}{n!} \frac{\partial^n T}{\partial x^n} &= \sum_{n=1}^\infty \frac{(- \Delta x)^{2n - 1}}{(2n - 1)!} \frac{\partial^{2 n - 1} T}{\partial x^{2 n - 1}} + \sum_{n=1}^\infty \frac{(- \Delta x)^{2n}}{(2n)!} \frac{\partial^{2n} T}{\partial x^{2n}} \\[0em] &= - \sum_{n=1}^\infty \frac{(\Delta x)^{2n - 1}}{(2n - 1)!} \frac{\partial^{2 n - 1} T}{\partial x^{2 n - 1}} + \sum_{n=1}^\infty \frac{(\Delta x)^{2n}}{(2n)!} \frac{\partial^{2n} T}{\partial x^{2n}} \end{align*}$$ So let's cancel those odd order terms in our truncation error. Next we separate out the second order term, and then cancel that. $$\begin{align*} \text{TE}_x &= \frac{2}{\Delta x^2} \left( \sum_{n=1}^\infty \frac{(\Delta x)^{2n}}{(2n)!} \frac{\partial^{2n} T}{\partial x^{2n}} \right) - \frac{\partial^2 T}{\partial x^2} \\[0em] &= 2 \left( \sum_{n=1}^\infty \frac{(\Delta x)^{2n-2}}{(2n)!} \frac{\partial^{2n} T}{\partial x^{2n}} \right) - \frac{\partial^2 T}{\partial x^2} \\[0em] &= 2 \left(\frac{1}{2!} \frac{\partial^2 T}{\partial x^2} + \sum_{n=2}^\infty \frac{(\Delta x)^{2n-2}}{(2n)!} \frac{\partial^{2n} T}{\partial x^{2n}} \right) - \frac{\partial^2 T}{\partial x^2} \\[0em] &= 2 \sum_{n=2}^\infty \frac{(\Delta x)^{2n-2}}{(2n)!} \frac{\partial^{2n} T}{\partial x^{2n}} \end{align*}$$ And we can of course repeat this process for $\text{TE}_y$ to get a similar result. $$\begin{gather*} \text{TE}_y = \frac{\delta^2 T}{\delta y^2} - \frac{\partial^2 T}{\partial y^2} = 2 \sum_{n=2}^\infty \frac{(\Delta y)^{2n-2}}{(2n)!} \frac{\partial^{2n} T}{\partial y^{2n}} \end{gather*}$$ This also means we get $\text{TE}_\nabla$. $$\begin{align*} \text{TE}_\nabla &= \nabla_\square^2 T - \nabla^2 T \\[0em] &= \left( \frac{\delta^2 T}{\delta x^2} + \frac{\delta^2 T}{\delta y^2} \right) - \left( \frac{\partial^2 T}{\partial x^2} + \frac{\partial^2 T}{\partial y^2} \right) \\[0em] &= 2 \sum_{n=2}^\infty \frac{(\Delta x)^{2n-2}}{(2n)!} \frac{\partial^{2n} T}{\partial x^{2n}} + 2 \sum_{n=2}^\infty \frac{(\Delta y)^{2n-2}}{(2n)!} \frac{\partial^{2n} T}{\partial y^{2n}} \end{align*}$$
Here is where the fun begins. To get the total truncation error, we'll need to actually use the heat equation again, which we recall is $$\begin{gather*} \frac{\partial T}{\partial t} (x,y,t) = \kappa \nabla^2 T (x,y,t). \end{gather*}$$ We have, fundamentally, two approximations for the two terms here; note that since we use RK2, we have an almost-forward-Euler method but with the time derivative shifted by $\Delta t /2$. $$\begin{gather*} \frac{T(x,y,t + \Delta t) - T(x,y,t)}{\Delta t} = \frac{\delta T}{\delta t} (x,y,t + \Delta t/2) \approx \frac{\partial T}{\partial t} (x,y,t + \Delta t/2) \\[0em] \nabla_\square^2 T_\square (x,y,t + \Delta t/2) \approx \nabla^2 T(x,y,t + \Delta t/2) \end{gather*}$$ Now, the right hand sides are equal to one another via the heat equation, that is, the actual derivatives without error are the true heat equation, so we use that to set the two approximations approximately equal to one another, with their error this time being the real truncation error. $$\begin{align*} 0 =& \frac{\partial T}{\partial t}(x,y,t+ \Delta t/2) - \kappa \nabla^2 T(x,y,t + \Delta t/2) \\[0em] \text{TE} =& \frac{\delta T}{\delta t}(x,y,t + \Delta t/2) - \kappa \nabla_\square^2 T_\square (x,y,t + \Delta t/2) \\[0em] =& \left(\frac{T(x,y,t+\Delta t) - T(x,y,t)}{\Delta t} \right) \\[0em]& - \kappa \left( \nabla^2 + L \right) \left( T(x,y,t) + \frac{\kappa \Delta t}{2} \nabla_\square^2 T(x,y,t) \right) \end{align*}$$
Now we'll need to start expanding the second term here, but it is worth noting that in the time derivative term, we are meant to use $T(x,y,t+ \Delta t)$ as though it were the true $T$ after a timestep. You might say that we don't have any way to know $T(x,y,t + \Delta t)$ and that is precisely why we have this method, but the expression we have for the truncation error subtracts exactly that method. If our method were perfect, this subtraction would literally be the heat equation with all terms moved to one side as above, equal to zero, and this is how the truncation error will tell us how off our solution is.
Let's focus on the time derivative term. Now that we've done this a couple of times, I'll skip over exactly the process of separating the zeroth order term. We obtain $$\begin{align*} \frac{T(x,y,t+\Delta t) - T(x,y,t)}{\Delta t} = \frac{\sum_{n=1}^{\infty} \frac{\Delta t^n}{n!} \frac{\partial^n T}{\partial t^n}}{\Delta t} &= \frac{\partial T}{\partial t} + \frac{\sum_{n=2}^{\infty} \frac{\Delta t^n}{n!} \frac{\partial^n T}{\partial t^n}}{\Delta t} \\[0em] &= \frac{\partial T}{\partial t} + \sum_{n=2}^{\infty} \frac{\Delta t^{n-1}}{n!} \frac{\partial^n T}{\partial t^n} \end{align*}$$
Now on the Laplacian term, we won't fully expand it yet but observe that we can at least separate out some terms. $$\begin{align*} \kappa \left( \nabla^2 + L \right) \left( T + \frac{\kappa \Delta t}{2} \nabla_\square^2 T \right) =& \kappa \nabla^2 T + \kappa L(T) + \frac{\kappa^2 \Delta t}{2} (\nabla^2 + L) \nabla_\square^2 T \\[0em] =& \kappa \nabla^2 T + \left( \frac{\Delta t}{2} \right) \big((\kappa \nabla^2) (\kappa \nabla^2) T \big) \\[0em] & + \frac{\kappa^2 \Delta t}{2} (L \nabla^2 + \nabla^2 L + LL) T \end{align*}$$
And my favorite step of this forsaken process is now observing that we can actually apply the heat equation on our real differential operators. That is, since the heat equation asserts $\frac{\partial T}{\partial t} = \kappa \nabla^2 T$, and we assert this to be true of the real $T$ which our $T_\square$ approximates, we can apply that equation here to swap instances of $\kappa \nabla^2 T$ for $\frac{\partial T}{\partial t}$. In addition to that, we can apply a presumption of smoothness to swap the ordering of these differential operators to make sure every $\kappa \nabla^2$ becomes a time derivative and our $\nabla^2 L$ and $L \nabla^2$ sum. $$\begin{gather*} \kappa \nabla_\square^2 T_\square (x,y,t + \Delta t/2) = \frac{\partial T}{\partial t} + \frac{\Delta t}{2} \frac{\partial^2 T}{\partial t^2} + \frac{\kappa^2 \Delta t}{2} (2 L \nabla^2 + LL) T \end{gather*}$$ Now we can start applying our subtraction in our truncation error. And in fact, our second time derivative which we got out of our Laplacian term will actually eat the second order term in our time derivative Taylor series. Observe: $$\begin{align*} \text{TE} =& \frac{\partial T}{\partial t} + \sum_{n=2}^{\infty} \frac{\Delta t^{n-1}}{n!} \frac{\partial^n T}{\partial t^n} - \frac{\partial T}{\partial t} - \frac{\Delta t}{2} \frac{\partial^2 T}{\partial t^2} - \frac{\kappa^2 \Delta t}{2} (2 L \nabla^2 + LL) T \\[0em] =& \sum_{n=2}^{\infty} \frac{\Delta t^{n-1}}{n!} \frac{\partial^n T}{\partial t^n} - \frac{\Delta t}{2} \frac{\partial^2 T}{\partial t^2} - \frac{\kappa^2 \Delta t}{2} (2 L \nabla^2 + LL) T \\[0em] =& \frac{\Delta t}{2} \frac{\partial^2 T}{\partial t^2} + \sum_{n=3}^{\infty} \frac{\Delta t^{n-1}}{n!} \frac{\partial^n T}{\partial t^n} - \frac{\Delta t}{2} \frac{\partial^2 T}{\partial t^2} - \frac{\kappa^2 \Delta t}{2} (2 L \nabla^2 + LL) T \\[0em] =& \sum_{n=3}^{\infty} \frac{\Delta t^{n-1}}{n!} \frac{\partial^n T}{\partial t^n} - \frac{\kappa^2 \Delta t}{2} (2 L \nabla^2 + LL) T \end{align*}$$
We're now ready for big-$O$ notation. Observe in our time peturbed Taylor series we have $n=3$ as the smallest term corresponding to a lowest order $\Delta t$ of $O(\Delta t^2)$. For our Laplacian error term, we must recall the definition of $L$, $$\begin{gather*} L = 2 \sum_{n=2}^\infty \frac{(\Delta x)^{2n-2}}{(2n)!} \frac{\partial^{2n}}{\partial x^{2n}} + 2 \sum_{n=2}^\infty \frac{(\Delta y)^{2n-2}}{(2n)!} \frac{\partial^{2n}}{\partial y^{2n}} \end{gather*}$$ Meaning that $L(T)$ has lowest order $O(\Delta x^2) + O(\Delta y^2)$. This also means that, since $\nabla^2$ doesn't contribute any factors of $\Delta x$ or $\Delta y$, we can ignore the contribution of $L(L(T))$ since it would be of order $O(\Delta x^4) + O(\Delta x^2 \Delta y^2) + O(\Delta y^4)$. Altogether then, our truncation error is written $$\begin{gather*} \text{TE} = O(\Delta t^2) + O(\Delta t \Delta x^2) + O(\Delta t \Delta y^2). \end{gather*}$$
So this tells us that our RK2 did in fact improve our truncation error! Now halving the time step reduces the error by a factor of four instead of by a factor of two. Our error improvements are quadratic instead of linear.