Prefix

Original document can be found here.

The code for our project can be found at this location: link. The original code for this project is taken from the following GitHub repository. The baseline code is written by Philip Mocz. Finally, we made following changes before starting with the project:

  • Fixed a bug that resulted in an infinite loop when setting the plotRealTime flag to False.

Introduction Link to heading

The Kelvin-Helmholz inequality is a phenomenon that arises when two fluid layers of different velocities interface with each other. It results in characteristic patterns that can be observed, e.g. in clouds, the surface of the sun or in Jupiter’s colourful atmosphere. Philip Mocz created an which produces a visualisation of the characteristic swirls of the K-H-inequality.

The Finite Volume method is a popular simulation technique to simulate fluids or partial differential equations that can be represented in a conservative form. This is often the case for equations that describe physical conservation laws. The Euler-Fluid-Equations (a simplifications of the Navier-Stokes-Equations) can be represented — besides it’s primitive description — in such conservative form. Within the algorithm, Philip Mosz makes use of both representations of the formula. He uses the primitive form to extrapolate values in time and space, but uses the conservative representation to derive the update formula. Numerically, this achieves the best of both worlds. Extrapolating within the primitive form is more stable whereas the conservative representation is used to compute the update efficiently.

The algorithm in principle works as follows: For each time step and each cell do

  1. Get cell central primitive variables (convert from conservative ones)
  2. Calculate time step size $\Delta t$
  3. Calculate gradient using primitive variables (using central differences)
  4. Extrapolate primitive variables in time
  5. Extrapolate primitive variables to faces
  6. Compute fluxes along each face
  7. Update solution by adding fluxes to conservative variables

First it is important to note, that all of these steps are computed for the density $\rho$, pressure $p$ as well as the velocity in both dimensions $v_x$ and $v_y$. These computations don’t rely on each other and can therefore be performed in parallel.

Second, each function call in itself usually performs a computation on the whole grid. The original implementation makes use of this by vectorising the computation (e.g. using np.roll). These calculations can be parallelised — they are in fact embarrassingly parallel.

We aim to introduce optimisations that decrease the overall runtime of the computation. To achieve this, we use techniques obtained from the lectures, specifically using cython to use pre-compiled c code to reduce python overhead and investigate a distributed computation approach utilising dask. As a little bonus treat, we combined the dask and cython approach. Lastly, we investigated using GPU accelerators using pytorch.

Baseline tests Link to heading

Testing and profiling setup Link to heading

All runtime tests that show results for a specific grid size were computed on a 2021 M1 MacBook Pro (16’) if not explicitly stated otherwise. For every grid we simulated 256 time steps without computing the visualisation. Every test was run three times, runtimes were then averaged.

Some of the profiling graphs are computed on a 2020 M1 MacBook Air, we specify it explicitly when this was done so. All implementations were tested to ensure correctness by comparing the final rho values on the grid between baseline and optimised implementations. Every approach presented in this report passed this test.

Runtime Link to heading

We approached this problem by trying to get a basic understanding of what might be the runtime bottleneck. For this we used the line profiler. We simulated1 on a 256x256 grid for 256 iterations. We came to the conclusion that a large portion (nearly 50%) of computation time is spent to compute the fluxes (getFlux method).

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   195                                           @profile
   196                                           def main():
[...]
   272                                               # compute fluxes (local Lax-Friedrichs/Rusanov)
   273       256     357351.0   1395.9     25.0      flux_Mass_X, flux_Momx_X, flux_Momy_X,
                                                         flux_Energy_X = getFlux(...)
   274       256     320673.0   1252.6     22.4      flux_Mass_Y, flux_Momy_Y, flux_Momx_Y,
                                                         flux_Energy_Y = getFlux(...)
[...]

All other function calls accumulated to the other 50%, however each in itself were more or less neglegtable (at most below 4%). This strongly incentivised to focus our efforts onto this part of the computation.

The runtime of the simulation mostly depends on two parameters. First and foremost the number of time steps: Each time step is computed in one iteration of the main loop. Figure 1 displays the runtime the computation of each time step needs. We can observe that the runtime per iteration is mostly consistent over the course of the simulation.

{{ < figure src=“images/baseline/baseline_per_iteration_macair.png” caption=“Fig 1: Baseline Runtime per Iteration on 2021 Macbook Air”>}}

Important note: To simulate a fixed time, the total number of time steps increases by increasing grid resolution. With other words, simulating two seconds using a fine grid uses more iterations than simulating two seconds using a coarse grid. The reason is that $dt$, which is computed each loop, is dependent on $dx$ — the granularity of the mesh.

{{ < figure src=“images/baseline/baseline_performance.png” caption=“Fig 2: Baseline Performance”>}}

The second important parameter is the grid size. Finer grids allow for higher precision but increase the computational complexity quadratically. To measure runtime over a variety of grid sizes, we decided to keep the number of time steps constant. We simulated 256 time steps over all grid sizes and performed each computation three times, which we then averaged. The baseline/default2 is the original implementation of the simulation. Figure 2 visualises the wall times for different grid sizes of the baseline implementation on a log-log scale.

Memory Link to heading

We measured the memory consumption of the simulation over the runtime on a 512x512 grid. Figure 3 depicts the measurement we obtained using the mprof module in python. We see that once all memory has been allocated, all computations are performed in-place. This is good, since the reallocating memory and waiting for the garbage collector to free memory could lead to performance issues, especially on larger grids.

{{ < figure src=“images/baseline/baseline_memory_512.png” caption=“Fig 3: Memory consumption 512x512 grid measured on M1 MacBookAir”>}}

Optimizations using Cython Link to heading

Since getFlux was the bottleneck for the simulation based on our profiling, we focused on optimising this function using Cython. We investigated four approaches in total, only one performed better than the baseline3. All runtimes presented in this section timed the simulation using different versions of the getFlux method.

Attempt 1: Using np.asarray With Minimal Changes Link to heading

For the first attempt, we extensively use Cython’s typed memoryviews; both for arguments and intermediate variables defined in the function. The memoryviews are converted to numpy arrays using np.asarray to allow for vectorised operations. This way there is minimal deviation from the structure of the original code4. Figure 4 depicts the performance of this attempt compared to the baseline, which is slightly worse. We believe this can be explained by the frequent calls to np.asarray, which likely requires interactions with the Python runtime. The original code was vectorised incredibly well, so there were no easy performance gains (e.g. by optimising nested for- loops) that Cython could have automatically achieved without our intervention.

Attempt 2: Re-Implement Vectorized Arithmetic Link to heading

We re-implemented5 numpy’s vectorised operations using simple nested for-loops. For example, the addition of 2 numpy arrays a + b was re-implemented with nested for-loops using element-wise additions. We continued using typed memoryviews in all places. We hoped that Cython would be able to optimise the simpler nested loops, leading to a better runtime; however, this was not the case. The comparison of this attempt with the baseline and Attempt-1 can be seen in Figure 4.

One explanation for the poor performance of this implementation could be attributed to the fact that each operation requires the program to iterate over the arrays multiple times. The original code contains various combination of such simpler operations to compute the intermediate variables. Another issue could be the fact that numpy is very good at vectorising these optimisations, whereas the used c-compiler of cython might not.

{{ < figure src=“images/cython/cython_attempt_4.png” caption=“Fig 4: Attempts 1-4 vs Baseline”>}}

Attempt 3: Single Python Loop Link to heading

We observed that all the computations for the returned fluxes can be performed in an embarrassingly parallel manner: no cell of the returned grids dependent on any other cell. Thus, iterating over the input grids one cell at a time and computing the resulting fluxes would give us the same result as the vectorised version. We implemented 6 a single nested loop in Python, where we compute each flux component one cell at a time using typed memoryviews. We disabled some cython flags (such as boundscheck, wraparound) to help with speed.

As depicted in Figure 4, the performance was the worst out of all attempts. This was surprising to us: we believed that Cython would be able to minimise the number of calls to the Python runtime because the nested loops involved only arithmetic operations on elements of typed memoryviews. However, on examining the HTML file generated using the cython -a ... command, we still observed heavy Python runtime calls for computations, particularly those that involved accessing elements of the typed memoryviews (see Figure 20). The frequent callbacks to the Python runtime must have reduced the number of optimisations that Cython was able to make — leading to worse performance.

Attempt 4 (Chosen Optimization): Single C Loop Link to heading

Since Cython was not able to prevent the Python runtime interactions, we decided to reimplement7 the core logic of the original getFlux method in C directly. The syntax and structure of the implementation was inspired by the example presented in the Cython documentation (link). The numpy arrays passed to the getFluxRawC are in contiguous row-major (C) order, which makes the port of the python code to C easier. Correctness is tested in the flux_test.py file to verify similarity between the default getFlux and getFluxRawC. We still use typed memoryviews for the numpy arrays, along with disablement of some Cython flags.

The runtime-comparison of this approach with the other attempts and baseline is present in Figure 4. Of all attempts, this approach performed the best and most importantly: better than the baseline for all grid sizes. This may indicate that performing everything in one-loop is the right approach and that Cython’s extensive calls to the Python runtime limited runtime optimisations in Attempt 3. In Figure 5, we see that the performance of this approach remains better than the baseline for grid sizes 2048 and 4096. The bar graphs in Figure 6 highlights the performance difference of this Cython optimisation over the baseline.

{{ < figure src=“images/cython/cython_final_plot.png” caption=“Fig 5: Cython Final Line Plot”>}}

{{ < figure src=“images/cython/cython_final_bar.png” caption=“Fig 6: Cython Final Bar Plot”>}}

Optimizations using Pytorch Link to heading

{{ < figure src=“images/torch/torch_mps_runtime.png” caption=“Fig 7: Torch Line Plot”>}}

{{ < figure src=“images/torch/torch_mps_bars.png” caption=“Fig 8: Torch Bar Plot”>}}

The original implementation makes heavy use of numpy functionalities. All data is stored as numpy arrays that are incrementally updated using the computed fluxes. This provides an easy approach to move the computations on a GPU using pytorch7. Pytorch’s computations are run using tensors — an immutable data-type that shares resemblance with numpy arrays. A big difference however is that tensor operations are supported to be computed using accelerators. Due to the fact that pytorch mirrors numpys interface, we simply need to convert said numpy arrays to pytorch tensors and tell pytorch to perform the computations using GPU compatible hardware.

To make runtime results more comparable to the other approaches, we used the M1’s metal, the integrated GPU on the M1 chip. This is achieved by specifying the mps device in torch. The only difference to CUDA is that metal does not support 64bit floats. Otherwise, the handling is identical.

As one can see in Figure 8, making use of accelerators greatly improves the overall runtime for larger grid sizes. However, on smaller grid sizes (Figure 7), the overhead significantly slows the computation.

Optimizations using Dask Link to heading

We approached optimisation with Dask from two angles8. For all dask runs, the default scheduler settings were used, which on the Mac book Pro gave us 5 workers, 2 threads per worker.

Each iteration of the simulation is dependent on the outputs of the previous iteration. Coming up with Dask optimisations that delayed the call to compute() till the last possible moment (preferably outside the loop), was difficult. Particularly, the termination condition of the original code dependent on the time parameter t, which was updated in every iteration based on the value of dt that was computed inside the loop. Thus, the 2 optimisations we tried respect this starting structure of the code, and consequently compute() is called in each iteration. While fixing the number of iterations could generate better results, we were still able to obtain promising improvements.

{{ < figure src=“images/dask/dask_comparison.png” caption=“Fig 9: Dask Attempts 1-2 vs Baseline”>}}

Attempt 1: Dask Arrays Link to heading

As noted above, getFlux is the bottleneck for the baseline simulation. This method is embarrassingly parallelisable (no cell computation dependent on any other cell). We used map_blocks from dask.arrays on the getFlux for the getFlux method9. We only used this for the computation of the fluxes in x- direction first, hoping to extend it to y- direction if the performance was good. We created dask arrays for the input to getFlux in each iteration. We then called compute() on the value returned by map_blocks to obtain the fluxes in x-direction in each loop iteration. The task-stream visualisations were done while running the experiment on a grid size of 128 till 2 seconds of simulated time.

{{ < figure src=“images/dask/dask_opt1_n_by_8.png” caption=“Fig 10: Work Stream for Dask Attempt 1”>}}

{{ < figure src=“images/dask/dask_opt1_n_by_2.png” caption=“Fig 11”>}}

We varied the chunk sizes of the dask arrays to see what impact they have on the runtime. A plot of this can be seen in Figure 21. In Plot 9 we present only the runtime for the best performing chunk-size, which was 1. In summary, the performance decreased with an increase in chunk-sizes.

The performance of this implementation always performed worse than the baseline. Several factors could have contributed to this. Firstly, there is overhead involved in Dask having to create the dask arrays with the correct chunk sizes, allocating them to the workers, setting up the scheduler, etc. This overhead can be present in the form of communication between workers. We observed this on the task-stream for the computation on the Dask Dashboard. The red/dark-red observed in the task stream in Figure 10 indicates a lot of transfer between workers10. This was also present when we had fewer chunks than cores/workers (Figure 11). Additionally, the call to compute in each loop iteration would also lead to a slowdown.

We thought of some ideas to address some of our concerns here (such as trying to delay compute for everything except dt). However, our experience with Assignment 4’s Bonus lead us to believe that even with fewer compute() calls, the performance of dask-arrays in a single-machine environment would likely still be worse than the baseline (because of the set-up overhead discussed above). Thus, before doing this, we tried a different approach: using dask delayed.

Attempt 2 (Chosen Optimization): Dask Delayed Link to heading

Instead of focusing only on the parallelisability within getFlux, we observed that there were several calls of helper functions that were independent of each other. For instance, computing getFlux in x- and y-direction can be done independently. By looking at the bigger picture, we decided to use dask-delayed to store the delayed tasks for each helper function in an array and then called dask.compute() for each element of the task array11.

Each such helper function was tagged the @delayed decorator.

{{ < figure src=“images/dask/dask_opt2.png” caption=“Fig 12: Task Stream for Dask-Delayed Optimization”>}}

The performance of this implementation with Attempt-1 (dask-arrays) and baseline can be seen in Figure 9. We see that initially for very small grids, this attempt performs worse. This is attributed to the increased number of compute() calls for each independent helper-function, which requires all parallel calls to complete before progressing. However, as the grid sizes increase, this approach comes out on top. This indicates to us that as the grid-size increases, the sheer number of computations to be performed outweighs the cost of the multiple compute() calls (and other overhead required by dask-delayed): leading to a better performance when doing the computations in parallel for grid sizes bigger than 512. The improved performance compared to Attempt-1 could be explained by almost no interaction required between workers, as can be seen by the absence of red in the task-stream graph in Figure 12. This was our chosen Dask optimisation.

{{ < figure src=“images/dask/dask_final_plot.png” caption=“Fig 13: Dask final plot”>}}

{{ < figure src=“images/dask/dask_final_bar.png” caption=“Fig 14: Dask final bar”>}}

In Figure 13, we see that the performance of this optimisation remains better than the baseline for larger grids, as we hypothesised. Figure 14 highlights the time-difference that might be lost in the log-scale based line plot.

Bonus Optimization: Cython + Dask Link to heading

We observed that for the 2 largest grid sizes (2048 and 4096), our Cython optimisation performed worse than our Dask optimisation. To us, this indicated that any performance advantage obtained from performing the time-consuming getFlux operations in C was lost when it came to the sheer number of operations to be performed, thus giving our parallel Dask optimisation an advantage.

As a bonus, we decided to combine the two: we performed the getFluxRawC computation (our Cython optimisation) in parallel, using our Dask-Delayed optimisation12.

{{ < figure src=“images/cython_dask/cython_dask_final_plot.png” caption=“Fig 15: cython dask final plot”>}}

{{ < figure src=“images/cython_dask/cython_dask_final_bar.png” caption=“Fig 16: cython dask final bar”>}}

This lead to incredible results for larger grids, as can be seen in Figure 15. The combined approach outperforms the approaches of the individual optimizations, and of the baseline. The bar graph in Figure 16 shows the runtimes in a non-logarithmic scale, highlighting the performance boost. For smaller grids, this appears to be between the Cython and Dask optimisations. This could be explained by the fact that the dask overhead pulls up the runtime compared to the sole Cython optimisation, and the use of the optimised getFluxRawC pulls down the runtime compared compared to he sole Dask optimisation.

Conclusion Link to heading

The final plot of baseline runtime, and of all the chosen optimisations, can be seen in Figure 17. Figure 18 shows the bar graphs for the largest 3 grid-sizes to highlight the performance improvements over the baseline.

{{ < figure src=“images/final/final_plot.png” caption=“Fig 17: Concluding results plot”>}}

{{ < figure src=“images/final/final_bar.png” caption=“Fig 18: Concluding results bar”>}}

The trends and likely explanations for the trends observed in the runtime compared to the baseline have been discussed extensively in their respective sections. Here, we would like to make one final important observation about the general trends. For optimization approaches that require overhead, whether it be initial or otherwise (e.g., moving data to the GPU, Dask-Delayed setup, compute() calls, etc), struggle to perform on coarse grid resolutions. There, techniques without significant overhead perform better. However, the advantages of overhead starts becoming apparent with increased grid-sizes. This is when the sheer number of computations becomes overwhelming It helps the runtime to prepare for such computations to perform them in parallel — as can be seen by the better performance of the Dask, Dask with Cython, and the Torch implementations on larger grids.

Appendix Link to heading

Cython Attempt 3: Python Runtime Interactions Link to heading

{{ < figure src=“images/misc/cython_attempt_3_annotated.png” caption=“Fig 20: Python Runtime Interactions for Cython Attempt 3”>}}

Dask Attempt 1: Dask Arrays Chunk Sizes Variation Link to heading

{{ < figure src=“images/dask/dask_opt1_chunk_size.png” caption=“Fig 21: ChunkSizeVariation (Dask Attempt 1)">}}

Runtime variation Link to heading

{{ < figure src=“images/misc/boxplot_128grid_2sec.png” caption=“Fig 22: Runtime variation of different implementations”>}}

For completeness, we explored how volatile the runtimes were for different implementations. However, on the MacBook Pro we could not find any indication that some implementation varied significantly more than others. Figure 22 shows a box-plot over runtimes simulating 2s, each performed 20 times using a grid of 128x128. As we can see, runtimes remained consistent, we did not pursue this further.


  1. using the MacBook Air (2020) ↩︎

  2. Code for baseline found here↩︎

  3. The code for this section is present in the cython/ directory of the repository here ↩︎

  4. Code found here (getFluxAsArray). ↩︎

  5. Code found here (getFluxAsLoops). ↩︎

  6. Code found here (getFlux). (getFluxRawC). The C code is found here↩︎

  7. The code for this section is present in the torch/ directory of the repository here↩︎ ↩︎

  8. The code for this section is present in the dask/ directory of the repository here↩︎

  9. Code found here↩︎

  10. The original html file is found here↩︎

  11. Code found here↩︎

  12. Code found here↩︎