It sounds a bit counter-intuitive, but boosting application performance by limiting the amount of vectorisation carried out is essentially what my postdoc, Vasileios Porpodas, and I have done in our latest paper on automatic vectorisation. We call it TSLP, or Throttled SLP, because it limits the amount of scalar code that the standard SLP algorithm converts into vectors.

The actual paper is available here. Vasileios presented it at PACT last week and will be at the LLVM Developers’ Meeting this week, so I thought it might be interesting to expand on one of the examples we give at the end, showing the source code and how it is actually vectorised with SLP and TSLP. The kernel is compute-rhs, which is a function from the BT benchmark within the NAS parallel benchmark suite. The code we’re interested in is this:

1: rhs[i][j][k][2] = rhs[i][j][k][2] + 2: dx3tx1 * (u[i+1][j][k][2] - 2.0*u[i][j][k][2] + u[i-1][j][k][2]) + 3: xxcon2 * (vs[i+1][j][k] - 2.0*vs[i][j][k] + vs[i-1][j][k]) - 4: tx2 * (u[i+1][j][k][2]*up1 - u[i-1][j][k][2]*um1); 5: 6: rhs[i][j][k][3] = rhs[i][j][k][3] + 7: dx4tx1 * (u[i+1][j][k][3] - 2.0*u[i][j][k][3] + u[i-1][j][k][3]) + 8: xxcon2 * (ws[i+1][j][k] - 2.0*ws[i][j][k] + ws[i-1][j][k]) - 9: tx2 * (u[i+1][j][k][3]*up1 - u[i-1][j][k][3]*um1);

Lovely, isn’t it! The code is executed in a loop nest with the outer loop using iteration variable `i`

, the next using `j`

and the inner loop having iteration variable `k`

. All scalar variables are globals and, although it doesn’t matter to us, everything is a `double`

.

Just by looking at the code, it’s obvious that there are similarities between the two statements, although they are not identical and so vectorisation isn’t trivial. Variables `um1`

and `up1`

are local to the function; all others are global. Also, because we extracted this kernel from the actual benchmark, we didn’t include the code that initialises any of the arrays.

The SLP graph created for these two statements follows. It’s just the same as figure 11(a) from our paper, but I’ve reproduced it here to save flicking back and forth between this page and that.

Going through the nodes, this is how they map to the code above.

- The vector subtract of lines 4 and 9 from their predecesors;
- The vector add of lines 3 and 8 with their predecessors;
- The vector add of lines 2 and 7 with their predecessors;
- The vector load of
`rhs[i][j][k][2]`

and`rhs[i][j][k][3]`

; - The vector multiplication of
`dx3tx1`

with the rest of line 2 and`dx4tx1`

with the rest of line 7; - For some reason, there is no node 6;
- These are scalar loads of globals
`dx3tx1`

and`dx4tx1`

, which are non-contiguous in memory, so can’t be vectorised; - The vector multiplication of
`xxcon2`

with everything else on lines 3 and 8; - The vector add within the brackets in lines 3 and 8;
- The scalar loads of
`vs[i-1]][j][k]`

and`ws[i-1][j][k]`

which are not contiguous in memory; - The vector subtract within the brackets in lines 3 and 8;
- The scalar loads of
`vs[i+1][j][k]`

and`ws[i+1][j][k]`

which are, again, non-contiguous; - The vector multiplication of
`2.0`

with`vs[i][j][k]`

and`ws[i][j][k]`

in lines 3 and 8; - Again, no node 14;
- The actual scalar loads of
`vs[i][j][k]`

and`ws[i][j][k]`

, it’s obvious that these are non-contiguous; - The scalar load of
`xxcon2`

, which is sent to all SIMD lanes; - The multiplication of
`tx2`

with everything in brackets on lines 4 and 9.

Obviously node 0 is the final vector store back to `rhs[i][j][k][2]`

and `rhs[i][j][k][3]`

and the seed for starting vectorisation. We think the compiler has also noticed that the array `u`

is not initialised, so lines 4 and 9 are identical.

This vectorised code results in 10 vector instructions (nodes 0-5, 8, 9, 11, 13), 10 scalar instructions (nodes 7, 10, 12, 15-17), and 10 gather instructions (1 for each of the scalar nodes; where there are two lines out of a scalar, the value is broadcast to all vector lanes, using 1 instruction). The cost model used by SLP in our case assigns a cost of 1 for each of these, leading to a total vector cost of 30. If all the code had been kept as scalar, then it would have also had a cost of 30 (each vector instruction replaces 2 scalar and there would be no gathers, so the cost is 2 * 10 + 10). This leads to no benefits from vectorisation, so the code is left as scalar.

However, looking at the graph it is easy to see that the left-hand side is vector-heavy, whereas the right-hand side has a lot of nasty red scalar instructions. These are obviously pushing up the costs of the graph, since they each have a cost of 2: 1 for the instruction and 1 for the gather we then require. We would love to remove them and see whether vectorisation is then profitable. Of course, this is what TSLP is all about.

TSLP goes through the graph considering the cost of all connected subgraphs that include the root. The following graph is the one it finds with the minimum cost, and therefore ultimately the one that is vectorised.

In this graph, the whole of the upper right-hand side, from node 8 onwards, has been kept as scalar. This means that it’s really only lines 1 & 2, and 6 & 7 that get vectorised, along with the additions and subtractions at the end of each line. The vector cost for this graph is 16 (from 6 vector instructions, 5 scalar instructions and 5 gathers), whereas the scalar cost is 17, a saving of 1. The code is inside 3 loops, each of has 10 iterations, meaning a cost saving of 1000 across the whole function, without accounting for any other code that is vectorised too. Although the amount of vector code is smaller, TSLP has ensured that the final graph is the most profitable and performance is higher than with either scalar code only or SLP vectorisation alone.

## Comments are closed.

What’s the compile-time impact of exploring those subgraps?

It’s small – SLP typically has an overhead of about 1% over O3, TSLP is then a further 1% over that in the general case.

This is very interesting! I’ve read your paper and have a couple of questions:

1) In Figure 5, the vectorized version of shift-LRcorrection program performs worse than the scalar version, allegedly due to an inaccurate cost model. However, as both SLP and TSLP operate on the same cost model, I was surprised to find that the TSLP-generated code performed worse than that of SLP. Could you please provide some details of why that happened?

2) In the paper you argue that static costs do not correspond to actual performance, one reason being that code coverage is not taken into account. Do you not use the execution frequency of the basic block, as this would give at least a hint on how often the code would be executed in relation to the rest of the program?

3) I am also curious on how LLVM derives the copy overhead costs as the SLP vectorizer is executed in opt, which operates on the LLVM IR code and thus should be target-independent. However, as we both know the overhead of using SIMD instructions is very hardware-dependent, so does it mean the SLP vectorizer has to be provided with the designated target machine – and thus make use of Tablegen – even though it runs inside opt?

You might be interested in a paper on instruction selection that we recently published, where we touch upon exactly this phenomenon where excessive use of SIMD instructions could in fact degrade performance due to the copying overhead. In our approach we use constraint programming for selecting instructions and propose a novel program and instruction representation for handling a wider range of instructions, including SIMD instructions. The paper, Modeling Universal Instruction Selection, is published by Springer but is also freely available on my website: http://gabriel.hjort.blindell.se/publications/

Thanks for your interest, Gabriel, I’ll take a look at your paper. The answers to your questions are: