Alias Analysis in HELIX

One of the most important parts of our HELIX compiler is the data dependence analysis we run on the compiler’s IR to determine which instructions are independent of each other.  You can read more about HELIX in general in our original CGO 2012 paper (click through my publications page to get free access to the ACM version).

HELIX’s initial data dependence pass is split into two phases, and it’s the memory alias analysis stage that is most interesting.  This has the job of identifying the locations in memory that are read and written by each instruction so that we can respect all data dependences within the loops we parallelise.  Since alias analysis is not precise, we need to be conservative and include dependences that we cannot prove to be false, but we need to keep these false positives to a minimum since they can significantly affect the amount of parallelism available within a loop.

Having considered a number of different algorithms, we decided to base our pass on VLLPA, a pointer analysis for low-level code, developed by David August‘s group and described in his CGO 2005 paper Practical and Accurate Low-Level Pointer Analysis.  What particularly appealed about this work was that it targets simple IR instructions, almost assembly-like, which fit very well with the IR in ILDJIT, the underlying compiler that HELIX is written within.  We used VLLPA as the starting point for our own alias analysis and augmented it with additional information to increase its accuracy.

The analysis works by determining where in memory each pointer can access, then propagating this information around the application.  Each block of memory that is accessed within a procedure is described as an unknown initial value or UIV.  These represent memory as a base and zero or more offsets from that base. For example, [G0]@4@8 describes the memory accessible from global G0 at an offset of 4, then an offset of 8 from that. I.e., add 4 to the pointer contained in G0, load from that and add another 8 to get the final pointer. UIVs are combined with offsets to form abstract addresses which take the form <S, o>, which represents an abstract structures with a structure name S (a UIV) and offset o.

The algorithm contains several loops, iterating until a fixed point with each. At a high level, it builds strongly connected components (SCCs) from the application’s call graph and analyses each, resolving indirect call targets when function pointer information becomes available. This iterates until no more indirect call targets can be resolved. Within each SCC, procedures are analysed to identify the set of abstract addresses accessed by each memory instruction. These are summarised as a transfer function for each procedure that can be propagated down to its callers. Analysing each procedure is one loop, which terminates when the sets of abstract addresses in memory or registers remains fixed. Transferring the sets between procedures within an SCC is another loop, terminating once more when there are no further changes to the sets. The final stage in the algorithm compares the abstract addresses accessed by each instruction and creates dependences between any pair where the intersection of their sets is non-empty.

The main addition we made to the algorithm was the inclusion of semantic information for each library function that is called. In the baseline case, calls to procedures that the compiler can’t analyse must be treated conservatively. Since we are unable to safely make any assumptions about the memory locations read and written, we must assume that all of memory is touched, leading to dependences between these call instructions and every other instruction that accesses memory within the procedure. Adding semantic information for known calls significantly reduces the number of dependences involving these instructions, in some cases eliminating them entirely.

Here’s an example of a loop that operates on some strings, converting them to integers using a call to atoi and summing them all up.  OK, it’s somewhat contrived, but it demonstrates the point I’m trying to make about library calls.

int sum = 0;
for (i = 0; i < MAX; ++i) {
  int val = atoi(str[i]);
  sum += val;
}

HELIX will automatically identify sum as a reduction variable, something it can privatise within each thread and obtain the final value by summing up each thread’s copies at the end of the loop. The loop also has a handy iteration variable, i, which is also an induction variable, meaning that it can also be privatised.

However, the call to atoi causes problems for the alias analysis without additional semantic information. Being conservative, the analysis will say that there is a memory dependence between calls to this function in different iterations, i.e., iteration 1 cannot execute atoi until iteration 0 has, iteration 2 must wait for iteration 1, etc. This is obviously wrong: atoi doesn’t write memory at all, it only reads it, and the array str isn’t written within the loop, so each iteration is actually independent of all others.

To address this, we add in the semantic information for the call to atoi. Instead of saying the read and write sets of abstract addresses contain all memory locations, we make the write set empty and set the read set to <[str], 0> (the algorithm doesn’t distinguish between accesses to different array elements, but uses just the base address).  This removes the conservative memory RAW, WAR and WAW dependences, leaving HELIX free to parallelise the loop without any synchronisation.

Although this additional information is fairly simple, it was surprisingly effective in reducing the number of conservative dependences identified by our pass.  In part, this was due to the propagation of these dependences down the call graph.  As mentioned above, the sets of abstract addresses read and written by each instruction are combined for each procedure to create its transfer function.  When a library call is conservatively labelled as reading and writing the whole of memory, it means that the whole procedure must be also treated in the same way, as well as the caller’s callers.  And so on.  Removing these conservative dependences means that the effects are seen all the way down the call graph, sometimes enabling parallelisation of loops that call these procedures, possibly indirectly, without the need for synchronisation, which improves performance.