hide
Free keywords:
-
Abstract:
Sample sort, a generalization of quicksort that partitions the input
into many pieces, is known as the best practical comparison based
sorting algorithm for distributed memory
parallel computers. We show that sample sort is also useful on a
single processor. The main algorithmic insight is that
element comparisons can be decoupled from expensive
conditional branching using predicated instructions.
This transformation facilitates optimizations like
loop unrolling and software pipelining.
The final implementation, albeit cache efficient,
is limited by a linear number of memory accesses rather than
the $\Oh{n\log n}$ comparisons.
On an Itanium 2 machine, we obtain a speedup of up to \maxspeedup\
over \texttt{std::sort} from the GCC STL library, which is known as one of
the fastest available quicksort implementations.