Saturday, July 26, 2008

Concurrent Patterns, Part 2

http://citeseer.ist.psu.edu/trinder98algorithm.html

This is more of what I was getting at; although for massively parallel systems the actual examples are all off the mark.

So far I have examined two algorithms; bitonic sort and parallel prefix scan. I have some clarifications on the parallel prefix scan.

You can visualize the two parts of the parallel prefix scan starting with n threads, and at each stage decreasing the number of threads by n/2 till you have 1 thread. So the algorithms starts with an inverted pyramid of threads.

Second off, you run from 1 thread back to N threads; this is a normal pyramid assuming you picture time increasing down; each level has 2*n more threads active than the level before it.

So we are all used to the terms gather and scatter. The parallel prefix scan basically maps a certain type of gather operation followed by a specific scatter. I believe the descriptions given in the paper really are not very good; and in any case the algorithm isn't easy to visuals *even* if you attempt to visualize it in terms of gather and scatter. All gather and scatter do is provide the thread management in this case.

Next up is the bitonic sort. That is an algorithm that sorts somewhat efficiently (not that I know a better algorithm) in a massively parallel system. It has an interesting property in that it takes a fixed number of stages regardless of the condition of the input. This algorithm is somewhat difficult to visualize; and the description I see on the web are generally really bad. Luckily I bought a book on massively parallel systems a long time ago and it contains a really, really good description of the algorithm.

This contains some pretty darn cool visualizations of it working, however:
http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm

Anyway, if you visualize the algorithm correctly, which is exactly what a parallelization strategy is supposed to provide for you, then it isn't quite so darn tough to reason about the algorithms. Bitonic sort is very cool in that it basically builds up inversely sorted sequences (increasing/decreasing) called bitonic sequences from unordered data. Each stage takes in twice the number of elements as the stage before it, and it requires log(n) stage-groupings to work correctly.

It all comes down to algorithm visualization and then concise implementation. I believe that separating code from parallelization strategy helps in both respects, and the parallelization strategy should be reusable for different problems. Furthermore I believe that a system build to support the use of parallelization strategies could provide much more benefit in terms of debuggability, memory and speed profiling as well as optimizations than a system that was not built with such steps in mind.

For instance, lets say I want a parallel max/min operation. I would use the gather (begin with many threads; end with one) parallelization strategy along with some min,max operation. The compilation system could check that I am accessing the shared memory or scratch pad as efficiently as possible and could generate the code to run the operation efficiently taking into account the fact that you are using the gather pattern.

This is a trivial example, but I guarantee it wouldn't work the way you think it would upon first attempt on a CUDA processor. Unless you already implemented sum, because min/max are identical operations to sum, and would use the same gather pattern.

Chris

No comments: