Sunday, July 13, 2008

Some thoughts about concurrency

A while back, when I found out about F#'s syntax trees, I thought that perhaps you could use them to program CUDA programs; or in general massively parallel systems.

In addition, I just spoke with Jeff about high level languages and how they enable certain types of optimizations that lower level programs make excessively difficult to get right.

http://spiral.ece.cmu.edu:8080/pub-spiral/abstract.jsp?id=1

Finally, I have been trying to understand what Monads are and exactly what parser combinators are. I still don't really understand Monads, but I am not too far off. I think if I took a week and really messed around with Haskell, I could figure them out pretty well.

I spoke with one of the guys in charge of CUDA development about F# or rather functional languages and his exact quote was "How does this help c++ programmers trying to program in CUDA?"

I didn't have a great answer to this right off the bat, but I am in the nascent stages of having a pretty good answer.

Check out the parallel scan algorithm description for CUDA processors systems:
http://beowulf.lcs.mit.edu/18.337/lectslides/scan.pdf.

The important thing to notice is that there is a high level description of the algorithm, where basically the author talks about walking your way from the leaves of a binary tree to the root and back. Each node is actually a bit of code, and the level of the tree you are running at is the argument. The depth of the tree tells you how many threads are running at a given time (or whether each thread will do several computations in the case where there are more leaves than threads).

Parallel prefix scan uses a binary numeric operator (+) . This operator tends to be very easy to perform interesting parallelization with because it is commutative and associative; which allows you to use a distributive property in clever ways. I do not think this algorithm would work with (-), but I am not certain. Anyway, you have a bit of code that takes numeric arguments and that you can prove has various properties.

Then you can separate the operation itself from the application or algorithm context (the weird binary tree meta-algorithm). The algorithm context could be an instance of a set of parallel algorithm context, lets call this binary-apply. So in some ways, if you call (binary-apply + ), then you *could* automatically parallelize things.

Later in the prefix scan paper you get some interesting details about the local memory tricks you need to apply in order to make using a certain type of memory on the GPU efficient. These memory tricks could probably be automated, since all the answer involves is careful and clever application of an offset to the local memory index per given tree node read and write requests.

There are a couple things that would be interesting to take from this.

The first is that overall you probably have many different algorithm patterns you would like to apply: Matrix multiply is going to use one type of algorithm pattern, bitonic sort probably uses a very different one, and finally this parallel prefix scan exposes yet another algorithm pattern. So in essence, I am describing a set of parallel combinators, much like the parser combinators other authors have spoke at length about.

The second is that given a specific operation (result[n] = result[n-1] + input[n]), you *should* be able to figure out automatic parallelization of the operation, but it *has* to be described in certain terms. The should part makes me thing that for some problems (or most problems), this may be a pipe dream and may in fact be provably NP complete. But for those problems that you can do automatic parallelization, it would be nice if there was some system of doing this.

The third is that given something described using a given parallel combinator, you can debug it at a higher level, perhaps graphically, *before* you run it on 1000s of pieces of hardware. The language can ensure that reads and write *cannot* happen to the same local space at the same level of the tree. This level of debuggability of parallel algorithms is another holy grail.

Anyway, given the fact that I see three *very* different algorithm patterns and I can think of a high level description of those algorithm patterns, I believe there is room to say that some further exploration of the idea of parallel combinators makes a lot of sense.

Chris

No comments: