Thursday, July 31, 2008

Time passes

Lots of applications incorporate animation. In fact, I would say that animation and interactivity are the two hardest things to add well to any application. I don't intend to address interactivity as I really don't know that much about it. So then we are left with time.

What exactly is animation? By animation, I generally mean interpolation; not frame-by-frame. I mean that you have something like two values, and an equation that takes you from value one to value two based off a third, independent variable.

This is a generally very useful and beautiful idea. Something really cool that Maya lets you do is it allows you to add a property that controls a set of other properties based off some interpolation of the original property. Lets say you have an animated face. You make it look like it isn't smiling and take a snapshot of your data (this snapshot is usually called a keyframe). Now you make it look like it is smiling and take another snapshot. Figure out the differences between the two of them and add a property where when the property is 0 the face isn't smiling and when it is 1 the face is smiling.

Crazy and simple; but the results of allowing the users to do this can *greatly* simplify a lot of tasks. Because the next thing an artist will do is to add a set of properties that describe a set of facial expressions and then try to play with all of them and see what happens. They then get completely bizarre output that is often delightful and have expressions on characters that no-one really understands either what the expression means *or* what is mathematically going on underneath the covers.

So there we have something *animating* due to user input. Now lets say you have a clock that continually increments based on time. You patch this value into the facial animation engine and all of a sudden the person may look like they are laughing (assuming the value is moded by its range). But that brings up another topic; what sort of transformations can you do on the input stream to get interesting behavior?

What if you take twice the range of the input and mod the clock by twice the range. If you time is in the upper half of the range then you run it backwards; if it is in lower have you run it forwards. This would be called ping-pong, and would make the animation bounce between the interpolations like someone dancing or doing facial exercises.

What if you multiply the input by a number? You can see that how you manage this input stream *also* gives you ranges of creativity and interesting effects?

So you have some function that takes input that ranges from 0-1 and produces output based on keyframes. You get all sorts of interesting properties by controlling this input. Lets say you use a bezier function to control this input range. Then you get bezier animation; except it is normalized so you can take the same bezier curve and apply it to several inputs. You can merge input streams by using a combination operator like divide or add (or subtract). You can do any number of crazy input nonsense and really produce some interesting stuff.

You can also setup processing graphs of these inputs. This will mimic behavior that makes a set of animations run in the *time context* of other animations.

So now lets get back to applications. Lots of applications allow animation. But none (or very few) of them allow you to setup arbitrary processing graphs to experiment with arbitrarily complex and clever animation systems. Breaking animation down into its components really allows you to do some interesting things.

For instance, what if the beginning value *isn't* a keyframe? What if it is based on something else; like an object's position or something like that. Then the animating object will animate from one object to a point in space. We called these dynamic keyframes; they are cool; I swear it. They allow you to mix interactivity with animation; without them you run into a lot of situations where you just can't get the object to move around reasonably.

We have an acyclic directed graph of floating point processing routines (at least; presumably other information could flow down this graph along with the floating point values). We have an object that generates a consistent increasing signal, we have things that will *reset* that signal when told so it appears to start from zero. We have sets of functions that given a floating point value can produce another floating point value. By combining these functions in clever ways we produce sophisticated and somewhat non-obvious such as character animation. But the point is that I think it would be cool to allow very open manipulation of this processing graph.

Chris

Saturday, July 26, 2008

Concurrent Patterns, Part 2

http://citeseer.ist.psu.edu/trinder98algorithm.html

This is more of what I was getting at; although for massively parallel systems the actual examples are all off the mark.

So far I have examined two algorithms; bitonic sort and parallel prefix scan. I have some clarifications on the parallel prefix scan.

You can visualize the two parts of the parallel prefix scan starting with n threads, and at each stage decreasing the number of threads by n/2 till you have 1 thread. So the algorithms starts with an inverted pyramid of threads.

Second off, you run from 1 thread back to N threads; this is a normal pyramid assuming you picture time increasing down; each level has 2*n more threads active than the level before it.

So we are all used to the terms gather and scatter. The parallel prefix scan basically maps a certain type of gather operation followed by a specific scatter. I believe the descriptions given in the paper really are not very good; and in any case the algorithm isn't easy to visuals *even* if you attempt to visualize it in terms of gather and scatter. All gather and scatter do is provide the thread management in this case.

Next up is the bitonic sort. That is an algorithm that sorts somewhat efficiently (not that I know a better algorithm) in a massively parallel system. It has an interesting property in that it takes a fixed number of stages regardless of the condition of the input. This algorithm is somewhat difficult to visualize; and the description I see on the web are generally really bad. Luckily I bought a book on massively parallel systems a long time ago and it contains a really, really good description of the algorithm.

This contains some pretty darn cool visualizations of it working, however:
http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm

Anyway, if you visualize the algorithm correctly, which is exactly what a parallelization strategy is supposed to provide for you, then it isn't quite so darn tough to reason about the algorithms. Bitonic sort is very cool in that it basically builds up inversely sorted sequences (increasing/decreasing) called bitonic sequences from unordered data. Each stage takes in twice the number of elements as the stage before it, and it requires log(n) stage-groupings to work correctly.

It all comes down to algorithm visualization and then concise implementation. I believe that separating code from parallelization strategy helps in both respects, and the parallelization strategy should be reusable for different problems. Furthermore I believe that a system build to support the use of parallelization strategies could provide much more benefit in terms of debuggability, memory and speed profiling as well as optimizations than a system that was not built with such steps in mind.

For instance, lets say I want a parallel max/min operation. I would use the gather (begin with many threads; end with one) parallelization strategy along with some min,max operation. The compilation system could check that I am accessing the shared memory or scratch pad as efficiently as possible and could generate the code to run the operation efficiently taking into account the fact that you are using the gather pattern.

This is a trivial example, but I guarantee it wouldn't work the way you think it would upon first attempt on a CUDA processor. Unless you already implemented sum, because min/max are identical operations to sum, and would use the same gather pattern.

Chris

Sunday, July 13, 2008

Ruminations on opengl

I have been rolling the idea of using thinking of rendering a particular item to the screen as executing all of the opengl state calls used for that item. All of them.

Meaning glEnable(GL_BLEND), as well as glDisable(GL_CULL_FACE) as well as glUniform and glUseProgram.

Given all the necessary state to render a given object, you have a large state vector. Now it wouldn't be efficient to actually program like this, but it makes sense, and sets the stage to think about things in an interesting way.

There are two concepts that I believe are important here:

First, given the set of objects you intend to render, I can produce a set of opengl state vectors assuming you need to set every single opengl state variable for every single item you want to render.

Now you would like to optimize this. Lets say you render a couple times and record, for each state item, if it was different from the last state item and if so, now many different items you found.

Now you sort each state vector by the number of different instances of each state property you found. Next you sort your entire render list by the state vectors. Now if you walk through the state vectors you are guaranteed to need to change the minimal number of gl state for each item in the state vector. You could add expense information to different operations in the sense that you some gl state properties are expensive to set (glUseProgram) and some aren't.

That gives you an extensible way to order your rendering operations on opengl objects where actually changing the state is slowing the program down.


Lets now take a look at exactly what a state vector is, or perhaps rather what it reminds me of.

The other place I have seen a large number of state items was a CSS system for html I wrote once. The result of a success full CSS system is a large vector of properties you apply to this html object during the rendering phase.

Perhaps there could be a graphics css system where the result is the opengl state vector you need to rendering this particular piece of geometry? I need to think about this for a while to see if it goes anywhere, but it is a start.

Chris

Some thoughts about concurrency

A while back, when I found out about F#'s syntax trees, I thought that perhaps you could use them to program CUDA programs; or in general massively parallel systems.

In addition, I just spoke with Jeff about high level languages and how they enable certain types of optimizations that lower level programs make excessively difficult to get right.

http://spiral.ece.cmu.edu:8080/pub-spiral/abstract.jsp?id=1

Finally, I have been trying to understand what Monads are and exactly what parser combinators are. I still don't really understand Monads, but I am not too far off. I think if I took a week and really messed around with Haskell, I could figure them out pretty well.

I spoke with one of the guys in charge of CUDA development about F# or rather functional languages and his exact quote was "How does this help c++ programmers trying to program in CUDA?"

I didn't have a great answer to this right off the bat, but I am in the nascent stages of having a pretty good answer.

Check out the parallel scan algorithm description for CUDA processors systems:
http://beowulf.lcs.mit.edu/18.337/lectslides/scan.pdf.

The important thing to notice is that there is a high level description of the algorithm, where basically the author talks about walking your way from the leaves of a binary tree to the root and back. Each node is actually a bit of code, and the level of the tree you are running at is the argument. The depth of the tree tells you how many threads are running at a given time (or whether each thread will do several computations in the case where there are more leaves than threads).

Parallel prefix scan uses a binary numeric operator (+) . This operator tends to be very easy to perform interesting parallelization with because it is commutative and associative; which allows you to use a distributive property in clever ways. I do not think this algorithm would work with (-), but I am not certain. Anyway, you have a bit of code that takes numeric arguments and that you can prove has various properties.

Then you can separate the operation itself from the application or algorithm context (the weird binary tree meta-algorithm). The algorithm context could be an instance of a set of parallel algorithm context, lets call this binary-apply. So in some ways, if you call (binary-apply + ), then you *could* automatically parallelize things.

Later in the prefix scan paper you get some interesting details about the local memory tricks you need to apply in order to make using a certain type of memory on the GPU efficient. These memory tricks could probably be automated, since all the answer involves is careful and clever application of an offset to the local memory index per given tree node read and write requests.

There are a couple things that would be interesting to take from this.

The first is that overall you probably have many different algorithm patterns you would like to apply: Matrix multiply is going to use one type of algorithm pattern, bitonic sort probably uses a very different one, and finally this parallel prefix scan exposes yet another algorithm pattern. So in essence, I am describing a set of parallel combinators, much like the parser combinators other authors have spoke at length about.

The second is that given a specific operation (result[n] = result[n-1] + input[n]), you *should* be able to figure out automatic parallelization of the operation, but it *has* to be described in certain terms. The should part makes me thing that for some problems (or most problems), this may be a pipe dream and may in fact be provably NP complete. But for those problems that you can do automatic parallelization, it would be nice if there was some system of doing this.

The third is that given something described using a given parallel combinator, you can debug it at a higher level, perhaps graphically, *before* you run it on 1000s of pieces of hardware. The language can ensure that reads and write *cannot* happen to the same local space at the same level of the tree. This level of debuggability of parallel algorithms is another holy grail.

Anyway, given the fact that I see three *very* different algorithm patterns and I can think of a high level description of those algorithm patterns, I believe there is room to say that some further exploration of the idea of parallel combinators makes a lot of sense.

Chris