Dirk Gregorius wrote:
if the GPU are so fast, couldn't it be that they use a Jacobi like solver which is by its very nature parallel as opposed to Gauss-Seidel.
The difference between GS and Jacobi is that GS directly changes the state of the rigid bodies while Jacobi changes the state after a complete sweep over the constraints. Jacobi has terrible convergence. But when I take your argumentation of some posts before. With 48 "CPUs" in parallel we could blindly try some more iterations, right?
OK - that makes a lot of sense. If the GPU can do Jacobi many, many times faster than the CPU could do GS - then maybe you can do enough iterations to make it worthwhile. However, the idea here is to use the GPU to do more objects - if we burned too much performance on Jacobi then it might not provide any benefits.
So GS works sequentially through a chain of interactions or something? Something like "look at object A - fix up it's motion - now look at what that fixing up did to objects B, C & D"? If that's it then some parallelism might still be possible if we ran GS in parallel on a bunch of objects that are so far apart that they can't possibly interact with each other?
The idea being to (say) split the world into a spatial grid then run GS in parallel by picking one object from every alternate cell and running all of those in parallel. Since we'd know for sure that those objects could never directly interact - would that be enough to let us run them in parallel? I could imagine situations where that might not work - maybe a long row of boxes - all touching each other. We push on a box on one end - but the consequences of that traverse multiple cells in my hypothetical grid.
Incidentally - when I talk of 48 CPU's in parallel, it's important to bear in mind that these are NOT general-purpose processors (although the programming 'model' makes it look like they are). There are some things to bear in mind:
1) 48 is the number for the high end nVidia products - the lower end of the range has (IIRC) just 16 processors. This number will go up over time. One of the reasons that GPU performance growth by *FAR* outstrips CPU performance growth is because the GPU designers can just replicate more shader processors as technology shrinks transistor sizes and we'll get a more or less linear improvement in horsepower without any messy software changes.
2) Each processor has four parallel floating point units - so you can do operations on vectors in a single cycle. There are some cunning tricks that can be used to speed things up - eg: In a loop in which some operation on a 'float' happens N times, you may find it better to run the loop N/4 times - doing the operation in 4 way parallel - then combine the results into a single float at the end. We also have stuff like four-way conditional testing resulting in four-way booleans.
3) The processors have very specialised instruction sets - so things like calculating the length of a vector using Pythagoras is a single clock tick, cross and dot products are single cycle instructions, matrix multiplication is accellerated. On the other hand, they are useless at dealing with complex data structures, they don't implement pointers or anything like that. Most current systems don't implement arrays other than 1,2,3 or 4 element 'vectors' and 2x2, 3x3 or 4x4 matrices.
4) This is a SIMD (Single-Instruction, Multiple Data) architecture. All of those processors run in utter lock-step through their programs. Thus (for example), if you have:
if ( this ) do_that ; else do_something_else ;
...then all of the processors execute BOTH 'do_that' and 'do_something_else' - but the ones that decided not to 'do_that' have write-protected their memory while 'do_that' is being executed, whilst the ones that DID 'do_that' will write-protect when executing "do_something_else". This means that attempting to use conditional code to save processing time is pointless. A similar problem exists with loops. Most current GPU's don't actually have looping instructions and those have to unroll loops at compile time (requiring fixed termination conditions). Even the ones that do implement proper loops have all of the processors executing the loop as many times as the worst case processor. This has severe consequences on how you program.
5) When we do these parallel operations, the 'per object' data comes in as textures and 'per object' data goes out as a single
texture. Since we can only read about 16 textures into the shader and write only one texture on the output, we'll have to chop up the algorithms into really simple steps. Fortunately, each texture can contain 1,2,3 or 4 numbers per object.
6) Setup costs are horrendous. Whilst there may be 'only' 48 processors, it's very inefficient to do things in 48-way parallel 'chunks'. Doing things in ten-thousand-way parallel is much faster because you avoid those repeated setup costs. A better 'mental model' is to imagine you have a fairly slow 10,000 processor machine rather than a blindingly fast 48 processor machine.
7) There is no persistant memory inside the processors - you have to save your results into a single 4-component texture. There is also *almost* no communications between processors. (You can ask what the rate of change of one of your variables is across 'neighbouring' processors - but that's about it).