Bullet on GPU

Erwin Coumans · Post by **Erwin Coumans** » Thu Sep 21, 2006 7:36 pm

Dirk Gregorius wrote:Erwin,
if the GPU are so fast, couldn't it be that they use a Jacobi like solver which is by its very nature parallel as opposed to Gauss-Seidel.

Perhaps they use Jacobi. Still, even when using Jacobi, the constraints might share rigid bodies, so velocity impulses from different constraint on sharing one rigidbody need to be combined. That sounds like a sequential problem, combining those impulses (or updated velocities).

Steve wrote: OK - that makes a lot of sense. If the GPU can do Jacobi many, many times faster than the CPU could do GS - then maybe you can do enough iterations to make it worthwhile. However, the idea here is to use the GPU to do more objects - if we burned too much performance on Jacobi then it might not provide any benefits.
So GS works sequentially through a chain of interactions or something? Something like "look at object A - fix up it's motion - now look at what that fixing up did to objects B, C & D"? If that's it then some parallelism might still be possible if we ran GS in parallel on a bunch of objects that are so far apart that they can't possibly interact with each other?

The idea being to (say) split the world into a spatial grid then run GS in parallel by picking one object from every alternate cell and running all of those in parallel. Since we'd know for sure that those objects could never directly interact - would that be enough to let us run them in parallel? I could imagine situations where that might not work - maybe a long row of boxes - all touching each other. We push on a box on one end - but the consequences of that traverse multiple cells in my hypothetical grid.

Steve, the solving method takes contact constraints between two bodies (typically), so usually solving happens on a body-pair basis, not per-body.

Also, we must deal with large stacks where all objects are touching directly or indirectly, so one connected graph. You can make some clustering where you can solve groups of objects in parallel. Then you write back velocities (merge/average results in some way (or apply impulses), and let the relaxation (iterative converging process) deal with your approximations over time.

SteveBaker · Post by **SteveBaker** » Thu Sep 21, 2006 9:46 pm

Erwin Coumans wrote: Steve, the solving method takes contact constraints between two bodies (typically), so usually solving happens on a body-pair basis, not per-body.

Well, so long as both sets of body data are present in the input texture, the shader can read a pair of object body parameters - crunch the math and write the results out to ONE output. If both bodies share a result - then it's likely that the simplest thing is just to redundandly do the math twice - once for each output.

Something like:

void main ( int myBodyIdent, // Which body are we processing?
texture allBodyData, // What is the data for body[N]
texture collisionMap ) // Who collided with who?
{
int otherBodyIdent = collisionMap ( myBodyIdent ) ;
if ( otherBodyIdent == NULL )
write out zero new forces () ;
else
{
myBodyData = allBodyData [ myBodyIdent ] ;
otherBodyData = allBodyData [ otherBodyIdent ] ;

write out ( computeForcesBetween ( myBodyData, otherBodyData ) ) ;
}
}

Also, we must deal with large stacks where all objects are touching directly or indirectly, so one connected graph. You can make some clustering where you can solve groups of objects in parallel. Then you write back velocities (merge/average results in some way (or apply impulses), and let the relaxation (iterative converging process) deal with your approximations over time.

That sounds OK.

I would imagine things going something like this:

1) Update accellerations from forces, velocities from accellerations and positions from velocities. All of those things are now sitting around in texture maps with one element per rigid body or something.
2) Do AABB versus AABB collision detection and end up with one or more textures representing which objects collided with which object. Again, one texture element per object - containing a short list of things that collided with it. So in the texture, for each object, there is pixel in the map that contains the identifiers of each other object that collided with this one. I call this the "collisionMap'. The 'width' of the collision map would be limited to maybe holding the first four or eight (at a stretch) objects that collided with this one...so you might need more than one map - which means more than one O(N) pass through the collision detector. This is what I imagine 'Broadphase' looks like.
3) Do more accurate collision detection on just the objects that passed (2) and extract contact points (again, into some kind of texture representation).
4) Process the first, second, third...Nth collision for all objects in parallel until done - producing forces that go back into (1) on the next iteration.

I believe that I completely understand (1) - that's my demo essentially. I think I see how to do (2) as an O(N) process - (3) is a bit of a mystery, and (4) is an O(M) process where M is the most collisions a single object underwent...but I have no clue as to what that entails either.

Erwin Coumans · Post by **Erwin Coumans** » Thu Sep 21, 2006 10:03 pm

if ( otherBodyIdent == NULL )
write out zero new forces () ;

If one objects is non-zero, you still need to calculate some impulse. Bullet uses sequential impulses, so on the velocity level, for solving (contact and other) constraints.

SteveBaker wrote: I would imagine things going something like this:

1) Update accellerations from forces, velocities from accellerations and positions from velocities. All of those things are now sitting around in texture maps with one element per rigid body or something.
2) Do AABB versus AABB collision detection and end up with one or more textures representing which objects collided with which object. Again, one texture element per object - containing a short list of things that collided with it. So in the texture, for each object, there is pixel in the map that contains the identifiers of each other object that collided with this one. I call this the "collisionMap'. The 'width' of the collision map would be limited to maybe holding the first four or eight (at a stretch) objects that collided with this one...so you might need more than one map - which means more than one O(N) pass through the collision detector. This is what I imagine 'Broadphase' looks like.
3) Do more accurate collision detection on just the objects that passed (2) and extract contact points (again, into some kind of texture representation).
4) Process the first, second, third...Nth collision for all objects in parallel until done - producing forces that go back into (1) on the next iteration.

I believe that I completely understand (1) - that's my demo essentially. I think I see how to do (2) as an O(N) process - (3) is a bit of a mystery, and (4) is an O(M) process where M is the most collisions a single object underwent...but I have no clue as to what that entails either.

To solve (3), can you please add sphere support in the demo. Let's try to add the sphere-sphere case first. I know you want more complexity, so once that is working, we put the cubes back in.
I would like to see the full pipeline in action first

The trivial sphere-sphere code inside Bullet is here:
http://svn.sourceforge.net/viewvc/bulle ... iew=markup

With respect to solving the actual constraint, for spheres that is also extremely simple, we can ignore the moment of inertia and angular effects to start with (no friction). Once this works, you/we add those more interesting bits. Those bits are in Bullet repository:

http://svn.sourceforge.net/viewvc/bulle ... iew=markup

I can add the sphere-box case, but we can leave the floor/infinite groundplane for prototyping.

You mention you want to try out a GPU broadphase, so please go ahead

But if you loose too much time in step 2 it would be better to built the broadphase information on the CPU, and upload the texture at the start of each frame.

Is that ok?
Erwin

Bullet goes GPU, digg this

DevO · Post by **DevO** » Fri Sep 22, 2006 1:21 pm

Nvidias G80 is on the way and will be available in 2006 as I know.
ATI nex gen GPU will be probalbe available soon to.
Shader model 4.0 seems to be much better for Physic calculations as S3.0.
What is the status of OpenGL and GLSL support for it, do some one know this?

About broadphase on GPU, in my opinion this must be possible but even if GPU age much faster as CPU brute force will be still too slow for 16000 objects.
May be ideas of Cullide can be used, not sure?
http://gamma.cs.unc.edu/QCULLIDE/

SteveBaker · Post by **SteveBaker** » Fri Sep 22, 2006 2:56 pm

DevO wrote:Nvidias G80 is on the way and will be available in 2006 as I know.
ATI nex gen GPU will be probalbe available soon to.
Shader model 4.0 seems to be much better for Physic calculations as S3.0.
What is the status of OpenGL and GLSL support for it, do some one know this?

The difference between the way OpenGL and GLSL operate compared to DirectX is that they tend to specify the API ahead of where the hardare currently is. So there are lots of GLSL and OpenGL features that don't really work all that well on present hardware - when the next gen hardware appears, OpenGL and GLSL will already have the needed features to drive them. Also, OpenGL has a comprehensive extension mechanism that allows individual manufacturers to add features to the API to suit their own hardware. Each revision of DirectX is pretty much a statement of where the hardware is right now.

At any rate, the features of OpenGL/GLSL right now are plenty good enough for doing fast physics - the problems we've been having are that only my particular hardware/software setup appears to actually support them! The current generation of nVidia hardware (7xxx) - or even last-years cutting-edge hardware (68xx) - even with present-gen API's are quite capable of doing GPU physics. ATI hardware (or maybe just their drivers) appear to be lagging in one key area (vertex shader texturing) - but I'm sure they'll catch up - and there are work-arounds. 5xxx series nVidia hardware definitely can't do it though - they don't support enough fragment shader facilities - and their support for floating point textures and FBO's is patchy at best.

About broadphase on GPU, in my opinion this must be possible but even if GPU age much faster as CPU brute force will be still too slow for 16000 objects.
May be ideas of Cullide can be used, not sure?
http://gamma.cs.unc.edu/QCULLIDE/

I'll check it out - thanks for the link.

SteveBaker · Post by **SteveBaker** » Fri Sep 22, 2006 3:24 pm

Hmm - it's going to take a while to fully understand that paper. What's interesting is that they are using 5950 hardware - which doesn't even support the features I think we need. It's slower than I'd hope - but on modern hardware, that wouldn't be so bad.

One idea they use (making use of the 'occlusion test' feature) has the potential to speed up the algorithm I have in mind for collision detection.

My problem has been that I want to run 16,000 collision queries through the GPU - but I only have limited space to store the results. If there are too many collisions against a single object - I have no place to store the consequences of that in a single output texture. To fix that, I'd have to run a bunch of collision tests - then read back the resulting texture and examine it in the CPU to see if any objects had more than (say) 4 collisions - and rerun the overflowed tests if there were. But reading back and examining that texture would slow the algorithm down considerably.

What the paper suggests is to use the glocclusionqueryNV extension which allows you to have the GPU count the number of pixels that passed the Z-depth test. I could theoretically use that to count the number of objects who's list of collisions has overflowed - so the CPU could tell whether or not it needs to do another pass without having to read back the full results texture. Grouping the collision tests into batches of a few hundred at a time and testing the occlusion test result after each batch would be pretty efficient I think.

Anyway - I need to finish reading the paper. It's good stuff.

Erwin Coumans · Post by **Erwin Coumans** » Fri Sep 22, 2006 11:36 pm

I still hope we can use the Bullet broadphase first, create a 2D bitmap, that encodes overlap between objects, and upload that to GPU memory.

Focussing on getting some narrowphase working, and then potentially optimize the broadphase for GPU. ATI and Nvidia both got very good results with such approach.

SteveBaker · Post by **SteveBaker** » Sat Sep 23, 2006 2:20 am

Erwin Coumans wrote:I still hope we can use the Bullet broadphase first, create a 2D bitmap, that encodes overlap between objects, and upload that to GPU memory.

OK - so what are the inputs? What are the outputs? What are the steps to get from one to the other? Are the outputs forces or impulses or something?

You talk about a "bitmap". Are you thinking of something like:

bool bitmap [ MAX_OBJECT ][ MAX_OBJECT ] ;

bool objectsOverlap ( int object1, int object2 )
{
return bitmap [ object1 ][ object2 ] ;
}

That wouldn't be at all GPU-friendly. The only way for the GPU to access such a thing would be in parallel. But if we want thousands of objects then the bitmap has millions of entries - which would be horrifically costly to process - even in parallel. Worse still, since most of the bitmap would be 'false', we'd have most of our processors executing the entire narrowphase algorithm with their processors write-protected...because (as I've explained before), this is a SIMD machine. That would be very wasteful.

Better would be an array with one element per object - with a list of the things that collided against it as a simple list of integers...or maybe an array with one element per collision that happened with a simple list of object pairs: A-collidedwith-B, F-collidedwith-Q...

Erwin Coumans · Post by **Erwin Coumans** » Mon Sep 25, 2006 10:42 pm

SteveBaker wrote:Better would be an array with one element per object - with a list of the things that collided against it as a simple list of integers...or maybe an array with one element per collision that happened with a simple list of object pairs: A-collidedwith-B, F-collidedwith-Q...

OK, my suggested approach was probably too old fashioned for GPU.

In that case, either one array with a small constant number of other elements, or a few arrays with at most one elements of interacting elements would be great.
I think its easiest to first create such array on the CPU side, if you tell me the format of that array it can be implemented in a day. GPU might take more time

If we then can run a shader that gets processed for each interacting 'pair', writing the results just for one object, that would be a great start.

Thanks,
Erwin

SteveBaker · Post by **SteveBaker** » Wed Sep 27, 2006 7:23 pm

I added some new code to the demo.

* There is now a 'collisionMap' that stores up to four objects that have (nominally) collided with each object.
* There is an extra shader that computes some sort of constraint force on every object from (up to) four objects it might have collided with.

For the sake of a demonstration which doesn't have collision detection yet, I hard-wired the four 'colliding' objects to four red cubes and changed the 'constraint force' to be an attraction like gravity. The result is that the four red cubes look like black holes around which galaxies are forming.

It's a very pretty demo!

You may still need to use a '-v' on the command line to make it work with some combinations of graphics card and operating system. (Urgh).

SteveBaker · Post by **SteveBaker** » Wed Sep 27, 2006 7:26 pm

Incidentally, I got an email from Chris Seitz at nVidia offering help from nVidia on the GPU physics stuff. Initially, this will get me onto their developer forums - which may help me to understand why the heck this code doesn't work properly under Windows with vertex texturing enabled. (We know why it doesn't work with ATI hardware though).

Erwin Coumans · Post by **Erwin Coumans** » Wed Sep 27, 2006 7:51 pm

Hi Steve,

That is good news from Nvidia!

Also ATI is very helpful, so we can contact them with specific questions. However, my issue was with Apple Macbook, where Apple writes the drivers. They are helpful too, and there is some recent demo that looks promising.

By the way, for the windows Bullet source zipfiles, I disabled vertex texture support by default, otherwise the demo always shows grey I think.
So I modified the line into:

Code: Select all

bool disableVertexTextureSupport = true ;

until things are fixed this looks better for the executables

Erwin

SteveBaker · Post by **SteveBaker** » Wed Sep 27, 2006 9:46 pm

Code: Select all

bool disableVertexTextureSupport = true ;

We should probably do that conditionally so Linux version still works fast.

Erwin Coumans · Post by **Erwin Coumans** » Wed Sep 27, 2006 10:20 pm

Hi Steve,

The source .tgz doesn't include this temporarily change, it is just the windows zip that includes it.

Your latest GPU demo looks very cool, I will upload some binaries for others to play with, and upload a new source distribution. It doesn't have the lower-case methods yet, but otherwise most refactoring is in it.

Erwin

tasora · Post by **tasora** » Tue Nov 14, 2006 1:00 pm

Hallo,
I am testing your GPU demo since I am going to implement some
GP-GPU stuff for my multibody simulator.
The demo is very interesting, but I found some problems on my
system.
These are the results, runing the EXE in debug mode as suggested
by Steve Baker:

GPUphysics -s:
ops: program crashes!

GPUphysics -p:
OK, static green cubes displayed.

GPUphysics -v:
OK, cubes are moving & colliding.

GPUphysics :
OK, cubes are moving & colliding.
Exactly as in GPUphysics -v mode (in fact there is
always the message about vertex shader being disabled,
by default on Win systems.)

PROBLEMS:

1)
The performance is a bit too low.. I suspect that there's something
wrong.. Just to give you an idea, the 'average' result is the following:
'Performance: 2 passes 256 cubes: other=0.501740ms collisions=16.285310ms'
Is this normal on a Nvidia 7900GS board?

2)
After I close the window by mouse clicking, the background DOS
window starts printing the following line THOUSANDS of times:
'GLSL_ShaderPair::getUniformLocation: OpenGL Error - operazione non valida'
until I stop this by Ctrl+C. Note, the sentence 'operazione non valida'
is in italian because my WindowsXP is localized in italian language (it
means, 'not a valid operation', anyway...)
I suspect this should not happen - also, I guess that this may be related
with note 1), that is the poor performance.

3)
Not as important as 1) or 2), but worth mentioning..
In the first release of the GPU demo, there were floating (not colliding)
cubes. This was very fast on my GPU, thought a strange effect
happened: most times it run about 10x slower as soon as started,
but after some seconds it turned into 'good speed' mode.
I mean, this may be not directly related to your GPU demo, since
a similar issue happens also with the Nvidia 'Nalu' demo, which
runs at 2-3fps for some seconds before going at full speed (but
does not happen on other demos, even more complex, who knows...)

My system is a Toshiba laptop with a Nvidia GeForce Go 7900GS
with 512Mb of gpu ram, with the following specs:

INFO: This hardware supports at most:
4 vert texture samplers
16 frag texture samplers
16 total texture samplers

PS: have you tested the GP-GPU code in
http://www.mathematik.uni-dortmund.de/~ ... orial.html
There are some hints about supported OpenGL modes
on ATI-Nvidia etc.

regards,
A.Tasora

A.Tasora

Real-Time Physics Simulation Forum

Bullet on GPU

Something wrong with the GPU demo..