Bullet on GPU

SteveBaker · Post by **SteveBaker** » Sat Sep 02, 2006 8:28 pm

cippyboy wrote:[SteveBaker] Dude, I know what shaders are, I didn't need THAT lesson.

Sorry - on these kinds of forums it's impossible to know who knows what about what - so I erred on the side of a full explanation. Didn't mean to preach a sermon to the converted!

To make things more clear I meant this
2D quad render calls in ortho mode :
glVertex2f(0,0);glTexCoord2f(0,0);
glVertex2f(8,0);glTexCoord2f(1,0);
glVertex2f(0,8);glTexCoord2f(0,1);
glVertex2f(8,8);glTexCoord2f(1,1);

(You'll need to put the texcoord calls *before* the glVertex calls - but OK).

And the picture I did in paint http://www.relativeengine.freegsm.ro/ph ... gpu_fp.jpg

I guess I don't understand what you are saying/asking then.

cippyboy · Post by **cippyboy** » Sat Sep 02, 2006 11:25 pm

Oh, yeah, my bad there.

And I was saying... that at every single pixel you have acces to the whole texture/array of everything so... that given, you could actually do everything in a single pixel, although that ain't too optimized.

And there's also a shader extension (GL_ARB_draw_buffers) that lets you put pixels in different drawbuffers (front buffer/back buffer, aux buffers) although I haven't used it yet, you could save the new position/etc in each one of them and then glCopyTexture them. It's interesting what what would happen when GL_ARB_draw_buffers and GL_ARB_render_texture meet, saving on more different textures at once sounds nice.

SteveBaker · Post by **SteveBaker** » Sun Sep 03, 2006 12:59 am

cippyboy wrote:Oh, yeah, my bad there.

And I was saying... that at every single pixel you have acces to the whole texture/array of everything so... that given, you could actually do everything in a single pixel, although that ain't too optimized.

There are two flaws in that:

1) The whole point of doing things in the GPU is because there is massive parallelism in there. If you did everything in a single pixel, it would be a lot slower than doing it in the main CPU. The idea is that there are maybe 16 to 48 or more fragment processors - each of which has 4-way internal parallelism - so you need to be using them all in parallel to get performance.

2) Each fragment processor can only write out R+G+B+A+Z+Stencil - and in truth, we should probably restrict ourselves to using R+G+B+A alone. So if you tried to process more than one object or more than one timestep or something, you'd have no place to put the results.

So it's VERY clear that you take the inputs for one or two steps of the calculation into the GPU as textures with one texel per object in the scene. You do a few simple calculations in massive parallelism - and what you get out goes back into texture, Each small stage in the calculation proceeds through a different set of shader code until you have your final answers and updated intermediates in texture ready for the next frame of physics.

And there's also a shader extension (GL_ARB_draw_buffers) that lets you put pixels in different drawbuffers (front buffer/back buffer, aux buffers) although I haven't used it yet, you could save the new position/etc in each one of them and then glCopyTexture them. It's interesting what what would happen when GL_ARB_draw_buffers and GL_ARB_render_texture meet, saving on more different textures at once sounds nice.

Yes - there is potential for multiple draw targets to allow more stages of the processing pipeline to proceed with a single polygon draw - and therefore in a single shader program. It remains to be seen to what degree you'll be able to output to multiple textures using floating point and with separate R,G,B,A (which for us will probably be X,Y,Z positions or the four quaternion components for rotations)

Erwin Coumans · Post by **Erwin Coumans** » Mon Sep 04, 2006 4:34 pm

A simple proof of concept with thousands of rotating cubes, that keep their transform on the GPU memory (over frames) would be very cool as a start. Once that is working, a library as you describe indeed. But the order is a matter of developer preference

Would that be feasible?
Thanks,
Erwin

SteveBaker wrote: I'll see what I can do. I guess you need a nice simple library where you have:

* Compile shader source code (returns a 'handle' to the shader).
* Allocate an NxM texture (returns a handle).
* Populate some section of a specified texture with floating point data.
* Run a specified shader on a specified set of textures - leaving the results in another specified texture.
* Read back a texture into the CPU.

Seems like that's a good starting point.

SteveBaker · Post by **SteveBaker** » Mon Sep 04, 2006 7:22 pm

Erwin Coumans wrote:A simple proof of concept with thousands of rotating cubes, that keep their transform on the GPU memory (over frames) would be very cool as a start. Once that is working, a library as you describe indeed. But the order is a matter of developer preference ;-)

Would that be feasible?
Thanks,
Erwin

To put the translational velocity in one texture, the rotational velocity (quaternion) in another and the position and rotation of the cube centers in a third and fourth texture is simple. Updating the positions from the velocities takes two passes of render-to-texture (one for translate and the other for rotate) - then you can render your cubes with a vertex shader that takes the position and rotation of each cube from a separate entry in the texture. That much is very do-able.

However, what concerns me about the whole process is the collision detection.

* If you can do it on the GPU - then that's great news - but I'm deeply skeptical. The algorithms involved in collision detection seems to be full of conditionals and loops of the kind that shaders are poor at handling.

* If you can't do it on the GPU then you'll be sucking positional data out of the GPU all the time - and the cost of doing that would wipe out the benefits of automating A=F/M, V+=At and Pos+=Vt. - which are pretty trivial to do compared to the data transfer times.

So I guess my question is: Aside from simple application of F=ma and s=ut+1/2 at^2 - what else can we do with the data down in the GPU?

Erwin Coumans · Post by **Erwin Coumans** » Mon Sep 04, 2006 8:59 pm

SteveBaker wrote:
Erwin Coumans wrote:A simple proof of concept with thousands of rotating cubes, that keep their transform on the GPU memory (over frames) would be very cool as a start. Once that is working, a library as you describe indeed. But the order is a matter of developer preference

Would that be feasible?
Thanks,
Erwin
To put the translational velocity in one texture, the rotational velocity (quaternion) in another and the position and rotation of the cube centers in a third and fourth texture is simple. Updating the positions from the velocities takes two passes of render-to-texture (one for translate and the other for rotate) - then you can render your cubes with a vertex shader that takes the position and rotation of each cube from a separate entry in the texture. That much is very do-able.

However, what concerns me about the whole process is the collision detection.

* If you can do it on the GPU - then that's great news - but I'm deeply skeptical. The algorithms involved in collision detection seems to be full of conditionals and loops of the kind that shaders are poor at handling.

* If you can't do it on the GPU then you'll be sucking positional data out of the GPU all the time - and the cost of doing that would wipe out the benefits of automating A=F/M, V+=At and Pos+=Vt. - which are pretty trivial to do compared to the data transfer times.

So I guess my question is: Aside from simple application of F=ma and s=ut+1/2 at^2 - what else can we do with the data down in the GPU?

As Havok already showed on both ATI and NVidia, using shader model 3, basic collision detection and solving is possible on the GPU. It won't be GJK in the first run, so we can start with basic sphere-sphere collision detection on GPU, and add more complicated/sophisticated algorithms later.

Let me worry about the GPU collision detection, and you worry about the rendering/GPU framework

How long would a prototype take, that rotates thousands of cubes, given you call it 'very do-able' ?

Erwin

SteveBaker · Post by **SteveBaker** » Tue Sep 12, 2006 2:36 pm

Just a note to let you guys know that I've started working on a 'wrapper' layer to make it quick and easy to use the GPU to accellerate massively parallel calculations.

I should have something to play with in a couple of days.

SteveBaker · Post by **SteveBaker** » Tue Sep 12, 2006 10:33 pm

Work on the low level GPU library went together pretty fast today.

I have a fairly untidy demo of a 4096 cubes falling under gravity and bouncing off Y==0
(no proper collisions - just 'if position.y < 0 then velocity.y *= -0.95 ") but 100% of the work happening in the GPU. I feed it per-cube data in textures on initialisation:

Force vector, Mass, Initial position/velocity, Initial rotation/rotational-velocity

...then the graphics card updates velocity, new position, etc in-place and automatically renders the cubes in the right positions without the CPU doing any of the calculations.

I'll clean it up and make it pretty tonight - should be able to post the code tomorrow.

Eternl Knight · Post by **Eternl Knight** » Wed Sep 13, 2006 3:19 am

Thanks for the update, Steve. I look forward to seeing how it all fits together

--EK

SteveBaker · Post by **SteveBaker** » Wed Sep 13, 2006 3:35 am

OK - the demo (and a pretty usable library that it's based upon) is done.

http://www.sjbaker.org/tmp/GPUphysics.tgz

...I'm putting it under LGPL licensing for now - we can argue about what it should be later - I'm not a bigot about licensing - just so long as there is something.

Check the 'README' file for more details.

It runs under Linux and bulds with 'make'. It uses GLUT so it's easy to port to Windows. I also use 'GLEW' for extension registration stuff. If you don't have GLEW, grab it at http://glew.sf.net - I use 'freeglut' (http://freeglut.sf.net) but regular GLUT will do just fine.

What the demo does is to display 16,000 bouncing, spinning, randomly coloured cubes - each of which is doing A=F/m ; V += (A+g)*deltaT ; Pos += V * deltaT ; (And mindless constant velocity rotation for visual impact!). Then each position is being tested to see if it's below Y==0 and if so then the Y component of velocity is being multiplied by -0.9 to make the cubes bounce and lose a bit of energy each time. Aside from that - there is no collision detection or anything fancy like that. This just demonstrates (and provides a handy abstraction layer for) the ability to do a vast number of simple calculation steps in massive parallelism using a modern graphics card.

Each cube can have a different mass, different inital position, velocity, rotation, rotational velocity, and a single force applied to it.

The data for the position and rotation of each cube is unknown to the CPU - it's all handled inside the graphics card. Clearly that's not going to work for most applications - so I'm writing more code so you can read back the state of objects from the GPU.

It's pretty fast - almost all of the time goes in rendering all those cubes - the physics compute time is far too small to measure reliably - and almost all of it is setup time for which there is considerable scope for optimisation.

OK - so proof of concept ....***DONE***

Now let's talk about how the heck we do collision detection in this kind of setting? That seems to me to be the hard part to do efficiently - and it's not immediately obvious how massive parallelism necessarily helps.

Erwin Coumans · Post by **Erwin Coumans** » Wed Sep 13, 2006 6:27 am

SteveBaker wrote:OK - the demo (and a pretty usable library that it's based upon) is done.

http://www.sjbaker.org/tmp/GPUphysics.tgz

...I'm putting it under LGPL licensing for now - we can argue about what it should be later - I'm not a bigot about licensing - just so long as there is something.

Check the 'README' file for more details.

That is amazingly rapid prototyping! Could you please make it Zlib license? Then it is more compatible with Bullet, and it allows application in closed source/commercial/console games.

Zlib still makes sure your name is in the source files.

It runs under Linux and bulds with 'make'. It uses GLUT so it's easy to port to Windows. I also use 'GLEW' for extension registration stuff. If you don't have GLEW, grab it at http://glew.sf.net - I use 'freeglut' (http://freeglut.sf.net) but regular GLUT will do just fine.

What the demo does is to display 16,000 bouncing, spinning, randomly coloured cubes - each of which is doing A=F/m ; V += (A+g)*deltaT ; Pos += V * deltaT ; (And mindless constant velocity rotation for visual impact!). Then each position is being tested to see if it's below Y==0 and if so then the Y component of velocity is being multiplied by -0.9 to make the cubes bounce and lose a bit of energy each time. Aside from that - there is no collision detection or anything fancy like that. This just demonstrates (and provides a handy abstraction layer for) the ability to do a vast number of simple calculation steps in massive parallelism using a modern graphics card.

Each cube can have a different mass, different inital position, velocity, rotation, rotational velocity, and a single force applied to it.

The data for the position and rotation of each cube is unknown to the CPU - it's all handled inside the graphics card. Clearly that's not going to work for most applications - so I'm writing more code so you can read back the state of objects from the GPU.

Well, for effect/eye candy physics, that is already enough. We could have a shader that allows some interaction with objects. Havok had some whirl-wind effect that looked amazing with the tons of objects.

It's pretty fast - almost all of the time goes in rendering all those cubes - the physics compute time is far too small to measure reliably - and almost all of it is setup time for which there is considerable scope for optimisation.

OK - so proof of concept ....***DONE***

Now let's talk about how the heck we do collision detection in this kind of setting? That seems to me to be the hard part to do efficiently - and it's not immediately obvious how massive parallelism necessarily helps.

I will work on the broadphase very soon, creating a 2d bitmap, encoding overlap between two objects. Also, I will spend some time on a parallel sphere-sphere implementation. First I will round of some boring demo-cleaning-up work (almost finished).

Thanks. What requirements does it have? Shader model 3?
Just modified the code so it compile under Mac OS X. It runs, but asserted in line 60 of fboSupport.cxx. The status is 36054 (didn't check the enum). After commenting that out, it failed later, see below.

Code: Select all

erwin-coumans-computer:~/Downloads/GPUphysics apple$ make
g++ -c fboSupport.cxx
g++ -framework GLUT -framework OpenGL -L"/System/Library/Frameworks/OpenGL.framework/Libraries" -lGL -lGLU -o GPU_physics_demo GPU_physics_demo.o fboSupport.o shaderSupport.o -L"/System/Library/Frameworks/OpenGL.framework/Libraries"  -lGLU -lGLEW -lGL -lGLU -lobjc
erwin-coumans-computer:~/Downloads/GPUphysics apple$ ./GPU_physics_demo 
status : 36054
Compiling:CollisionGenerator Frag Shader - 
ERROR: 0:1: '<' :  wrong operand types  no operation '<' exists that takes a left-hand operand of type 'float' and a right operand of type 'const int' (or there is no acceptable conversion)

Failed to compile shader 'CollisionGenerator Frag Shader'.
GPU_physics_demo.cxx:204: failed assertion `collisionGenerator -> compiledOK ()'
Abort trap
erwin-coumans-computer:~/Downloads/GPUphysics apple$

The glewinfo.txt for Apple Powerbook Pro (Intel) is here

I suppose I need to re-insert my freezing nvidia 6800 GT card. Is there another stable nvidia card that works on older AGP slots?
So framebuffer objects is one way of doing GPU shaders. Is there an alternative? Anyway, this is more a hardware issue on my side

I'll try it at work, there is a new nvidia card in one of my machines.

Thanks a lot!
Erwin

SteveBaker · Post by **SteveBaker** » Wed Sep 13, 2006 12:03 pm

Erwin Coumans wrote: That is amazingly rapid prototyping!

Two evenings...Meh...

Could you please make it Zlib license? Then it is more compatible with Bullet, and it allows application in closed source/commercial/console games.

OK - I'll read the zlib license - but it'll probably be OK.

Well, for effect/eye candy physics, that is already enough. We could have a shader that allows some interaction with objects. Havok had some whirl-wind effect that looked amazing with the tons of objects.

Yeah - but particle system approaches are fine if there is no interaction between particles or between particles and the world. We already do this stuff at work (I work in flight simulation - not games - but it's hard to tell the difference sometimes!) - if the position of every particle follows a 100% predictable path then you can compute it's position without knowing any past history and you're better off just using 's = ut + 1/2 a t^2" for each particle each frame and just feeding the time into the shader. Even whirlwind type effects have an equation that will tell you the position of every particle from the time and the initial conditions - so you don't need an incremental approach that demands all of this sophistication. We only need it here because the particles are bouncing off the ground (that's why I included that in the demo!).

I will work on the broadphase very soon, creating a 2d bitmap, encoding overlap between two objects. Also, I will spend some time on a parallel sphere-sphere implementation.

Certainly if we wished to test ONE sphere (or cubeoid or triangle) against a huge number of other spheres (or cubeoids or triangles) - that could easily be done this way. But to test a huge number of spheres simultaneously against a huge number of spheres will require some thought and some cleverness or it'll require one polygon draw per sphere. For 16,000 spheres tested against 16,000 spheres using mindless brute force, we'd need to draw 16,000 polygons at 128x128 pixels each. That would take a significant amount of GPU time - but it's certainly do-able within a few milliseconds I think.

The interesting question is whether we could do something like classify which spheres were in which 'grid cell' of the world volume in one pass (yes, that's doable) and then only test the spheres from that grid cell against the others in that grid cell...something like that might speed the algorithm up by many orders of magnitude and stll keep it mainlin inside the GPU. That's the kind of approach I think might work - but I'm no collision detection expert.

Thanks. What requirements does it have? Shader model 3?

I'm not really up on the 'Shader model' thing (that's a Microsoft/DirectX kind of thing and I'm strictly a Linux/OpenGL guy). The two specialised features it requires right now are the ability to render to floating point textures and the ability to read floating point textures into both the VERTEX shader and the fragment shader. Both of those are required for a strictly legal GLSL implementation - but older hardware will probably either fall back on software emulation or simply refuse to do it or drop the precision of the float down to a byte or something horrible.

This is definitely advanced stuff - it's not going to work on nVIdia 5xxx cards.

Just modified the code so it compile under Mac OS X. It runs, but asserted in line 60 of fboSupport.cxx. The status is 36054 (didn't check the enum). After commenting that out, it failed later, see below.

OK - you must have added some lines at the top. That would be at the end of the "checkFrameBufferStatus" function? A fail there probably means that render to floating point texture is unsupported. That would be "A Bad Thing" - do you have the latest OpenGL drivers installed?

Code: Select all

erwin-coumans-computer:~/Downloads/GPUphysics apple$ ./GPU_physics_demo 
status : 36054
Compiling:CollisionGenerator Frag Shader - 
ERROR: 0:1: '<' :  wrong operand types  no operation '<' exists that takes a left-hand operand of type 'float' and a right operand of type 'const int' (or there is no acceptable conversion)

OK - that's my bad. On line 191 of GPU_physics_demo.cxx, change '< 0' to '< 0.0'.

I suppose I need to re-insert my freezing nvidia 6800 GT card. Is there another stable nvidia card that works on older AGP slots?

Probably.

So framebuffer objects is one way of doing GPU shaders. Is there an alternative?

Short Answer: No.

Long Answer: Well, maybe...but I wouldn't try it. If we can't render into a three or four component floating point texture, it's hard to see how else we'd get data out of one pass of the shader and into the next. At work, we played briefly with splitting apart a 'float' and repacking it into four bytes and putting that into the RGBA of the frame buffer - then copying the frame buffer into a texture adn recombining the four bytes into a float in the next shader pass. However, that means that you can only output one floating point number in each shader pass - so things like calculating a rotation quaternion takes four shader passes instead of one - repeating the same calculations each pass but writing out a different byte each time. It's also slow because of copying frame buffer into texture...but we're talking pretty small textures here (128x128 is tiny - and yet gives you 16,000 cubes bouncing around!)...the worst part is that OpenGL only GUARANTEES to support up to 16 texture inputs into a shader. If each floating point number is one texture (instead of having up to four floats in a single texture) - then each step of the physics math can have only 16 floating point numbers going into it instead of 64 if you can use floating point texture targets. Worse still, the packing and unpacking of floating point numbers into RGBA colour bytes is nasty inside a shader (you can't just cast a pointer to a float into a pointer into four bytes because there aren't any pointers in shader-land)....so it has to be done arithmetically - which is insanely painful and will probably grind out performance into the dust.

It's possible that older hardware might be able to do this at 'half-float' precision. My code doesn't support that - but it could be made to do so. I just have doubts that you can do useful physics at 16 bit precision...heck - there are enough limits on object sizes and such already - if we make it any worse, it'll be unusuable.

Like I said - this is really a high-end technique. It's going to be a while before most of the hardware out there can support it. But that's also true of the hardware-physics-engine and both of the other physics-on-GPU solutions. At least we don't demand a second graphics card for running the physics like the nVidia solution does!

I think the plan here has to be to get it working so that it's mature code by the time the hardware catches up.

SteveBaker · Post by **SteveBaker** » Wed Sep 13, 2006 3:27 pm

Some thoughts on Parallel Sphere/Sphere collision testing.

OK - so - how about some mindless brute-force sphere/sphere collision testing?! :-)

As with the present demo code, each 'object' is represented by a single texel in the texture map. It has an 'identifier' which is a simple integer that represents it's location in the map (it's actually the S and T coordinate within the map bit-packed into a single integer).

* Store the center and radius of each sphere in a four-component texture - the center being relative to the origin of the model that it's a part of so we don't have to update it as the object moves.

* We already have the translate/rotate of each model in two more textures.

* Imagine a 'collidemap' texture that contains the "identifiers" of up to four spheres that collided with the sphere represented by this pixel - it's initialised to some out-of-range value which we'll call 'NOCOLLIDE'. Because we have 16 or even 32 bits for the 'collidemap', we can reserve one bit in one of - the four entries for a flag called 'OVERFLOW' which indicates that more than four collisions happened here.

* For each 'probe' sphere, draw a polygon with one pixel per 'target' sphere we're testing it against (which might as well be all of them because it's fast).

* In the vertex shader, read the position/rotation map location for our probe sphere and transform it. Pass the transformed data down to the fragment shader.

* In the fragment shader, read the position/rotation of the 'target' sphere that's at this location and compare it to the 'probe' sphere. If they collide, then look at the collidemap for this target and write the identifier of the probe sphere into the next empty slot of the collidemap. If the collide map is already showing four collisions then set the 'OVERFLOW' bit in the collidemap.

If we now draw one polygon for each 'probe' sphere, we'll have a collidemap that contains (for each 'target') the identifiers of up to four 'probe' spheres that this one collided with.

If there are no 'OVERFLOW' hits then we're done.

If there ARE some 'OVERFLOW's then there were more than four collisions with this target - but we didn't have space to store them. However, we do know the highest 'identifier' that was stored - so whatever fifth (sixth, seventh...) object collided with this target much have had a higher identifier than that.

(It might be that with clever packing of identifiers, we could store more than four collisions in the collidemap - but we'll never be able to store them all because the widest texture
we can possibly write to is 4x32bits - so we'll continue to talk about four right now).

The most obvious next step has to involve the CPU. We have to read back (and save) the collidemap - find the set of OVERFLOW'ed target spheres and figure out which was the first probe sphere to overflow an entry.

Then we have to clear out the collidemap and retest all of the probe spheres from then on. It might be worthwhile to figure out the bounding box of the OVERFLOW'ed target spheres and only re-test that rectangle of targets...but the setup costs generally exceed the computation costs, so that may be a pessimal thing to do...especially for thousands to tens of thousands of spheres. For hundreds of thousands and up - then gut feel is that it's worthwhile...but I don't think we'll get to those kinds of numbers.

This strategy will work exceedingly well if the worst case collision situation has four or less probe spheres hitting any given target sphere. (The cost for collision detection is one polygon drawn for every 'probe' sphere - which is likely to be a lot less than the cost of actually drawing the 3D models they are connected to. For (say) 20 polygon objects, the cost of brute force, mindless bounding sphere collision testing of every bounding sphere against every other bounding sphere is going to be about 5% of the total GPU time).

It'll work less well if this situation happens - but only rarely and we don't exceed eight collisions.

The worst case is where every single sphere collides with one particular target sphere - then we'll need N/4 passes - the first requiring N polygons to be drawn, the second N-4, then N-8 all the way down to zero. That's better than N^2...but not by much...a 'smart' algorithm running in the CPU could probably do better.

So at the end of this, we have some number (hopefully only one!) of these collision maps sitting around on both the CPU and the GPU.

The same kind of technique could be used for sphere/cubeoid or cubeoid/cubeoid collisions...maybe even collisions against triangles too. But the simplistic nature of the
shaders is such that we would probably want to make one pass through all of the data for each type of probe/target combination.

Now we could theoretically loop through all of the collision maps deciding what to do about the collisions inside another single-pass shader...but let's understand one thing at a time!

SteveBaker · Post by **SteveBaker** » Wed Sep 13, 2006 5:32 pm

Erwin Coumans wrote:Just modified the code so it compile under Mac OS X. It runs, but asserted in line 60 of fboSupport.cxx. The status is 36054 (didn't check the enum).

That enum translates to:

GL_FRAMEBUFFER_INCOMPLETE_ATTACHMENT_EXT

...which suggests that it wants me to attach something for the stencil buffer or something.
I wrote code to do that but my computer at home didn't seem to like it so I #ifdef'ed it out. Running it at work on different hardware and different drivers got me the same error as you and turning on the stencil stuff fixed it.

After commenting that out, it failed later, see below.
Code: Select all
Failed to compile shader 'CollisionGenerator Frag Shader'.
GPU_physics_demo.cxx:204: failed assertion `collisionGenerator -> compiledOK ()'

OK - so that was an error in my frag shader that I was 'getting away with' in an older device driver. My machine at work failed on that one too - fixing *that* just got me into more problems that weren't showing up at home.

I'll try to nail these issues here at work where my hardware/driver version evidently matches yours a little better - there are enough changes that I'd better put out another version....watch this space!

Erwin Coumans · Post by **Erwin Coumans** » Wed Sep 13, 2006 6:01 pm

SteveBaker wrote:
Erwin Coumans wrote:Just modified the code so it compile under Mac OS X. It runs, but asserted in line 60 of fboSupport.cxx. The status is 36054 (didn't check the enum).
That enum translates to:

GL_FRAMEBUFFER_INCOMPLETE_ATTACHMENT_EXT

...which suggests that it wants me to attach something for the stencil buffer or something.
I wrote code to do that but my computer at home didn't seem to like it so I #ifdef'ed it out. Running it at work on different hardware and different drivers got me the same error as you and turning on the stencil stuff fixed it.
After commenting that out, it failed later, see below.
Code: Select all
Failed to compile shader 'CollisionGenerator Frag Shader'.
GPU_physics_demo.cxx:204: failed assertion `collisionGenerator -> compiledOK ()'
OK - so that was an error in my frag shader that I was 'getting away with' in an older device driver. My machine at work failed on that one too - fixing *that* just got me into more problems that weren't showing up at home.

I'll try to nail these issues here at work where my hardware/driver version evidently matches yours a little better - there are enough changes that I'd better put out another version....watch this space!

Thanks for the excellent work!

With respect to multi-platform building, I can recommend adding CMake support, which I also do in Bullet. It automatically recognizes/finds Glut on Linux/Apple/Unix. Win32 glut lib is included with Bullet sources. CMake can autogenerate Makefiles, projectfiles (Xcode,KDevelop,MSVC) etc etc. You can still include manual makefiles obvious, having CMakeList.txt and Makefile is fine. In fact, in Bullet I ship with jamfiles too, to provide some choice

Certainly if we wished to test ONE sphere (or cubeoid or triangle) against a huge number of other spheres (or cubeoids or triangles) - that could easily be done this way. But to test a huge number of spheres simultaneously against a huge number of spheres will require some thought and some cleverness or it'll require one polygon draw per sphere. For 16,000 spheres tested against 16,000 spheres using mindless brute force, we'd need to draw 16,000 polygons at 128x128 pixels each. That would take a significant amount of GPU time - but it's certainly do-able within a few milliseconds I think.

Bullet's collision detection is 3 phase process (very common)

phase 1, Broadphase, is based on axis aligned bounding box (AABB) colling. Several ways are popular, Sweep and Prune (SAP) very optimial, Hashspace is an alternative (not in Bullet yet). SAP incrementally updates the list of overlapping objects pairs (based on AABB overlap). The interface is 'addPair' and 'removePair'. This can be translated into a 2d texture/bitmap, where a 1 is 'overlap' between two objects. At the moment, there is just a pairlist, but adding such bitmap is easy. This broadphase overlap-bitmap can be uploaded to GPU and hopefully used to avoid N^2 tests.

phase 2: Midphase culling for complex objects. a complex object is an object with more then 1 primitive/shape. For example a triangle mesh has multiple triangles. Typically a local-space AABB tree is popular. A stackless tree traversal can be implemented on GPU. I would postpone this stage until later, and just deal with basic primitive objects first.

phase3: narrow phase, the actual contact point(s) generation. For spheres, this is rather trivial. I just added an extension mechanism in

Basically, phase 1 and phase 2 give potential overlapping primitive pairs. A dispatched chooses an algorithm based on the two types. In the most recent version of Bullet, it is allowed to register user-defined collision detection algorithms for each type. See http://svn.sourceforge.net/viewvc/bulle ... lgorithms/ for a sphere-sphere example in Bullet.
After the Narrowphase, you pass the contact point into a 'PersistentManifold'. that collects and maintains contact points. Those contact points are then processed by the constraint solver. The constraint solver calculates new impulses based, which result in new velocities. Those new velocities can be passed in your PROTOTYPE

Looking forward to see this prototype running!
I didn't read your "Some thoughts on Parallel Sphere/Sphere collision testing" yet. I will give some feedback on that later.

Erwin

Real-Time Physics Simulation Forum

Bullet on GPU

Progress

DONE!!

Re: DONE!!

Re: DONE!!

Some thoughts on Parallel Sphere/Sphere collision testing.

Re: DONE!!

Re: DONE!!