Performance on a Cell Blade

danieltracy
Posts: 13
Joined: Mon Oct 13, 2008 11:01 pm

Performance on a Cell Blade

Post by danieltracy »

I've been playing with CellSpuDemo on a QS22 Cell blade. My tests seem to indicate that Bullet isn't really benefiting very well from USE_PARALLEL_DISPATCHER being defined. I've tried tests with the spheres/sphere ground, and set up a test with a box ground and boxes. My tests included small and large island configurations (4-64). The results indicate that the USE_PARALLEL_DISPATCHER mode works comparably or worse, depending upon conditions.

I've operated under the assumption that USE_PARALLEL_DISPATCHER enables the use of SPUs, whereas no defining it disables their use (based upon output). Does undefining USE_PARALLEL_DISPATCHER disable use of SPUs entirely as I assumed?

Are these results expected due to the experimental nature of Bullet on Cell? Or is it likely I'm doing something wrong?

Are there configurations for which using USE_PARALLEL_DISPATCHER may do significantly better?

In addition, the environment for which I'm using Bullet includes many OBBs with a trimesh landscape. Are there problems/caveats to this configuration with the current Cell code? For example, GJK for triangle/OBB intersection?

Daniel
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: Performance on a Cell Blade

Post by Erwin Coumans »

Something must be wrong. The SPUs should give a decent speedup, but I haven't tested on a Cell Blade yet (only PLAYSTATION 3 GameOS and Linux).

I'm not familiar with 'small' and 'large' island, but it would be good to test with 6 local SPUs first.
Are you using btBvhTriangleMeshShape for the static triangle mesh, and btBoxShape for moving OBB?

Can you provide a the console output when running "CProfileManager::dumpAll();" after each stepSimulation, so we can compare both versions?
Thanks,
Erwin
danieltracy
Posts: 13
Joined: Mon Oct 13, 2008 11:01 pm

Re: Performance on a Cell Blade

Post by danieltracy »

Results for 780 bodies, 300 in one island, the rest in islands-of-4.
All steps are 1/60th second, Test proceeds for 4500 cycles, no objects allowed to rest.
The printed profiles are of the last cycle before finishing execution
The first run is without SPUs, and the second uses 4 SPUs. Any more than 4
SPUs causes the program to seg fault.

If the profiling is accurate, it seems to be spending relatively a lot
of time in updateAabbs, with the rest in predictUnconstraintMotion and
integrateTransforms.

The physical object setup lends itself well to equilibrium, but I've confirmed that
the objects are not resting, and in fact wouldn't rest for some time if allowed
to do so. There should be plenty of contacts.


No SPUs:

[daniel@cssrev3n1 ibmsdk]$ time ./BasicDemo2
Actual body count: 780
----------------------------------
Profiling: Root (total running time: 4.656 ms) ---
0 -- stepSimulation (99.87 %) :: 4.650 ms / frame (1 calls)
Unaccounted: (0.129 %) :: 0.006 ms
...----------------------------------
...Profiling: stepSimulation (total running time: 4.650 ms) ---
...0 -- synchronizeMotionStates (3.38 %) :: 0.157 ms / frame (2 calls)
...1 -- internalSingleStepSimulation (94.02 %) :: 4.372 ms / frame (1 calls)
...Unaccounted: (2.602 %) :: 0.121 ms
......----------------------------------
......Profiling: internalSingleStepSimulation (total running time: 4.372 ms) ---
......0 -- updateActivationState (3.50 %) :: 0.153 ms / frame (1 calls)
......1 -- updateCharacters (0.00 %) :: 0.000 ms / frame (1 calls)
......2 -- updateVehicles (0.00 %) :: 0.000 ms / frame (1 calls)
......3 -- integrateTransforms (19.90 %) :: 0.870 ms / frame (1 calls)
......4 -- solveConstraints (3.61 %) :: 0.158 ms / frame (1 calls)
......5 -- calculateSimulationIslands (0.98 %) :: 0.043 ms / frame (1 calls)
......6 -- performDiscreteCollisionDetection (43.55 %) :: 1.904 ms / frame (1 calls)
......7 -- predictUnconstraintMotion (28.29 %) :: 1.237 ms / frame (1 calls)
......Unaccounted: (0.160 %) :: 0.007 ms
.........----------------------------------
.........Profiling: solveConstraints (total running time: 0.158 ms) ---
.........0 -- processIslands (42.41 %) :: 0.067 ms / frame (1 calls)
.........1 -- islandUnionFindAndQuickSort (55.70 %) :: 0.088 ms / frame (1 calls)
.........Unaccounted: (1.899 %) :: 0.003 ms
............----------------------------------
............Profiling: processIslands (total running time: 0.067 ms) ---
............0 -- solveGroup (0.00 %) :: 0.000 ms / frame (0 calls)
............Unaccounted: (100.000 %) :: 0.067 ms
...............----------------------------------
...............Profiling: solveGroup (total running time: 0.000 ms) ---
...............0 -- solveGroupCacheFriendlyIterations (0.00 %) :: 0.000 ms / frame (0 calls)
...............1 -- solveGroupCacheFriendlySetup (0.00 %) :: 0.000 ms / frame (0 calls)
...............Unaccounted: (0.000 %) :: 0.000 ms
.........----------------------------------
.........Profiling: performDiscreteCollisionDetection (total running time: 1.904 ms) ---
.........0 -- dispatchAllCollisionPairs (0.05 %) :: 0.001 ms / frame (1 calls)
.........1 -- calculateOverlappingPairs (0.16 %) :: 0.003 ms / frame (1 calls)
.........2 -- updateAabbs (99.63 %) :: 1.897 ms / frame (1 calls)
.........Unaccounted: (0.158 %) :: 0.003 ms

real 0m22.538s
user 0m22.273s
sys 0m0.261s




4 SPUs:

[daniel@cssrev3n1 ibmsdk]$ time ./BasicDemo2
IMAGE OPENED:../../../src/BulletMultiThreaded/out/spuCollision.elf
Actual body count: 780
SPU: hello
SPU: hello
Spu 0 is ready
Spu 1 is ready
SPU: hello
SPU: hello
Spu 2 is ready
Spu 3 is ready
sizeof SpuGatherAndProcessWorkUnitInput: 16
----------------------------------
Profiling: Root (total running time: 6.390 ms) ---
0 -- stepSimulation (99.91 %) :: 6.384 ms / frame (1 calls)
Unaccounted: (0.094 %) :: 0.006 ms
...----------------------------------
...Profiling: stepSimulation (total running time: 6.384 ms) ---
...0 -- synchronizeMotionStates (2.88 %) :: 0.184 ms / frame (2 calls)
...1 -- internalSingleStepSimulation (95.55 %) :: 6.100 ms / frame (1 calls)
...Unaccounted: (1.566 %) :: 0.100 ms
......----------------------------------
......Profiling: internalSingleStepSimulation (total running time: 6.100 ms) ---
......0 -- updateActivationState (2.31 %) :: 0.141 ms / frame (1 calls)
......1 -- updateCharacters (0.00 %) :: 0.000 ms / frame (1 calls)
......2 -- updateVehicles (0.00 %) :: 0.000 ms / frame (1 calls)
......3 -- integrateTransforms (14.51 %) :: 0.885 ms / frame (1 calls)
......4 -- solveConstraints (4.79 %) :: 0.292 ms / frame (1 calls)
......5 -- calculateSimulationIslands (0.75 %) :: 0.046 ms / frame (1 calls)
......6 -- performDiscreteCollisionDetection (57.20 %) :: 3.489 ms / frame (1 calls)
......7 -- predictUnconstraintMotion (20.34 %) :: 1.241 ms / frame (1 calls)
......Unaccounted: (0.098 %) :: 0.006 ms
.........----------------------------------
.........Profiling: solveConstraints (total running time: 0.292 ms) ---
.........0 -- processIslands (67.47 %) :: 0.197 ms / frame (1 calls)
.........1 -- islandUnionFindAndQuickSort (31.51 %) :: 0.092 ms / frame (1 calls)
.........Unaccounted: (1.027 %) :: 0.003 ms
............----------------------------------
............Profiling: processIslands (total running time: 0.197 ms) ---
............0 -- solveGroup (61.93 %) :: 0.122 ms / frame (6 calls)
............Unaccounted: (38.071 %) :: 0.075 ms
...............----------------------------------
...............Profiling: solveGroup (total running time: 0.122 ms) ---
...............0 -- solveGroupCacheFriendlyIterations (55.74 %) :: 0.068 ms / frame (6 calls)
...............1 -- solveGroupCacheFriendlySetup (40.16 %) :: 0.049 ms / frame (6 calls)
...............Unaccounted: (4.098 %) :: 0.005 ms
.........----------------------------------
.........Profiling: performDiscreteCollisionDetection (total running time: 3.489 ms) ---
.........0 -- dispatchAllCollisionPairs (0.95 %) :: 0.033 ms / frame (1 calls)
.........1 -- calculateOverlappingPairs (0.11 %) :: 0.004 ms / frame (1 calls)
.........2 -- updateAabbs (98.88 %) :: 3.450 ms / frame (1 calls)
.........Unaccounted: (0.057 %) :: 0.002 ms
SPU: shutdown
SPU: shutdown
SPU: shutdown
SPU: shutdown

real 0m29.220s
user 0m28.715s
sys 0m0.503s
[daniel@cssrev3n1 ibmsdk]$


Daniel
danieltracy
Posts: 13
Joined: Mon Oct 13, 2008 11:01 pm

Re: Performance on a Cell Blade

Post by danieltracy »

I've just recently tested for shorter periods of time and in configurations where objects should fall some distance before contact, and get similar results. Curiously, updateAabbs seems to have high overhead. Very strange. Could it be a problem with the profiling?

Daniel
danieltracy
Posts: 13
Joined: Mon Oct 13, 2008 11:01 pm

Re: Performance on a Cell Blade

Post by danieltracy »

Oh yes. You asked what I was using for the tri-mesh. I'm not using a tri-mesh, but simply a box for the floor. The question regarding trimesh on the Cell was not related to the benchmarks.

Daniel
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: Performance on a Cell Blade

Post by Erwin Coumans »

The SPU only processes the narrowphase collision detection 'dispatchAllCollisionPairs', so you can ignore all other timings.

PPU:
.0 -- dispatchAllCollisionPairs (0.05 %) :: 0.001 ms / frame (1 calls)
SPU:
0 -- dispatchAllCollisionPairs (0.95 %) :: 0.033 ms / frame (1 calls)

So there is pretty much no narrowphase collision detection happening in your scene. Can you create a COLLADA .dae snapshot?
The 0.033ms is some neglectible overhead, I would ignore it.
So it would be best to create some scene with actual narrowphase work.

Thanks for the timings,
Erwin
danieltracy
Posts: 13
Joined: Mon Oct 13, 2008 11:01 pm

Re: Performance on a Cell Blade

Post by danieltracy »

Yes, it appears that you are right, though that was not my intention. It's difficult to see exactly what is happening when you have no visualization. I'm going to reproduce the same scene in a Windows environment to determine what might be wrong.

Daniel
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: Performance on a Cell Blade

Post by Erwin Coumans »

danieltracy wrote:Yes, it appears that you are right, though that was not my intention. It's difficult to see exactly what is happening when you have no visualization. I'm going to reproduce the same scene in a Windows environment to determine what might be wrong.

Daniel
That is why I asked for a COLLADA .dae snapshot, so you can watch the simulation snapshot under Windows.
It just takes 3 lines of code, and linking against 3 additional libraries included in the Bullet package (Extras/BulletColladaConverter, Extras/COLLADA_DOM, Extras/libxml).

Code: Select all

#include "Extras\BulletColladaConverter\ColladaConverter.h"
...
//take an existing dynamicsWorld, filled with rigid bodies etc and export it to COLLADA .dae xml file.
...
ColladaConverter converter(dynamicsWorld);
converter.save("snapshot.dae");
In a future version of Bullet we plan on making binary snapshots a built-in feature, so no need for additional libraries. You can load/view the COLLADA .dae snapshot on other platforms using the ColladaDemo viewer (Bullet/Demos/ColladaDemo), just drag the file snapshot.dae over the ReleaseColladaDemo.exe.

Hope this helps,
Erwin
danieltracy
Posts: 13
Joined: Mon Oct 13, 2008 11:01 pm

Re: Performance on a Cell Blade

Post by danieltracy »

I "ported" the files over to a Win32 machine with Bullet because it makes incremental changes easier to visualize, and I haven't used the Collada feature before. We usually stick with what we know. =)

After stabilizing the initial conditions correctly, using SPUs does indeed speed up the program, sometimes by a little and sometimes by a lot. In my tests it seems to speed up only slightly when objects are not near very many other objects (island size of 1, 9% speedup). Under those conditions, most of the time (~50%) is taken up by solveConstraints. But when clusters of boxes are near or in contact, the speedup is several times.

Strangely enough, in this case, it goes into that funk where most of the time is taken up by updateAabbs() with the remaining 20% in predictUnconstraintMotion(). It may be possible that the results are different due to inaccuracies in the Cell computations? I started all the boxes precisely touching. I'll have to do that Collada thing Manana. =)

P.S. I saw your name on the IEEE-VR 2009 roster. You're hosting the "physics for VR" tutorial/workshop? I'm going to try to be there. I'll be giving a talk on a paper on sweep and prune.

Daniel
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: Performance on a Cell Blade

Post by Erwin Coumans »

danieltracy wrote: It may be possible that the results are different due to inaccuracies in the Cell computations?
Yes, any change can result in diverging simulation results. Note that we have a SPU optimized sweep and prune broadphase and also SPU optimized constraint solver. Both will be release at some stage.
P.S. I saw your name on the IEEE-VR 2009 roster. You're hosting the "physics for VR" tutorial/workshop? I'm going to try to be there. I'll be giving a talk on a paper on sweep and prune.
Daniel
Yes, we are planning to do this "Physics for VR" workshop, would be nice to meet you there! Do you think you can make it Saturday?

Thanks,
Erwin
danieltracy
Posts: 13
Joined: Mon Oct 13, 2008 11:01 pm

Re: Performance on a Cell Blade

Post by danieltracy »

I'm going to try to be there Saturday. Main issue is registration. I'm trying to upgrade from single-day within time and cost constraints. I'm fairly new to Bullet, but I'm quite impressed with its performance.

After getting some Collada output (before and after simulation), it appears that the SPU-enabled simulation does have slightly different properties, and that caused my finely-tuned stable box stacks to fall into a heap of chaos, which drastically increased the number of contacts (thus nullifying the comparative results). Testing on Intel couldn't show this.

Getting Collada enabled on the Cell blade wasn't trivial for me, though there may have been a better way. The ibmsdk Makefiles do not make the necessary libraries, so I went to root and generated Makefiles for ppc-linux. Then I modified the Makefile in Extras to call pp32u-gcc and some other switches I copied from ibmsdk to prevent linking issues (not doing this results in crashes when pointers are dereferenced across object modules at runtime).

Daniel

P.S. My paper makes several references to your forums with regard to modern sweep and prune features, as I could not find references to them in published literature. It appears that the private industry progressed quite a bit, and sweep and prune hasn't been covered specifically for a long time. I've seen recent papers on subdivision techniques, but I can't find a paper significantly covering sweep and prune since iCollide! (apart from Daniel Coming's work, actually). Just giving you a heads up. Hope you guys don't mind. It's a bit strange to include URLs as references, but the situation is also a bit strange.

SPU-accelerated sweep and prune? Interesting. I considered trying that with batch insertion and removal techniques, doing merge-sorting and swapping using vector permutations and mapping spaces to SPUs, but I don't have practical experience in Cell/SPU programming yet.