Ragdoll performance

S.Lundmark
Posts: 50
Joined: Thu Jul 09, 2009 1:46 pm

Ragdoll performance

Post by S.Lundmark »

Hi,

I'm currently experiencing performance issues with multiple interacting ragdolls. My timings show that four interacting ragdolls produces a roughly 10 ms simulationtime for bullet. We're running bullet 2.74.

My scene-setup is that my static geometry is a btCompoundShape with 100's of convex-shapes inside. I could optimize this so that most of them would be boxes. Since there are very few BoxVsX-specialcases in bullet I see little reason to attempt this.

My ragdolls are multiple rigidbodies containing btCompountShape's with a single btCapsuleShape inside it. The reason for the btCompoundShape is to add a transform to the capsule.

Using the bullet-performance tool (btQuickprof) I see that a lot of the time is spent in the btCompoundShape-lookups (the dynamic tree), almost 50% of the collision-detection. The rest is done by ConvexConvexCollisionAlgorithm.

I am unable to see wether the time is spent in ragdollvsragdoll collision or in ragdoll vs scene (static geometry). I am wondering if there is planned any optimizations to bullet that would help my current situation. I have thought about adding a CapsuleCapsuleCollisionAlgorithm and maybe some transformshape in order to move some of the overhead of btCompoundShape vs btCompoundShape. My last idea is to implement a kd-tree similar to what Christer Ericsson suggests in his book. The tree would be used for the static geometry (the scene) in order to speed up the lookup into it.

Since most of these tasks are quite time consuming (except perhaps the CapsuleCapsule one), any pointers or guidelines or thought about these would be greatly appriciated. Most work would be submitted to bullet.

Cheers,
Simon.
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: Ragdoll performance

Post by Erwin Coumans »

10ms for 4 ragdolls sounds too much, is this in optimized/release builds? It depends on the machine, but in a similar scenario with simplified static environment the time for 5 ragdolls was less than 3ms.

Is it really necessary to use btCompoundShape around the btCapsuleShape? There are a few examples using ragdolls using btCapsuleShape only, in combination with either btConeTwistConstraint or btGeneric6DofConstraint.
It is best to avoid btCompoundShape for each limb, but if you can't, you can disable the dynamic tree by passing false in the btCompound constructor:

Code: Select all

btCompoundShape(bool enableDynamicAabbTree = true);
My last idea is to implement a kd-tree
The current acceleration structures should be fast and they are not likely the bottleneck. Can you provide statistics using CProfileManager::dumpAll(), right after stepSimulation call? If the performance is still slow after removing the btCompoundShape around each limb, we can consider an optimize capsule-capsule test.

Thanks,
Erwin
S.Lundmark
Posts: 50
Joined: Thu Jul 09, 2009 1:46 pm

Re: Ragdoll performance

Post by S.Lundmark »

Hi Erwin,

And thanks for the quick reply!

Yes it is in a release-build, on win32 with a high-end machine. I'm suspecting that quite some time is being taken up by the static-geometry collision-detection, but I'm unable to verify that.

I will provide you with a dumpall tomorrow since I've left work for the day. I'm implementing a test-version for the capsule-capsule collision, to be able to get a better view of what's taking all the time. I'll also attempt the removal of the tree for the single-shape versions of the btCompound, and I'll look into removing them completely but I think it can be a lot of work required for such a solution.

Cheers,
Simon.
S.Lundmark
Posts: 50
Joined: Thu Jul 09, 2009 1:46 pm

Re: Ragdoll performance

Post by S.Lundmark »

Hi again,

I've removed the dynamic aabb-tree from the ragdoll-rb's. Unfortunately the reason why the compound exists there seems to be too deep into our chain for me to solve it currently. The results were slightly better, 7-8 ms for 4 ragdolls. What worries me is that there seems to be 299 collision-pairs, since dispatchAllCollisionPairs seems to do 598 calls and it's called twice per frame. This is without a CapsuleCapsuleCollisionAlgorithm (it's nearly done).

I've added some extra profiled scopes that the dumpall will show.

The dumpall for 4 ragdolls in a heap together with 1 convex rigidbody:


Profiling: stepSimulation (total running time: 7.420 ms) ---
...0 -- synchronizeMotionStates (0.03 %) :: 0.002 ms / frame (3 calls)
...1 -- internalSingleStepSimulation (99.68 %) :: 7.396 ms / frame (2 calls)
Unaccounted: (0.297 %) :: 0.022 ms
----------------------------------
Profiling: internalSingleStepSimulation (total running time: 7.396 ms) ---
......0 -- updateActivationState (0.11 %) :: 0.008 ms / frame (2 calls)
......1 -- updateActions (0.00 %) :: 0.000 ms / frame (2 calls)
......2 -- integrateTransforms (0.70 %) :: 0.052 ms / frame (2 calls)
......3 -- solveConstraints (11.88 %) :: 0.879 ms / frame (2 calls)
......4 -- calculateSimulationIslands (0.54 %) :: 0.040 ms / frame (2 calls)
......5 -- performDiscreteCollisionDetection (85.86 %) :: 6.350 ms / frame (2 calls)
......6 -- predictUnconstraintMotion (0.80 %) :: 0.059 ms / frame (2 calls)
Unaccounted: (0.108 %) :: 0.008 ms
----------------------------------
Profiling: solveConstraints (total running time: 0.879 ms) ---
.........0 -- processIslands (83.39 %) :: 0.733 ms / frame (2 calls)
.........1 -- islandUnionFindAndQuickSort (3.19 %) :: 0.028 ms / frame (2 calls)
Unaccounted: (13.424 %) :: 0.118 ms
----------------------------------
Profiling: processIslands (total running time: 0.733 ms) ---
............0 -- solveGroup (97.14 %) :: 0.712 ms / frame (2 calls)
Unaccounted: (2.865 %) :: 0.021 ms
----------------------------------
Profiling: solveGroup (total running time: 0.712 ms) ---
...............0 -- solveGroupCacheFriendlyIterations (42.98 %) :: 0.306 ms / frame (2 calls)
...............1 -- solveGroupCacheFriendlySetup (55.62 %) :: 0.396 ms / frame (2 calls)
Unaccounted: (1.404 %) :: 0.010 ms
----------------------------------
Profiling: solveGroupCacheFriendlyIterations (total running time: 0.306 ms) ---
..................0 -- solveFrictionConstraintsSIMD (16.34 %) :: 0.050 ms / frame (20 calls)
..................1 -- solveContactConstraintsSIMD (16.67 %) :: 0.051 ms / frame (20 calls)
..................2 -- solveJointConstraintsSIMD (55.56 %) :: 0.170 ms / frame (20 calls)
Unaccounted: (11.438 %) :: 0.035 ms
----------------------------------
Profiling: performDiscreteCollisionDetection (total running time: 6.350 ms) ---
.........0 -- dispatchAllCollisionPairs (98.96 %) :: 6.284 ms / frame (2 calls)
.........1 -- calculateOverlappingPairs (0.19 %) :: 0.012 ms / frame (2 calls)
.........2 -- updateAabbs (0.80 %) :: 0.051 ms / frame (2 calls)
Unaccounted: (0.047 %) :: 0.003 ms
----------------------------------
Profiling: dispatchAllCollisionPairs (total running time: 6.284 ms) ---
............0 -- processConvexConvexCollision (0.00 %) :: 0.000 ms / frame (0 calls)
............1 -- processCompoundCollision (94.68 %) :: 5.950 ms / frame (598 calls)
Unaccounted: (5.315 %) :: 0.334 ms
----------------------------------
Profiling: processCompoundCollision (total running time: 5.950 ms) ---
...............0 -- Compound-Child (32.45 %) :: 1.931 ms / frame (504 calls)
...............1 -- Compound-Child2 (12.05 %) :: 0.717 ms / frame (598 calls)
...............2 -- DBVT-Query (28.71 %) :: 1.708 ms / frame (94 calls)
Unaccounted: (26.790 %) :: 1.594 ms
----------------------------------
Profiling: Compound-Child (total running time: 1.931 ms) ---
..................0 -- processConvexConvexCollision (9.58 %) :: 0.185 ms / frame (38 calls)
..................1 -- processCompoundCollision (69.19 %) :: 1.336 ms / frame (194 calls)
Unaccounted: (21.233 %) :: 0.410 ms
----------------------------------
Profiling: processCompoundCollision (total running time: 1.336 ms) ---
.....................0 -- Compound-Child2 (7.78 %) :: 0.104 ms / frame (194 calls)
.....................1 -- Compound-Child (57.41 %) :: 0.767 ms / frame (194 calls)
Unaccounted: (34.805 %) :: 0.465 ms
----------------------------------
Profiling: Compound-Child (total running time: 0.767 ms) ---
........................0 -- processConvexConvexCollision (72.10 %) :: 0.553 ms / frame (194 calls)
Unaccounted: (27.901 %) :: 0.214 ms
----------------------------------
Profiling: DBVT-Query (total running time: 1.708 ms) ---
..................0 -- processConvexConvexCollision (4.68 %) :: 0.080 ms / frame (8 calls)
..................1 -- processCompoundCollision (66.92 %) :: 1.143 ms / frame (116 calls)
Unaccounted: (28.396 %) :: 0.485 ms
----------------------------------
Profiling: processCompoundCollision (total running time: 1.143 ms) ---
.....................0 -- Compound-Child2 (5.77 %) :: 0.066 ms / frame (116 calls)
.....................1 -- Compound-Child (69.64 %) :: 0.796 ms / frame (116 calls)
Unaccounted: (24.584 %) :: 0.281 ms
----------------------------------
Profiling: Compound-Child (total running time: 0.796 ms) ---
........................0 -- processConvexConvexCollision (85.80 %) :: 0.683 ms / frame (116 calls)
Unaccounted: (14.196 %) :: 0.113 ms


Thanks :)

Simon
S.Lundmark
Posts: 50
Joined: Thu Jul 09, 2009 1:46 pm

Re: Ragdoll performance

Post by S.Lundmark »

Some Explanations to the scopes that I added:

"DBVT-Query" is the if(tree)-scope in btCompoundCollisionAlgorithm::processCollision. "Compound-Child" is the else-scope to that if. "Compound-Child2" is the iteration over numChildren in the same function. If I'm reading the dumpall correctly, is the 94 calls for DBVT-Query in the first processCompoundCollision the timings for querying the static geometry I have? 1.7 ms for 94 calls, doesn't this seem quite high? Compound-Child in the same timing should be the Ragdoll-Ragdoll collision, which is only 1.9 ms for 504 calls, which sounds ok?

/Simon
S.Lundmark
Posts: 50
Joined: Thu Jul 09, 2009 1:46 pm

Re: Ragdoll performance

Post by S.Lundmark »

Removing my scopes gave a better measurement of performance, 6 ms for 5 ragdolls.


Profiling: stepSimulation (total running time: 5.970 ms) ---
...0 -- synchronizeMotionStates (0.07 %) :: 0.004 ms / frame (3 calls)
...1 -- internalSingleStepSimulation (99.55 %) :: 5.943 ms / frame (2 calls)
Unaccounted: (0.385 %) :: 0.023 ms
----------------------------------
Profiling: internalSingleStepSimulation (total running time: 5.943 ms) ---
......0 -- updateActivationState (0.12 %) :: 0.007 ms / frame (2 calls)
......1 -- updateActions (0.02 %) :: 0.001 ms / frame (2 calls)
......2 -- integrateTransforms (1.16 %) :: 0.069 ms / frame (2 calls)
......3 -- solveConstraints (15.38 %) :: 0.914 ms / frame (2 calls)
......4 -- calculateSimulationIslands (0.54 %) :: 0.032 ms / frame (2 calls)
......5 -- performDiscreteCollisionDetection (81.63 %) :: 4.851 ms / frame (2 calls)
......6 -- predictUnconstraintMotion (1.03 %) :: 0.061 ms / frame (2 calls)
Unaccounted: (0.135 %) :: 0.008 ms
----------------------------------
Profiling: solveConstraints (total running time: 0.914 ms) ---
.........0 -- processIslands (95.73 %) :: 0.875 ms / frame (2 calls)
.........1 -- islandUnionFindAndQuickSort (2.63 %) :: 0.024 ms / frame (2 calls)
Unaccounted: (1.641 %) :: 0.015 ms
----------------------------------
Profiling: processIslands (total running time: 0.875 ms) ---
............0 -- solveGroup (97.71 %) :: 0.855 ms / frame (2 calls)
Unaccounted: (2.286 %) :: 0.020 ms
----------------------------------
Profiling: solveGroup (total running time: 0.855 ms) ---
...............0 -- solveGroupCacheFriendlyIterations (40.00 %) :: 0.342 ms / frame (2 calls)
...............1 -- solveGroupCacheFriendlySetup (59.18 %) :: 0.506 ms / frame (2 calls)
Unaccounted: (0.819 %) :: 0.007 ms
----------------------------------
Profiling: performDiscreteCollisionDetection (total running time: 4.851 ms) ---
.........0 -- dispatchAllCollisionPairs (98.17 %) :: 4.762 ms / frame (2 calls)
.........1 -- calculateOverlappingPairs (0.25 %) :: 0.012 ms / frame (2 calls)
.........2 -- updateAabbs (1.50 %) :: 0.073 ms / frame (2 calls)
Unaccounted: (0.082 %) :: 0.004 ms

Edit:
upon discovering your btCapsuleShapeX, I could use that instead of the rotation. This improved performance even more. I now had 8 ragdolls stacked which led me to 8 ms with roughly the same dumpall. Still, I would like to improve this even more if possible. Thanks for all the suggestions, they have proven to help out a lot!
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: Ragdoll performance

Post by Erwin Coumans »

Can you please try again to remove the btCompoundShape from each ragdoll limb, and provide DumpAll timings when only using btCapsuleShape(X/Z)?

Inserting additional timers in innerloops hurts performance a lot, as you already noticed, so best to not add any when doing profiling.

Hope this helps,
Erwin
S.Lundmark
Posts: 50
Joined: Thu Jul 09, 2009 1:46 pm

Re: Ragdoll performance

Post by S.Lundmark »

Hi again Erwin,

These are my latest results. The simulation is from 6 ragdolls. I've discovered some issues that I previously had not seen.
Once a ragdoll starts to deactivate - when a few of the bodyparts are marked as "trying to become inactive" as seen by visualising debugdrawing, the ragdoll seems to rotate the "trying to become inactive" limbs in awkward directions. This sometimes leads to interpenetrations with the static geometry. By awkward directions, I mean that they seem unaffected by gravity and slightly influenced by the rest of the body's attempt to relax (by being affected by gravity). Is there something that we could have set up wrong in order for this to happen?

The following dumpall is for 4 ragdolls in the same simulation island (all interacting with each other). With only btCapsuleShape's directly under the RB's. I did some initial testing with a capsule-capsule algorithm, and it seems to cut down time by almost one millisecond in my case (which is not included here). But I'd still want to get even better performance out of this. What could our expectations be for performance-improval from 2.75?

Profiling: stepSimulation (total running time: 4.221 ms) ---
...0 -- synchronizeMotionStates (0.21 %) :: 0.009 ms / frame (3 calls)
...1 -- internalSingleStepSimulation (99.22 %) :: 4.188 ms / frame (2 calls)
Unaccounted: (0.569 %) :: 0.024 ms
----------------------------------
Profiling: internalSingleStepSimulation (total running time: 4.188 ms) ---
......0 -- updateActivationState (0.43 %) :: 0.018 ms / frame (2 calls)
......1 -- updateActions (0.02 %) :: 0.001 ms / frame (2 calls)
......2 -- integrateTransforms (1.38 %) :: 0.058 ms / frame (2 calls)
......3 -- solveConstraints (16.74 %) :: 0.701 ms / frame (2 calls)
......4 -- calculateSimulationIslands (2.24 %) :: 0.094 ms / frame (2 calls)
......5 -- performDiscreteCollisionDetection (78.65 %) :: 3.294 ms / frame (2 calls)
......6 -- predictUnconstraintMotion (0.31 %) :: 0.013 ms / frame (2 calls)
Unaccounted: (0.215 %) :: 0.009 ms
----------------------------------
Profiling: solveConstraints (total running time: 0.701 ms) ---
.........0 -- processIslands (91.44 %) :: 0.641 ms / frame (2 calls)
.........1 -- islandUnionFindAndQuickSort (5.14 %) :: 0.036 ms / frame (2 calls)
Unaccounted: (3.424 %) :: 0.024 ms
----------------------------------
Profiling: processIslands (total running time: 0.641 ms) ---
............0 -- solveGroup (94.07 %) :: 0.603 ms / frame (2 calls)
Unaccounted: (5.928 %) :: 0.038 ms
----------------------------------
Profiling: solveGroup (total running time: 0.603 ms) ---
...............0 -- solveGroupCacheFriendlyIterations (34.16 %) :: 0.206 ms / frame (2 calls)
...............1 -- solveGroupCacheFriendlySetup (63.85 %) :: 0.385 ms / frame (2 calls)
Unaccounted: (1.990 %) :: 0.012 ms
----------------------------------
Profiling: performDiscreteCollisionDetection (total running time: 3.294 ms) ---
.........0 -- dispatchAllCollisionPairs (97.12 %) :: 3.199 ms / frame (2 calls)
.........1 -- calculateOverlappingPairs (0.64 %) :: 0.021 ms / frame (2 calls)
.........2 -- updateAabbs (2.13 %) :: 0.070 ms / frame (2 calls)
Unaccounted: (0.121 %) :: 0.004 ms
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: Ragdoll performance

Post by Erwin Coumans »

So this is without btCompoundShape for each limb right?
S.Lundmark wrote: This sometimes leads to interpenetrations with the static geometry.
What version of Bullet are you using exactly? Is this using your modified convex-convex, capcule-capsule test? Can you make sure to turn off the following optimization? It can cause penetrations in older Bullet versions:

Code: Select all

dynamicsWorld->getDispatchInfo().m_useConvexConservativeDistanceUtil = false;
...1 -- internalSingleStepSimulation (99.22 %) :: 4.188 ms / frame (2 calls)
[...]
......5 -- performDiscreteCollisionDetection (78.65 %) :: 3.294 ms / frame (2 calls)
This is about 2ms to simulate 4 ragdolls per single simulation frame, with 75% of total time spend in narrowphase collision detection (calculating contact points). It is more clear to use a single simulation frame for timing/benchmark purposes.

Why do you need 2 simulation frames?
Thanks,
Erwin
S.Lundmark
Posts: 50
Joined: Thu Jul 09, 2009 1:46 pm

Re: Ragdoll performance

Post by S.Lundmark »

We're exactly using 2.74 with the spu-optimizations (although these things are measured on win32). This is without my modified capsule-capsule tests. I'll try to turn off the convex-conservative distance util, but it seems as if when parts of the ragdoll chain is trying to deactivate, gravity is disabled for them but they continue to move with the velocities it had the frame it was trying to deactivate. I haven't debugged it anything more than that.
Why do you need 2 simulation frames?
We're using 2 simulation frames since the part of our engine that runs bullet is simulated at 30 hz (usually), but can sometimes drop down to 15 hz. Bullet compensates for this by itself if I remember correctly. I thought it was recommended to run bullet at 60 hz (splitting up timesteps) for best behavior?
Erwin Coumans wrote: So this is without btCompoundShape for each limb right?
That is correct.
This is about 2ms to simulate 4 ragdolls per single simulation frame, with 75% of total time spend in narrowphase collision detection (calculating contact points). It is more clear to use a single simulation frame for timing/benchmark purposes.
Yes but it still seems like a very high amount of narrowphase for only 4 ragdolls, and it is quite obvious that it must be the representation of the geometry that collides with the ragdolls that are taking up a lot of the time. Doing the same thing in our engine with a much less complex version of the static geometry causes the narrowphase for 1 ragdoll down to 0.5 ms, while 1 ragdoll in the above case puts one ragdoll up beyond 1 ms. I haven't looked into the representation of the dynamic tree for the btCompoundList, but how well does it scale? The geometry represented here is thousands of convex shapes. I'll try to do some performance measurements on the ps3 without the spu-code (we had some box-collision issues that I haven't resolved yet) enabled, and maybe we can get a more exact analysis with a more detailed tool.

Thanks,
Simon
S.Lundmark
Posts: 50
Joined: Thu Jul 09, 2009 1:46 pm

Re: Ragdoll performance

Post by S.Lundmark »

I have discovered the issue with the ragdoll behavior. It was due to the optimization in http://code.google.com/p/bullet/issues/detail?id=243. It seemed as if the optimization forced some parts of the simulation island to not simulate if they wanted deactivation, but wasn't deactivated. I haven't digged really into what happened, but reverting -only- that change caused behavior to behave as wanted.
S.Lundmark
Posts: 50
Joined: Thu Jul 09, 2009 1:46 pm

Re: Ragdoll performance

Post by S.Lundmark »

I've started looking into profiling on the ps3-ppu, do you know of any simple way for me to profile the time on the tree-lookup for the compoundcollisionalgorithm? Everything seems inlined and if I extract the lookup-scope in the processCollision() in btCompoundCollisionAlgorithm, I think I get the time spent in narrowphase collision dt in functions called in lower scopes.

Atm, it seems like 30% or so of my entire frame is spent in the lookup-tree and lower. roughly 50% of the time is spent in compound-collision-algorithm. This is with 3 ragdolls only, and the entire bullet-frame costs up to 30 ms (on the ppu). Although I'm not really worried about the ps3 since that'll just go out to the ppu's, I'm more worried about win32/xenon performance.

Any tips/guidelines on further optimizations are greatly appriciated.

Cheers
/Simon
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: Ragdoll performance

Post by Erwin Coumans »

S.Lundmark wrote:I've started looking into profiling on the ps3-ppu
btCompoundShape hasn't been optimized for static world geometry, it is best to use static triangle meshes through the btBvtTriangleMeshShape. It uses btOptimizedBvh, an AABB tree acceleration structure with traversal optimized for SPUs. Is it possible to switch to btBvtTriangleMeshShape?

If you really want to use static world geometry with volumetric convex parts instead of triangles, we can help to re-use the btOptimizedBvh with convex shapes as child shapes (instead of triangles): btOptimizedBvh is faster than btDbvt and can be traversed on SPU. Are you licensed PS3 developer? If so, I recommend using the spubullet-2.75 release and get in touch through PS3 Devnet to put in a request for this.
Thanks,
Erwin
S.Lundmark
Posts: 50
Joined: Thu Jul 09, 2009 1:46 pm

Re: Ragdoll performance

Post by S.Lundmark »

Hi again Erwin,

I'll bring this up on ps3 devnet.

Thanks!

/Simon