Performance "problem" with large numbers of static objects

mp-nico
Posts: 5
Joined: Mon Feb 23, 2009 10:20 am

Performance "problem" with large numbers of static objects

Post by mp-nico »

Hi all,

Recently I was wondering why stepSimulation() took about 10% of our CPU time even though we only have a single _deactived_ dynamic object and a lot (~10000) static objects in a scene.

I debugged into Bullet 2.73-sp1 and saw this inside stepSimulation(...)

Code: Select all

	if (numSimulationSubSteps)
	{

		saveKinematicState(fixedTimeStep);

		applyGravity();

		//clamp the number of substeps, to prevent simulation grinding spiralling down to a halt
		int clampedSimulationSteps = (numSimulationSubSteps > maxSubSteps)? maxSubSteps : numSimulationSubSteps;

		for (int i=0;i<clampedSimulationSteps;i++)
		{
			internalSingleStepSimulation(fixedTimeStep);
			synchronizeMotionStates();
		}

	} 

	synchronizeMotionStates();

	clearForces();
Every function like saveKinematicState(), synchronizeMotionStates(), applyGravity(), clearForces()
does something like this:

Code: Select all

	for (int i=0;i<m_collisionObjects.size();i++)
	{
		btCollisionObject* colObj = m_collisionObjects[i];
		btRigidBody* body = btRigidBody::upcast(colObj);
		if (body)
		{
				...
		}
	}
There's even more of this inside internalSingleStepSimulation()

I'm aware that a single upcast from btCollisionObject to btRigidBody isn't exactly that much costly on it's own.
But I figure it gets somewhat relevant if you have 10000 collision objects and fixed timestep of 1/60.

This is only a estimation, but let's say there are 10 upcast loops per stepSimulation call:
10* 10000 * 60 = 6.000.000 upcasts per frame. That's a little much ...

I made a small test with a base class and 2 different derived classes:
6.000.000 casts take about 180 milliseconds on my PC.

My suggestion is to optimize this by doing the upcast only one itme for each object and storing the pointers in sperated lis which is then used by all functions that follow. Another (better?) option would be to insert rigidbodies into that list when addRigidBody() is called (and remove them from the list at the time removeRigidBody() is called ).

This would cost some memory of course.

Another ( more simple improvement ) I suggest:

Code: Select all

		for (int i=0;i<clampedSimulationSteps;i++)
		{
			internalSingleStepSimulation(fixedTimeStep);
			synchronizeMotionStates();
		}

	} 
	if( clampedSimulationSteps < 1) // HERE ///////////////////////
		synchronizeMotionStates();
Or is there a problem with that I'm not aware of?

I'm posting this because I plan to make these changes and to see how it works out.
Unfortunatly I'm short of spare time currently. Maybe I can save my time if someone here can tell me if there is something fundamentally wrong with my ideas.
Or maybe someone had have similar problems and can tell me about his experiences.

Thanks for your time.
fido
Posts: 14
Joined: Sun Mar 22, 2009 5:57 pm

Re: Performance "problem" with large numbers of static objects

Post by fido »

Hello,
I have the same problem. In our scenes there are thousands of static trees and a few of dynamic objects.
Most of CPU time is spent on *pointless* interating whole set of rigid bodies. There is not only problem with upcast,
but also with cache saturation. On certain platforms (Xbox360) it's slow like hell. Another performance
bottleneck is island manager, which also iterates all rigid bodies in scene and sorts them. I hoped for improvement with
implementation of active-only sets, but it has been removed from 2.74
I'm surprised that nobody spotted this performance problem, which makes bullet practically unusable for large
scenes. Now I'm thinking about to write it by myself but I don't like that I'll split from official bullet,
which will make future upgrades very difficult.
Can I hope for resolving this issue any soon? I like Bullet anyways, but this issue makes it an order slower than
other physical engines.

best regards, Fido
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: Performance "problem" with large numbers of static objects

Post by Erwin Coumans »

fido wrote:I have the same problem. In our scenes there are thousands of static trees and a few of dynamic objects.
Have you tried to group trees together into fewer bt(Scaled)BvhTriangleMeshShape?
Now I'm thinking about to write it by myself but I don't like that I'll split from official bullet, which will make future upgrades very difficult.
Can you provide a performance profile dump, using CProfileManager::dumpAll(); after a slow stepSimulation cal (using an optimized build)?

We will fix this for the next 2.75 version. You can keep track of the progress here:
http://code.google.com/p/bullet/issues/detail?id=128

Thanks,
Erwin
fido
Posts: 14
Joined: Sun Mar 22, 2009 5:57 pm

Re: Performance "problem" with large numbers of static objects

Post by fido »

I haven't tried to group them, as I cannot combine it with another features - streaming the world
and reactions of every one tree, what could be difficult when they will be grouped.
Furthermore, our collision shapes are built up from convex polyhedrons.
It would be easier to implement active list myself :-) I have also other interesting
thoughts - static rigid body class, without dynamic parameters, which could save a lot of
memory and cache misses, separating material properties from rigidBody etc etc. There is
really a lot of issues connected with scenes of a lot of static objects.

I'll try to create dump, anyway the best overwiew I've got pro XDK PIX profiler.
There are ten thousands of L2 cache misses in *every* loop iterating all collisionObjects
(and they have deadly cost). Most of time is spent in btSimulationIslandManager::buildIslands
which is also iterating whole scene and sorts it. Tommorow, when I'll get onto devkit,
I'll profile and point out another bottlenecks.

best regards, Fido
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: Performance "problem" with large numbers of static objects

Post by Erwin Coumans »

fido wrote:I have also other interesting
thoughts - static rigid body class, without dynamic parameters, which could save a lot of
memory and cache misses
You can use btCollisionObject for static objects, no need for using btRigidBody. If you can, please share some of your optimizations here in the forum, or in the googlecode tracker. We consider to use it, so your don't need to fork.
I'll try to create dump, anyway the best overwiew I've got pro XDK PIX profiler.
Tommorow, when I'll get onto devkit, I'll profile and point out another bottlenecks.
We plan on improving island management.
Sharing PIX dumps might violate your NDA with Microsoft. If you can create an optimized CProfileManager::dumpAll() that would be great.
Thanks a lot,
Erwin
fido
Posts: 14
Joined: Sun Mar 22, 2009 5:57 pm

Re: Performance "problem" with large numbers of static objects

Post by fido »

Wow! I completelly missed that btCollisionObject can be used in simulation,
thank you for pointing it out. It will save me a lot of memory! Probably it
would be good enough hack for speeding things up until there will be some
real activeList - I can make another container of rigidBodies and iterate
only this subset inside simulation.
Anyway, I'll be happy to share my optimizations here in forum.

About PIX, you're probably right. I'll create DumpAll instead, plus I'll
describe my findings using "this" tool by some indirect way ;)
fido
Posts: 14
Joined: Sun Mar 22, 2009 5:57 pm

Re: Performance "problem" with large numbers of static objects

Post by fido »

So, here are some figures. It's island with couple of thousands trees, represented as static collision objects (I used the
hint you gave me, so I changed static stuff to btCollisionObject only) There is not any dynamic rigidbody.

First, dumpAll on PC with 2.4GHz P4:
----------------------------------
Profiling: Root (total running time: 1.510 ms) ---
0 -- stepSimulation (99.54 %) :: 1.503 ms / frame (1 calls)
Unaccounted: (0.464 %) :: 0.007 ms
...----------------------------------
...Profiling: stepSimulation (total running time: 1.503 ms) ---
...0 -- synchronizeMotionStates (1.46 %) :: 0.022 ms / frame (2 calls)
...1 -- internalSingleStepSimulation (90.95 %) :: 1.367 ms / frame (1 calls)
...Unaccounted: (7.585 %) :: 0.114 ms
......----------------------------------
......Profiling: internalSingleStepSimulation (total running time: 1.367 ms) ---
......0 -- updateActivationState (0.80 %) :: 0.011 ms / frame (1 calls)
......1 -- updateActions (0.00 %) :: 0.000 ms / frame (1 calls)
......2 -- integrateTransforms (0.88 %) :: 0.012 ms / frame (1 calls)
......3 -- solveConstraints (12.58 %) :: 0.172 ms / frame (1 calls)
......4 -- calculateSimulationIslands (5.41 %) :: 0.074 ms / frame (1 calls)
......5 -- performDiscreteCollisionDetection (78.79 %) :: 1.077 ms / frame (1 calls)
......6 -- predictUnconstraintMotion (1.24 %) :: 0.017 ms / frame (1 calls)
......Unaccounted: (0.293 %) :: 0.004 ms
.........----------------------------------
.........Profiling: solveConstraints (total running time: 0.172 ms) ---
.........0 -- processIslands (41.28 %) :: 0.071 ms / frame (1 calls)
.........1 -- islandUnionFindAndQuickSort (57.56 %) :: 0.099 ms / frame (1 calls)
.........Unaccounted: (1.163 %) :: 0.002 ms
.........----------------------------------
.........Profiling: performDiscreteCollisionDetection (total running time: 1.077 ms) ---
.........0 -- dispatchAllCollisionPairs (0.09 %) :: 0.001 ms / frame (1 calls)
.........1 -- calculateOverlappingPairs (0.37 %) :: 0.004 ms / frame (1 calls)
.........2 -- updateAabbs (99.35 %) :: 1.070 ms / frame (1 calls)
.........Unaccounted: (0.186 %) :: 0.002 ms

It's quite bearable, although >10% of whole game frame spent on nothing is not very good.

And now, the same scene on "that another" platform:
----------------------------------
Profiling: Root (total running time: 31.296 ms) ---
0 -- stepSimulation (99.96 %) :: 31.282 ms / frame (1 calls)
Unaccounted: (0.045 %) :: 0.014 ms
...----------------------------------
...Profiling: stepSimulation (total running time: 31.282 ms) ---
...0 -- synchronizeMotionStates (1.63 %) :: 0.511 ms / frame (2 calls)
...1 -- internalSingleStepSimulation (92.59 %) :: 28.965 ms / frame (1 calls)
...Unaccounted: (5.773 %) :: 1.806 ms
......----------------------------------
......Profiling: internalSingleStepSimulation (total running time: 28.965 ms) ---
......0 -- updateActivationState (1.33 %) :: 0.386 ms / frame (1 calls)
......1 -- updateActions (0.00 %) :: 0.001 ms / frame (1 calls)
......2 -- integrateTransforms (1.57 %) :: 0.455 ms / frame (1 calls)
......3 -- solveConstraints (20.86 %) :: 6.042 ms / frame (1 calls)
......4 -- calculateSimulationIslands (3.64 %) :: 1.053 ms / frame (1 calls)
......5 -- performDiscreteCollisionDetection (71.12 %) :: 20.601 ms / frame (1 calls)
......6 -- predictUnconstraintMotion (1.41 %) :: 0.407 ms / frame (1 calls)
......Unaccounted: (0.069 %) :: 0.020 ms
.........----------------------------------
.........Profiling: solveConstraints (total running time: 6.042 ms) ---
.........0 -- processIslands (43.71 %) :: 2.641 ms / frame (1 calls)
.........1 -- islandUnionFindAndQuickSort (56.12 %) :: 3.391 ms / frame (1 calls)
.........Unaccounted: (0.166 %) :: 0.010 ms
.........----------------------------------
.........Profiling: performDiscreteCollisionDetection (total running time: 20.601 ms) ---
.........0 -- dispatchAllCollisionPairs (0.02 %) :: 0.004 ms / frame (1 calls)
.........1 -- calculateOverlappingPairs (0.17 %) :: 0.036 ms / frame (1 calls)
.........2 -- updateAabbs (99.77 %) :: 20.554 ms / frame (1 calls)
.........Unaccounted: (0.034 %) :: 0.007 ms

It makes it completelly unusable, as physics takes more than whole rendering and game logic.
I can switch between 2 physical engines in our technology (until we will resolve all issues
and we will choose one of them) and the other one takes nearly zero in the same scene.

When I look more inside of profiler output, it produces 21612 L2 cache misses in stepSimulation,
where most of it is spent in internalSingleStep->btCollisionWorld::performDiscreteCollisionDetection->btCollisionWorld::updateAabbs
,next eater is here:
btDiscreteDynamicsWorld::saveKinematicState
,and the rest is in here:
btDiscreteDynamicsWorld::calculateSimulationIslands

It's clear that there is no reason for that, because there is not anything moving in scenes at all. I'm sure
that only solution is to separate active and non-active objects and work only on active set, which is
usually much smaller in typical game scanario.

best regards, Fido
fido
Posts: 14
Joined: Sun Mar 22, 2009 5:57 pm

Re: Performance "problem" with large numbers of static objects

Post by fido »

Hi again,

I just made small change, that "mp-nico" suggested, that I keep in
btDiscreteDynamicsWorld pointers to only btRigidBody objects and
I've got to these figures:
----------------------------------
Profiling: Root (total running time: 27.908 ms) ---
0 -- stepSimulation (99.94 %) :: 27.891 ms / frame (1 calls)
Unaccounted: (0.061 %) :: 0.017 ms
...----------------------------------
...Profiling: stepSimulation (total running time: 27.891 ms) ---
...0 -- synchronizeMotionStates (0.01 %) :: 0.002 ms / frame (2 calls)
...1 -- internalSingleStepSimulation (99.95 %) :: 27.876 ms / frame (1 calls)
...Unaccounted: (0.047 %) :: 0.013 ms
......----------------------------------
......Profiling: internalSingleStepSimulation (total running time: 27.876 ms) ---
......0 -- updateActivationState (0.01 %) :: 0.002 ms / frame (1 calls)
......1 -- updateActions (0.00 %) :: 0.001 ms / frame (1 calls)
......2 -- integrateTransforms (0.01 %) :: 0.002 ms / frame (1 calls)
......3 -- solveConstraints (21.63 %) :: 6.029 ms / frame (1 calls)
......4 -- calculateSimulationIslands (3.78 %) :: 1.055 ms / frame (1 calls)
......5 -- performDiscreteCollisionDetection (74.50 %) :: 20.768 ms / frame (1 calls)
......6 -- predictUnconstraintMotion (0.00 %) :: 0.001 ms / frame (1 calls)
......Unaccounted: (0.065 %) :: 0.018 ms
.........----------------------------------
.........Profiling: solveConstraints (total running time: 6.029 ms) ---
.........0 -- processIslands (43.85 %) :: 2.644 ms / frame (1 calls)
.........1 -- islandUnionFindAndQuickSort (56.01 %) :: 3.377 ms / frame (1 calls)
.........Unaccounted: (0.133 %) :: 0.008 ms
.........----------------------------------
.........Profiling: performDiscreteCollisionDetection (total running time: 20.768 ms) ---
.........0 -- dispatchAllCollisionPairs (0.01 %) :: 0.003 ms / frame (1 calls)
.........1 -- calculateOverlappingPairs (0.18 %) :: 0.037 ms / frame (1 calls)
.........2 -- updateAabbs (99.77 %) :: 20.721 ms / frame (1 calls)
.........Unaccounted: (0.034 %) :: 0.007 ms


what is little bit better. Anyway, the amount of cache misses is still pretty high
(19600) as there are still iterated all objects in here:
btCollisionWorld::updateAabbs

as there in CollisionWorld are all objects in the same container I cannot work on
smaller set. It needs either active-only set, or some extra container for objects
with dirty bounds.

best regards, Fido
User avatar
Erwin Coumans
Site Admin
Posts: 4221
Joined: Sun Jun 26, 2005 6:43 pm
Location: California, USA

Re: Performance "problem" with large numbers of static objects

Post by Erwin Coumans »

Is this an unmodified version of Bullet?

It seems strange that updateAabbs is so expensive, updateSingleAabb should never be called. Also solveConstraints should be free, without any constraints. Did you modify the island generation code?

Is this really the same scene, with purely static geometry (no dynamic/moving rigid bodies)?

Code: Select all

void	btCollisionWorld::updateAabbs()
{
	BT_PROFILE("updateAabbs");

	btTransform predictedTrans;
	for ( int i=0;i<m_collisionObjects.size();i++)
	{
		btCollisionObject* colObj = m_collisionObjects[i];

		//only update aabb of active objects
		if (colObj->isActive())
		{
			updateSingleAabb(colObj);
		}
	}
}
Can you figure out why updateAabbs is so expensive, and if the objects are really flagged as inactive?
Can you try to force deactivation for static geometry, using

Code: Select all

object->forceActivationState(ISLAND_SLEEPING);
Thanks,
Erwin
fido
Posts: 14
Joined: Sun Mar 22, 2009 5:57 pm

Re: Performance "problem" with large numbers of static objects

Post by fido »

It's mostly unmodified Bullet, I have only couple of optimizations there. Anyway, the problem was
activation indeed ;) Now I do this for all static objects:
btCollisionObject* rb = new btCollisionObject();
rb->setCollisionFlags(btCollisionObject::CF_STATIC_OBJECT);
rb->forceActivationState(ISLAND_SLEEPING);
/*
[...] sets btCollisionshape and other stuff
*/
m_pDynamicsWorld->addCollisionObject(rb, short(btBroadphaseProxy::StaticFilter), short(btBroadphaseProxy::AllFilter ^ btBroadphaseProxy::StaticFilter));

and it runs fine:

----------------------------------
Profiling: Root (total running time: 1.620 ms) ---
0 -- stepSimulation (99.07 %) :: 1.605 ms / frame (1 calls)
Unaccounted: (0.926 %) :: 0.015 ms
...----------------------------------
...Profiling: stepSimulation (total running time: 1.605 ms) ---
...0 -- synchronizeMotionStates (0.12 %) :: 0.002 ms / frame (2 calls)
...1 -- internalSingleStepSimulation (99.31 %) :: 1.594 ms / frame (1 calls)
...Unaccounted: (0.561 %) :: 0.009 ms
......----------------------------------
......Profiling: internalSingleStepSimulation (total running time: 1.594 ms) ---
......0 -- updateActivationState (0.06 %) :: 0.001 ms / frame (1 calls)
......1 -- updateActions (0.06 %) :: 0.001 ms / frame (1 calls)
......2 -- integrateTransforms (0.06 %) :: 0.001 ms / frame (1 calls)
......3 -- solveConstraints (67.25 %) :: 1.072 ms / frame (1 calls)
......4 -- calculateSimulationIslands (16.25 %) :: 0.259 ms / frame (1 calls)
......5 -- performDiscreteCollisionDetection (15.24 %) :: 0.243 ms / frame (1 calls)
......6 -- predictUnconstraintMotion (0.06 %) :: 0.001 ms / frame (1 calls)
......Unaccounted: (1.004 %) :: 0.016 ms
.........----------------------------------
.........Profiling: solveConstraints (total running time: 1.072 ms) ---
.........0 -- processIslands (36.19 %) :: 0.388 ms / frame (1 calls)
.........1 -- islandUnionFindAndQuickSort (63.06 %) :: 0.676 ms / frame (1 calls)
.........Unaccounted: (0.746 %) :: 0.008 ms
.........----------------------------------
.........Profiling: performDiscreteCollisionDetection (total running time: 0.243 ms) ---
.........0 -- dispatchAllCollisionPairs (1.23 %) :: 0.003 ms / frame (1 calls)
.........1 -- calculateOverlappingPairs (2.47 %) :: 0.006 ms / frame (1 calls)
.........2 -- updateAabbs (94.65 %) :: 0.230 ms / frame (1 calls)
.........Unaccounted: (1.646 %) :: 0.004 ms

There is still problem with overhead of island manager (considering that the scene do not
contain any dynamic objects), but if I can hope it will improve any soon, it's
quite bearable.
Thank you for pointing out the wrong setting!

best regards, Fido