Basics of using OpenCL soft body?

dphil · Post by **dphil** » Mon Apr 11, 2011 12:56 pm

I'm interested in testing out the OpenCL soft body solver, but have been unable to get it working properly. I tried to figure out the basics of what I need to do from the AppOpenCLClothDemo, and it seems that in the simplest case - using a CPU-based CL solver - I need to:

1) Initialize CL (I copied over the demo files and just call their InitCL() function in clstuff.cpp)
2) Provide the soft body solver as the 5th argument to the soft/rigid dynamics world:

Code: Select all

btSoftBodySolver* softBodySolver = new btCPUSoftBodySolver;
btSoftRigidDynamicsWorld* dynamicsWorld = new btSoftRigidDynamicsWorld(dispatcher, broadphase, solver, collisionConfiguration, softBodySolver);

3) call

Code: Select all

softBodySolver->optimize(dynamicsWorld->getSoftBodyArray();

after my soft body is added, although I think this is optional.

I already had a soft dynamics world working before; all I did was add the above. I have everything compiled and linked correctly, and the simulation runs, but my soft body does not seem to be updated properly. The bounding box is actually falling (gravity) correctly (I render the aabb separately and monitor its position), but the mesh vertices (as retrieved via softBody->m_nodes.m_x) remain stationary, and then after 5-10 seconds they just disappear (but output indicates they are still stationary at their original location). My knowledge of OpenCL (and bullet's use of it) is very limited, and I could well be missing something. The only other thing that caught my eye in the demo was use of a (btSoftBodySolverOutput*) member, but from what I could tell this is just used for demo rendering purposes, so I didn't include it in my own. My rendering is done through btOgre, which updates vertex positions from the soft body's m_nodes array, but like I said the vertices don't seem to change.

Just trying to hack together a working test at the moment, without any real knowledge of OpenCL or how bullet is using it. For all I know it wouldn't even work on my system. I am running Mac OS X 10.6 on an Intel Core i5 iMac, with ATI Radeon HD 4850 graphics. I have not specifically installed any drivers or CL frameworks myself; just using what is provided in the bullet source code. However, I can run both the AppVectorAdd_Mini and AppVectorAdd_Apple with success (as noted in the output), which suggests to me that I have the appropriate setup for using OpenCL. I get varying results with the other CL demos:

1) AppOpenCLClothDemo_Mini: Doesn't compile for me (linker error, can't find _clGetProgramInfo symbol)
2) AppOpenCLClothDemo_Apple: Compiles but throws a runtime error (the btAssert(0) at line 1488 of btSoftBodySolver_CPU.cpp) with an error log:

Error in clBuildProgram, Line 1485 in file /Users/Dave/Documents/Bullet Physics/bullet-2.78/src/BulletMultiThreaded/GpuSoftBodySolvers/OpenCL/btSoftBodySolver_OpenCL.cpp, Log:
cvmsErrorCompilerFailure: LLVM compiler has failed to compile a function.

3) AppParticlesOCL_Mini: Compiles and runs, even with correct-looking particle dynamics. However the performance is rather slow (~250ms/step)
4) AppParticlesOCL_Apple: Compiles, but gives runtime error:

OpenCL compiles ParticlesOCL.cl ...

cvmsErrorCompilerFailure: LLVM compiler has failed to compile a function.

Any tips are much appreciated.

Erwin Coumans · Post by **Erwin Coumans** » Mon Apr 11, 2011 4:14 pm

If you check out the Bullet/Demos/OpenCLCLothDemo, by default it keeps the cloth data on GPU, to avoid slow copying between CPU and GPU, using btSoftBodySolverOutputCLtoGL. If you need the cloth data on CPU (likely with Ogre) you need to copy it back, using btSoftBodySolverOutputCLtoCPU.

The *Mini projects are using MiniCL, which is a CPU replacement for a OpenCL implementation. It can run multi-threaded if configured correctly (see console output for threads), but it is usually slower than GPU. I just checked the MiniCL.cpp code, and it seems that non-Windows platforms run in single threaded more (Linux/Apple).
I just committed PosixThreadSupport for MiniCL, that should use multi-threading for Linux/Apple, can you try this? It might require linking against pthreads though.
http://code.google.com/p/bullet/source/detail?r=2388

I'm suprised about the AppOpenCLClothDemo_Mini linking issue (clGetProgramInfo symbol missing), where is clGetProgramInfo used?
The OpenCLClothDemo_Mini compiles and runs fine on my Mac OSX 10.6.1 and OSX 10.6.7 fine here, just like the CPU and GPU OpenCL version.
What revision of OSX are you using?

Thanks for testing and feedback!
Erwin

by the way, the OpenCL/DirectCompute cloth acceleration is very limited and doesn't support most features. As people start using it, hopefully it will become more mature.

dphil · Post by **dphil** » Tue Apr 12, 2011 1:46 am

In regard to AppParticlesOCL_Mini:

1) I realized the 250ms/frame performance I noted in my previous post was using the Debug setting, and without resetting the simulation (see below).

2) I've found that I need to "reset" (using the space bar) the particle demo once it's started for it to properly "sort itself out". Otherwise the demo starts out about twice as slow and a number of the particles suddenly disappear. But once I hit the space bar to reset, it starts over nice and fast with all particles staying visible. Doing this in release mode, I achieved ~65ms/step with collision detection on (not bad for 32768 particles), and ~33ms/step with either collision detection or motion integration turned off (using the demo interface). Would be neat to have the same demo but using the standard non-CL bullet structures to compare performance.

3) I updated your change to use pthreads, however I have noticed no change in performance (still 65/33ms), though I do note a change in the output. Without PosixThreadSupport (ie before your changes), the particle demo output is:

Total number of particles : 32768
STS: Not starting any threads
STS: Created local store at 0x101626ea0 for task MiniCL
CL_DEVICE_NAME: MiniCL CPU
CL_DEVICE_VENDOR: MiniCL, SCEA
CL_DRIVER_VERSION: 1.0
CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS: 4
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 1
CL_DEVICE_MAX_WORK_ITEM_SIZES: 64 / 24 / 16
CL_DEVICE_MAX_WORK_GROUP_SIZE: 128
CL_DEVICE_MAX_CLOCK_FREQUENCY: 3072 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: global
CL_DEVICE_LOCAL_MEM_SIZE: 32 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_IMAGE_SUPPORT: 0
CL_DEVICE_MAX_READ_IMAGE_ARGS: 0
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 0

CL_DEVICE_IMAGE <dim> 2D_MAX_WIDTH 0
2D_MAX_HEIGHT 0
3D_MAX_WIDTH 0
3D_MAX_HEIGHT 0
3D_MAX_DEPTH 0

CL_DEVICE_EXTENSIONS:
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, FLOAT 1, DOUBLE 1

OpenCL compiles ParticlesOCL.cl ... OK
generating font at resolution 640,480

With PosixThreadSupport, the particle demo output is:

Total number of particles : 32768
startThreads creating 4 threads.
starting thread 0
started thread 0
starting thread 1
started thread 1
starting thread 2
started thread 2
starting thread 3
started thread 3
CL_DEVICE_NAME: MiniCL CPU
CL_DEVICE_VENDOR: MiniCL, SCEA
CL_DRIVER_VERSION: 1.0
CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS: 4
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 1
CL_DEVICE_MAX_WORK_ITEM_SIZES: 64 / 24 / 16
CL_DEVICE_MAX_WORK_GROUP_SIZE: 128
CL_DEVICE_MAX_CLOCK_FREQUENCY: 3072 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: global
CL_DEVICE_LOCAL_MEM_SIZE: 32 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_IMAGE_SUPPORT: 0
CL_DEVICE_MAX_READ_IMAGE_ARGS: 0
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 0

CL_DEVICE_IMAGE <dim> 2D_MAX_WIDTH 0
2D_MAX_HEIGHT 0
3D_MAX_WIDTH 0
3D_MAX_HEIGHT 0
3D_MAX_DEPTH 0

CL_DEVICE_EXTENSIONS:
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, FLOAT 1, DOUBLE 1

OpenCL compiles ParticlesOCL.cl ... OK
generating font at resolution 640,480

So it appears to start 4 threads, which I assume makes sense on this quad-core machine. I wonder why there is no performance change though. Either it isn't actually distributing the work, or perhaps it somehow already was before anyway, so adding the PosixThreadSupport doesn't add any enhancement...?

As for the AppOpenCLClothDemo_Mini linker error, the log shows:

Undefined symbols:
"_clGetProgramInfo", referenced from:
CLFunctions::compileCLKernelFromString(char const*, char const*, char const*)in btSoftBodySolver_OpenCL.o
CLFunctions::compileCLKernelFromString(char const*, char const*, char const*)in btSoftBodySolver_OpenCL.o
ld: symbol(s) not found

which appear to be referring to the clGetProgramInfo calls on lines 1469 and 1471 of btSoftBodySolver_OpenCL.cpp. clGetProgramInfo is declared in cl.h and defined in the OpenCL.framework (on a mac), I assume, since I can't find the definition anywhere in the bullet source. I've tried linking the AppOpenCLClothDemo_Mini target ("Link with binary" build phase) to /System/Library/Frameworks/OpenCL.framework but I get the same linker error. Maybe there are some other Xcode build settings to modify; I am not too experienced with it.

By the way I am using Mac OS X 10.6.7.

dphil · Post by **dphil** » Tue Apr 12, 2011 4:28 am

Think I found the linking problem. It seems like the clGetProgramInfo definition should be in MiniCL.cpp, but it isn't (was it accidentally left out of a commit, perhaps?). I added the following to MiniCL.cpp:

Code: Select all

CL_API_ENTRY cl_int clGetProgramInfo(cl_program         /* program */,
                 cl_program_info    /* param_name */,
                 size_t             /* param_value_size */,
                 void *             /* param_value */,
                 size_t *           /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0
{
	return 0;
}

and now I have a working demo with waving flags/cloths. I just set it to return 0, since it is never actually used anywhere except where an error has already occurred, and this is how clGetProgramBuildInfo was implemented as well (it also is never called except when an error already occurs, so no problem).

As for the demo itself, it runs smoothly for a low number of flags. When I set the flag count to 20, it is noticeably in more of a "slow motion". I would expect a slowdown with a higher number of flags, but I would have thought it would be able to handle that many ok. However, I ran multiple times to see the profiling in the output, and the following is representative of my average results for 20 flags:

----------------------------------
Profiling: Root (total running time: 245.166 ms) ---
0 -- stepSimulation (15.40 %) :: 37.752 ms / frame (1 calls)
Unaccounted: (84.601 %) :: 207.414 ms
...----------------------------------
...Profiling: stepSimulation (total running time: 37.752 ms) ---
...0 -- synchronizeMotionStates (0.00 %) :: 0.000 ms / frame (1 calls)
...1 -- solveSoftConstraints (76.97 %) :: 29.057 ms / frame (1 calls)
...2 -- internalSingleStepSimulation (8.15 %) :: 3.075 ms / frame (1 calls)
...Unaccounted: (14.887 %) :: 5.620 ms
......----------------------------------
......Profiling: internalSingleStepSimulation (total running time: 3.075 ms) ---
......0 -- updateActivationState (0.00 %) :: 0.000 ms / frame (1 calls)
......1 -- updateActions (0.00 %) :: 0.000 ms / frame (1 calls)
......2 -- integrateTransforms (0.00 %) :: 0.000 ms / frame (1 calls)
......3 -- solveConstraints (0.03 %) :: 0.001 ms / frame (1 calls)
......4 -- calculateSimulationIslands (0.03 %) :: 0.001 ms / frame (1 calls)
......5 -- addSpeculativeContacts (0.00 %) :: 0.000 ms / frame (1 calls)
......6 -- performDiscreteCollisionDetection (1.07 %) :: 0.033 ms / frame (1 calls)
......7 -- predictUnconstraintMotionSoftBody (98.76 %) :: 3.037 ms / frame (1 calls)
......8 -- predictUnconstraintMotion (0.00 %) :: 0.000 ms / frame (1 calls)
......Unaccounted: (0.098 %) :: 0.003 ms
.........----------------------------------
.........Profiling: solveConstraints (total running time: 0.001 ms) ---
.........0 -- processIslands (0.00 %) :: 0.000 ms / frame (1 calls)
.........1 -- islandUnionFindAndQuickSort (100.00 %) :: 0.001 ms / frame (1 calls)
.........Unaccounted: (0.000 %) :: 0.000 ms
.........----------------------------------
.........Profiling: performDiscreteCollisionDetection (total running time: 0.033 ms) ---
.........0 -- dispatchAllCollisionPairs (30.30 %) :: 0.010 ms / frame (1 calls)
.........1 -- calculateOverlappingPairs (6.06 %) :: 0.002 ms / frame (1 calls)
.........2 -- updateAabbs (63.64 %) :: 0.021 ms / frame (1 calls)
.........Unaccounted: (0.000 %) :: 0.000 ms

Notice that the stepSimulation time seems ok (~38ms) but there is 207ms of "unaccounted" time! This is obviously what is slowing down the simulation. Any idea what this is? I thought it might be the copying of vertex buffers for rendering, but commenting out:

Code: Select all

for( int flagIndex = 0; flagIndex < m_flags.size(); ++flagIndex )
{
	g_softBodyOutput->copySoftBodyToVertexBuffer( m_flags[flagIndex], cloths[flagIndex].m_vertexBufferDescriptor );
	cloths[flagIndex].draw();
}

in cl_cloth_demo.cpp made no difference (besides making the flags invisible, of course).

Real-Time Physics Simulation Forum

Basics of using OpenCL soft body?

Basics of using OpenCL soft body?

Re: Basics of using OpenCL soft body?

Re: Basics of using OpenCL soft body?

Re: Basics of using OpenCL soft body?