Physics Simulation Forum

 

All times are UTC




Post new topic Reply to topic  [ 4 posts ] 
Author Message
PostPosted: Wed Jan 11, 2012 11:36 pm 
Offline

Joined: Thu Jul 20, 2006 1:29 pm
Posts: 25
Hello,

I was wondering about some lines in this file:
https://code.google.com/p/bullet/source ... lverData.h
Lines 133 - 143.

It is interesting that Bullet puts every single variable into a seperate array even if the original SoftBody link structure contains all that variables in 1 class/struct.
Can anyone explain that to me? I couldn't find a comment regarding that in the source code.

My guess so far is that is has to do with the "batching" and the memory distance to neighbours would get too big if Bullet would use 1 array with a big struct for every link.

Any confirmations/denial? :D

Greetings and keep up the good work!
Daniel


Top
 Profile  
 
PostPosted: Thu Jan 12, 2012 6:02 pm 
Offline

Joined: Fri Jun 18, 2010 12:58 am
Posts: 3
Hi Genscher,
That memory layout transformation is standard practice for data parallel execution and vector machines, of which the GPU is an example. Remember that the way the GPU works is that you have some set of wide SIMD threads - each of those SIMD threads contains many work items which access data simultaneously. If we access the compacted array version of, for example, the position then each work item will be reading consecutive positions (I may or may not have split that into x, y, z, but given that the whole position is being read at once that is less of an issue). If we read the position from within a larger struct then the memory accesses from neighbouring work items in the vector will stride through memory by some larger amount - this would span multiple cache lines and even DRAM banks and hence cause performance degradation.

This transformation is traditionally known as Struct of Arrays/Array of Structs.

In addition to allow for a single dispatch to better use GPU parallelism where thousands of work items can be in flight at one time we packed multiple cloths into a single set of arrays - a single dispatch could easily work over that set of cloths rather than needing a GPU kernel launch for each. This helps with launch overhead which can become significant if the cloths are not so big that the machine is fully utilised for each.

Lee


Top
 Profile  
 
PostPosted: Thu Jan 12, 2012 6:44 pm 
Offline

Joined: Thu Jul 20, 2006 1:29 pm
Posts: 25
Hello Lee,

I wasn't sure if the cache degradation would cause a severe enough problem so that 7 or more kernel arguments can be justified.

Thank you for your time explaining this.

Daniel


Top
 Profile  
 
PostPosted: Thu Jan 12, 2012 7:34 pm 
Offline

Joined: Fri Jun 18, 2010 12:58 am
Posts: 3
That would depend on the cache architecture. That algorithm was written for Pre-Fermi/Southern Islands class GPUs with more limited caching than we have now. A lot of those architectures have very severe performance issues with uncoalesced access. In general for vector architecture's it's worth thinking about the SoA optimisation - and that includes SSE. Most data parallel workloads should probably be architected that way unless it would cause a serious maintenance problem.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group