Solving a single island across multiple cores

raigan2 · Post by **raigan2** » Tue Jun 09, 2009 7:32 pm

In this month's GD mag there's a paid portion from Intel that talks about this simulator: http://www.infernalengine.com/tech_physics.php

They specifically mention that it can solve an island of constraints in parallel; collision detection and Jacobian building are more easily split up, because each individual test/Jacobian happens in isolation, but how would you parallelize the actual relaxation/solving part?

The only obvious thing I can think of is: take an island and "split" it by ignoring some constraints and/or bodies, so that it becomes two separate islands B,C. Then, solve each island on a different core, and once both are done solve the constraints/bodies that were initially ignored. Or you could just use Jacobi iterations instead of Gauss-Seidel, but that seems like it would have terrible convergence.

Are there any other (better) approaches?

Erwin Coumans · Post by **Erwin Coumans** » Tue Jun 09, 2009 8:00 pm

You can do parallel PGS for a single island by re-organizing constraints into batches, where each batch has constraints that don't share dynamic rigid bodies with any other constraints in the same batch. This allows to parallelize simulation islands where all objects are connected (directly or indirectly).

See my posting and attached GDC 2009 slides by Takahiro Harada in this topic.

Hope this helps,
Erwin

raigan2 · Post by **raigan2** » Tue Jun 09, 2009 9:06 pm

Thanks, I _think_ I understand -- if you had a chain of bodies ABCDE connected by joints AB, BC, CD, DE you could then solve [AB,CD] in one batch and [BC,DE] in the next?

That's pretty smart!

You mentioned that you used 10 batches -- the number of batches would have to depend on the number of constraints involved with the most-constrained body, right? I.e if you had a body that was being acted upon by 11 different constraints, it wouldn't be possible to schedule all the constraints in 10 batches. This seems reasonable anyway, unless you have a bunch of spiders colliding with each other (with 8 joints to connect legs to thorax, the thorax can only collide with two things before you have problems). But then I suppose you could "solve" this problem by splitting the thorax into two bodies which are constrained to each other..

noctoz · Post by **noctoz** » Wed Jun 10, 2009 2:04 pm

http://download.intel.com/technology/it ... -art08.pdf

This paper by Intel has a section about this.
It's on page 5-6.

raigan2 · Post by **raigan2** » Wed Jun 10, 2009 5:18 pm

Awesome, thanks!

Erwin Coumans · Post by **Erwin Coumans** » Wed Jun 10, 2009 6:06 pm

You mentioned that you used 10 batches -- the number of batches would have to depend on the number of constraints involved with the most-constrained body, right?

Actually, if you just check out the Bullet 2.75 beta1, compile the demos, and run Gpu2dDemo or Gpu3dDemo, you can check the batches on-screen. The tests show that typically less than 10 batches are required, we just use a configurable limit to 20, with an early-out once all constraints have been assigned a batch. Indeed, it is possible to create a worst case by attaching many constraints to a single dynamic rigid body.

http://download.intel.com/technology/it ... -art08.pdf

Right. We are working with the authors of that Intel paper, Mikhail Smelyanskiy and others (they are just around the corner here in the bay area).

Splitting the constraints in batches just a starting point. The current Bullet CUDA parallel PGS implementation performs the batching on CPU. It is based on Takahiro Harada's work, and his implementation already performs batching and all other stages in parallel on GPU as well. We plan on providing a parallel GPU implementation of the batching in OpenCL in one of the upcoming Bullet releases.

Thanks,
Erwin

raigan2 · Post by **raigan2** » Wed Jun 10, 2009 7:18 pm

Erwin Coumans wrote: Actually, if you just check out the Bullet 2.75 beta1, compile the demos, and run Gpu2dDemo or Gpu3dDemo, you can check the batches on-screen.

I'm looking forward to checking them out once I get a new graphics card

Johan Gidlund · Post by **Johan Gidlund** » Fri Jun 12, 2009 10:43 pm

I'm hoping to try doing some implementations of this on Larabee as soon as we get some testing hardware.

Have you done any benchmarking to see how your bullet implementations scales with lots of cores?
If so, is this test data available somewhere?

Real-Time Physics Simulation Forum

Solving a single island across multiple cores

Solving a single island across multiple cores

Re: Solving a single island across multiple cores

Re: Solving a single island across multiple cores

Re: Solving a single island across multiple cores

Re: Solving a single island across multiple cores

Re: Solving a single island across multiple cores

Re: Solving a single island across multiple cores

Re: Solving a single island across multiple cores