Selecting one of the other children as first traversal for raycast makes the worst case worse.
What is the worst case? When the ray doesn't touch anything? Then this is correct, in this case the extra CPU time spent on ordering the nodes is wasted. However, sorting the 2 candidate nodes is very cheap (2 dot products) and in practice the overhead is difficult to notice (as you know it's nothing compared to memory access).
I doubt you need this 'optimization'.
You only say that because it makes your life easier, but you would include it in Bullet immediately if it was easy
The optimization is not "vital", but it -definitely- helps.
There is a fundamental problem with the "2 passes" approach we talked about a while back. You said that Opcode would be cleaner if things were separated in distinct phases:
1) traverse the tree, gather touched triangles
2) perform the actual test against reported triangles
This is certainly cleaner, but for raycasts it makes a few optimizations difficult to implement. For example if you traverse closest nodes first and shrink the ray according to current best distance, you stop the traversal earlier (culling a lot of nodes that would otherwise have been visited).
Let's take a pathological case: imagine your mesh is just a vertial pile of triangles, and your ray is vertical as well, touching all of them. If you do 1), then you end up reporting all the nodes of the tree (touching the full tree memory) before performing a single raycast in 2). However, if you do both at the same time, and implement the optimizations I mentioned, then it's likely that you will only fetch a single node (or the few first nodes), and then early exit because the shrunk ray doesn't touch the remaining nodes.
I found that this optimization does help in real cases - at least in a SW version.
Better would be to randomize the children, so you get a good average case.
I doubt this is really useful.
Also, doing a stackless skip-list traversal allows you to perform calculations on the sub-trees in parallel
Maybe, but I think this is off-topic
Is this optimization really worth the hassle?
With usual trees it certainly is, especially since it's very easy to implement. With stackless trees, no, it is not.
How many bytes are you using for your stackless Node?
I'm not the one who rewrote them, but I think it's 20 bytes.