Well it would be very nice to test the speed in my environment. I would be thankful if you could send it to me.
It would be also nice to rip of all of this defines and make the code more modern c++, template based.
Have you uses totally different triangles in all 100,000,000 tests, or the same triangle?
Probably the best way to code SIMD code today is to use intrinsics. They should be pretty portable. At least much more portable as assembler code.
I'll prepare a ZIP file and send it to you via PM (personal message). Let me know whether you have different results.
My routines don't have any macros or defines. I'd say my routines are pretty much clean C code, but no templates.
BTW, the tropp code involves a bit of a scam, obviously concocted to boost speed by pushing part of the required code outside of his function by him requiring unnatural input arguments (four edge vectors and two triangle vertex positions instead of what every engine has available - the positions of the 6 triangle vertices). So I made sure ALL my timings were based on a level playing field - the timer starts with the same arguments - the 6 triangle vertices - and thus computing two edge vectors is overhead in his routine that he doesn't count. I hate when people pull scams like that. Still, I appreciate their work, because my improvement of their code is the fastest routine, and I would never have figured out their approach myself.
I ran my tests on my engine with a bunch of objects rotating and moving around in 3D space and suffering random collisions with each other. I had the collision detection code call all these routines with the same data each time it needed to determine whether two triangles were intersecting. In all cases, the vertices of both triangles had just been accessed by the collision detection routine, and therefore in all cases that information was in the cache. Therefore, every triangle-pair sent to these routines was unique (and fairly randomized), and no routine had any cache advantages or disadvantages.
I program in straight assembler. I dislike intrinsics for various reasons, but I won't get into that here (not relevant to this topic). Oh, but I do have a question related to this. Currently I have 4 versions of each assembly language routine: #1: 32-bit mode MASM syntax; #2: 32-bit mode linux syntax; #3: 64-bit mode MASM syntax; #4: 64-bit mode linux syntax. Does anyone know whether it is possible to compile the linux syntax code on windoze with the tools that would be running with CodeBlocks (on windoze)? Clearly this process generates an object file that gets linked into an executable that runs on windoze. My question is, if I was to take that object file and make it part of a VisualStudio project, would that link in correctly? If only one of the function protocols is supported in this mode (cdecl or stdcall or whatever), I can live with that. It would be nice to downshift from keeping 4 versions in sync to "only" 2 versions.
Later: I found that PM doesn't let me attach files. Oh well. I created a ZIP file that contains two .cpp files and uploaded the ZIP file where you can download it: http://www.iceapps.com/triangle_triangle_intersection.zip
One .cpp file is the tropp code exactly as his partner posted it on the internet, except for a few very minor formatting touch-ups. This is precisely the file I called to perform the speed tests I mentioned. Note again that you need to start the timer before you compute the four edge vectors his function requires as arguments, or you're not benchmarking on equal footing. You will find those four lines of code near the top of my "improved version" of tropp function in the other file. Just start your timer, execute those four functions, then call the tropp function, then stop the timer when it returns. That will be equivalent to calling the other intersection functions in the other file (that perform those operations inside those functions). Probably I should have put that code inside tropps function and changed his arguments, but I didn't feel right doing that (even with comments to that effect), so I didn't.
The other .cpp file contains functions from my engine, including two additional triangle-triangle routines. The one labeled "30% slower" is the one that makes perfect sense to me and is what I wrote from tabula rasa originally. It executes 6 divide operations which surely slows it down. I see how to eliminate the divide overhead on 4 of those divides by performing other non-math operations before I access the result of the divide. Furthermore, two divides can be performed in parallel in SIMD/SSE2+ assembly/intrinsics. But I haven't done any of this because the code code would become less readable and
especially because I suspect the routine would probably still be 10% to 20% slower than my cleaned-up and slightly improved version of the tropp routine. I did find a couple other optimizations I could make in my original function, but I sorta lost interest because I have lots of other work to do on my engine, and suspect that I would never quite make it faster than the current fastest function.
ig_triangle_triangle_intersection() - fastest one, improvement of the tropp routine
ig_triangle_triangle_intersectiono() - my original tabula rasa implementation
tri_tri_intersect3D - tropp function in separate file
In the second file I included several other functions that these functions call, just in case replacing them isn't easy for you (or will be easier to replace with these to examine).
A few notes about my code. I have typedefs in a file to make consistent, readable type names, so you'll see lots of variables declared with types like f64, s32, s64, and so forth. I'm sure you can create equivalent typedefs in your program without any trouble, and I'm sure you know f64 is "double", f32 is "single" or "float" (gee, i forget which it is now!), s32 is a 32-bit signed integer, u32 is a 32-bit unsigned integer, and so forth. The only non-trivial type names might be "cpu" and "f64vec4". The "cpu" type is a 32-bit signed integer when compiling 32-bit mode applications, and a 64-bit signed integer when compiling 64-bit mode applications. f64vec4 is what it sounds like - a structure containing four f64 variables called .x .y .z .w defined with convenient unions so the variables can alternatively be accessed as s64 integers as .ix .iy .iz .iw or as array elements .a ~ .a. If you need or want my file that creates all these and many other useful type names, let me know.
You will also note my math function names are "excessively readable and descriptive"... as in "very long". At least you're unlikely to wonder what these functions do. You'll see what I mean. Also, the code is written to "align and look right" when tabs are displayed as 4 spaces wide. Please note the terms attached to this code as stated at the top of the file. Note that the tropp functions (in the separate .cpp file) are not my code, but was downloaded from the internet.
If you find these functions call any functions that I didn't include, let me know and I'll dig them up.
Let me know what are your impressions, opinions and independent timing results.