Inline and Non-Inline
Performance coding is always a balance - between execution speed and resources consumed. Even achieving the desired execution speed is a balance as the increase in code complexity to achieve a certain optimization but at worse defeat the desired speed increase form the change. A general method to improve execution speed is having a function inlined - compiled directly into the calling function. Let's consider the work that is normally done when calling a function:
1. Marshall the parameters for the function onto the stack
2. Obtain the address of the function, and jump to the location.
3. Construct a stack frame for local variables
4. Perform the function execution
5. Deconstruct the stack frame
6. Marshall return values into the expected return locations (stack, or registers)
7. Return to the call function
Having the code inline, compiled directly into the calling function, removes all the steps except for the actual performance of the function execution. The parameters are used directly, though it is possible copies will be made on the stack if necessary (though this will be part of the stack frame already created by the calling function). The return values are directly available to the calling function and will be used in that way. There is no jump to a new execution pointer, and no need for a jump back. In all, we have eliminated for the most part - the memory work in moving the parameters, in creating a new stack frame, and the branches caused by the instruction pointer jumps.
So what's the problem? The issue is that code is no different than data in many ways. It needs to be loaded onto the CPU for processing and execution. Inline compilation will of course cause code bloat, which will then cause more memory fetches (and cache misses) when executing the code. In performance critical code this can lead to execution slowdowns that are hard to pin down. One way of controlling this is to keep better control over what functions are inlined and which ones are not inlined. Similar to the compiler hint to inline a function, it is possible to mark functions not to be inlined.
For example, let’s say you have a function that has critical performance requirements. In looking over the implementation of the function, you notice that it has two possible execution paths. However, it is noted that 95% of the time the calling functions are always in only one of the paths (possibly the second path is for initialization, over flow or other error handling). The problem is that 80% of the code is in the unused second execution path. By splitting the function into two functions - one for the primary majority case and the second containing the bulk of the code (but rarely executed), it is possible to greatly reduce the execution size of the function (increasing instruction page performance) with minimal cost.
Take Away:
1. Function inline can produce a performance increase by flattening the code compilation at the cost of code bloat.
2. There may be an advantage to extracting out large (relative) amounts of code inside of an inline function that has only a rare execution probability and placing it into a non-inline tagged secondary function to reduce the code bloat.
Pass In Register
Well, apparently its been over three years since I last made a post. I was originally planning on trying to post something every week, but I am really bad when it comes to any type of regular correspondence. We'll see how long I manage to keep it going this time, eh :)
So, as for the title - pass in register. This is an interesting thing that in some cases can be a very good performance gain but needs to be balanced against the number of available registers. On the PPC platforms (consoles) we have a ton of registers and so passing things by register is something that can be done regularly without too much forethought. However, on the PC the number of available registers is much more limited and care should be taken when using this execution path. Keep in mind as well that the optimizer for the PC platforms can often do a better job since most functions that use pass-in-register semantics are most likely inlined as well. Be careful trying to be smarter than the optimizer (even if you think - like me - that most optimizers are only slightly better than a five-year old when it comes to manipulating code).
If you have read this far, there is a good chance you are shaking your head - pass in register on the PC? When using vector (SIMD variables) most people believe that it is not possible to actually do this on the PC. As it turns out, it is possible - just a little annoying. Assuming you are using the Microsoft compiler - pass in register is done by using the native type. Be warned, type defs are not equivalent in this case. The common method for creating a cross-platform math library is to either use a type definition to cast the native type to a common name or, and the more common case, to use a structure to contain the native type (usually within a union for element access). As it turns out, there is way to convince the compiler (to my knowledge) to pass any of these in register (either the type definition or the container structure). As a matter of fact, the only way that I have found to work is to use the actual __m128 type as the variable type in the parameter. Documentation also warns that this will work only for the first three parameters. If you try to do any more than three parameters using this method, you will get the standard compiler error about being unable to guarantee variable-stack alignment requirements.
Take away:
1) It is possible to pass in register on the PC platform.
2) Using this passing method can provide a performance boost by preventing a store-load on the stack.
3) Needs to be balanced against the usage case and the number of available registers.
PPC Compiler
I was quite proud about the way I had designed my math and collision code base using templates so that it allowed for easy flexibility between float and double computations. With the native 64bit nature of the new PPC chips this could be a very strong asset for collisions that require extra precision (quadratic surfaces for instance). Then I find out that my good friend, Mr. Compiler, insists on doing a heap shuffle on each and every parameter for which a 1:1 mapping between variable and register type does not exist. For instance the compiler will not move a vector through on 4 float registers or a matrix through on 3/4 vector registers. It will insist on doing a heap shuffle - even when inlining the code (don't ask me - I'm just saying what I see on release-optimized asm output). This is enough for me to want to commit serious bodily harm on someone - the speed loss is ridiculous (for instance some hand tweaking of one loop in the code base changed my frame rate from 2FPS to 55FPS). There are times you just want to take the compiler out back for a few rounds, eh -) So as it stands the only way to get the needed efficiency would be to use #define network of math functions - since this would allow for the automatic transfer of matrix as vectors. What a pain in the ass, eh -( Anyways - going to play with it a bit more and see - but as far as I know this was never solved for the PS2 compiler either so I dont have high hopes.
Xbox360 GPU functions
I have been spending some time working on the Xbox360 recently working how best to use the L2 locking functionality of the hardware and the specific GPU function call in the API. Essentially they allow for greater seperation between CPU and GPU execution, minimizing the number of sychronization points. This has required a rewrite of how video constants are stored and manipulated in general, keeping in mind the 64 byte alignment that is required for data transfer from the CPU to the GPU. Over all its been interesting.
Someone emailed me recently pointing out that my html parser mangled the code drop online - dropping any code after a division symbol ( the parser was interpreting it as a failed comment ). This has been fixed and so the code base should be more reasonable now. If anyone see's any other problems, please email me!
Implemented a basic input library through XInput. Bought myself a 360 controller for windows so that I would never have to revisit Direct Input ever again. Anyone who has ever had to create a robust and thorough solution using it will understand - its a nightmare. I understand why it was designed the way it was - a PC can have any type of input device - but from a game point of view it could drive you nuts. XInput is just a slam-bam-thank-you-ma'am in-and-out affair - its wonderful.
Vectorization of a Physics Solver
Been spending the last few days taking a standard physics solver setup and solution and vectorizing the resulting operations. Its been a lot of fun and will make porting it/working with it on a SPU much easier. I have also been trying to isolate small tasks to get a good to do list going for the holiday break. Finally, been working out the last remaining issues on the X360 build - which is now up and running. I threw in basic controller, audio and XMV support since its literally only takes six lines of code on the X360. Its amazing how easy the SDK for that platform is to use. One day when I feel extremely masochistic and in the need of a good sledge hammer to the brain, I'll work on the PS3 port. It is possible hell will freeze over first. Hard to say. I did manage to survive multiple PS2 titles, so it aint all bad -)
:: Next >>
©2010 by Andrew Aye
Contact | Blog template by Asevo | blog soft | cheap webhosting | adsense