MarkF Senior Heliman Location: Palo Alto, CA
| How to Write 10X Faster CodeHi, Gang!
Warning: We're about to get into coding tips and techniques here, which is really far afield from helicopters. Skip to the next message if you're not interested in this!
These days, there aren't too many of us left that are real efficiency fiends. Of those that are, most started in the early days of computing when almost any "real time" project was a major challenge (I started back with the 8008 and the Signetics 2650)! While this is off the topic of the receiver, per se, I'll describe some of the techniques that I've learned over the years, which start well before contemplating assembler language.
Before we get started, my receiver project is structured around what I call "Events", that is either an Input (reading the pin connected to the RF portion of the receiver), or an Output Set (driving servo pins High), or an Output Clear (driving servo pins low), so I'll be using that terminology below.
Know When to Break The Rules!
One of the first, and most important techniques for improving efficiency is to know when, and how, to break the rules of modularity. Computer Science profs will drill into you over and over again how important it is to maintain strict modularity guidelines like data hiding, abstraction and isolation, etc, and they are wise to do so! However, as Ralph Waldo Emerson once wrote, "A foolish consistency is the hobgoblin of little minds". When you're getting started, there is no question whatsoever that strict modularity is the way to go. However, as you gain more experience, you'll begin to identify times when strictly enforcing modularity can seriously hinder performance.
One of the best-known examples of this is "TCP/IP Fast Forwarding", which is central to the Internet. To begin with, TCP/IP itself does not adhere rigidly to the standard OSI network model, it breaks modularity for performance. What Fast Forwarding does, though, is to let the lowest-level network code reach up two layers into the network stack directly, to handle the 95% case of simple network forwarding. By doing this, CPU overhead can be dropped by an order of magnitude.
Overdoing this can easily wind you up with a spaghetti mess of unmaintainable code, so the challenge is how to do it well and appropriately. Unfortunately, this mostly takes experience, for it's difficult to specify a hard and fast set of rules which tell you when to do this. However, when performance demands it, modularity-breaking really can be appropriate. Along those lines...
Design Your Data Structures for the Inner Loop
When I tackle a performance-critical program, I'll start with how I store data first, keeping in mind how the inner loop will access the data. While multiple nicely-coded structures can be pretty, it can also result in far less efficient data accesses. The receiver uses a simple linear array of packed structures that allows each event to be read by the event code in just one instruction, instead of the 10X more it might take with structures that aren't designed for access efficiency.
Global Variables Aren't Necessarily Evil!
OK, so maybe you'll prefer me to say that global variables are an evil necessity! There are many times when global variables are important for performance enhancement, and this is definitely the case here. As one example, I could be passing around the "time_base" variable that determines the timing of every action in the receiver, but why? Instead, this is a global variable, as are many of the others.
Note that this, too, shouldn't be overdone, since your goal is still to try to be as modular as possible. However, the key is to recognize when modularity is getting in the way, rather than helping out!
Use Fast Algorithms
After you've designed the data structures, use smart algorithms. The most tightly coded assembler-language bubble sort routine will probably be blown away by a C quick-sort. Along those lines, the add_event() routine uses a binary search and an insertion sort to efficiently add new events to the event lists.
Think first... about algorithms. THEN start coding!
Make the Tighest Inner Loop You Can
Still in the realm of algorithm design, the next step is to move whatever you can outside of the inner loop code. Simple examples include things like counters, where you count the number of bytes received as you receive them. Instead of doing this, move the count determination outside of the loop. The goal here is to pare away every operation other than that which is absolutely essential to the core problem.
Remove Special Cases
The next key thing to do is to try to make everything in the inner loop as consistent as possible. Wherever you can, check for special cases outside of the inner loop, rather than inside the inner loop. Comparisons and branches are extremely slow!
The consequence of this is that you'll want to make everything handled by the inner loop as consistent as possible. In the case of the receiver, the code always loads both a delay length, and an output value - even for input events! The reason for this is that it is far faster to always read both than it is to check to see if this is an input event and just read the delay value.
Eliminate Library Calls
Here's one of the first tricks that are relatively unique to assembly language. Wherever possible, don't call library routines in an inner loop, instead perform them inline. A perfect example in the receiver is the rf_input() function, which needs to keep a 64-bit long buffer of the most recently received samples. The C compiler forces a library call to accomplish the shifting of the "long long" rf_sample value, even if you specify inlining (and that's not at all unusual). By replacing this entire call with two ARM7 "MOV xx,yy,RRX" instructions, we save more than an order of magnitude here.
Use Data Types that C Doesn't Support
Continuing with the rf_input() function, try to use unique features of the processor that C doesn't support. A perfect example is the processor's carry flag, which essentially allows you to have a 65-bit data type. Accomplishing an efficient input in assembler requires just one instruction to load the Input register, one to shift the input into the carry flag, and two to do the 64-bit shift.
Since C doesn't support this data type, you have to replace it with the before-mentioned library call, the Input read, an immediate read of a mask value, an AND operation, a Shift operation, and an Or operation. Much slower!
Use The Instruction Set!
The next thing to learn is to carefully study the instruction set of the processor that you are using, and to seek ways of exploiting its unique abilities. A great example here is the first instruction that starts each event in the receiver: "LDMIA R4!,{R2-R3}". This single opcode loads both the duration of an event, as well as the outputs to be written, and it advances the data pointer to the next event in the event list.
Another, and even more significant example, is that the ARM instruction set supports conditional execution. That is, it lets the processor execute an instruction if and only if the processor flags are set in a particular way as the result of a previous operation. Before I wrote the event compiler, I used this capability to allow the event handler to process all three different types of events with no branches, by loading the event_type into the processor flags and then conditionally executing the I/O instructions (this is also another example of removing special cases).
Similarly, the ARM lets you decide on an instruction-by-instruction basis whether you want to have the operation affect the processor flags, it has incredibly powerful shift capabilities, etc. Spending the time to learn these capabilities in-depth can make a big difference!
Eliminate Temporary Variables
One huge benefit of assembly language is that it lets you eliminate temporary variables that compilers will require. For example, very few C compilers will generate an instruction like "LDR R0,[R0]", in which case a register that's pointing to something is loaded with the value of what it's pointed to. Using methods like this reduces the number of registers required, which can prevent running out of registers. This can lead to not needing to store values in memory, which is a very good thing!
Eliminate Loads and Stores
Compilers rarely are able to structure complex routines such that no memory references are required. However, if you very carefully analyze the flow of data in your inner loop, you can usually manage to keep all intermediate values in registers, rather than in memory. While this isn't as important on the chips we're using here, this can be enormously important on fast X86 machines, where a single memory reference can result in the loss of hundreds of instruction execution opportunities. With processors like those, just eliminating a single "uncached" memory reference can speed up code nearly two orders of magnitude!
Write a Compiler
When all of these techniques don't meet your expectations, you can always do what I did, which is to write a compiler that generates its own machine code on-the-fly! This is perhaps the ultimate performance-enhancing trick, for it results in the fastest possible execution speed. It's also not as hard as you might think - the code I'll be publishing will show you one way it can be done.
Summary
After all is said and done, it really is possible to deliver 10X speed-ups in many situations by using assembly langauge. However, choose your opportunities carefully, for writing this kind of code requires far more time than normal C hacking. By carefully designing the entire program so that the inner loop is minimized, it's then worth the effort to optimize the hell out of it!
Have Fun!
MarkF
Addendum: While it isn't relevant to what we're doing here, I'll mention another approach that's extremely important when working on the "big" CPUs like the P4 and the Opteron. These processors execute multiple instructions simultaneously, and each instruction has a variable amount of execution time. When an instruction needs the output of a prior instruction that hasn't yet completed, it will force the CPU to "stall", wasting cycles until the prior instruction completes. While the compiler will try to "schedule" instructions in an order that minimizes execution time, no compiler can come close to what humans are capable of. In several cases, we've created routines that execute the first half of an inner loop, then we've interleaved, or "folded together" the second half of the inner loop with another copy of the first half of the inner loop, the purpose of which is to ensure that the processor is never stalled. This can have an enormous impact on performance. In one nice case in the AMD Opteron Performance Optimization manual, AMD shows that it takes 16 clock cycles to perform a single complex multiplication on the Opteron processor. However, if you very carefully write the code, it's possible to perform four complex multiplications in 17 clock cycles!!! Now, writing this kind of code is definitely in the "advanced" category, but it's a powerful technique that can have a major impact on performance-critical code! |