So far this code will compile on Borland C++ V4.0 for cases up to N=16, Sunpro4 for cases up to N=64, and Photon C++ for N=16. Andrew Dalgleish (andrewd@axonet.com.au) was able to compile with Microsoft Visual C++ 4.1 for N=256 (however, the FFT wasn't completely inlined). Thomas Kunert (kunert@Ptprs1.phy.tu-dresden.de) was able to compile with IBM xlC for N=256 (possibly also not completely inlined). Felix Kasza (felixk@mailbag.shd.de) set a record with N=512 on Visual C++ (Alpha XL 300, single processor, NT 4.0), requiring 80 Mb of memory. If you are successful in compiling this code on another platform, please let me know.
I used the FFT routine from 'Numerical Algorithms in C' as a base. Using that routine as a benchmark, the inlined version runs 4.5 times faster on Borland C++ (N=16), and 3.7 times faster on Sunpro4 (N=16). These figures should be taken with a lot of salt, since the Numerical Algorithms version calculates weights on the fly, whereas the template metaprogram version precalculates them at compile time.
For huge N, four1(..) may be faster, since the inlined version will not fit in the cache, whereas four1 will. To tackle large transforms, it's best to use a routine like four1(..), and bottom out to an inlined version for N=16,32 or 64.
The program is useful as an illustration of template metaprogram techniques: