T. Veldhuizen, "Using C++ template metaprograms," *C++ Report* Vol. 7
No. 4 (May 1995), pp. 36-43.

*C++ Report (ISSN 1040-6042) is published nine times per year by SIGS Publications Inc., 71 West 23rd St., 3rd Floor, New York, NY 10010. © Copyright 1995 SIGS Publications Inc.*

Back to the Blitz++ Home Page

Back to Todd's Home Page

Here's a simple example which generates factorials at compile time:

template<int N> class Factorial { public: enum { value = N * Factorial<N-1>::value }; }; class Factorial<1> { public: enum { value = 1 }; };Using this class, the value N! is accessible at compile time as Factorial<N>::value. How does this work? When Factorial<N> is instantiated, the compiler needs Factorial<N-1> in order to assign the enum 'value'. So it instantiates Factorial<N-1>, which in turn requires Factorial<N-2>, requiring Factorial<N-3>, and so on until Factorial<1> is reached, where template specialization is used to end the recursion. The compiler effectively performs a for loop to evaluate N! at compile time.

Although this technique might seem like just a cute C++ trick, it becomes powerful when combined with normal C++ code. In this hybrid approach, source code contains two programs: the normal C++ run-time program, and a template metaprogram which runs at compile time. Template metaprograms can generate useful code when interpreted by the compiler, such as a massively inlined algorithm -- that is, an implementation of an algorithm which works for a specific input size, and has its loops unrolled. This results in large speed increases for many applications.

Here is a simple template metaprogram "recipe" for a bubble sort algorithm.

inline void swap(int& a, int& b) { int temp = a; a = b; b = temp; } void bubbleSort(int* data, int N) { for (int i = N - 1; i > 0; --i) { for (int j = 0; j < i; ++j) { if (data[j] > data[j+1]) swap(data[j], data[j+1]); } } }A specialized version of bubble sort for N=3 might look like this:

inline void bubbleSort3(int* data) { int temp; if (data[0] > data[1]) { temp = data[0]; data[0] = data[1]; data[1] = temp; } if (data[1] > data[2]) { temp = data[1]; data[1] = data[2]; data[2] = temp; } if (data[0] > data[1]) { temp = data[0]; data[0] = data[1]; data[1] = temp; } }In order to generate an inlined bubble sort such as the above, it seems that we'll have to unwind two loops. We can reduce the number of loops to one, by using a recursive version of bubble sort:

void bubbleSort(int* data, int N) { for (int j = 0; j < N - 1; ++j) { if (data[j] > data[j+1]) swap(data[j], data[j+1]); } if (N > 2) bubbleSort(data, N-1); }Now the sort consists of a loop, and a recursive call to itself. This structure is simple to implement using some template classes:

template<int N> class IntBubbleSort { public: static inline void sort(int* data) { IntBubbleSortLoop<N-1,0>::loop(data); IntBubbleSort<N-1>::sort(data); } }; class IntBubbleSort<1> { public: static inline void sort(int* data) { } };To sort an array of N integers, we invoke IntBubbleSort<N>::sort(int* data). This routine calls a function IntBubbleSortLoop<N-1,0>::loop(data), which will replace the for loop in j, then makes a recursive call to itself. A template specialization for N=1 is provided to end the recursive calls. We can manually expand this for N=4 to see the effect:

static inline void IntBubbleSort<4>::sort(int* data) { IntBubbleSortLoop<3,0>::loop(data); IntBubbleSortLoop<2,0>::loop(data); IntBubbleSortLoop<1,0>::loop(data); }The first template argument in the IntBubbleSortLoop classes (3,2,1) is the value of i in the original version of bubbleSort(), so it makes sense to call this argument I. The second template parameter will take the role of j in the loop, so we'll call it J. Now we need to write the IntBubbleSortLoop class. It needs to loop from J=0 to J=I-2, comparing elements data[J] and data[J+1] and swapping them if necessary:

template<int I, int J> class IntBubbleSortLoop { private: enum { go = (J <= I-2) }; public: static inline void loop(int* data) { IntSwap<J,J+1>::compareAndSwap(data); IntBubbleSortLoop<go ? I : 0, go ? (J+1) : 0>::loop(data); } }; class IntBubbleSortLoop<0,0> { public: static inline void loop(int*) { } };Writing a base case for this recursion is a little more difficult, since we have two variables. The solution is to make both template parameters revert to 0 when the base case is reached. This is accomplished by storing the loop flag in an enumerative type (go), and using a conditional expression operator (?:) to force the template parameters to 0 when 'go' is false. The class IntSwap<I,J> will perform the task of swapping data[I] and data[J] if necessary. Once again, we can manually expand IntBubbleSort<4>::sort(data) to see what is being generated:

static inline void IntBubbleSort<4>::sort(int* data) { IntSwap<0,1>::compareAndSwap(data); IntSwap<1,2>::compareAndSwap(data); IntSwap<2,3>::compareAndSwap(data); IntSwap<0,1>::compareAndSwap(data); IntSwap<1,2>::compareAndSwap(data); IntSwap<0,1>::compareAndSwap(data); }The last remaining definition is the IntSwap<I,J>::compareAndSwap() routine:

template<int I, int J> class IntSwap { public: static inline void compareAndSwap(int* data) { if (data[I] > data[J]) swap(data[I], data[J]); } };The swap() routine is the same as before:

inline void swap(int& a, int& b) { int temp = a; a = b; b = temp; }It's easy to see that this results in code equivalent to

static inline void IntBubbleSort<4>::sort(int* data) { if (data[0] > data[1]) swap(data[0], data[1]); if (data[1] > data[2]) swap(data[1], data[2]); if (data[2] > data[3]) swap(data[2], data[3]); if (data[0] > data[1]) swap(data[0], data[1]); if (data[1] > data[2]) swap(data[1], data[2]); if (data[0] > data[1]) swap(data[0], data[1]); }In the next section, the speed of this version is compared to the original version of bubbleSort().

In theory, unwinding loops can result in a performance loss for processors with instruction caches. If an algorithm contains loops, the cache hit rate may be much higher for the non-specialized version the first time it is invoked. However, in practice, most algorithms for which specialization is useful are invoked many times, and hence are loaded into the cache.

The price of specialized code is that it takes longer to compile, and generates larger programs. There are less tangible costs for the developer: the explosion of a simple algorithm into scores of cryptic template classes can cause serious debugging and maintenance problems. This leads to the question: is there a better way of working with template metaprograms? The next section presents a better representation.

IntBubbleSort<N>: IntBubbleSortLoop<N-1,0> IntBubbleSort<N-1> IntBubbleSort<1>: (nil) IntBubbleSortLoop<I,J>: IntSwap<J,J+1> IntBubbleSortLoop<(J+2<=I) ? I : 0, (J+2<=I) ? (J+1) : 0> IntBubbleSortLoop<0,0>: (nil) IntSwap<I,J>: if (data[I] > data[J]) swap(data[I],data[J]);Although this representation is far from elegant, it is more compact and easier to work with than the corresponding class implementations. When designing template metaprograms, it's useful to avoid thinking about details such as class declarations, and just concentrate on the underlying production rules.

C++ version | Template metaprogram version |
---|---|

if (condition) statement1; else statement2; |
// Class declarations template<bool C> class _name { }; class _name<true> { public: static inline void f() { statement1; } // true case }; class _name<false> { public: static inline void f() { statement2; } // false case }; // Replacement for 'if/else' statement: _name<condition>::f(); |

C++ version | Template metaprogram version |
---|---|

int i; switch(i) { case value1: statement1; break; case value2: statement2; break; default: default-statement; break; } |
// Class declarations template<int I> class _name { public: static inline void f() { default-statement; } }; class _name<value1> { public: static inline void f() { statement1; } }; class _name<value2> { public: static inline void f() { statement2; } }; // Replacement for switch(i) statement _name<I>::f(); |

C++ version | Template metaprogram version |
---|---|

int i = N; do { statement; } while (--i > 0); |
// Class declarations template<int I> class _name { private: enum { go = (I-1) != 0 }; public: static inline void f() { statement; _name<go ? (I-1) : 0>::f(); } }; // Specialization provides base case for // recursion class _name<0> { public: static inline void f() { } }; // Equivalent loop code _name<N>::f(); |

int countBits(int N) { int bit3 = (N & 0x08) ? 1 : 0, bit2 = (N & 0x04) ? 1 : 0, bit1 = (N & 0x02) ? 1 : 0, bit0 = (N & 0x01) ? 1 : 0; return bit0+bit1+bit2+bit3; } int i = countBits(13);Writing a template metaprogram version of this function, the argument N is passed as a template parameter, and the four temporary variables (bit0,bit1,bit2,bit3) are replaced by enumerative types:

template<int N> class countBits { enum { bit3 = (N & 0x08) ? 1 : 0, bit2 = (N & 0x04) ? 1 : 0, bit1 = (N & 0x02) ? 1 : 0, bit0 = (N & 0x01) ? 1 : 0 }; public: enum { nbits = bit0+bit1+bit2+bit3 }; }; int i = countBits<13>::nbits;

sin x = x - x^3/3! + x^5/5! - x^7/7! + ...

A C implementation of this series might be

// Calculate sin(x) using j terms float sine(float x, int j) { float val = 1; for (int k = j - 1; k >= 0; --k) val = 1 - x*x/(2*k+2)/(2*k+3)*val; return x * val; }Using our production rule grammar analogy, we can design the corresponding template classes needed to evaluate sin x using J=10 terms of the series. Letting x = 2*Pi*I/N, which permits us to pass x using two integer template parameters (I and N), we can write

Sine<N,I>: x * SineSeries<N,I,10,0> SineSeries<N,I,J,K>: 1-x*x/(2*K+2)/(2*K+3) * SineSeries<N*go, I*go,J*go, (K+1)*go> SineSeries<0,0,0,0>: 1;Where go is the result of (K+1 != J). Here are the corresponding class implementations:

template<int N, int I> class Sine { public: static inline float sin() { return (I*2*M_PI/N) * SineSeries<N,I,10,0>::accumulate(); } }; // Compute J terms in the series expansion. K is the loop variable. template<int N, int I, int J, int K> class SineSeries { public: enum { go = (K+1 != J) }; static inline float accumulate() { return 1-(I*2*M_PI/N)*(I*2*M_PI/N)/(2*K+2)/(2*K+3) * SineSeries<N*go,I*go,J*go,(K+1)*go>::accumulate(); } }; // Specialization to terminate loop class SineSeries<0,0,0,0> { public: static inline float accumulate() { return 1; } };The redeeming quality of that cryptic implementation is that a line of code such as

float f = Sine<32,5>::sin();gets compiled into this single 80486 assembly statement by Borland C++:

mov dword ptr [bp-4],large 03F7A2DDEhThe literal 03F7A2DDEh represents the floating point value 0.83147, which is the sine of 2*Pi*I/N. The sine function is evaluated at compile time, and the result is stored as part of the processor instruction. This is very useful in applications where sine and cosine values are needed, such as computing a Fast Fourier Transform [2]. Rather than retrieving sine and cosine values from lookup tables, they can be evaluated using compile-time functions and stored in the instruction stream. Although the implementation of an inlined FFT server is too complex to describe in detail here, it serves as an illustration that template metaprograms are not limited to simple applications such as sorting. An FFT server implemented with template metaprograms does much of its work at compile time, such as computing sine and cosine values, and determining the reordering needed for the input array. It can use template specialization to squeeze out optimization for special cases, such as replacing

y = a * cos(0) + b * sin(0);with

y = a;The result is a massively inlined FFT server containing no loops or function calls, which runs at roughly three times the speed of an optimized non-specialized routine.

[2] W. Press et al., Numerical Recipes in C, Cambridge University Press, 1992.

Todd Veldhuizen

Dept. of Systems Design Engineering

University of Waterloo, Waterloo, Ontario

Canada. N2L 3G1

Tel: (519) 885-1211 ext. 6087

Fax: (519) 746-3077