Benchmarking of FFT algorithms

Please download to get full document.

View again

of 3
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Document Description
Benchmarking of FFT algorithms
Document Share
Document Tags
Document Transcript
  BENCHMARKING OF FFT ALGORITHMS *  Abstract-  A large number of Fast Fourier Transform(FFT) algorithms have been developed over the years.Among these, the most promising are the Radix-2,Radix-4, Split-Radix, Fast Hartley Transform (FHT),Quick Fourier Transform (QFT), and the Decimation-in-Time-Frequency (DITF) algorithms. In this paper,we present a rigorous analysis of these algorithms thatincludes the number of mathematical operations,computational time, memory requirements, and objectcode size. The results of this work will serve as aframework for creating an object-oriented, poly-functional FFT implementation which willautomatically choose the most efficient algorithm givenuser-specified constraints.  INTRODUCTION  Though development of the Fast Fourier Transform (FFT)algorithms is a fairly mature area, several interestingalgorithms have been introduced in the last ten years thatprovide unprecedented levels of performance. The firstmajor breakthrough was the Cooley-Tukey algorithmdeveloped in the mid-sixties which resulted in a flurry of activity on FFTs [2]. Further research led to thedevelopment of the Fast Hartley Transform, and Split-Radix algorithm. Recently, two new algorithms have alsoemerged: the Quick Fourier Transform and theDecimation-in-Time-Frequency algorithm.While there has been extensive research on the theoreticalefficiency of these algorithms, there has been little researchto-date comparing these algorithms on practical terms.Architectures have become quite complex today withmulti-level caches, super-pipelined processors, long wordinstruction sets, etc. Efficiency, as we will show, isintricately related to how an algorithm can be implementedon a given architecture. The issues to be considered includecomputation speed, memory, algorithm complexity,machine architecture, and the compiler design. In thispaper, we report on preliminary work to find the mostefficient algorithm under application-specific constraints.Our approach is to benchmark each algorithm under a varietyof constraints, and to use the benchmark statistics to create awrapper capable of choosing the algorithm best suited for agiven application. Obviously, object-oriented programmingmethodologies will play a large role.  ALGORITHMS The definition of the Discrete Fourier Transform (DFT) isshown in (1).(1)Most of the algorithms take the divide-and-conquer approachto reduce computations. The Radix-2 and Radix-4approaches decompose the N-point DFT computations intosets of two and four-point DFTs, respectively [2]. The corecomputation in a Radix-4 butterfly involves fewer complexmultiplications than the Radix-2 butterfly, yielding anincrease in efficiency when the order of the transform is apower of 4. To take advantage of this fact, the Split-radixalgorithm makes use of both the Radix-2 and Radix-4decomposition [3].The Hartley Transform, shown in (2), further reducescomputation by replacing the complex exponential term inthe DFT with a kernel using real variables [4].(2)This reduces the number of real multiplications andadditions, with only a modest gain in memory. The maindrawback of the Hartley Transform is the additionalcomputation needed to transform the results from the realHartley coefficients to the standard complex Fouriercoefficients. However, since the relationship between the twoforms of coefficients is linear, the additional cost incurred is  X k  ( )  x n e  j 2 π kn –  N  ------------------- n  0=  N   1– ∑ =  X   H   k  ( )  1  N  --------  x n 2 π kn N  -------------      cos 2 π kn N  -------------      sin+ n  0=  N   1– ∑ =  Michael Balducci, Aravind Ganapathiraju, Jonathan Hamaker, Joseph Picone Department of Electrical and Computer EngineeringMississippi State UniversityMississippi State, Mississippi 39762, USAPh (601) 325-3149 - Fax (601) 325-3149{ganapath, hamaker, picone}  Ajitha Choudary, Anthony Skjellum Department of Computer ScienceMississippi State UniversityMississippi State, Mississippi 39762, USAPh (601) 325-8435 - Fax (601) 325-8997{ajitha, tony} * ThisworkwassupportedbyDARPAthroughUSAirForce’sRomeLab-oratories under contract F30602-96-1-0329.  less than the efficiency gained.The Quick Fourier Transform (QFT) uses the symmetry of the cosine and sine terms to reduce the number of complexcalculations [5]. The QFT breaks a signal into its even andodd components. A Discrete Cosine Transform (DCT) isused on the even samples to calculate the real portions of the Fourier coefficients, while a Discrete Sine Transform(DST) is used on the odd samples to compute theimaginary portions. Both the DCT and DST are in turncomputed recursively. An important aspect of the QFT isthat all complex operations occur at the last stage of recursion, making it well-suited for real data.Finally, the Decimation-In-Time-Frequency (DITF)algorithm leverages the Radix-2 approach in both thetime-domain (DIT) and frequency (DIF). The DITF isbased on the observation that the DIT algorithm has amajority of its complex operations towards the end of thecomputation cycle and the DIF algorithm has a majoritytowards the beginning. The DITF makes use of this fact byperforming the DIT at the outset and then switching to aDIF to complete the transform. Combining thesealgorithms comes at the cost of computing complexconversion factors at the transition stage [6].  EVALUATION METHODOLOGY  Criteria:  Computation speed was selected as the corecriteria for comparison since the fastest method is generallythe most desirable one. However, the amount of memoryavailable is not unlimited; therefore, memory was alsoincluded as a measure of efficiency. The number of mathematical operations is also important since it isdirectly related to the computation time and the hardwarerequirements. The additions and multiplications werebroken into floating-point and integer operations becausefloating point operations are more costly in computationtime than are integer operations (on most hardware). Of course, all of this is mitigated by the degree to which thecompiler can perform optimizations. Modern compilers areable to optimize code for speed and hardware usage bysuch techniques as loop-unrolling, delayed-branching, etc.The level of optimization performed on an algorithm willbe highly algorithm and implementation dependent. Implementation:  Bearing in mind the evaluation criteria,we designed a class structure that is intuitive and allows foreasy inclusion of new algorithms to the existing set.Computation speed is measured using system utilitiesaccurate to . For evaluation of memory usage, wedeveloped floating-point and integer classes that have thefeatures of accumulating a count for every variable that isdeclared and counting the number of mathematical1 msoperations. This gives a very efficient method for viewing thedynamic memory usage. An iterative approach to testing wasalso used to reduce the transients of processor loading. Thismethod involved running each test for a large number of iterations, using median values for comparison.  RESULTS An often used criterion for comparison of FFT algorithmsthus far has been the number of mathematical operationsinvolved. These statistics usually do not, however, accountfor the cost of incrementing integer counters and indices.Integer operations can be a significant portion of thecomputation time. Operation counts for a 1024-pointcomplex FFT is presented in Table 1. Most algorithms tradefloating-point operations for integer operations.A criterion closely related to the number of operations is thecomputation speed. A summary of the computation times forthe selected algorithms is shown in Table 2. One wouldexpect the algorithm with the least number of operations tobe the fastest. However, we show that compiler optimizationsplay a large role by virtue of the large difference between theFHT and all other algorithms. It is interesting to note that thedifference in performance tends to decrease as the orderincreases. As the order increases in the FHT, the cost of converting the Hartley coefficients to Fourier coefficientsseems to becomes substantial, explaining the less dramaticdifference in performance.It is a well-known fact that most FFT algorithms achieve ancomplexity. The constants of proportionality arewhat differentiates the performance of the algorithms. Asshown in Table 2, second-order effects can be dominated bycompiler efficiency. For lower orders the FHT algorithm inits present implementation seems to be making better use of the cache than the other algorithms. This advantage flattensout as the order increases, though. O N N  log [ ] Table 1. Comparison of mathematical operations involved ina 1024-point complex FFT. Algorithm FloatMultsFloatAddsIntMultsIntAddsBinShiftsRAD-2 20480 30720 0 15357 1024RAD-4 15701 28842 336 8877 2738SRFFT 10016 25488 502 12448 2937FHT 18704 32056 0 8367 4246QFT 8448 31492 16 70058 316DITF 16640 28800 1076 18839 1086  Another common design criterion in practical systems ismemory. Algorithm efficiency can always be traded formemory and code size. Table 3 illustrates this fact. CONCLUSIONS We have presented results on a comprehensive collection of FFT algorithms, each of which was programmed in asimilar framework. We have generated statistics tosupplement the mathematical formulation for thecomplexity of these algorithms. Combined, this data givesthe developer a clear picture of the computationalrequirements of each algorithm.At our current level of implementation, the FHT appears tobe the best overall algorithm. It is interesting to note thatthe higher number of mathematical operations does notnecessarily translate into a reduction in speed. The FHTrequires a marginally higher number of floating-pointoperations than the Radix-4, yet the FHT is faster than theRadix-4 by more than a factor of 2. This implies that the FHTmakes more efficient use of the resources available (cache,pipelining, etc.) than does the Radix-4. Also, this makes iteven more important to look at the order of operationsbecause multiplications and additions that can be pipelinedneed much less computation time than algorithms where theflow of operations cannot be easily pipelined. This is anespecially important consideration for the current generationof complex architectures (such as the Pentium Pro chip).We also see that, in general, the number of integer additionsis a good indication of speed. Note that the ranking of algorithms by speed is almost the same as the ranking of thealgorithms by integer additions. The only exception is theQFT that seems to make up for the high number of integeroperations by using very few floating-point multiplications.Our work has laid the foundation for an “intelligent”environment which will automatically choose the bestalgorithm and execute it for a given set of user constraints.As the hardware available is becoming more specialized, itbecomes imperative that the software take full advantage of the hardware capabilities. In the future, our work will beextended to parallel processing environments. Detaileddocumentation of our experiments can be downloaded from  ACKNOWLEDGEMENTS We gratefully acknowledge the suggestions and assistancegiven by Shane Hebert of ICDCRL in the Computer ScienceDepartment at Mississippi State University.  REFERENCES [1] J.W. Cooley and J.W. Tukey, “An Algorithm forMachine Computation of Complex Fourier Series,”  Math. Comp. , vol. 19, pp. 297-301, April 1965.[2] C.S.Burrus and T.W.Parks,  DFT/FFT and Convolution Algorithms: Theory and Implementation , John Wileyand Sons, New York, NY, USA, 1985.[3] P. Duhamel and H. Hollomann, “Split Radix FFTAlgorithm,”  Electronic Letters , vol. 20, pp. 14-16, Jan.1984.[4] R. Bracewell,  The Hartley Transform , Oxford Press,Oxford, England, 1985.[5] H. Guo, G.A. Sitton, and C.S. Burrus, “The Quick Discrete Fourier Transform,”  Proc. of ICASSP,  vol. III,pp. 445-447, Adelaide, Australia, April 1994.[6] A. Saidi, “Decimation-In-Time-Frequency FFTAlgorithm.”  Proceedings of ICASSP,  vol. III, pp. 453-456, Adelaide, Australia, April 1994.Table 2. Comparison of computation speed [in ] forvarying FFT orders on a 200 MHz processor UltraSparcmachine compiled using gcc with optimization level 3. Thereported speeds are median values over 1000 iterations. FFT OrderAlgorithm 64 256 1024 4096 16384RAD-2 238 952 3952 17857 77714RAD-4 191 762 3476 14714 60333SRFFT 238 905 3810 17429 74524FHT 48 286 1333 7143 31905QFT 143 762 3476 15952 78000DITF 238 1000 4191 18714 80333 µ secTable 3. Comparison of memory usage [in bytes] for a1024-point complex FFT on a 200 MHz UltraSparccompiled using gcc with optimization level 3. Note thatpeak memory requirements are shown. Algorithm Memory Object SizeRAD-2 24616 1740RAD-4 8344 2020SRFFT 24656 2232FHT 32832 4812QFT 49152 7520DITF 24616 3060
Search Related
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks