FezSpider Perfomance issue

biren · November 8, 2012, 12:21am

Hi all,


 void FFT1D(int dir,int m,double *x, double *y)
        {
   

            long nn, i, i1, j, k, i2, l, l1, l2;
            double c1, c2, tx, ty, t1, t2, u1, u2, z;
            /* Calculate the number of points */
            nn = 1;
            for (i = 0; i < m; i++)
                nn *= 2;
            /* Do the bit reversal */
            i2 = nn >> 1;
            j = 0;
            for (i = 0; i < nn - 1; i++)
            {
                if (i < j)
                {
                    tx = x[i];
                    ty = y[i];
                    x[i] = x[j];
                    y[i] = y[j];
                    x[j] = tx;
                    y[j] = ty;
                }
                k = i2;
                while (k <= j)
                {
                    j -= k;
                    k >>= 1;
                }
                j += k;
            }
            /* Compute the FFT */
            c1 = -1.0;
            c2 = 0.0;
            l2 = 1;
            for (l = 0; l < m; l++)
            {
                l1 = l2;
                l2 <<= 1;
                u1 = 1.0;
                u2 = 0.0;
                for (j = 0; j < l1; j++)
                {
                    for (i = j; i < nn; i += l2)
                    {
                        i1 = i + l1;
                        t1 = u1 * x[i1] - u2 * y[i1];
                        t2 = u1 * y[i1] + u2 * x[i1];
                        x[i1] = x[i] - t1;
                        y[i1] = y[i] - t2;
                        x[i] += t1;
                        y[i] += t2;
                    }
                    z = u1 * c1 - u2 * c2;
                    u2 = u1 * c2 + u2 * c1;
                    u1 = z;
                }
                c2 = sqrt((1.0 - c1) / 2.0);
                if (dir == 1)
                    c2 = -c2;
                c1 = sqrt((1.0 + c1) / 2.0);
            }
            /* Scaling for forward transform */
            if (dir == 1)
            {
                for (i = 0; i < nn; i++)
                {
                    x[i] /= (double)nn;
                    y[i] /= (double)nn;

                }
            }

        }

              This is a function of FFT transform in RLP in FezSpider.

I find that it takes 15sec to execute. I want to improve its performance.
If I use FezHydra then how much performance it will improve?
If you have another option then please share.

Thanks

Duke_Nukem · November 8, 2012, 1:16am

Could you give me a suitable set of input parameters and I’ll try it on a number of different mainboards.

biren · November 8, 2012, 1:27am

@ Duke Nukem -
in this function
nx=160
ny=160
dir=1


 void FFT2D(COMPLEX *c, int nx, int ny, int dir)
      {
           COMPLEX *Output;
           Output = RLPext->malloc(25600 * sizeof(COMPLEX));
		
           //// Transform the Rows 
               index = 0;
		    for (i=0;i<=nx -1;i++)
			{
                for (j = 0; j <= ny - 1; j++)
                {
                    Output[index].real = c[index].real;
                    Output[index].imag = c[index].imag;
                    index++;
				}
			}

            for (j = 0; j < ny; j++)
            {
                for (i = 0; i < nx; i++)
                {
                    real[i] = c[j*ny+i].real;
                    imag[i] = c[j * ny + i].imag;
                }
                // Calling 1D FFT Function for Rows
				
                m = 7;//(int)Log(nx,2);//Finding power of 2 for current number of points e.g. for nx=512 m=9
				
               FFT1D(dir, m,real,imag);
				

				
                for (i = 0; i < nx; i++)
                {

                    Output[j * ny + i].real = real[i];
                    Output[j * ny + i].imag = imag[i];
                    
                }
               
            }

            // Transform the columns  
            //*real = new double[ny];
            //imag = new double[ny];*/

            for (i = 0; i < nx; i++)
            {
                for (j = 0; j < ny; j++)
                {

                    real[j] = Output[j * ny + i].real;
                    imag[j] = Output[j * ny + i].imag;
                    
                }
               
                // Calling 1D FFT Function for Columns
				m = 7;//(int)Log((double)ny, 2);//Finding power of 2 for current number of points e.g. for nx=512 m=9

                FFT1D(dir, m,real,imag);

                for (j = 0; j < ny; j++)
                {
                    Output[j * ny + i].real = real[j];
                    Output[j * ny + i].imag = imag[j];
                    
                }
               
            }

			//// transfer output values in to fourior///////////////
			index=0;
       for (i=0;i<=nx -1;i++)
	{
                for (j = 0; j <= ny - 1; j++)
                {
                    Fourier[index].real = Output[index].real;
                    Fourier[index].imag = Output[index].imag;
                    index++;
		}
	}
            
			
			
		 
		   //if(Output)
				RLPext->free(Output);
        
        }

this image are given to this fft2d function .but fft1d takes 15 seconds to execute . how to improve performance

leforban · November 8, 2012, 3:43am

There’s different way of computing an FFT according to its size, the radix 2 is not necessarly the best one (see the FFTW project or the spiral project from CMU).

Moreover try to extract the nx-1, ny-1 computations from the loop.

biren · November 8, 2012, 3:53am

@ leforban - fez hydra is give better speed instead of fez spider? if i use fez hydra code excution speed will improve ? plz give me suggestion

leforban · November 8, 2012, 4:00am

Even if Hydra is Faster than Spider in term of MHz, the only thing that will result in a real speed up is a perfect match between the architecture of the processor and the algorithm. In this field we often talk about architecture, algorithm adequation ( A-cube paradigm). Therefore you need to hand tune your FFT algorithm in order to help the compiler to extract butterfly pattern that will be executed in an efficient way (usually by a MAC operator).

Duke_Nukem · November 8, 2012, 4:01am

There is a fair bit of undefines etc in FFT2D can we just go with FFT1D and a typical input set for that?

leforban · November 8, 2012, 4:17am

As an example if size of your samples is known before computation, there’s no need to compute:

for (i = 0; i < m; i++)
nn *= 2;

if m is a constant in your application nn can be known before compilation.

Gus_Issa · November 8, 2012, 4:25am

Hydra is about 5 times faster than spider but hydra has floating point processor so it will be a lot faster. I am not sure by how much but maybe 30 times faster. I would love to know if anyone has tried this.

Pablo · November 8, 2012, 7:27am

If you are using it on hydra be sure to use a compiler with float hardware support enabled…

Duke_Nukem · November 8, 2012, 9:12am

Well I might have to post this in Code Share as I put together a PI Calculator and I’m using that for performance comparisons. I’m just waiting on the Spider to calculate 200 places of PI, but it is the slowest of my mainboards (Spider, Hydra, Cerberus, Mountaineer USB, ArgonR1), but the Spider is still my go to board for prototyping as it can do anything. The Mountaineer board is the most surprising as its quick and the Cerberus gives huge bang for the buck.

Any mainboards that I’m mssing?

Justin · November 8, 2012, 9:15am

C64?

WouterH · November 8, 2012, 10:04am

The hydra will be plenty of times faster. You are using floating point (float/double), and the EMX (FEZ Spider) doesn’t have a floating-point-unit.
So all floating point operations result in a huge blob of assembly code. The Hydra has a floating-point-unit.

When I was developing the 3D engine on the hydra I was surprised that it does more then a million floating point operations each second, don’t expect this on EMX.

Duke_Nukem · November 8, 2012, 10:45am

The results of my testing:

Compiled with Debug (not run in debug however)

Calculate 20 digits of PI

Mountaineer USB - 1.8 sec
FEZ Hydra - 3.5 sec
FEZ Cerberus - 4.5 sec
ArgonR1 - 5.5 sec
FEZ Spider - 13.6 sec

Calculate 200 digits of PI

Mountaineer USB - 03:28
FEZ Hydra - 06:22
FEZ Cerberus - 08:29
ArgonR1 - 10:29
FEZ Spider - 26:00

Compiled with Release

Calculate 20 digits of PI

Mountaineer USB - 1.4 sec
FEZ Hydra - 2.6 sec
FEZ Cerberus - 3.4 sec
ArgonR1 - 4.6 sec
FEZ Spider - 10.5 sec

Calculate 200 digits of PI

Mountaineer USB - 02:36
FEZ Hydra - 04:43
FEZ Cerberus - 06:32
ArgonR1 - 08:52
FEZ Spider - 20:13

Code Share Project http://www.tinyclr.com/codeshare/entry/597

Architect · November 8, 2012, 11:25am

Good job and interesting results.

Do you have Cerbuino to test?

WouterH · November 8, 2012, 12:22pm

@ Duke Nukem - problem is that you’re not measuring floating point abilities as most of the code uses integer math.

Duke_Nukem · November 8, 2012, 4:35pm

@ WouterH you are correct and hence why I had to post the code for this as you can’t do a performance test and not post the code so people can see exactly what your testing. Perhaps someone has a Whetstone or other FPU tests they can test and post as I’ll run it against my collection of mainboards.

biren · November 19, 2012, 5:19am

Fez Spider does not have support FPU. so we are thinking about to use fixed point library like “Libfixmath”. fez hydra does not support camera so we can not use fez hydra. if we use Fixed Point library how much performance incraese? Any suggestion Please tell me.

Brett · November 19, 2012, 5:55am

Biren, please DO NOT DOUBLE POST. Yes, yelling.

biren · November 19, 2012, 11:05pm

@ Gus - if we use Fixed Point library like “libfixmath” how much performance incraese in Fez Spider? Any suggestion Please tell me.