EMX execution speed question

Hi,

i have encountered some issues with execution speed, after reading multiple posts i set to write a small benchmark (below) and tests shows that my EMX module does a single empty loop with 1000 iterations at 55~60msec, is that reasonable? am I testing this the wrong way? it sounds very slow to me

Any advise is appreciated

Liran

using System;
using System.Diagnostics;
using Microsoft.SPOT;
using Microsoft.SPOT.Hardware;

namespace MFConsoleApplication8
{
    public class Program
    {
        public static void Main()
        {
            int i = 0;

            var ts = Microsoft.SPOT.Hardware.Utility.GetMachineTime();
            var ts1 = Microsoft.SPOT.Hardware.Utility.GetMachineTime();
            Debug.Print("Empty= " + (ts1.Subtract(ts).Ticks.ToString()));

            ts = Microsoft.SPOT.Hardware.Utility.GetMachineTime();
            i++;
            ts1 = Microsoft.SPOT.Hardware.Utility.GetMachineTime();
            Debug.Print("i++=" + (ts1.Subtract(ts).Ticks.ToString()));
            Debug.Print("Ticks Per MSec=" + TimeSpan.TicksPerMillisecond.ToString());
        }

        /* Results on my EMX
        Empty= 2582
        i++=3035
        Ticks Per MSec=10000
        */
    }
}

@ liran.shahar - first problem: the debug mode is quite slower than regular release speed.

@ VB-Daniel:
I agree but I am not sure this is relevant to the code I showed since I timed only specific commands, my application (as many other) use loops all the time and if this simple action takes a long time then…

Debug.Print() commands are very costly, try removing them.

@ Simon: the Debug.Print is only here to show my timing, obviously I don’t have them inside the working code, other than that please refer to the code example I sent, it is very clear and demonstrate my point.
You could run the sample on your device and see the results for yourself, are they logical? that slow? in the emulator it runs about x10 faster (in case the .NETMF is the cause)

@ liran.shahar - Actually, I do not quite understand what are you measuring and what your code does.

Anyway, EMX is pretty much the slowest thingy in the whole still-supported NETMF fleet. It’s an old module based on old ARM, so if you are creating something new, better go with G400, as it costs about the same amount and is ~20 times faster — when works, that is :slight_smile:

Also, plain NETMF is not suitable for intensive calculations. For that, you should use RLP (speed increased ~40 times), or do not use NETMF at all.

@ Simon: my code does two measurements

  1. empty measurement between two getmachinetime just to get the basic time diff
  2. a measurement of a simple “i++” increament

it then prints the ticks difference and also the ticks per milisecond value, from it you can see that a simple command takes a lot of time, try to do a empty loop between 1…1000 and time it with ticks, convert to MSEC and you’ll see that EMX takes about 57 msec to complete such loop which to me makes no sense, EMX is supposed to be 70Mhz device, even with the heavy and non-optimized .NET M.F it shouldn’t respond this way

can you time your device (EMX or G400 whatever you have) and let us know what it your timing results?

Liran

I still don’t understand your definition of an “empty loop”. It’s not empty at all, and actually there’s no loop in your code. If you want, I can compare EMX and G400, but then post exact program you want to try. That code you posted already contains Debug.Prints, so there’s not much point in benchmarking it.

Hi,
I dont think that you can calculate that NETMF needs for 1000 i++ 1000 times the time that it needs for one i++. I do not know much about this, but many commands e.g. array.copy are performed in native code. So I think that you really have to measure the time that it needs for the loop

for (int i = 0; i < 1000; i++)
{
}
to get a correct result.

Edit:
Btw: I measured the time for 100 x i++ and calculated on 1000 x. I got the result of 13.8 milliseconds.

This seems like the old “I’m running on bare metal” vs “I’m running on an interpreted managed framework” discussion.

netmf is not real time. Stuff happens behind the scenes, you don’t see it, you don’t get to (overly) control it, you have to accept it.

As a simple change to this “test” code, you should drop all debug.prints until the end. Use multiple variables so you can capture the information along the way; as it stands today the string creation can cause GC to kick in part way between all your tests which can influence the timing you see - but again, I think this isn’t a valid test.

As the others have said, EMX isn’t the newest processor around, but still even with a faster proc with more memory etc etc, you’re not running a compiled app direct on the processor. And if you want to get onto the processor, RLP is there to let you.

@ liran.shahar -
Hi,
I made some tests with a little modification of your approach:
The tests were made on a Spider (EMX), Cerbuino Bee, Cobra-2 and Raptor:
I hope that I didnt miss anything or made wrong calculations. Please tell me if you see any faults in my approach.
My code was this:


using System;
using System.Collections;
using System.Threading;
using Microsoft.SPOT;
using Microsoft.SPOT.Presentation;
using Microsoft.SPOT.Presentation.Controls;
using Microsoft.SPOT.Presentation.Media;
using Microsoft.SPOT.Presentation.Shapes;
using Microsoft.SPOT.Touch;
using Gadgeteer.Networking;
using GT = Gadgeteer;
using GTM = Gadgeteer.Modules;
using Gadgeteer.Modules.GHIElectronics;

namespace SpiderLoopSpeed_Test
{

    public partial class Program
    {
        // This method is run when the mainboard is powered up or reset.   
        void ProgramStarted()
        {

            Debug.Print("Program Started");
            button.ButtonPressed += button_ButtonPressed;
        }

        void button_ButtonPressed(Button sender, Button.ButtonState state)
        {
            int i = 0;
            long Empty = 0;
            long Load = 0;
            byte[] buffer1 = new byte[1000];
            byte[] buffer2 = new byte[1000];
            byte[] buffer3 = new byte[250];

            TimeSpan ts;
            TimeSpan ts1;
            ts = Microsoft.SPOT.Hardware.Utility.GetMachineTime();
            ts1 = Microsoft.SPOT.Hardware.Utility.GetMachineTime();

            Empty = ts1.Subtract(ts).Ticks;
            Debug.Print("Empty= " + Empty.ToString());
            Debug.Print("Ticks Per MSec=" + TimeSpan.TicksPerMillisecond.ToString());
            Debug.Print("\r\n");

            //****************************************************
            ts = Microsoft.SPOT.Hardware.Utility.GetMachineTime();
            i++;
            ts1 = Microsoft.SPOT.Hardware.Utility.GetMachineTime();
            Load = ts1.Subtract(ts).Ticks;

            Debug.Print("Duration of one i++: " + (Load - Empty).ToString() + " ticks or " + (Load - Empty) / (TimeSpan.TicksPerMillisecond / 1000) + " microseconds");
            Debug.Print("Time before operation: Seconds: " + ts.Seconds.ToString() + "   Milliseconds: " + ts.Milliseconds.ToString() + "   Ticks: " + ts.Ticks.ToString());
            Debug.Print("Time after operation: Seconds : " + ts1.Seconds.ToString() + "   Milliseconds: " + ts1.Milliseconds.ToString() + "   Ticks: " + ts1.Ticks.ToString() + "\r\n");
            //****************************************************

            //****************************************************
            ts = Microsoft.SPOT.Hardware.Utility.GetMachineTime();
            i++;
            i++;
            i++;
            i++;
            i++;
            i++;
            i++;
            i++;
            i++;
            i++;

            ts1 = Microsoft.SPOT.Hardware.Utility.GetMachineTime();
            Load = ts1.Subtract(ts).Ticks;
            Debug.Print("Duration of ten i++: " + (Load - Empty).ToString() + " ticks or " + (Load - Empty) / (TimeSpan.TicksPerMillisecond / 1000) + " microseconds");
            Debug.Print("Time before operation: Seconds: " + ts.Seconds.ToString() + "   Milliseconds: " + ts.Milliseconds.ToString() + "   Ticks: " + ts.Ticks.ToString());
            Debug.Print("Time after operation: Seconds : " + ts1.Seconds.ToString() + "   Milliseconds: " + ts1.Milliseconds.ToString() + "   Ticks: " + ts1.Ticks.ToString() + "\r\n");
            //****************************************************


            //****************************************************
            ts = Microsoft.SPOT.Hardware.Utility.GetMachineTime();

            for (int i2 = 0; i2 < 250; i2++) { }

            ts1 = Microsoft.SPOT.Hardware.Utility.GetMachineTime();
            Load = ts1.Subtract(ts).Ticks;
            Debug.Print("Duration of for Loop with 250 runs: " + (Load - Empty).ToString() + " ticks or " + (Load - Empty) / (TimeSpan.TicksPerMillisecond / 1000) + " microseconds");
            Debug.Print("Time before operation: Seconds: " + ts.Seconds.ToString() + "   Milliseconds: " + ts.Milliseconds.ToString() + "   Ticks: " + ts.Ticks.ToString());
            Debug.Print("Time after operation: Seconds : " + ts1.Seconds.ToString() + "   Milliseconds: " + ts1.Milliseconds.ToString() + "   Ticks: " + ts1.Ticks.ToString() + "\r\n");
            //****************************************************

            //****************************************************
            ts = Microsoft.SPOT.Hardware.Utility.GetMachineTime();

            foreach (byte myByte in buffer3) { }

            ts1 = Microsoft.SPOT.Hardware.Utility.GetMachineTime();
            Load = ts1.Subtract(ts).Ticks;
            Debug.Print("Duration of foreach Loop with 250 runs: " + (Load - Empty).ToString() + " ticks or " + (Load - Empty) / (TimeSpan.TicksPerMillisecond / 1000) + " microseconds");
            Debug.Print("Time before operation: Seconds: " + ts.Seconds.ToString() + "   Milliseconds: " + ts.Milliseconds.ToString() + "   Ticks: " + ts.Ticks.ToString());
            Debug.Print("Time after operation: Seconds : " + ts1.Seconds.ToString() + "   Milliseconds: " + ts1.Milliseconds.ToString() + "   Ticks: " + ts1.Ticks.ToString() + "\r\n");
            //****************************************************

            //****************************************************
            ts = Microsoft.SPOT.Hardware.Utility.GetMachineTime();

            Array.Copy(buffer1, 0, buffer2, 0, 1000);

            ts1 = Microsoft.SPOT.Hardware.Utility.GetMachineTime();
            Load = ts1.Subtract(ts).Ticks;
            Debug.Print("Duration copying a byte array with 1000 elements : " + (Load - Empty).ToString() + " ticks or " + (Load - Empty) / (TimeSpan.TicksPerMillisecond / 1000) + " microseconds");
            Debug.Print("Time before operation: Seconds: " + ts.Seconds.ToString() + "   Milliseconds: " + ts.Milliseconds.ToString() + "   Ticks: " + ts.Ticks.ToString());
            Debug.Print("Time after operation: Seconds : " + ts1.Seconds.ToString() + "   Milliseconds: " + ts1.Milliseconds.ToString() + "   Ticks: " + ts1.Ticks.ToString() + "\r\n");
            //*****************************************************
        }
    }
}
 

The surprising, hardly believable results can be seen in the pictures:
The first surprising thing was, that there were slight differences between the measurements from run to run, so if you try the code, you can get slightly different results:

I++ Operation:
The fastet machine was the Cerbuino, Raptor was equal to Cobra 2 and Spider was faster than Raptor and Cobra-2

10 x i++ one behind another:
Cerbuino was the fastes machine again. Raptor was on second place, needed for ten operation the same time as for one! Cobra-2 needed for ten operation only 3.5 times the time it needed for one operation.

For(int i = 0; i < 250; i++) and foreach(byte b in buffer) operations:
Raptor is the fastes machine, Cerbuino is faster than Cobra-2. Cobra-2 is twice as fast as Spider.

Array.Copy for bytearray with 1000 elements:
The operation is very fast compared with normal loops for all the devices. The collumns can hardly be seen in the picture.
Surprising: Cerbuino is faster than Raptor.

2 Likes

Cerb is fast for small operation probably because he’s got everything on SRAM. All the other participants have to use external DRAM, introducing some lag.

But these are synthetic test :slight_smile: In my real world operation, Raptor is wayyyyy much faster, than Spider :slight_smile:

Ah… yes, this may be a good explanation.

I think you are also right that the results of these tests cannot be extrapolated to the performance of a real application. Already the measurement oft he „empty value may not be appropriate as the processors my use prefetch mechanisms (or other things beyond my understanding ) that can make execution times vary depending on the commands executed before.
To achieve reasonably accurate results it seems essential to measure a sequence of multiple times the same command and from this calculate the time for one command.
But then it has to be considered that the first execution of one command can last much longer than the following executions of the same command.

For me the most important finding of these tests was: The commands of the Array Class are very fast .