Odd RLP results

Hi,

I’m converting a .NETMF 4.2 application, for a Cerbuiono Bee, over to 4.3 and have had some odd results in execution time of RLP native code. It seems that the native code that I was running in RLPlite is now slower than a managed code version that achieves the same result. I don’t remember the results in 4.2, I haven’t really got the time to retest it, but I’m certain it was faster in RLP.

It’s a simple function that parses a byte array of ASCII characters that represent a floating point number into an actual floating point number. The output of a test program gives this;

[quote]This segment uses the C# interpreter to get the double.
Parse time in ms: 0.29299999999999998
’61.935’ returned as:61.9350014

This segment uses native C/C++ to get the double.
Parse time in ms: 1.123
’61.935’ returned as:61.9350014[/quote]

Does anyone think the result of the managed code version seems quite fast? Any other thoughts (please be kind about my code!:-[)?

Here’s the native code;

int RLP_charArrayToDouble(void **args)
{
    // retrieve passed args
    float* num = (float*)args[0]; // this is the array to hold the result
    unsigned int length = *((int*)args[1]);
    unsigned char* str = (unsigned char*)args[2];

    // create some new vars
    bool decimalFound = false;
    bool preDec = true;
    float preDecimalMulti = (float)1;
    float postDecimalMulti = 0.1;
    float ret = 0;
    int i = 0;

    for(i=0; i<length; i++)
    {
        if (str[i] == 0x2e)
        {
            decimalFound = true;
        }
        else if (str[i] != 0x2e)
        {
            if (decimalFound)
            {
                ret = ret + ((float)(str[i] - '0') * postDecimalMulti);
                postDecimalMulti = postDecimalMulti / (float)10;
            }
            else if (!decimalFound)
            {
                ret = (ret * preDecimalMulti) + (float)(str[i] - '0');
                if (preDec)
                {
                    preDecimalMulti = (float)10;
                    preDec = false;
                }
            }
        }
    }

    num[0] = ret;
    return 1;
}

And the managed code;

 value)
        {
            float myDouble = 0;
            int preDecimalMulti = 1;
            float postDecimalMulti = (float)0.1;
            bool decimalFound = false;
            bool preDec = true;

            for (int i = 0; i < value.Length; i++)
            {
                if (value[i] == '.')
                {
                    decimalFound = true;
                }
                else if (value[i] != '.')
                {
                    if (decimalFound)
                    {
                        myDouble = myDouble + ((int)(value[i] - '0') * postDecimalMulti);
                        postDecimalMulti = postDecimalMulti / 10;
                    }
                    else if (!decimalFound)
                    {
                        myDouble = (myDouble * preDecimalMulti) + (int)(value[i] - '0');
                        if (preDec)
                        {
                            preDecimalMulti = 10;
                            preDec = true;
                        }
                    }
                }
            }

            return myDouble;
        }

OK, so I tested this with a G120 and added a method to parse this with .NET functions and got this;

[quote]This segment uses the C# interpreter to get the double.
Parse time in ms: 0.96419999999999995
’61.935’ returned as: 61.9350014

This segment uses native C/C++ to get the double.
Done.

Waiting for debug commands…

Parse time in ms: 4.4663000000000004
’61.935’ returned as: 61.9350014

This segment uses .NETMF functions to get the double.
Parse time in ms: 1.9902
’61.935’ returned as: 61.9350014[/quote]

What are the extra messages from RLP on the G120? Printing to the debug window is not something I want to be doing in RLP as it always seems to slow applications down.

The NETMF functions used are;

 chars = UTF8Encoding.UTF8.GetChars(byteArrayOfChars);
string s = new String(chars);
floatReturned = (float)double.Parse(s);

@ wolfbuddy - Can you post the program you use the invoke the RLP function? I would test in 4.2 if you can to eliminate a potential reason. Make sure you’re optimizing the build too.

Below is the test program I am using.

using System;
using System.Text;
using Microsoft.SPOT;
using Microsoft.SPOT.Hardware;
using GHI.Processor;

namespace RLPTestProgram
{
    public class Program
    {
        public static void Main()
        {
            double TotalTime = 0;
            long TickStart = 0;
            long TickEnd = 0;

            byte[] binfile = Resources.GetBytes(Resources.BinaryResources.G120RLP);
            var elfImage = new RuntimeLoadableProcedures.ElfImage(binfile);
            var RLP_charArrayToDouble = elfImage.FindFunction("RLP_charArrayToDouble");

            byte[] charArrayBytes = new byte[7];
            charArrayBytes = Encoding.UTF8.GetBytes("61.935");
            int[] length = new int[1];
            length[0] = charArrayBytes.Length;
            float[] doubleReturned = new float[1];

            Debug.Print("\nThis segment uses the C# interpreter to get the double.");
            TickStart = DateTime.Now.Ticks;
            doubleReturned[0] = ByteArrayToDouble(charArrayBytes);
            TickEnd = DateTime.Now.Ticks - TickStart;
            TotalTime = ((double)TickEnd / (double)TimeSpan.TicksPerMillisecond); // 1 tick is 1/10 of 1 µs.
            Debug.Print("Parse time in ms: " + TotalTime);
            Debug.Print("'61.935' returned as: " + doubleReturned[0].ToString());

            doubleReturned[0] = 0;
            Debug.Print("\nThis segment uses native C/C++ to get the double.");
            TickStart = DateTime.Now.Ticks;
            RLP_charArrayToDouble.Invoke(doubleReturned, length, charArrayBytes);
            TickEnd = DateTime.Now.Ticks - TickStart;
            TotalTime = ((double)TickEnd / (double)TimeSpan.TicksPerMillisecond); // 1 tick is 1/10 of 1 µs.
            Debug.Print("Parse time in ms: " + TotalTime);
            Debug.Print("'61.935' returned as: " + doubleReturned[0].ToString());

            doubleReturned[0] = 0;
            Debug.Print("\nThis segment uses .NETMF functions to get the double.");
            TickStart = DateTime.Now.Ticks;
            char[] chars = UTF8Encoding.UTF8.GetChars(charArrayBytes);
            string s = new String(chars);
            doubleReturned[0] = (float)double.Parse(s);
            TickEnd = DateTime.Now.Ticks - TickStart;
            TotalTime = ((double)TickEnd / (double)TimeSpan.TicksPerMillisecond); // 1 tick is 1/10 of 1 µs.
            Debug.Print("Parse time in ms: " + TotalTime);
            Debug.Print("'61.935' returned as: " + doubleReturned[0].ToString());

           
        }

        static float ByteArrayToDouble(byte[] value)
        {
            float myDouble = 0;
            int preDecimalMulti = 1;
            float postDecimalMulti = (float)0.1;
            bool decimalFound = false;
            bool preDec = true;

            for (int i = 0; i < value.Length; i++)
            {
                if (value[i] == '.')
                {
                    decimalFound = true;
                }
                else if (value[i] != '.')
                {
                    if (decimalFound)
                    {
                        myDouble = myDouble + ((int)(value[i] - '0') * postDecimalMulti);
                        postDecimalMulti = postDecimalMulti / 10;
                    }
                    else if (!decimalFound)
                    {
                        myDouble = (myDouble * preDecimalMulti) + (int)(value[i] - '0');
                        if (preDec)
                        {
                            preDecimalMulti = 10;
                            preDec = true;
                        }
                    }
                }
            }

            return myDouble;
        }
    }
}

How do you mean by optimizing the build? Which build?

I tried to compile the RLP function in EM::Blocks (as per the guide that one of the forum guys wrote) but I always get an exception when my C# app calls;



The only way I can get it to work is to use the batch file method like in the examples linked in the RLP guide that you GHI chaps wrote.

Cheers

@ wolfbuddy - There are two things going on here. RLP was changed between 4.2 and 4.3. While the API was made easier to use, there is a larger managed to native marshalling overhead when calling Invoke on 4.3 than there was on 4.2. The first call is quite a bit more expensive than subsequent calls (the first call initializes various data structures used in marshalling the parameters), so you should see a shorter time on your second call. The overhead on calls after the first is constant with regards to the number of parameters. When you perform very little work on the native side, like in your example, the bulk of the execution time is taken up by the overhead. As you perform more work on the native side, this overhead becomes less and less noticeable.

If you design your program such that the garbage collector is never invoked, there is another optimization you can make if you call the function multiple times. Create a new function that takes the byte array and saves the pointer. Invoke that once at the start of your program. Change the original function to take no parameters and directly reference the pointer saved by the function you added. Call this to convert the data. Changes you make in the managed array will automatically appear on the native side. Be careful though, if the garbage collector ever runs, you run the risk of crashing or corrupting your board. See the warning at the end of this section: https://www.ghielectronics.com/docs/50/rlp#449. Remember that it’s not just your program that can cause the garbage collector to run, assemblies you reference can too (including Microsoft’s).

Thanks for the detailed explanation.

Then it’s not a viable option if it can cause the board to crash and it is out of my control.

@ wolfbuddy - RLP is aimed more at improving performance for processor intensive tasks. Things like calculating hashes or controlling low level device peripherals. It is not optimized for improving the performance of tasks that already take ~1ms in managed code.

That said, we can take a look in future SDKs and see if there are any obvious places where we can reduce the flat overhead in invoking the function. There is a limit though because there is a cost in invoking any native function from managed code that applies to all of NETMF.

1 Like

@ wolfbuddy - For now, using something like the code below, I was able to get it down to 0.65ms on the G120. It allocates a buffer on the native side that does not get garbage collected. It uses that buffer to read the parameters. It’s up to you to free it when you’re done. In C#, you can write to that buffer using AddressSpace with the address returned from init. The space allocated for the string data to parse is only 11 bytes. You can of course increase that, but for whatever value you pick, make sure you don’t write more than is allocated or you can corrupt things.

Since you are only using a float (which is 32 bits), we can stick it in the int return value and then reinterpret as a float in C#. If you need to return data larger than what can fit in the 32 bits of an int, you can use the same trick with AddressSpace to return results as well. You’ll just need a call to AddressSpace.Read(parameterAddress, parameters) after parse.Invoke(). Then you can extract whatever data you put in the native buffer on the native side. That adds another 0.25ms to the total time though.


using System;
using System.Text;
using System.Threading;
using GHI.Processor;
using Microsoft.SPOT;

public class Program {
    public static void Main() {
        var image = new RuntimeLoadableProcedures.ElfImage(Resources.GetBytes(yourRLPImage);
        var init = image.FindFunction("init");
        var uninit = image.FindFunction("uninit");
        var parse = image.FindFunction("parse");
        var parameterAddress = (uint)init.Invoke();
        var parameters = new byte[11 + 1];
        var toParse = "61.935";
        var parsed = 0.0;
        var resultAsInt = 0;

        if (parameterAddress == 0)
            throw new OutOfMemoryException();

        Encoding.UTF8.GetBytes(toParse, 0, toParse.Length, parameters, 0);
        parameters[11] = (byte)toParse.Length;

        for (int i = 0; i < 10; i++) {
            var s = DateTime.UtcNow.Ticks;

            AddressSpace.Write(parameterAddress, parameters);

            resultAsInt = parse.Invoke();
            parsed = BitConverter.ToSingle(BitConverter.GetBytes(resultAsInt), 0);

            var e = DateTime.UtcNow.Ticks;

            Debug.Print(((e - s) / (double)TimeSpan.TicksPerMillisecond).ToString("N2") + " " + parsed.ToString());

            Thread.Sleep(250);
        }

        uninit.Invoke();

        init.Dispose();
        uninit.Dispose();
        parse.Dispose();
    }
}


 #define G120

 #include "../RLP.h"

void* ptr;
unsigned char* length;
unsigned char* input;

int init(void** args) {
    ptr = RLP->malloc(11 + 1);    
    
    if (!ptr)
        return 0;
    
    input = (unsigned char*)(ptr + 0);
    length = (unsigned char*)(ptr + 11);
        
    return (int)ptr;
}

int uninit(void** args) {
    RLP->free(ptr);

    return 0;
}

int parse(void** args) {
    int decimalFound = 0;
    int preDec = 1;
    float preDecimalMulti = 1.0;
    float postDecimalMulti = 0.1;
    float result = 0.0;
    unsigned char i;

    for (i = 0; i < *length; i++) {
        if (input[i] == 0x2E) {
            decimalFound = 1;
        }
        else {
            if (decimalFound) {
                result += ((float)(input[i] - '0') * postDecimalMulti);
                
                postDecimalMulti /= 10.0;
            }
            else {
                result = (result * preDecimalMulti) + (float)(input[i] - '0');
                
                if (preDec) {
                    preDecimalMulti = 10.0;
                    preDec = 0;
                }
            }
        }
    }

    return *(int*)(&result);
}

3 Likes

Fascinating! :think: Thanks, I will have a good read to try and understand the techniques you’ve used here.

How do you know for sure that it will not be garbage collected?

@ wolfbuddy - RLP->malloc does not make the garbage collector aware of the allocation. It is a function we provide for you in RLP.

I can’t see this function mentioned in the documentation here;

https://www.ghielectronics.com/docs/50/rlp#449

Is there some documentation on all of the available RLP functions somewhere?

@ wolfbuddy - RLP.h provides the definitions for every function we make available for you. You can find it in https://www.ghielectronics.com/downloads/NETMF/RLP/RLP%20Examples.zip

1 Like