Cerberus stops working on <procedure>.Invoke

Hi,

I’m trying to use RLPLite on a FEZ Cerberus, however when the procedure is being invoked the Cerberus just stops (or seems to stop) processing. Unfortunately, I haven’t been able to figure out the problem for the last 8 hours.

I double checked the location of the the procedure in the .map file and also checked that the data of the binary file at the specific location equals the binary respresentation of the c function.

The native function (I simplified it to do nothing but return 0)


int RLP_Init(unsigned int *par0, int *par1, unsigned char *par2)
{
     return 0;
}

Extract from lst file:


  11:src/main.c    **** int RLP_Init(unsigned int *par0, int *par1, unsigned char *par2)
  12:src/main.c    **** {
  73              		.loc 1 12 0
  74              		.cfi_startproc
  75              		@ args = 0, pretend = 0, frame = 0
  76              		@ frame_needed = 0, uses_anonymous_args = 0
  77              		@ link register save eliminated.
  78              	.LVL0:
  13:src/main.c    **** 	return 0;
  14:src/main.c    **** }
  79              		.loc 1 14 0
  80 0000 0020     		movs	r0, #0	@ ,
  81              	.LVL1:
  82 0002 7047     		bx	lr	@ 
  83              		.cfi_endproc

Extract from .map file:


.text           0x2001a000      0x22c
 *(.text)
 .text          0x2001a000      0x104 ./src/RLPHeap.o
                0x2001a000                SimpleHeap_Initialize
                0x2001a01c                SimpleHeap_IsAllocated
                0x2001a030                SimpleHeap_Release
                0x2001a088                SimpleHeap_Allocate
                0x2001a0e8                SimpleHeap_ReAllocate
 .text          0x2001a104      0x124 ./src/main.o
                0x2001a104                RLP_Init  <<<< This is the called function
                0x2001a108                setQuaternion
                0x2001a11e                getQuaternion
                0x2001a134                quaternionMultiply
                0x2001a20c                allocateMemory
                0x2001a21c                freeMemory

The call in .net: (I checked in the debugger, that the locations 0x104-0x107 of binFile contain 0x00 20 40 47 as seen int the .lst file)


byte[] binFile = Resource.GetBytes(Resource.BinaryResources.main);
AddressSpace.Write(0x2001A000, binFile, 0, binFile.Length);

rlpInit = new RLPLite.Procedure(0x2001a104);

rlpInit.Invoke(new uint[0], new int[0], new byte[0]); // Executing this line stops Cerberus

This is the gcc call I’m using build the native function:


arm-none-eabi-gcc -c -Os -g0 -mlittle-endian -mthumb -mcpu=cortex-m4 -mfloat-abi=hard -mfpu=fpv4-sp-d16 -D__FPU_USED=1  -Wall -I. -IC:\STM32F4xx_DSP_StdPeriph_Lib_V1.0.1\Libraries\CMSIS\Include -mapcs-frame -fno-builtin -gdwarf-2 -fverbose-asm -Wa,-ahlms=src/main.lst  src/main.c -o src/main.o

I think, that I’m maybe building the native code for a wrong mcpu or so, but I wasn’t able to figure out what the problem could be, and I came to a point at which I’m running out of ideas where I could go on looking. I hope someone else has any ideas what the problem could be or where and how I could investigate it further. Is there any way to be able to debug into the native function to see what’s going on?

This project is supposed to become an rc-airplane autopilot. I’m not sure, whether the final version will still be running on .NetMF, however it’s a great platform for prototyping. Currently I’m working on a 9DOF IMU (currently based on a Invensense MPU6050 (Gyro+Accelerometer) and a Honeywell HMC5883L (Compass). The native c coding is then supposed to run a kalman filter, therefore I’m really interested in seeing how it performs on the Cerberus.

Best regards,
Markus

@ andre.marschalek - The RLPLite signature differs from the full RLP provided with the premium libraries.

I just tested your change, but with no change in the behaviour.

The suggested function signature, as well as the link to the wiki, seems to be part of the RLP library from the premium framework, but the Cerberus is running the open-source version, which only supports RLPLite.

This change also does not solve the issue, the call to the procedure still halts the system.

In the thread right below about Cerbuino Bee is a r/w area mentioned starting at 0x2009A000 instead of 0x2001A000. This address is not mentioned in the wiki entry for the cerb-family. Do I have to take care of something I haven’t done?

I just changed my linker settings to have two memory areas: RX at 2001A000 size 0x3000 and RW at 0x2009A000 size 0x3000, however with no effect. But could this still be related to my problem?

So I went on simplifying everything in order to narrow down the issue, however with no change, yet.

I’m just posting everything I have right now:

C#:


public partial class Program
    {
        void ProgramStarted()
        {
            Debug.Print("Program Started");

            button.ButtonPressed += new Button.ButtonEventHandler(button_ButtonPressed);
        }

        void button_ButtonPressed(Button sender, Button.ButtonState state)
        {
            byte[] binFile = Resources.GetBytes(Resources.BinaryResources.main);
            AddressSpace.Write(0x2001A000, binFile, 0, binFile.Length);

            RLPLite.Procedure proc = new RLPLite.Procedure(0x2001A000);
            proc.Invoke(new uint[1], new int[1], new byte[1]);
        }
    }

C:


int RLP_Init(void *par0, int *par1, unsigned char *par2)
{
	return 0;
}

Map:



Memory Configuration

Name             Origin             Length             Attributes
RAM1             0x2001a000         0x00003000         xrw
*default*        0x00000000         0xffffffff

Linker script and memory map

LOAD ./src/main.o
LOAD c:/yagarto-20121222/bin/../lib/gcc/arm-none-eabi/4.7.2/../../../../arm-none-eabi/lib\libc.a
LOAD c:/yagarto-20121222/bin/../lib/gcc/arm-none-eabi/4.7.2\libgcc.a
START GROUP
LOAD c:/yagarto-20121222/bin/../lib/gcc/arm-none-eabi/4.7.2\libgcc.a
LOAD c:/yagarto-20121222/bin/../lib/gcc/arm-none-eabi/4.7.2/../../../../arm-none-eabi/lib\libc.a
END GROUP
                0x00000000                . = ALIGN (0x4)

.text           0x2001a000       0x1c
 *(.text)
 .text          0x2001a000       0x1c ./src/main.o
                0x2001a000                RLP_Init

.glue_7         0x2001a01c        0x0
 .glue_7        0x00000000        0x0 linker stubs

.glue_7t        0x2001a01c        0x0
 .glue_7t       0x00000000        0x0 linker stubs

.vfp11_veneer   0x2001a01c        0x0
 .vfp11_veneer  0x00000000        0x0 linker stubs

.v4_bx          0x2001a01c        0x0
 .v4_bx         0x00000000        0x0 linker stubs
                0x2001a01c                . = ALIGN (0x4)

.rodata
 *(.rodata)
                0x2001a01c                . = ALIGN (0x4)

.data           0x2001a01c        0x0
 *(.data)
 .data          0x2001a01c        0x0 ./src/main.o
                0x2001a01c                . = ALIGN (0x4)

.bss            0x2001a01c        0x0
                0x2001a01c                __bss_start__ = .
 *(.bss)
 .bss           0x2001a01c        0x0 ./src/main.o
                0x2001a01c                __bss_end__ = .
                0x2001a01c                end = .
OUTPUT(main.elf elf32-littlearm)

.comment        0x00000000       0x11
 .comment       0x00000000       0x11 ./src/main.o
                                 0x12 (size before relaxing)

.ARM.attributes
                0x00000000       0x33
 .ARM.attributes
                0x00000000       0x33 ./src/main.o

Make all:


arm-none-eabi-gcc -c -O0 -g0 -mlittle-endian -mthumb -mcpu=cortex-m4   -Wall
      -I. -IC:\STM32F4xx_DSP_StdPeriph_Lib_V1.0.1\Libraries\CMSIS\Include -mapcs-frame
      -f no-builtin  -fverbose-asm -Wa,-ahlms=src/main.lst  src/main.c -o src/main.o

arm-none-eabi-gcc  ./src/main.o -nostartfiles -Wl,--Map -Wl,./main.map -lc -lgcc
       -Wl,--omagic -T cerberus.ld  -o main.elf

arm-none-eabi-objcopy -O ihex main.elf main.hex
arm-none-eabi-objcopy -O binary main.elf main.bin

This seems to be as simple as it could get. The binary file now has a length if 28 bytes, the first 25 bytes corresponding to the instructions’ binary representations in the .lst file.
I’ve also verified with a AddressSpace.Read() after the AddressSpace.Write() that the binary had been written to the device.

I better also add the device capabilities from MFDeploy:


ClrInfo.targetFrameworkVersion:         4.2.0.0
SolutionReleaseInfo.solutionVersion:    4.2.3.3
SolutionReleaseInfo.solutionVendorInfo: Copyright (C) GHI Electronics, LLC
SoftwareVersion.BuildDate:              Nov 19 2012
SoftwareVersion.CompilerVersion:        410462
FloatingPoint:                          True
SourceLevelDebugging:                   True
ThreadCreateEx:                         True
LCD.Width:                              0
LCD.Height:                             0
LCD.BitsPerPixel:                       0
AppDomains:                             True
ExceptionFilters:                       True
IncrementalDeployment:                  True
SoftReboot:                             True
Profiling:                              False
ProfilingAllocations:                   False
ProfilingCalls:                         False
IsUnknown:                              False

Are there any must have includes oder linked libraries, which I’m currently missing?
The -mcpu cortex-m4 should also be correct for the STM32F405, right?

Is it okay, that the latest GHI libraries have slightly different version numbers?
GHI.OSHW.Hardware = 4.2.3.1
GHI.OSHW.Native = 4.2.3.0

Right now, I’m completely out of ideas :frowning:

Addition:

I changed the native code to deactivate the debug led for a short time, in order to see, whether the problem is caused before or after the native code is called.
I hope I’ve done it the right way, as I can’t test my code.


void Delay(__IO uint32_t nCount) {
  while(nCount--) {
  }
}

int RLP_Init(void *par0, int *par1, unsigned char *par2)
{
	GPIOC->BSRRH = (1 << 4);
	Delay(1000000L);
	GPIOC->BSRRL = (1 << 4);
	return 0;
}

I activated the DebugLed in .net before calling the rlplite function, so that I don’t have to deal with setting the pin to output etc.

The LED never changes, so that I assume the cpu halts before executing my native coding.

This problem completely blocks me from continuing working on my project, as the managed code is way to slow to run the kalman filter with up to 200Hz. :frowning:

I changed your suggestion a bit, as I just came across the hardware encoder interface in Codeshare, which comes with compiled native binaries.

I had been searching for complete example projects before, but I only searched for Cerberus and not for Cerbuino Bee projects.

So I just tried to run it and it works fine.

This leaves us with

  1. the native code and the calls from managed code
  2. compilation

Next I’m gonna try to compile the natives for the hardware encoder myself. If that does not work, I’m gonna see whether I can spot the differences between the binaries.

EDIT:

Okay, I just installed Keil uVision, as there are projects for Cerb-family available.
Using the project which comes with the hardware encoder I was able to compile it myself using uVisuin and run it using RLPLite. I guess my linker settings for yagarto have been wrong in some way.

But if anyone has a RLPLite complete project using Yakarto I would still be interested in it!

Now that I can invoke my native coding, I’d like to share the first performance results:

I’m currenty only running a quaternion multiplication. Quaternions are basically 4-dimensional vectors and a multiplication can be computed by:


qResult[0] = q1[0]*q2[0]-q1[1]*q2[1]-q1[2]*q2[2]-q1[3]*q2[3];
qResult[1] = q1[0]*q2[1]+q1[1]*q2[0]+q1[2]*q2[3]-q1[3]*q2[2];
qResult[2] = q1[0]*q2[2]+q1[2]*q2[0]+q1[3]*q2[1]-q1[1]*q2[3];
qResult[3] = q1[0]*q2[3]+q1[3]*q2[0]+q1[1]*q2[2]-q1[2]*q2[1];

So it consists of 16 multiplications and 12 additions/subtractions.

So here are the performance results:

500 multiplications calculated by managed code only: 0.95 seconds
500 calls to RLPLite, everytime executing one multiplication: 0.17 seconds

Running the whole loop on native code (compiled with -O3):
5,000: 0.0095 seconds
50,000: 0.061 seconds
500,000: 0.58 seconds
1,000,000: 1.15 seconds

To see the difference between code-optimization I recompiled with -O0 and reexecuted the last run:
1,000,000: 1.30 seconds

I bet you can utilize some assembly coding and even make it even faster.

As I have only the lite version of µVision I can’t check how the assembly looks like. But the code Yagarto produced was using the DSP instructions like Multiply-Accumulate, so I’m not sure whether I could improve it myself.
The only thing I can tell about the µVision code is that the function executing the multiplication has a size of 216 bytes.

But due to my laziness, the actual multiplication is executed in a seperate function, called once every iteration in the loop. So by making the calculation within the loop I could get rid of all the register pushs/pops, as well as the jmp/return.

As soon as I have some more code to show I’m gonna publish what I have so far.

EDIT:
I just realized, that the function containing the loop has a size of 242 bytes, so it seems µVision put the code inside the loop by itself :slight_smile:

Yes, it seems that either my compiler or linker settings are wrong.

I think the next thing I do is to write a ‘custom tool’ for visual studio which automatically updates my function addresses after a build, cause right now I’m getting tired of reading .map files and applying the new offsets.