Inline assembler

I want to use inline assembler. Which instruction set should I use? The ARM or the Thumb instruction set?

I already figured that inline assembler goes like this:


asm("<instruction>");

I think everything we do in RLP is ARM instead of tumb

I’ve written a small piece of assembler that updates the screen as fast as possible.

I’m rotating a bitmap on the 320 x 240 screen.

Now I have a question about the instruction execution speed of the LPC24xx.

You can compare the assembly code with this pseude C code:


y = 240;
while (y-- > 0)
{
  x = 320;
  while (x-- > 0)
  {
    // 24 assembly instruction here
  }
}

So to draw one frame, 240 * ((320 * (24 + 3)) + 3) = 2074320 instructions are needed. But I ‘only’ get a throughput of 4 / 5 frames per second. Which would mean the CPU executes ‘only’ at 10 MIPS. Is this true? What is the cause of this bottleneck?

You are doing a lot of operations on SDRAM and SDRAM is not as fast as the processor. You now can see where processor cash can really help.

This is the result of my inline assembler test in RLP on the FEZ Cobra

WOW…this is fast considering no hardware acceleration and ARM7 on QVGA screen. Very nice work.

very nice

It even runs slightly faster now, since I was able to remove another 3 asm instructions by combining them. ARM instruction set is really neat when it comes to instruction combining.

First managed code sends the 64x64 GHI image to RLP.

Next, managed code calculates an optimised (full circle contains 256 degrees) sin / cos lookup table.

Last, managed code calls the Animate function with a sin / cos for the current angle and increments the angle.

This is how the Animate functions looks like (this method draws a single frame ASAP :slight_smile: )

Nice!

One thing I wanted to try was to move the old MPEG2 video playback RLP on EMX and see how fast can EMX do video. Are you up for this challenge? I am sure I can convince GHI to get you some new toys :wink:

Hmm, I don’t think the cobra can handle mpeg decoding. As you said memory access is really a bottleneck. That’s why I used 8 CPU registers to cache everything. And reduced inner loop to 14 asm instructions. No C optimiser will ever compact it that much.

Plus doing mpeg decoding in asm will be a pain in the *** 8)

I meant take the same code and run on EMX not rewrite it in assembly :slight_smile:

EMX can do MPEG but how fast can it doe it :slight_smile: At 320x240 ChipworkX can do over 60fps so assuming EMX is 6 times slower then you get 10fps, which is okay. Or make the video smaller and get 15fps which is enough.

If I have some spare time I’ll look into it :slight_smile:

Just captured another video of the rotating bitmap. Now with a 128x128 bitmap and some random movement.

This belongs on code.tinyclr.com so we can all enjoy the awesomess.

cough hint cough 8)

pretty amazing for sure

Looks very smooth!

Thank you

Here you go:
http://code.tinyclr.com/project/295/rlp-bitmap-rotation-demo/

Cool stuff.
Mind sharing what IDE you use for native and where did you get the info on syntax?
I used PICs for a while and I’m used to have all in one place: doc sheet + instruction set + free ide with good tutorials. (I’m spoiled I know). With ARM all seem scattered and tons of 3rd party IDEs.
Maybe it’s just me :-[

Well I just use “Programmers Notepad” for editing and a console window for compiling. Nothing fancy.

NOTE: all links below are for the FEZ Cobra, I don’t know if they match other boards.

I use the “Sample Code Bundle” from here:
http://ics.nxp.com/support/documents/microcontrollers/?scope=LPC2468

On the same page you’ll find the “LPC24xx User Manual (UM10237)”

Then you have the instruction set:

More information on inline assembler in GCC:
http://www.ethernut.de/en/documents/arm-inline-asm.html

http://www.devrs.com/gba/files/asmc.txt

You can always let the compiler generate an assembly list file from your C code. There you will learn a lot about how the compiler does things in assembly.

During development I have found that memory access is the bottleneck. The ARM has some internal registers free to use, so if you see that the C code generates too much memory lookups and you want to speed things up, then you can cache stuff in those registers for faster access.