Bottleneck

I’ve implemented the Bresenham Line algorithm in RLP using a PutPixel implemented as assembler.

Then I’ve made a demo in managed code that moves some lines on the screen.

When I run the demo with the RLP Line function, the code is more then 10 times slower then when I draw the lines on a Bitmap and then Flush to the screen.

How can that be explained?

Can you show the code?


inline void Line(WORD x0, WORD y0, WORD x1, WORD y1, WORD color)
{
	short e2 = 0;
	WORD dx = x0 > x1 ? (x0 - x1) : (x1 - x0);
	WORD dy = y0 > y1 ? (y0 - y1) : (y1 - y0);
	short sx = x0 < x1 ? 1 : -1;
	short sy = y0 < y1 ? 1 : -1;
	short err = dx - dy;

	for (;;)
	{
		// PutPixel
		asm volatile (
			"mov	r0, %[x]"								"\n"
			"mov	r1, %[y]"								"\n"
			"mov	r2, %[color]"							"\n"
			"ldr	r3, =" G_STRINGIFY(LCD_UPBASE)			"\n"	// r3 = LCD_UPBASE
			"ldr	r3, [r3]"								"\n"	// r3 = LCD base address
			"add	r3, r0, lsl #1"							"\n"	// r3 += x << 1
			"add	r3, r1, lsl #7"							"\n"	// r3 += y << 7
			"add	r3, r1, lsl #9"							"\n"	// r3 += y << 9
			"strh	r2, [r3]" 								"\n"	// *r3 = color
			:// no output
			:[x]"r"(x0),[y]"r"(y0),[color]"r"(color)//input
			:"r0","r1","r2","r3"	// clobbered registers
		);
		
		if ((x0 == x1) && (y0 == y1))
			break;
			
		e2 = 2 * err;
		
		if (e2 > -dy)
		{
			err -= dy;
			x0 += sx;
		}
		if (e2 < dx)
		{
			err += dx;
			y0 += sy;
		}
	}
}

Note that draw line in C# is not really managed. That is just the call and drawing functions all happens in C++

I know Gus. But I should get at least the same speed, unless the rlp invoke has a huge overhead or waits for something.

RLP over head will not make it 10 times slower, I doubt it. I am not sure why

Just curious :wink: but I hardly know anything about RLP…
if replace the ASM part by NOP (s)
of course you are not going so see anything on the screen,
but if you measure, is this “nothing” faster than managed ?
To figure out if the the slow comes from the putpixel or something else ?

I don’t know ARM assembler, but I use to meet similar problems on programming direct VESA video access to PCs.
I remember beeing disapointed writting a graphic library for PC, and it was very slow, however everything was written in assembler. Accessing video ram pixel per pixel was very slow, but if I was doing it in memory, I could dump / copy all the memory to video ram with a 4 lines assembler like
"rep stosw" etc and the speed was fine.

Maybe totaly unrelated… ::slight_smile:

@ Nicolas: Thanks for your suggestion. When I think about it, the method you describe is the way it works when using the bitmap flush. So it is worth to give it a try.

So I ran this code (with putpixel in tact)


int start = Environment.TickCount;
                    screen.Clear();
                    for (int i = 0; i < 1000; i++)
                        screen.DrawLine(Color.White, 1, 60, 75, 281, 187);
                    screen.Flush();
                    int duration = Environment.TickCount - start;
                    Debug.Print("Bitmap.DrawLine 1000 lines took " + duration + "ms");

                    start = Environment.TickCount;
                    lcd.clearProcedure.Invoke((ushort)0x0000);
                    for (int i = 0; i < 1000; i++)
                        lcd.lineProcedure.Invoke((ushort)60, (ushort)75, (ushort)281, (ushort)187, (ushort)0xFFFF);
                    duration = Environment.TickCount - start;
                    Debug.Print("RLP Line 1000 lines took " + duration + "ms");

This is the (sad) result:


Bitmap.DrawLine 1000 lines took 879ms
RLP Line 1000 lines took 11649ms

Then I completely commented out the PutPixel asm routine, then this is the result:


Bitmap.DrawLine 1000 lines took 879ms
RLP Line 1000 lines took 11119ms

Then, commented out the Clear methods in the test, result:


Bitmap.DrawLine 1000 lines took 877ms
RLP Line 1000 lines took 11100ms

So as I say, RLP seems more then 10 times slower…
What am I missing here?

I did one more test, I moved the for loop to RLP, and only Invoke the RLP method once. So, RLP draws the line 1000 times with just one managed call.

This is the result:


Bitmap.DrawLine 1000 lines took 875ms
RLP Line 1000 lines took 488ms

Amazing :slight_smile: Double as fast as the bitmap method.

Conclusion: Invoking methods from RLP has a huge overhead. GHI, how can this be finetuned?

(BTW I would like to make an alpha-blending graphics library for the cobra, so it’s not just test :stuck_out_tongue: )

You can try generalarray which is very little overhead

This is the result when using GeneralArray. We’re getting closer :slight_smile:


Bitmap.DrawLine 1000 lines took 867ms
RLP Line 1000 lines took 3450ms

[quote]Amazing Double as fast as the bitmap method.
Conclusion: Invoking methods from RLP has a huge overhead. GHI, how can this be finetuned? [/quote]

Did you try a single invoke with GeneralArray?
I’m curious how that compares to your best time without generalarray - 488ms.

That ‘best time’ of 488ms was a test with the for-loop in RLP instead of in C#. So there is only one call from managed to unmanaged code.

To compare 1000 calls from managed to unmanaged then you should compare:

1000 calls to draw a line without using general array took 11100ms
1000 calls to draw a line by using general array took 3450ms

In general you can say that a call with general array has 3ms overhead, and a call without general array has 10ms overhead.

Have you tried taking the word “inline” out of the function declaration? The function is a bit big for inlining and may be hurting you. You also can’t really inline functions with managed code (not manually anyway, the JITter does it) so that may be adding some overhead there.

I even programmed the whole line routine in assembler inside the main RLP function but that didn’t speed things up.

  • I even made the function ‘naked’ with my own prolog and epilog code to make the asm output as small as possible… Still no luck

Those numbers do not seem reasonable! I am passing this on to the team for verification.

@ Wouter, mind if I enlist your help w/ some assembly? RE: [url]http://www.tinyclr.com/forum/20/3408/#/2/msg32484[/url]

Using this code:


DateTime start, end;
int e = 0;
int[] intArray = new int[100];
byte[] byteArray = new byte[100];

start = DateTime.Now;
DoNothing.Invoke();
end = DateTime.Now;
Debug.Print("Invoke() in " + (end - start).Ticks / 10 + " micro seconds");

start = DateTime.Now;
DoNothing.Invoke(e);
end = DateTime.Now;
Debug.Print("Invoke(int) in " + (end - start).Ticks / 10 + " micro seconds");

start = DateTime.Now;
DoNothing.InvokeEx(intArray);
end = DateTime.Now;
Debug.Print("InvokeEx(int[]) in " + (end - start).Ticks / 10 + " micro seconds");

start = DateTime.Now;
DoNothing.InvokeEx(intArray, byteArray);
end = DateTime.Now;
Debug.Print("InvokeEx(int[], byte[]) in " + (end - start).Ticks / 10 + " micro seconds");

start = DateTime.Now;
DoNothing.Invoke(byteArray, byteArray);
end = DateTime.Now;
Debug.Print("Invoke(byte[], byte[]) in " + (end - start).Ticks / 10 + " micro seconds");

We get these results:

So passing no arguments is fast.
general array is the next best option.
We can optimize RLP in future but this requires major changes on invoking and passing arguments…

Mike-

Could you try the same test, but wrap the function calls in an unsafe {} block? You will need to go into the project preferences to allow unsafe code to compile and run it.

Thanks