Bottleneck

WouterH · May 3, 2011, 4:33pm

I’ve implemented the Bresenham Line algorithm in RLP using a PutPixel implemented as assembler.

Then I’ve made a demo in managed code that moves some lines on the screen.

When I run the demo with the RLP Line function, the code is more then 10 times slower then when I draw the lines on a Bitmap and then Flush to the screen.

How can that be explained?

Architect · May 3, 2011, 4:38pm

Can you show the code?

WouterH · May 3, 2011, 4:48pm


inline void Line(WORD x0, WORD y0, WORD x1, WORD y1, WORD color)
{
	short e2 = 0;
	WORD dx = x0 > x1 ? (x0 - x1) : (x1 - x0);
	WORD dy = y0 > y1 ? (y0 - y1) : (y1 - y0);
	short sx = x0 < x1 ? 1 : -1;
	short sy = y0 < y1 ? 1 : -1;
	short err = dx - dy;

	for (;;)
	{
		// PutPixel
		asm volatile (
			"mov	r0, %[x]"								"\n"
			"mov	r1, %[y]"								"\n"
			"mov	r2, %[color]"							"\n"
			"ldr	r3, =" G_STRINGIFY(LCD_UPBASE)			"\n"	// r3 = LCD_UPBASE
			"ldr	r3, [r3]"								"\n"	// r3 = LCD base address
			"add	r3, r0, lsl #1"							"\n"	// r3 += x << 1
			"add	r3, r1, lsl #7"							"\n"	// r3 += y << 7
			"add	r3, r1, lsl #9"							"\n"	// r3 += y << 9
			"strh	r2, [r3]" 								"\n"	// *r3 = color
			:// no output
			:[x]"r"(x0),[y]"r"(y0),[color]"r"(color)//input
			:"r0","r1","r2","r3"	// clobbered registers
		);
		
		if ((x0 == x1) && (y0 == y1))
			break;
			
		e2 = 2 * err;
		
		if (e2 > -dy)
		{
			err -= dy;
			x0 += sx;
		}
		if (e2 < dx)
		{
			err += dx;
			y0 += sy;
		}
	}
}

Gus_Issa · May 3, 2011, 4:49pm

Note that draw line in C# is not really managed. That is just the call and drawing functions all happens in C++

WouterH · May 3, 2011, 4:54pm

I know Gus. But I should get at least the same speed, unless the rlp invoke has a huge overhead or waits for something.

Gus_Issa · May 3, 2011, 5:10pm

RLP over head will not make it 10 times slower, I doubt it. I am not sure why

Nicolas3 · May 3, 2011, 5:15pm

Just curious but I hardly know anything about RLP…
if replace the ASM part by NOP (s)
of course you are not going so see anything on the screen,
but if you measure, is this “nothing” faster than managed ?
To figure out if the the slow comes from the putpixel or something else ?

I don’t know ARM assembler, but I use to meet similar problems on programming direct VESA video access to PCs.
I remember beeing disapointed writting a graphic library for PC, and it was very slow, however everything was written in assembler. Accessing video ram pixel per pixel was very slow, but if I was doing it in memory, I could dump / copy all the memory to video ram with a 4 lines assembler like
"rep stosw" etc and the speed was fine.

Maybe totaly unrelated… :

WouterH · May 4, 2011, 12:57am

@ Nicolas: Thanks for your suggestion. When I think about it, the method you describe is the way it works when using the bitmap flush. So it is worth to give it a try.

WouterH · May 4, 2011, 2:06am

So I ran this code (with putpixel in tact)


int start = Environment.TickCount;
                    screen.Clear();
                    for (int i = 0; i < 1000; i++)
                        screen.DrawLine(Color.White, 1, 60, 75, 281, 187);
                    screen.Flush();
                    int duration = Environment.TickCount - start;
                    Debug.Print("Bitmap.DrawLine 1000 lines took " + duration + "ms");

                    start = Environment.TickCount;
                    lcd.clearProcedure.Invoke((ushort)0x0000);
                    for (int i = 0; i < 1000; i++)
                        lcd.lineProcedure.Invoke((ushort)60, (ushort)75, (ushort)281, (ushort)187, (ushort)0xFFFF);
                    duration = Environment.TickCount - start;
                    Debug.Print("RLP Line 1000 lines took " + duration + "ms");

This is the (sad) result:


Bitmap.DrawLine 1000 lines took 879ms
RLP Line 1000 lines took 11649ms

Then I completely commented out the PutPixel asm routine, then this is the result:


Bitmap.DrawLine 1000 lines took 879ms
RLP Line 1000 lines took 11119ms

Then, commented out the Clear methods in the test, result:


Bitmap.DrawLine 1000 lines took 877ms
RLP Line 1000 lines took 11100ms

So as I say, RLP seems more then 10 times slower…
What am I missing here?

WouterH · May 4, 2011, 2:47am

I did one more test, I moved the for loop to RLP, and only Invoke the RLP method once. So, RLP draws the line 1000 times with just one managed call.

This is the result:


Bitmap.DrawLine 1000 lines took 875ms
RLP Line 1000 lines took 488ms

Amazing Double as fast as the bitmap method.

Conclusion: Invoking methods from RLP has a huge overhead. GHI, how can this be finetuned?

(BTW I would like to make an alpha-blending graphics library for the cobra, so it’s not just test )

Gus_Issa · May 4, 2011, 8:05am

You can try generalarray which is very little overhead

WouterH · May 4, 2011, 12:34pm

This is the result when using GeneralArray. We’re getting closer


Bitmap.DrawLine 1000 lines took 867ms
RLP Line 1000 lines took 3450ms

Felix · June 4, 2011, 5:22pm

[quote]Amazing Double as fast as the bitmap method.
Conclusion: Invoking methods from RLP has a huge overhead. GHI, how can this be finetuned? [/quote]

Did you try a single invoke with GeneralArray?
I’m curious how that compares to your best time without generalarray - 488ms.

WouterH · June 5, 2011, 3:19am

That ‘best time’ of 488ms was a test with the for-loop in RLP instead of in C#. So there is only one call from managed to unmanaged code.

To compare 1000 calls from managed to unmanaged then you should compare:

1000 calls to draw a line without using general array took 11100ms
1000 calls to draw a line by using general array took 3450ms

In general you can say that a call with general array has 3ms overhead, and a call without general array has 10ms overhead.

Silic0re · June 5, 2011, 11:38am

Have you tried taking the word “inline” out of the function declaration? The function is a bit big for inlining and may be hurting you. You also can’t really inline functions with managed code (not manually anyway, the JITter does it) so that may be adding some overhead there.

WouterH · June 5, 2011, 11:47am

I even programmed the whole line routine in assembler inside the main RLP function but that didn’t speed things up.

I even made the function ‘naked’ with my own prolog and epilog code to make the asm output as small as possible… Still no luck

Gus_Issa · June 5, 2011, 12:05pm

Those numbers do not seem reasonable! I am passing this on to the team for verification.

Skewworks · June 8, 2011, 1:14pm

@ Wouter, mind if I enlist your help w/ some assembly? RE: [url]http://www.tinyclr.com/forum/20/3408/#/2/msg32484[/url]

User_5 · June 10, 2011, 12:57pm

Using this code:


DateTime start, end;
int e = 0;
int[] intArray = new int[100];
byte[] byteArray = new byte[100];

start = DateTime.Now;
DoNothing.Invoke();
end = DateTime.Now;
Debug.Print("Invoke() in " + (end - start).Ticks / 10 + " micro seconds");

start = DateTime.Now;
DoNothing.Invoke(e);
end = DateTime.Now;
Debug.Print("Invoke(int) in " + (end - start).Ticks / 10 + " micro seconds");

start = DateTime.Now;
DoNothing.InvokeEx(intArray);
end = DateTime.Now;
Debug.Print("InvokeEx(int[]) in " + (end - start).Ticks / 10 + " micro seconds");

start = DateTime.Now;
DoNothing.InvokeEx(intArray, byteArray);
end = DateTime.Now;
Debug.Print("InvokeEx(int[], byte[]) in " + (end - start).Ticks / 10 + " micro seconds");

start = DateTime.Now;
DoNothing.Invoke(byteArray, byteArray);
end = DateTime.Now;
Debug.Print("Invoke(byte[], byte[]) in " + (end - start).Ticks / 10 + " micro seconds");

We get these results:

So passing no arguments is fast.
general array is the next best option.
We can optimize RLP in future but this requires major changes on invoking and passing arguments…

Silic0re · June 10, 2011, 2:58pm

Mike-

Could you try the same test, but wrap the function calls in an unsafe {} block? You will need to go into the project preferences to allow unsafe code to compile and run it.

Thanks