Main Site Documentation

[solved] Panda runs around 3x faster than Cobra?


#1

I ordered a Cobra since my app outgrew the Panda’s memory. Converting the app was very easy, but something was not working as expected.

So I did some tests and found out that the Cobra is CONSIDERABLY slower than the Panda II.

What am I doing wrong?

Regards
Mark


#2

Most probably the doctor isn’t able to reply without further examination.

So, if you want a answer to your question, you gonna have to provide us with some more details, and in particular your code (or at least the slow part).


#3

Cobra is much faster, so I do not know how you can state that it’s slower then panda.
As Eric said; please show us some code. :wink:


#4

I’ve got a Panda II and a Cobra sitting around; if you post code I can run a test for you.


#5

Panda II and cobra use the same processor family but cobra (EMX) also relies on external memory (overhead) and has TFT display controller (overhead) and Ethernet controller (overhead) so depending on what you are doing, EMX can be a but slower than USBizi.


#6

I have a sample program that performs many very simple “benchmarks” (like sum of an byte array, converting Int32 to byte[], etc) and times them. The program is plain NET MF, it does not rely on any GHI functionality, so it should work without any modifications on any NET MF device.

The complete solution (Stopwatch class, main program, test methods) can be downloaded from my public mobile.me disk here https://public.me.com/m.munte

Please don’t tell me real world usage is any different from these simple benchmarks. I’m timing exactly the stuff I use in my current project.

Eg I need to adjust several stepper motors connected through RS485. The motor controllers use an ASCII protocol that requires me to convert UInt32 to a byte array. Since the built in .NET function (Int32.ToString()) is too slow, I built my own. It runs about 2-3 times faster. I also coded an RLP version, basically I took the itoa code from someone. The RLP version runs 50x faster, but the RLP overhead of 1 to 2 milliseconds per function call kills all the gain.

Now about my Cobra: it is plain. No TFT connected. Just the USB device cable connected to the PC. It has the latest firmware (released a few days ago).

If you run the test app and compare to a Panda II you’ll see the Cobra is really between 2 to 3 times slower for most tests.

I asked for someone with a ChipworkX to run this program some days ago but got no reply other than Gus affirmation it is 6x faster - now I know not if it is 6x faster than Panda or Cobra…

I’m kind of confused. I ordered the Panda just to get to know NET MF and see if it is a viable choice. It is. Since my final app needs Ethernet, and some functions have not been implemented on the Panda (eg serialization is missing), and I was almost running out of memory, I decided the Cobra would be my board for further development. It is the board for professional development.

Please try out my sample code. Maybe I’m just doing something wrong. I’d love that to be the case!!!

Mark


#7

Wow, this is surprising. Overhead? 50%? Starting from a fresh flash on each, Cobra seems to be half the speed of Panda II:

Cobra

PerformanceTester - PrintSomeInfo
System Version: 4.1.6.0
Cpu.SlowClock: 18000000
Cpu.SystemClock: 18000000
Debugger Attached: True

PerformanceTester - MiscTests
time nothing [0.010 ms]
time Utility.ComputeCRC 2701537051 [0.175 ms]
time IntPlaces 0: 1 [0.282 ms]
time IntPlaces 4000: 4 [0.462 ms]
time IntPlaces -4000: 5 [0.584 ms]
time IntPlaces int.MaxValue: 10 [0.766 ms]
time IntPlaces int.MinValue: 11 [0.867 ms]

PerformanceTester - ArrayTests
time fill bytes1 [15.327 ms]
time FillArrayWithIndexValue bytes1 [15.438 ms]
time Array.Clear [0.059 ms]
time Array.Copy [0.222 ms]
time bytes1.CopyTo [0.287 ms]
time Array.IndexOf 98 is 98 [0.561 ms]
time Utility.CombineArrays 200 [0.149 ms]
time Utility.ExtractValueFromArray 50462976 [0.109 ms]
time Utility.ExtractRangeFromArray 10 [0.123 ms]
time bytes2[99] == 99 True [0.052 ms]
time bytes2[bytes2.Length-1] == 99 True [0.087 ms]

PerformanceTester - StringTests 100chars
time init string of 100 chars [0.031 ms]
time UTF8Encoding.UTF8.GetBytes [0.462 ms]
time UTF8Encoding.UTF8.GetChars [1.106 ms]
time new string(chars) [0.212 ms]

PerformanceTester - StringTestsShort 20chars
time init string of 20 chars [-0.014 ms]
time UTF8Encoding.UTF8.GetBytes [0.386 ms]
time UTF8Encoding.UTF8.GetChars [0.573 ms]
time new string(chars) [0.089 ms]

PerformanceTester - SomeLoopOps
Time loop: i++, u++ [10.477 ms]
Time loop: i += 1, u += 1 [10.598 ms]
Time loop: f = 2.0f/3.0f [9.062 ms]
Time loop: d = 2.0/3.0 [9.556 ms]
Time loop: if (i % 10 == 0) u4++ [16.101 ms]

PerformanceTester - ClearByteArray
create new byte array [0.043 ms] 00
Array.Clear [0.117 ms]
clear using ‘for ++’ [12.701 ms]
clear using ‘for --’ [13.236 ms]
clear using ‘while --’ [12.838 ms]

PerformanceTester - SumByteArray
sum byte array ‘for each’ [17.493 ms]
sum byte array ‘for ++’ [14.535 ms]
sum byte array ‘while --’ [14.558 ms]

PerformanceTester - IntToByteArrayTests
time IntToASCII 0: 0 [0.595 ms]
time IntToASCII 4000: 4000 [1.183 ms]
time IntToASCII -4000: -4000 [1.772 ms]
time IntToASCII int.MaxValue: 2147483647 [2.886 ms]
time IntToASCII int.MinValue: -2147483647 [3.362 ms]
time int.ToString + GetBytes 0: 0 [2.033 ms]
time int.ToString + GetBytes 4000: 4000 [1.823 ms]
time int.ToString + GetBytes -4000: -4000 [2.511 ms]
time int.ToString + GetBytes int.MaxValue: 2147483647 [2.098 ms]
time int.ToString + GetBytes int.MinValue+1: -2147483647 [2.609 ms]

tests ran in: [694.379 ms]
The thread ‘’ (0x1) has exited with code 0 (0x0).
Done.

=========================================================================
Panda II

PerformanceTester - PrintSomeInfo
System Version: 4.1.6.0
Cpu.SlowClock: 18000000
Cpu.SystemClock: 18000000
Debugger Attached: True

PerformanceTester - MiscTests
time nothing [0.005 ms]
time Utility.ComputeCRC 2701537051 [0.134 ms]
time IntPlaces 0: 1 [0.162 ms]
time IntPlaces 4000: 4 [0.187 ms]
time IntPlaces -4000: 5 [0.231 ms]
time IntPlaces int.MaxValue: 10 [0.306 ms]
time IntPlaces int.MinValue: 11 [0.348 ms]

PerformanceTester - ArrayTests
time fill bytes1 [5.702 ms]
time FillArrayWithIndexValue bytes1 [5.779 ms]
time Array.Clear [0.101 ms]
time Array.Copy [0.139 ms]
time bytes1.CopyTo [0.174 ms]
time Array.IndexOf 98 is 98 [0.299 ms]
time Utility.CombineArrays 200 [0.073 ms]
time Utility.ExtractValueFromArray 50462976 [0.050 ms]
time Utility.ExtractRangeFromArray 10 [0.064 ms]
time bytes2[99] == 99 True [0.025 ms]
time bytes2[bytes2.Length-1] == 99 True [0.084 ms]

PerformanceTester - StringTests 100chars
time init string of 100 chars [0.017 ms]
time UTF8Encoding.UTF8.GetBytes [0.234 ms]
time UTF8Encoding.UTF8.GetChars [0.443 ms]
time new string(chars) [0.120 ms]

PerformanceTester - StringTestsShort 20chars
time init string of 20 chars [0.017 ms]
time UTF8Encoding.UTF8.GetBytes [0.191 ms]
time UTF8Encoding.UTF8.GetChars [0.237 ms]
time new string(chars) [0.117 ms]

PerformanceTester - SomeLoopOps
Time loop: i++, u++ [3.891 ms]
Time loop: i += 1, u += 1 [3.843 ms]
Time loop: f = 2.0f/3.0f [3.398 ms]
Time loop: d = 2.0/3.0 [3.497 ms]
Time loop: if (i % 10 == 0) u4++ [6.175 ms]

PerformanceTester - ClearByteArray
create new byte array [0.026 ms] 00
Array.Clear [0.102 ms]
clear using ‘for ++’ [4.486 ms]
clear using ‘for --’ [5.041 ms]
clear using ‘while --’ [5.145 ms]

PerformanceTester - SumByteArray
sum byte array ‘for each’ [6.528 ms]
sum byte array ‘for ++’ [5.075 ms]
sum byte array ‘while --’ [5.436 ms]

PerformanceTester - IntToByteArrayTests
time IntToASCII 0: 0 [0.205 ms]
time IntToASCII 4000: 4000 [0.553 ms]
time IntToASCII -4000: -4000 [0.612 ms]
time IntToASCII int.MaxValue: 2147483647 [1.092 ms]
time IntToASCII int.MinValue: -2147483647 [1.195 ms]
time int.ToString + GetBytes 0: 0 [0.906 ms]
time int.ToString + GetBytes 4000: 4000 [0.963 ms]
time int.ToString + GetBytes -4000: -4000 [1.548 ms]
time int.ToString + GetBytes int.MaxValue: 2147483647 [0.976 ms]
time int.ToString + GetBytes int.MinValue+1: -2147483647 [1.163 ms]

tests ran in: [337.640 ms]
The thread ‘’ (0x1) has exited with code 0 (0x0).


#8

Unless .NETMF implements a pipeline or a prefetcher unit I dont see how can an EMX module be faster than a USBizi as USBizi runs from internal flash/ram at almost 0 WS and EMX should need some WS to acces external memory.


#9

GHI, am I doing something wrong or is this the real speed difference?
For most tests the Panda is two times faster. For the loop and sum byte array tests it is more like three times faster.

I don’t have to say this is very very disappointing. The Cobra is your “professional” board.

I’d still love to see the numbers from a ChipworkX. Gus has said so many times ChipworkX runs 6 times faster - without ever mentioning compared to what, Panda or Cobra. This gave me the impression that Cobra and Panda run at the same speed.

So what shall I do with my Cobra now?


#10

I’ll send you my postal address and you can send it to me seeing how you no longer want it.


#11

I have already said that EMX and USBizi have the same processor but due to many factors, mainly external memory, USBizi will run faster! I am not sure why this is surprising.
And yes ChipworkX is about 6 times the speed of EMX. This is NOT an advertised speed and nothing that I am promising you but this is what I have personally seen when I use jpeg images. GHI didn’t and doesn’t promise any speed or performance but GHI does provide you with RLP and other features to help out when more speed is needed, features you do not see anywhere else :slight_smile:

USBizi and EMX are both professional and commercial offers from GHI. I still do not see where the disappointment is coming from? EMX is (by far) the most featured NETMF module in the world and still priced very reasonably. It has more to offer than ChipworkX, how is this disappointing!!!


#12

I’m still hoping someone with a ChipworkX will post the output from the app.

@ Gus:

I didn’t say you promised anything, nor GHI.

I am well aware that GHI currently has the best hardware feature support for NET MF - better than any other supplier. I also think your price is very reasonable. Last pro is a vivid community - this is worth quite a lot.

But when you offer two boards advertising they use the same CPU of 72Mhz, but one of them costs three times more and is for professional use - then I guess you can understand that I am disappointed for having bought said board and afterwards finding out is is not just a tad slower, but actually a LOT slower.

You knew it from the beginning didn’t you?
Still you formulated it nicelly

:wink:

Maybe this all is just so obvious since the Cobra uses external ram. Well, then I disappointed I was so naive, but I learned my lesson.


#13

I am still not sure what you are trying to accomplish here! Do you have a project and EMX can’t do because of speed? What is it? How is EMX speed not good enough and why can’t you use RLP?

Can we concentrate on your needs and see how I or GHI can help to fit those needs?


#14

Adding to question above, I see you disable the GC messages in your code, you shouldn’t. You need the messages to be visible because if GC is firing continuously then you are seeing any accurate results.

…still, the speed is not what we should concentrate on but seeing how we can accomplish your needs.


#15

BTW, I ran the test code after setting the cobra into headless mode to see if it would make a difference. While it’s possible I messed up and the display controller was still active, it seemed to make NO difference. Timings were the same.

Seems to be all external RAM/FLASH access waits.

Since Panda II and Cobra use the same chip and Panda is all internal RAM/FLASH, is there a way to load the program just into internal flash and run only with internal RAM on the COBRA? Or at least get critical routines and operations into the internal RAM & FLASH (I presume that COBRA has some intermal memory).

I tend to agree that the Cobra’s performance as the near top level offering under the Chipworks solutions is a bit disappointing.

Also, I think the touch screen with the $200 LCD is horrible for the price in terms of accuracy and response time.


#16

Hello Gus,
I think that if you follow how the thread evolved from the beginning, is it not quite fair to ask what I’m trying to “accomplish here”. All started with a question, and ended with me expressing kind of a frustration for things being as they are.

I tend to speak out lout when I like things, and also when I do not like them.

We already talked about RLP and I said that RLP in it’s current form, with an 1 to 2 mSec function call overhead (on a Panda, don’t know on the EMX) is not much help. RLP might be super for drawing images or the like - something that would take dozens or hundreds of mSecs to complete in managed code. But it is useless to speed up small functions.

RLP is actually pretty devilish if you ask me: people write native code that totally ties them to GHI since RLP is not standard to .NET. If all code being developed in RLP was done in interop instead, we would actually be driving development of NET MF forward. This quite a difference!

As for my project: I will concentrate in finishing it disregarding speed.

By the time I’m ready

  1. I’ll get back to your offer to help get the work done in the time needed
  2. maybe there will be a FEZ Mammoth available
  3. I might just use hardware from another supplier.

I have learned enough about NET MF to say it is a viable option for what I want to accomplish, and a dream to develop for. For this the Panda was an excellent choice.

Now, this problem of interpreted code speed has been discussed so many times here and at the Netduino forums.
All who care about the subject please have a look here http://www.tinyclr.com/forum/1/3618/#/1/msg34245

This definitely sounds like a challenge! Digging into the execution engine of TinyCRL and messing around there… hum… nice lace to get started:)

I’d love to hear how strong community interest is for this.

Gus, would you join such a project to help? What about the other GHI guys?
I’ll try to talk to some other NET MF porting pros to see who might help too (Device Solutions, Secret Labs, Sytech Designs, ?).

Imagine having a configurable JIT in NET MF 4.3 - wouldn’t that be a pretty nice feature?

Cheers
Mark


#17

I go through tens/hundreds of emails and forum posts daily so I apologies if I asked questions you had already answered before. This is why we always recommend a new thread with a one direct question.

All I am trying to see is what I/GHI can do to help in completing your project.

JIT sounds great and we had talked to Microsoft about it years ago! Unfortunately, GHI is already working on so many improvements for NETMF, some know of and so you don’t. Maybe we can get to JIT in future but even if we do or Microsoft does, this will not help your current project, right? This will take months/years to complete. So we are back to the first question, what can we do “today” to help you?!

Please keep in mind that this free support is shared with all community so the more we argue or discuss things that will not help you today then we are basically harming the whole community. I can spend the extra time on writing a tutorial or another free ebook for example :slight_smile: Please continue to post and we will continue to help as much as we can.


#18

Or you could continue to make nice videos on how to solder SMT stuff :slight_smile:
I do watch one of those every few days…

As I said, I’ll just forget about performance for now.

But since you asked again, what you/GHI could do:

  • think of any way to greatly speed up RLP in a future (soon) release. We know that RLP is slow probably due to passing variable arguments on the managed side (I bet you use reflection to pass that over to the RLP backend in Interop). So a set of “fast RLP” functions with minimal overhead / limited argument passing choices would be great!

  • what I need most is a way to convert numbers (Int32, float) to a byte array so they can be sent out to a bunch of stepper motors on a RS 485 network. The stepper controllers unfortunately use an ASCII based protocol. A bare Int32 to string + get byte array takes 1-2 mSec on the Panda, more on Cobra. I need to send around 10-15 of these commands, worst case might be even more, to 5 motors within 20mSec.

So if you provide a set of fast utility functions for int => byte array, byte array => int to the GHI libraries that would also help me avoid the performance problem. At least the biggest one I identified so far.

Not sure how useful this is for others.

If I had to choose between one or the other I’d definitely say enhance RLP on the short run, help re-surrecting the JIT on the long run :wink:


#19

Those are already built in :slight_smile: for int and for float. They are in Util and in Utilities classes.


#20

Thanks.
But unfortunately not - I had looked into those but they don’t really help.

My description above was a bit misleading in the end. My special need is to convert numbers to/from an ASCII coded byte array.

So:
UInt32: 4000
ToString(): "4000"
GetBytes: 0x70, 0x60, 0x60, 0x60

Alternatively:
Place each digit in an array (loog doing v % 10)
Since the 0 starts a t hex code 0x30 in ASCII, offset each byte by that.

I coded this in managed C#, it works but is still slow, tough faster than the ToString() variant.
http://www.torsten-horn.de/techdocs/ascii.htm

So the number of bytes used is the number of digits, and optional sign byte.

Cheers
mark