How long does it take to fix a bug? Over a year of work!

Gary_Beaver · August 17, 2015, 8:38am

We come to you today with very good news but let us give you some history, maybe you will learn something as well.

Over a year ago, we were contacted by one of our important customers saying that some EMX modules are failing at high temperatures, around 70 degree Celsius (158 degree Fahrenheit). They have used EMX for years with no problems so something must have changed. We started looking into lot numbers and manufacturing dates. It looked all random but we were certain that this problem did not exist before. We could not find a definite date but a few things had changed. The SDRAM we used was discontinued and we replaced it with a compatible version, we had moved to a new building, we have processors with new manufacturing dates and we changed the stencil printer with a high end (and very expensive) solder jet printer.

The first candidate was the solder jet printer. This is a new technology that allows companies to print solder right onto the raw circuits, instead of using stencils. While, it costs hundreds of thousands of dollars, it may not be perfect. There was also a concern that the reflow process was not tuned right to the type of solder we use with the solder jet printer. So, a few samples were shipped to one of our partners who own a X-ray machine. They found no problems in the soldering quality. Ok good… it is not the machines!

We then started looking into the SDRAM but then we learned that some modules do not fail ever and they use the exact same SDRAM. Still, we got right to the SDRAM datasheet and triple checked everything. No problems were found there.

Now we are sure it is static issue, right? No wrong. An independent company was hired to test our facility for static and they found no issues.

So it has to be a bad batch of processors right? We quickly got a hold of NXP and started working with them on investigating the issue. According to them it was a moisture issue and the chips needed to be baked! This was the worst answer they could have given us because we are now sure this was not the problem and they simply pulled this random answer to get us off their back! We spent weeks, and maybe months, baking and trying different things. We have gone so far thinking the problem is related to an error in our manufacturing.

At this point, we have spent months upon months of work, with literally thousands of dollars just running down the drain. But we knew, nothing would stop our engineers from finding the mysterious issue.

The last resort was in testing the individual components on the EMX module, CPU, flash, RAM and peripherals. The difficult part was EMX will only fail if it was put in a heat chamber and let run for a few hours. You can imagine the frustration there. We write a test program, load on EMX, put EMX in the oven, and wait for few hours just to see if the individual test would fail or not. For that, we had few EMX modules cycling through with different tests.

We finally found the issue! One of the timing parameters was setting the clocks right on the edge of accepted levels. We slowed things down very slightly and EMX never failed again.

But who’s fault is this? And why it took so long and cost so much to solve? We want to say it is simply bad luck. The same values worked on thousands of EMX modules running in hundreds of products around the world for about 7 years! Then one day, something on the CPU, in the SDRAM, in the PCB impedance or a combination of those caused few to fail at extreme temperatures.

We completely agree with our customer’s frustration over this issue. There was unfortunately no magical answer. Only hard work and dedication is what solved this mystery. Please accept our apologies for what happened and we hope that everyone understands that this was beyond anyone’s control.

The fix is available in the current beta release. Thank you for continuing to believe in GHI Electronics and standing behind our products.

SDK: https://www.ghielectronics.com/support/netmf/sdk/37/ghi-electronics-netmf-sdk-2015-r1-pre-release-4

Wolfgang_Feneberg · August 17, 2015, 9:19am

Better late than never 8)

Mike · August 17, 2015, 9:51am

while reading Gary’s post, this exact thought came to mind. we must have a similar engineering background!

Gus_Issa · August 17, 2015, 11:02am

What makes me pull my hair out is how thousands of modules worked fine for 7 years and then one day they stopped working.

Duke_Nukem · August 17, 2015, 12:51pm

In the world of IT its a wonder when shit doesn’t happen.

Mike · August 17, 2015, 2:36pm

for those who are who are wondering, shit is short for semi hardened information technology.

Jay_Jay · August 17, 2015, 3:38pm

Global warming is to blame :whistle:

Gene · August 17, 2015, 8:56pm

Congratulations, I’ve been chasing the dreaded random reboot issue for months and 10s of thousands of dollars but am nearly convinced it is fixed in the latest pre-release. So I hope to be in the same happy place you are soon.

May I ask what timing parameter was the culprit? What were its min and max acceptable values and what was it set at?

Thanks - Gene

Gus_Issa · August 17, 2015, 9:15pm

It is in the memory accelerator delay cycles. The change would not effect the device speed.