EMX: Temperature issue with some of our product

We are using the EMX module on a few products. We have shipped some hundreds of units. We have had some come back usually after being in the field having been observed to have failed while in the field. I’m still looking for answers. I still haven’t quite got to the bottom of it. We’re interested to see if anyone else has a similar story. I realise it’s difficult to get all the details out in a forum post. Theories, speculations and questions are welcome at this point.

After some tests we supposed that temperature might be a factor. After these tests we concluded that the suspect units were all failing without other prompting at temperatures below those expected for the product as a whole and also below those specified for the EMX module. The temperatures at which they failed were above a surrounding air ambient temperature of 30 degrees Celcius. All my quotes of temperatures will be in degrees Celcius.

We had four samples with similar faults. I’ve since accidentally destroyed one while trying some experiments. I still have three that display the unwanted symptom. We did a number of tests to see vaguely at what temperature they fail at. The product has some other circuits other than the EMX module so we checked first to see if they were the cause of the problem. Eliminating those as suspects we were more or less left with the EMX module to look at.

Further temperature tests were showing that the unit would fail at about the same temperature each time. We now had a box set up that we could have cables and temperature probes entering while the device was under test. We had thought maybe that the point at which it failed might have been related to a particular point in time with the program that was running or the number of iterations of a routine it was up or something like that but changing the program to a simple small program disproved that and we would regulate the time it took the unit to reach the suspected temperature we were seeing it fail at.

Sometimes when the unit failed we would see in the USB debug log a short spurt of RAM memory listing before the USB and the microcontroller failed. It seemed to depend on which program we had loaded into the unit. Other times, usually on a different unit, it just stopped and there would be no hint in the debug output that it had failed. We would have to notice that indicator leds were not flashing or notice the absence of periodic debug outputs. When we had an external watchdog enabled, the failure was noticed a minute or two later when the watchdog tried to reset the EMX module. Once it had reached this failure, usually a reset did not help. Neither did a full power down/power up reset. If it did restart after such a reset, it would fail within 20 seconds if it restarted at all. To get it to operate again, the unit was left to cool perhaps 10 to 20 degrees below the fail temparature before it would reliably restart again and operate for a time.

Testing against other units we had in the office, we could not get them to fail even when the ambient temperature was higher than the highest we might have imagined. In such tests we allowed the surrounding ambient temperature to rise above 70 degrees C. In other tests with units that had not displayed any temperature related issues, we allowed the temperature in these tests to rise above 85 degrees C which is above the specified temperature of the module and the silicon. In those tests with units that had not been displaying temperature issue symptoms, they continued to work under these conditions.

Such temperature tests have not been performed before shipping. So far we only have the four samples with largely proven symptoms. We can’t yet tell how poor the yield has been until the products have been used in the field for some time and then are reported to continue failing I think. We are in the processing of taking bunches out of current batches to test for the symptoms.

We were interested in what part of the module was most temperature affected. We effected an experiment to heat up, while trying to avoid thermal shock and trying to isolate the concentration of heat, different components of the module - i.e. RAM, flash, uC, Ethernet phy, crystals. It was determined that the NXP microcontroller was obviously the most sensitive component during these tests. Other components withstood 90 deg C to 100 deg C without the unit failing.

We set up some experiments to measure the microcontroller chip surface temperature and the temperature of the air above the components. One sensor was a laser guided Infrared temperature sensor deisgnated CSL pointed at the micrcontroller surface. One sensor was a K type thermocouple probe designated CSK in contact with the chip surface (towards the centre). The third, designated BA was a K type thermocouple measuring air temperature about 10mm to 15mm above the surface of the module. The whole experiment was inside a plastic box. I was a little skeptical about the readings from the laser guided infrared sensor. I tended to rely more on the k type attached to the surface of the IC.

We put a resistor heater in the box as well (30W to 45W) to accelerate the tests.

This dropbox folder below has a PDF of graphs showing temperature plots during the experiments. They might need a little explanation. Even though the temperatures in the tests in the PDF might appear a little high we still believe the module should be able to continue working in those conditions.

We did 4 tests of each of the three that fail within specified temperature maximums. For comparison there are two tests of one that is difficult to make fail. In the first test of the OK one, I couldn’t get it to fail and you can see where I let the experiment cool back down. It was hard to keep the box at these temperatures and you can see at the end where I heat the box a little externally and then check some readings and then heat it again. In all the other graphs the final recorded temperature was the temperature at which the unit appeared to fail.

In the second test of the OK one I did get it to fail (at an ambient temperature somewhere in the 8x degrees celcius). At the very least the microcontroller reset but when it did, it started up OK and continued to operate which is not the same behaviour as I had seen with the others.

The failing sample #1 seems to fail around Chip surface:55 C Box Ambient: 48 C
The failing sample #2 seems to fail around Chip surface:68 C Box Ambient: 59 C
The failing sample #3 seems to fail around Chip surface:71 C Box Ambient: 59 C
The OK sample #4 seemed to fail around Chip surface: 92.8 C Box Ambient: 81.6 C

I’m not too worried about the OK sample.

Some other experiments and actions we are looking at are:

  • testing with a range of other random production stock samples
  • getting more back from the field that are seen to fail to test for the same symptom
  • go back with the failing samples and test at a highish but regulated ambient temperature
  • Get the board and module x-rayed to see if there is anything going on under the chip or under the module maybe
  • Get the microcontroller of one or more failing samples reballed to see if it was an issue with a solder ball

I’ve been trying to do as many non destructive tests on these samples that I have as I can because I don’t know if more will come back from the field with the same symptoms.

There’s more to the story and I’ve probably left out some context but I’m hoping this is enough to get some ideas and feedback about the issues we are seeing.

If you got to here, thanks for persevering,
John Dowdell

2 Likes

I suggest you call Gus at GHI on the telephone.

Please call or email us directly. We will assist you promptly.

I think I can add some little to this story.

On my “Spider” mainboard I noticed the same temperature issue.
It was standing on my desk in direct sunlight and without any warning it suddently stalled.
I noticed this behavior a couple of times but just moved the unit out of the sun and the problem dissapeared.

I also noticed the issue worsened when using the Ethernet J11D module.

After switching to the wireless module the problem also was gone.

Nevertless I noticed the Spider / EMX modules are very “picky” about supply voltage.
I really don’t know if this is linkable to the problems as described but I think it’s worth trying or envistigating.

Tom.

Are you an Aussie John ? :slight_smile: Welcome, didn’t know you were a netmf guy !

Sorry for coming in so late with this reply, but we’ve just had some very similar problems to what John describes so well. I don’t know whether he found a cause or solution, it would be nice to have some closure to the thread.

Here it started with a report from one of our client sites that on hot days one of the buttons (the “select button”) stopped working, and they had to hold the device out the window to cool before the button worked again. With all their meters. They said the ambient temperature was around 50C when this happened. This had never been reported by other sites, which had worked over a couple of summers with temperatures as high or higher.

We got a couple shipped back for inspection, and they were right. The button stopped working somewhere above 40C.
In this design the “select” button input is pulled low by a FET, with the gate pulled high when the actual button is pressed, and tied to ground by a 100K resistor. A Schottky diode from gate to 4V battery was reverse leaking way above spec at high temperature, pulling the gate high (up to 2V at 50C) and holding the button continuously ON.

We found that there was a batch of ON-Semi diodes which were leaking way above the curves in their spec sheets, whilst the NXP diodes we used before and after were fine. All worked ok at normal temperatures. Replacing the diode fixed the problem, and we are working through getting the ones in the field shipped back and updated. Luckily it appears the majority of builds with the out-of-spec diode ended up in Canada, where we haven’t had any reports of 50C day problems!

So far, this is a different problem than the one John reported. HOWEVER, we then thought it prudent to temperature test the boards we had in stock. I started on this, and quickly ran into John’s problem. Whilst most boards work quite happily at around 66C (PCB temp measured with IR Pyrometer and thermocouple), some just “lock up” well before that. I have one that just stops between 52-55C, and won’t reset until it is allowed to cool.

Another was worse, it worked perfectly well and passed all tests at 20C, but locked up at 34C. Again it stayed locked up until cooled, then worked properly until heated when it locked again.

I then attempted to reprogram it, but the updated program wouldn’t download. I used EMXUpdater to erase/reinstall the firmware, which reported as successful, but it still refused to accept a program. After repeated attempts, put aside for now.

I have plenty more queued up to test, but if there was any resolution to John’s original message it would be good to hear.

David

We wouldn’t mind testing few boards for you. Please contract us directly.

Adding some more data points, I’ve just finished temperature testing another 20 previously loaded and tested boards from stock, and of those only one has a lock-up problem, consistently between 56-57C. That I had two with problems out of the first four I tested must have been a coincidence.
I’ve also tested several more back from the field, with no problems up to 60C
The one that locks up at 56-57C shouldn’t be a problem in the field anyway, as I don’t see them (the operators) working in an ambient that high.
When I get the time I’ll test some of the unloaded modules, just in case there is a problem with any of them, and if so ship them back to GHI for diagnosis.

We have a very sweet temperature chamber.

Hi all,

does this thread have any solution? We have very similar problem - no problems at room temperatures, but random watchdog restarts with EMX in more than 40 deg. C. Some boards are working fine all the time, some reboot at 60, etc. Can you confirm problems in EMX production batch?

We will appreciate any help.

Thanks a lot.
Vasek

@ vasek - Do you have a few boards that you can send to us?

Sure we can send you some of the most problematic boards, but EMX modules are already part of our hardware design, so we can send you only whole devices (although they are not that much bigger). Please let me know the postal address where we should send devices.

Can you please confirm there is nothing like “problematic batch”? How did other guys in this thread solved the issue? Was the problem in EMX or in something else?

Thanks
Vasek

@ vasek - The problems in the past have been a combination of different things, I will email you shipping address etc.

What “things” and what should we check in our PCB design before we will send you devices? Maybe it’s something trivial in our design that we miss for now.

Thanks
Vasek

@ vasek - The most recent issue was the relow process, it would honestly be easier to just send a board to us versus trying to diagnose over the forum when it comes to hardware issues. I emailed you already, let me know if you don’t get it.