We are using the EMX module on a few products. We have shipped some hundreds of units. We have had some come back usually after being in the field having been observed to have failed while in the field. I’m still looking for answers. I still haven’t quite got to the bottom of it. We’re interested to see if anyone else has a similar story. I realise it’s difficult to get all the details out in a forum post. Theories, speculations and questions are welcome at this point.
After some tests we supposed that temperature might be a factor. After these tests we concluded that the suspect units were all failing without other prompting at temperatures below those expected for the product as a whole and also below those specified for the EMX module. The temperatures at which they failed were above a surrounding air ambient temperature of 30 degrees Celcius. All my quotes of temperatures will be in degrees Celcius.
We had four samples with similar faults. I’ve since accidentally destroyed one while trying some experiments. I still have three that display the unwanted symptom. We did a number of tests to see vaguely at what temperature they fail at. The product has some other circuits other than the EMX module so we checked first to see if they were the cause of the problem. Eliminating those as suspects we were more or less left with the EMX module to look at.
Further temperature tests were showing that the unit would fail at about the same temperature each time. We now had a box set up that we could have cables and temperature probes entering while the device was under test. We had thought maybe that the point at which it failed might have been related to a particular point in time with the program that was running or the number of iterations of a routine it was up or something like that but changing the program to a simple small program disproved that and we would regulate the time it took the unit to reach the suspected temperature we were seeing it fail at.
Sometimes when the unit failed we would see in the USB debug log a short spurt of RAM memory listing before the USB and the microcontroller failed. It seemed to depend on which program we had loaded into the unit. Other times, usually on a different unit, it just stopped and there would be no hint in the debug output that it had failed. We would have to notice that indicator leds were not flashing or notice the absence of periodic debug outputs. When we had an external watchdog enabled, the failure was noticed a minute or two later when the watchdog tried to reset the EMX module. Once it had reached this failure, usually a reset did not help. Neither did a full power down/power up reset. If it did restart after such a reset, it would fail within 20 seconds if it restarted at all. To get it to operate again, the unit was left to cool perhaps 10 to 20 degrees below the fail temparature before it would reliably restart again and operate for a time.
Testing against other units we had in the office, we could not get them to fail even when the ambient temperature was higher than the highest we might have imagined. In such tests we allowed the surrounding ambient temperature to rise above 70 degrees C. In other tests with units that had not displayed any temperature related issues, we allowed the temperature in these tests to rise above 85 degrees C which is above the specified temperature of the module and the silicon. In those tests with units that had not been displaying temperature issue symptoms, they continued to work under these conditions.
Such temperature tests have not been performed before shipping. So far we only have the four samples with largely proven symptoms. We can’t yet tell how poor the yield has been until the products have been used in the field for some time and then are reported to continue failing I think. We are in the processing of taking bunches out of current batches to test for the symptoms.
We were interested in what part of the module was most temperature affected. We effected an experiment to heat up, while trying to avoid thermal shock and trying to isolate the concentration of heat, different components of the module - i.e. RAM, flash, uC, Ethernet phy, crystals. It was determined that the NXP microcontroller was obviously the most sensitive component during these tests. Other components withstood 90 deg C to 100 deg C without the unit failing.
We set up some experiments to measure the microcontroller chip surface temperature and the temperature of the air above the components. One sensor was a laser guided Infrared temperature sensor deisgnated CSL pointed at the micrcontroller surface. One sensor was a K type thermocouple probe designated CSK in contact with the chip surface (towards the centre). The third, designated BA was a K type thermocouple measuring air temperature about 10mm to 15mm above the surface of the module. The whole experiment was inside a plastic box. I was a little skeptical about the readings from the laser guided infrared sensor. I tended to rely more on the k type attached to the surface of the IC.
We put a resistor heater in the box as well (30W to 45W) to accelerate the tests.
This dropbox folder below has a PDF of graphs showing temperature plots during the experiments. They might need a little explanation. Even though the temperatures in the tests in the PDF might appear a little high we still believe the module should be able to continue working in those conditions.
We did 4 tests of each of the three that fail within specified temperature maximums. For comparison there are two tests of one that is difficult to make fail. In the first test of the OK one, I couldn’t get it to fail and you can see where I let the experiment cool back down. It was hard to keep the box at these temperatures and you can see at the end where I heat the box a little externally and then check some readings and then heat it again. In all the other graphs the final recorded temperature was the temperature at which the unit appeared to fail.
In the second test of the OK one I did get it to fail (at an ambient temperature somewhere in the 8x degrees celcius). At the very least the microcontroller reset but when it did, it started up OK and continued to operate which is not the same behaviour as I had seen with the others.
The failing sample #1 seems to fail around Chip surface:55 C Box Ambient: 48 C
The failing sample #2 seems to fail around Chip surface:68 C Box Ambient: 59 C
The failing sample #3 seems to fail around Chip surface:71 C Box Ambient: 59 C
The OK sample #4 seemed to fail around Chip surface: 92.8 C Box Ambient: 81.6 C
I’m not too worried about the OK sample.
Some other experiments and actions we are looking at are:
- testing with a range of other random production stock samples
- getting more back from the field that are seen to fail to test for the same symptom
- go back with the failing samples and test at a highish but regulated ambient temperature
- Get the board and module x-rayed to see if there is anything going on under the chip or under the module maybe
- Get the microcontroller of one or more failing samples reballed to see if it was an issue with a solder ball
I’ve been trying to do as many non destructive tests on these samples that I have as I can because I don’t know if more will come back from the field with the same symptoms.
There’s more to the story and I’ve probably left out some context but I’m hoping this is enough to get some ideas and feedback about the issues we are seeing.
If you got to here, thanks for persevering,