We have a server at a clients that has been showing random signs of instability (random reboots and a couple of panics)
Suspecting hardware problems I brought it down and ran memtest86 on it.
Then something odd happened.
Memtest shows plenty of memory errors, spreading over pretty much the whole of the installed 512MB. When I stopped the tests halfway through the first pass there were two pages, the lowest error was at about 50MB the highest was at circa 430
But the really strange thing is that the error bits are 00000000 i.e the test data and "bad" data match perfectly. There is no ECC on this machine so how can the data I put in the memory and the data I take out be the same yet there be a memory error. Memtest clearly shows the "good" and "bad" column with matching data but still flags these as errors.
I have ordered replacement RAM anyway, but I was just interested in what other type of memory error I am seeing here ?
On Wed, Jun 07, 2006 at 06:30:54PM +0100, Wayne Stallwood wrote:
But the really strange thing is that the error bits are 00000000 i.e the test data and "bad" data match perfectly. There is no ECC on this machine so how can the data I put in the memory and the data I take out be the same yet there be a memory error. Memtest clearly shows the "good" and "bad" column with matching data but still flags these as errors.
I have ordered replacement RAM anyway, but I was just interested in what other type of memory error I am seeing here ?
Could it be related to a memory timing problem? I've got a weird problem at the moment with an i810 based machine. It's running Windows but just gets lots of weird errors, random reboots, crashes etc. while using it (no, this is more than the normal amount before anyone says) ;)
Anyhow, I've ran memtest86+ on it several times and the ram is ok, I even swapped the ram around in the memory slots in case it was the portion of ram used as video memory and it's all ok. In desperation I put the old ram from this machine back in and it's perfectly ok. The only major difference is the old ram is CAS3 and the newer stuff CAS2, even more weird is that the machine ran Linux fine with no problems at all! I'm still a bit of a loss to explain what's going on, the only other thing I can think is a driver problem for the i810 chipset and it not liking faster memory.
Thanks Adam
On Thu, 2006-06-08 at 10:34 +0100, Adam Bower wrote:
Could it be related to a memory timing problem? I've got a weird problem at the moment with an i810 based machine. It's running Windows but just gets lots of weird errors, random reboots, crashes etc. while using it (no, this is more than the normal amount before anyone says) ;)
The thought had crossed my mind, but this memory is the vendor installed factory original part and the machine was faultless until recently.
Therefore I think it is more likely to be a fault rather than incorrect specification.
I have mailed the author of memtest86 to see if he has any input, in any case the ram is being replaced tomorrow morning (which will hopefully clear the fault)
I meant to post back to the list on this ages ago..
Just for future reference the author of Memtest got back to me and told me that errors like this usually mean a problem with either the Northbridge or the CPU
Sure enough when I returned to site with the memory I had ordered, I noticed that the CPU heatsink support frame had become partly detached and therefore the heatsink wasn't in full contact with the CPU. Fortunately no obvious permanent damage seems to have been done to the CPU and reattaching the support frame has cured the fault.
On a slightly related note...those new Intel LGA socket heatsinks (the new pin on board, pad on the chip arrangement) seem quite nasty...If anybody gets to build one then take note of how much the board is flexed by the retaining mechanism when the heatsink retaining clips are tensioned. It also looks like (at least with the Intel original part) that removal and refitting of the Heatsink with the board in situ will be at worst impossible and at least very likely to damage the PCB. I built up a custom machine for a client the other day and was horrified by this arrangement.
I think the new retaining mechanism is supposed to be BTX compliant (on BTX the heatsink is supported by the case with the PCB sandwiched in-between) but it doesn't work very well with the current ATX mainboards and cases...and it's not even certain that BTX will now ever properly take off..certainly it has gone nowhere fast in the OEM markets)
On Wed, 2006-06-07 at 18:30 +0100, Wayne Stallwood wrote:
Memtest shows plenty of memory errors, spreading over pretty much the whole of the installed 512MB. When I stopped the tests halfway through the first pass there were two pages, the lowest error was at about 50MB the highest was at circa 430
But the really strange thing is that the error bits are 00000000 i.e the test data and "bad" data match perfectly. There is no ECC on this machine so how can the data I put in the memory and the data I take out be the same yet there be a memory error. Memtest clearly shows the "good" and "bad" column with matching data but still flags these as errors.