Project: 3064 (Run 2, Clone 124, Gen 26) ERROR 0x0
Posted: Sat Aug 16, 2008 12:52 pm
Several times Segmentation Fault communications ERROR 0x0 at ~33% on one quad core 2.6.25-gentoo-r6 smp folding host. This was with `ctrl-c, wait for all cores to finish, and restart before hitting problem %' at least once with still error at about 33%. Checked ping times and /etc/host -- all okay and as running successfully for months before. This is running in foreground in screen over ssh. (ugh, hope that's not part of the problem. guess I'll try bg &> /dev/null and tail -f the logs)
After several times same error same % on the original box, ran memtest several hours no errors, ran stresscpu six hours no errors, (stresscpu 100% all cores vs. fah6 -smp about 35% usage average) Ambient temps have been unusually low past few days and no problems earlier during heat of summer, so doubt that's a factor.
Started unit again on same box and also on another quad running 2.6.26-gentoo kernel. (different glibc's too) this also died seg fault at 33%, but luckily beat the original and saw Ivoshiee's post about Multiple failures at the same point - UNKNOWN, 0x0 or 0x1. So archived work unit at about 32% and posted archives here* for people to see if can run successfully, or troubleshoot.
Edit: //hmm: (from in folding dir) running `./fah6 -smp -verbosity 9 >/dev/null 2>&1 &' segfaults immediately on both machines. Never run into that before. Got new WU's on both boxes. Both are a1 cores. Now running in fg in screen over ssh. //
Also, fwiw, just noticed format of client.cfg seems to have changed with 6.02. bigpackets and machineid are swapped, and don't see any 'local' field when I made a brand new foldingathome using release client (vs beta that made earlier config erroring WU was running from.)
* Unfortunately comcast only allows 8MB uploads so had to split work directory archive from foldingathome directory and use tar -> bzip to get it small enough. The wiki 'sneakernet' stuff referred to in the what to do with 0x0 at same point post implied queue.dat and all of work would be enough, but that was very nearly as big as everything, so I kept everything. So extract with tar -xjf and cp work.3064.2.124.26.bad back into foldingathome/ as work dir if you want to test it on your machine.
After several times same error same % on the original box, ran memtest several hours no errors, ran stresscpu six hours no errors, (stresscpu 100% all cores vs. fah6 -smp about 35% usage average) Ambient temps have been unusually low past few days and no problems earlier during heat of summer, so doubt that's a factor.
Started unit again on same box and also on another quad running 2.6.26-gentoo kernel. (different glibc's too) this also died seg fault at 33%, but luckily beat the original and saw Ivoshiee's post about Multiple failures at the same point - UNKNOWN, 0x0 or 0x1. So archived work unit at about 32% and posted archives here* for people to see if can run successfully, or troubleshoot.
Edit: //hmm: (from in folding dir) running `./fah6 -smp -verbosity 9 >/dev/null 2>&1 &' segfaults immediately on both machines. Never run into that before. Got new WU's on both boxes. Both are a1 cores. Now running in fg in screen over ssh. //
Also, fwiw, just noticed format of client.cfg seems to have changed with 6.02. bigpackets and machineid are swapped, and don't see any 'local' field when I made a brand new foldingathome using release client (vs beta that made earlier config erroring WU was running from.)
* Unfortunately comcast only allows 8MB uploads so had to split work directory archive from foldingathome directory and use tar -> bzip to get it small enough. The wiki 'sneakernet' stuff referred to in the what to do with 0x0 at same point post implied queue.dat and all of work would be enough, but that was very nearly as big as everything, so I kept everything. So extract with tar -xjf and cp work.3064.2.124.26.bad back into foldingathome/ as work dir if you want to test it on your machine.