Page 1 of 1

Project: 3064 (Run 2, Clone 124, Gen 26) ERROR 0x0

Posted: Sat Aug 16, 2008 12:52 pm
by rada
Several times Segmentation Fault communications ERROR 0x0 at ~33% on one quad core 2.6.25-gentoo-r6 smp folding host. This was with `ctrl-c, wait for all cores to finish, and restart before hitting problem %' at least once with still error at about 33%. Checked ping times and /etc/host -- all okay and as running successfully for months before. This is running in foreground in screen over ssh. (ugh, hope that's not part of the problem. guess I'll try bg &> /dev/null and tail -f the logs)

After several times same error same % on the original box, ran memtest several hours no errors, ran stresscpu six hours no errors, (stresscpu 100% all cores vs. fah6 -smp about 35% usage average) Ambient temps have been unusually low past few days and no problems earlier during heat of summer, so doubt that's a factor.

Started unit again on same box and also on another quad running 2.6.26-gentoo kernel. (different glibc's too) this also died seg fault at 33%, but luckily beat the original and saw Ivoshiee's post about Multiple failures at the same point - UNKNOWN, 0x0 or 0x1. So archived work unit at about 32% and posted archives here* for people to see if can run successfully, or troubleshoot.

Edit: //hmm: (from in folding dir) running `./fah6 -smp -verbosity 9 >/dev/null 2>&1 &' segfaults immediately on both machines. Never run into that before. Got new WU's on both boxes. Both are a1 cores. Now running in fg in screen over ssh. //

Also, fwiw, just noticed format of client.cfg seems to have changed with 6.02. bigpackets and machineid are swapped, and don't see any 'local' field when I made a brand new foldingathome using release client (vs beta that made earlier config erroring WU was running from.)

* Unfortunately comcast only allows 8MB uploads so had to split work directory archive from foldingathome directory and use tar -> bzip to get it small enough. The wiki 'sneakernet' stuff referred to in the what to do with 0x0 at same point post implied queue.dat and all of work would be enough, but that was very nearly as big as everything, so I kept everything. So extract with tar -xjf and cp work.3064.2.124.26.bad back into foldingathome/ as work dir if you want to test it on your machine.

Re: Project: 3064 (Run 2, Clone 124, Gen 26) ERROR 0x0

Posted: Sat Aug 16, 2008 9:12 pm
by rada
doh! Embarassing how long before I noticed this and did anything about it... ugh.
-- kernel messages (also posted with's linked above) --

Code: Select all

-- problem on this host first

Aug  7 16:30:29 host4 FahCore_a1.exe[24648]: segfault at 790fe10 ip 5cc68a sp 41bcbd60 error 4 in FahCore_a1.exe[400000+362000]
Aug  7 16:30:29 host4 FahCore_a1.exe[24650]: segfault at 2f1ef80 ip 5cc669 sp 4124ed60 error 4 in FahCore_a1.exe[400000+362000]
Aug  7 16:30:29 host4 FahCore_a1.exe[24645]: segfault at 7f39f20 ip 5cc674 sp 41459d60 error 4 in FahCore_a1.exe[400000+362000]
Aug  7 16:30:29 host4 FahCore_a1.exe[24646]: segfault at 3412180 ip 5cade6 sp 41c4bf30 error 4 in FahCore_a1.exe[400000+362000]

Aug  7 18:55:39 host4 FahCore_a1.exe[24672]: segfault at 1f1ff90 ip 5cc669 sp 41bb6d60 error 4<6>FahCore_a1.exe[24673]: segfault at 732ff20 ip 5cc674 sp 4191
7d60 error 4<6>FahCore_a1.exe[24677]: segfault at 8f65e10 ip 5cc68a sp 425f7d60 error 4 in FahCore_a1.exe[400000+362000]
Aug  7 18:55:39 host4 in FahCore_a1.exe[400000+362000]
Aug  7 18:55:39 host4 in FahCore_a1.exe[400000+362000]

Aug 10 07:25:47 host4 FahCore_a1.exe[24691]: segfault at 2d90cdf30 ip 5cc674 sp 408bed60 error 4<6>FahCore_a1.exe[24695]: segfault at 2d959bde0 ip 5cc67f sp
4253ed60 error 4 in FahCore_a1.exe[400000+362000]
Aug 10 07:25:47 host4 FahCore_a1.exe[24689]: segfault at 2d9b011e0 ip 5cc67f sp 41693d60 error 4 in FahCore_a1.exe[400000+362000] in FahCore_a1.exe[400000+36

Aug 10 18:51:31 host4 FahCore_a1.exe[24746]: segfault at 13bf490 ip 5cc67f sp 40b73d60 error 4<6>FahCore_a1.exe[24752]: segfault at 1edb350 ip 5cade6 sp 4184
df30 error 4 in FahCore_a1.exe[400000+362000]
Aug 10 18:51:31 host4 in FahCore_a1.exe[400000+362000]

Aug 12 14:20:11 host4 FahCore_a1.exe[25013]: segfault at 21195f0 ip 5cc669 sp 42441d60 error 4<6>FahCore_a1.exe[25009]: segfault at 2de9c58e0 ip 5cc674 sp 41
b9bd60 error 4 in FahCore_a1.exe[400000+362000] in FahCore_a1.exe[400000+362000]
Aug 12 14:20:11 host4 FahCore_a1.exe[25015]: segfault at 20220c8f0 ip 5cc674 sp 413a3d60 error 4 in FahCore_a1.exe[400000+362000]
Aug 12 14:20:11 host4 FahCore_a1.exe[25011]: segfault at 1d9bc00 ip 5cad99 sp 41183f30 error 4 in FahCore_a1.exe[400000+362000]

Aug 13 06:54:17 host4 FahCore_a1.exe[25037]: segfault at 2dd96d8f0 ip 5cc674 sp 40b1ed60 error 4<6>FahCore_a1.exe[25039]: segfault at 202ace8f0 ip 5cc674 sp 42316d60 error 4 in FahCore_a1.exe[400000+362000]

Aug 13 17:30:13 host4 FahCore_a1.exe[4144]: segfault at 2ddb248e0 ip 5cc674 sp 41f84d60 error 4<6>FahCore_a1.exe[4140]: segfault at 2024618e0 ip 5cc674 sp 41
341d60 error 4 in FahCore_a1.exe[400000+362000]
Aug 13 17:30:13 host4 in FahCore_a1.exe[400000+362000]Project: 3064 (Run 2, Clone 124, Gen 26) ERROR 0x0
Aug 13 17:30:13 host4 FahCore_a1.exe[4142]: segfault at 2f465f0 ip 5cc669 sp 42188d60 error 4 in FahCore_a1.exe[400000+362000]
Aug 13 17:30:13 host4 FahCore_a1.exe[4146]: segfault at 2ef8c60 ip 5cad99 sp 416a8f30 error 4 in FahCore_a1.exe[400000+362000]

Aug 14 04:08:27 host4 FahCore_a1.exe[4169]: segfault at 20306f8f0 ip 5cc674 sp 41d87d60 error 4 in FahCore_a1.exe[400000+362000]<6>FahCore_a1.exe[4173]: segfault at 3b4f5f0 ip 5cc669 sp 40ac6d60 error 4
Aug 14 04:08:27 host4 in FahCore_a1.exe[400000+362000]
Aug 14 04:08:27 host4 FahCore_a1.exe[4167]: segfault at 2dd6108f0 ip 5cc674 sp 4222ed60 error 4 in FahCore_a1.exe[400000+362000]
Aug 14 04:08:27 host4 FahCore_a1.exe[4168]: segfault at 18ffc10 ip 5cad99 sp 40f1ff30 error 4 in FahCore_a1.exe[400000+362000]

Aug 14 11:45:41 host4 FahCore_a1.exe[4222]: segfault at 4cd9260 ip 5cc669 sp 423f3d60 error 4 in FahCore_a1.exe[400000+362000]
Aug 14 11:45:41 host4 FahCore_a1.exe[4218]: segfault at 5e88170 ip 5cc669 sp 4147bd60 error 4 in FahCore_a1.exe[400000+362000]
Aug 14 11:45:41 host4 FahCore_a1.exe[4220]: segfault at 2836e9fd0 ip 5cadcb sp 41046f30 error 4 in FahCore_a1.exe[400000+362000]
Aug 14 11:45:41 host4 FahCore_a1.exe[4216]: segfault at 17f9820 ip 5ce05e sp 411bcaa0 error 4 in FahCore_a1.exe[400000+362000]

Aug 14 22:21:53 host4 FahCore_a1.exe[4244]: segfault at 2de9fb8f0 ip 5cc674 sp 410d3d60 error 4 in FahCore_a1.exe[400000+362000]
Aug 14 22:21:53 host4 FahCore_a1.exe[4246]: segfault at 23bd5f0 ip 5cc669 sp 42673d60 error 4 in FahCore_a1.exe[400000+362000]
Aug 14 22:21:53 host4 FahCore_a1.exe[4243]: segfault at 2023c88f0 ip 5cc674 sp 4251ad60 error 4 in FahCore_a1.exe[400000+362000]

Aug 15 01:34:36 host4 FahCore_a1.exe[4262]: segfault at 17904a0 ip 5cad99 sp 415f3f30 error 4 in FahCore_a1.exe[400000+362000]
Aug 15 01:34:36 host4 FahCore_a1.exe[4260]: segfault at 243a380 ip 5cc674 sp 4243dd60 error 4 in FahCore_a1.exe[400000+362000]
Aug 15 01:34:36 host4 FahCore_a1.exe[4264]: segfault at 147f9f0 ip 5cc68a sp 424bdd60 error 4 in FahCore_a1.exe[400000+362000]

Aug 15 12:03:36 host4 FahCore_a1.exe[4293]: segfault at 2de0198e0 ip 5cc674 sp 41109d60 error 4 in FahCore_a1.exe[400000+362000]
Aug 15 12:03:36 host4 FahCore_a1.exe[4287]: segfault at 2035488f0 ip 5cc674 sp 42108d60 error 4 in FahCore_a1.exe[400000+362000]
Aug 15 12:03:36 host4 FahCore_a1.exe[4288]: segfault at 30805f0 ip 5cc669 sp 41bb0d60 error 4 in FahCore_a1.exe[400000+362000]

Aug 15 22:51:55 host4 FahCore_a1.exe[4342]: segfault at 155b380 ip 5cc674 sp 40918d60 error 4<6>FahCore_a1.exe[4348]: segfault at 23249f0 ip 5cc68a sp 41a8dd60 error 4 in FahCore_a1.exe[400000+362000]
Aug 15 22:51:55 host4 in FahCore_a1.exe[400000+362000]
Aug 15 22:51:55 host4 FahCore_a1.exe[4344]: segfault at 2c3f4a0 ip 5cad99 sp 40a7bf30 error 4 in FahCore_a1.exe[400000+362000]

Aug 16 12:01:55 host4 fah6[4448]: segfault at 0 ip f7f0b143 sp f7e3a964 error 6 in[f7ea3000+12a000]
Aug 16 12:04:10 host4 fah6[4455]: segfault at 0 ip f7e89143 sp f7db8964 error 6 in[f7e21000+12a000]
Aug 16 12:05:11 host4 fah6[4460]: segfault at 0 ip f7ec4143 sp f7e1a964 error 6 in[f7e5c000+12a000]

-- then second host (used to test same WU) 
Aug 16 02:20:19 host2 dot[21430]: segfault at 8 ip 2afac7365223 sp 7fffe4f366a0 error 4 in[2afac7300000+14b000]
Aug 16 02:20:19 host2 dot[21431] general protection ip:2abefd352273 sp:7fffaef496b0 error:0 in[2abefd2ed000+14b000]
Aug 16 02:20:19 host2 dot[21432]: segfault at 8 ip 2ab519146223 sp 7fff931568d0 error 4 in[2ab5190e1000+14b000]
Aug 16 02:20:19 host2 dot[21433] general protection ip:2b67e99df273 sp:7fffc28bc020 error:0 in[2b67e997a000+14b000]
Aug 16 02:20:19 host2 dot[21433] general protection ip:2b67e99df273 sp:7fffc28bc020 error:0 in[2b67e997a000+14b000]
Aug 16 02:20:19 host2 dot[21434] general protection ip:2acf2373d273 sp:7fff88b5e2d0 error:0 in[2acf236d8000+14b000]
Aug 16 02:20:19 host2 dot[21435] general protection ip:2ae598651273 sp:7fff13c4c3b0 error:0 in[2ae5985ec000+14b000]
Aug 16 02:20:19 host2 dot[21436]: segfault at 8 ip 2b8461f7b223 sp 7fff4a31fa90 error 4 in[2b8461f16000+14b000]
Aug 16 02:20:19 host2 dot[21437] general protection ip:2b2554eef273 sp:7fff573adb20 error:0 in[2b2554e8a000+14b000]
Aug 16 02:20:19 host2 dot[21438]: segfault at 8 ip 2b8ca055a223 sp 7fff0bd434b0 error 4 in[2b8ca04f5000+14b000]
Aug 16 02:20:19 host2 dot[21439] general protection ip:2b724bae3273 sp:7fff607b7f20 error:0 in[2b724ba7e000+14b000]

Aug 16 02:44:11 host2 __ratelimit: 9 messages suppressed
Aug 16 02:44:11 host2 FahCore_a1.exe[17738]: segfault at 263e640 ip 5cc669 sp 4194cd60 error 4 in FahCore_a1.exe[400000+362000]
Aug 16 02:44:11 host2 FahCore_a1.exe[17734]: segfault at 202d7e940 ip 5cc674 sp 40aded60 error 4 in FahCore_a1.exe[400000+362000]
Aug 16 02:44:11 host2 FahCore_a1.exe[17735]: segfault at 1f96cc0 ip 5cad99 sp 42558f30 error 4 in FahCore_a1.exe[400000+362000]
Aug 16 02:44:11 host2 FahCore_a1.exe[17740]: segfault at 2df0e7930 ip 5cc674 sp 42128d60 error 4 in FahCore_a1.exe[400000+362000]
Aug 16 07:18:46 host2 fah6[3602]: segfault at 0 ip f7ebc903 sp f7e0c970 error 6 in[f7e4f000+13a000]

Re: Project: 3064 (Run 2, Clone 124, Gen 26) ERROR 0x0

Posted: Mon Aug 18, 2008 10:17 am
by toTOW
There's no record of results for Project: 3064 (Run 2, Clone 124, Gen 26). :(