Page 1 of 1
I think I have a bad one, WU 18244 Core 0x26
Posted: Fri May 09, 2025 11:50 pm
by dschief
this has failed 3 times in the last 2 days for other folders. It started out at over 1 hour per frame.
that has gradually come down . now run time per frame is 13 m :28 sec
this is on a GTX 1660 Super, Same card in my win11 box is putting out over 1 Million PPD
Might be worth keeping an eye on
Re: I think I have a bad one, WU 18244 Core 0x26
Posted: Sat May 10, 2025 12:50 am
by Joe_H
What are the Run, Clone, and Gen numbers for the WU? No way for us to tell and keep an eye on it without those.
Re: I think I have a bad one, WU 18244 Core 0x26
Posted: Sat May 10, 2025 1:39 am
by dschief
RCG 441,0,59
runtimes are now 5m:55sec
I've never had a wu that frame times went from over an hour to 5:55. they are always the same down to the second ??
Re: I think I have a bad one, WU 18244 Core 0x26
Posted: Sat May 10, 2025 2:03 am
by arisu
Was it the frame estimate that went from over an hour to 5:55 or did you time the actual duration of the frames? Because the initial frame estimate is sometimes very inaccurate even if it is often correct down to the second. What matters is how long each frame is actually taking.
All the failures from the other three people for P18244 R441 C0 G59 are instant failures which indicates a core compatibility problem, and those are not unheard of for core26.
Can you post the log for that core so we can see how long each frame takes? Because I suspect that it's just an ETA estimation issue along with the WU getting unlucky and being sent to three people in a row with incompatible libraries for core26, which makes it seem like it's bad but could just be a false positive.
Re: I think I have a bad one, WU 18244 Core 0x26
Posted: Sat May 10, 2025 2:30 am
by dschief
The wonky numbers are from the " View Work Unit Details" when you click on the little i in the
client window. the actual log shows a stable 4:06. It's the view details page that's messed up.
Sorry for the confusion, I should have dug a little deeper before pushing the panic button
Re: I think I have a bad one, WU 18244 Core 0x26
Posted: Sat May 10, 2025 5:18 am
by arisu
A lot of people get confused by the way the client interface works. Some people have even dumped perfectly good work units because it seemed to be glitchy when it suffered a known bug that caused PPD to inflate to impossible numbers. The interface, and the estimation algorithm, definitely needs improvement.
Re: I think I have a bad one, WU 18244 Core 0x26
Posted: Sat May 10, 2025 1:29 pm
by dschief
Follow-up, dug back through the log, the problem child { W1 } finished up over night.
06:13:45:11 W1 finished 100%
06:13:49:11 W1 Credited. It must have been big, it seemed longer than avg. to upload
client seems to be back to normal what ever that is. frame times are right at 3 min
Another crisis survived, case closed
Re: I think I have a bad one, WU 18244 Core 0x26
Posted: Sat May 10, 2025 4:10 pm
by Joe_H
Project 18244 is a medium sized project in terms of atom count, but the researcher may be loading up additional information on the return. More frequent checkpoints and sanity checks would result in a larger upload.
Re: I think I have a bad one, WU 18244 Core 0x26
Posted: Sun May 11, 2025 4:51 am
by arisu
Joe_H wrote: ↑Sat May 10, 2025 4:10 pm
Project 18244 is a medium sized project in terms of atom count, but the researcher may be loading up additional information on the return. More frequent checkpoints and sanity checks would result in a larger upload.
More frequent checkpoints don't increase the total size (only the last one is sent), but more frequent XTC and especially TRR trajectory snapshots would. Those are files that encode the types, positions, and velocities of each particle. XTC is like TRR but contains reduced-resolution trajectories to reduce file size. 18244 doesn't use TRR trajectories and that's what would account for the biggest size increase.
It does use more frequent XTC trajectory snapshots than most projects (every 0.4% so 250 total) but that wouldn't make it extremely large by itself, and it's not sending any extra files (the only big files are the serialized final checkpoint and the XTC file).
Usually only non-solvent atoms are written to the XTC file but because they're testing a new type of water (OCL3-pol), it might be that the XTC file contains the solvent as well, which would inflate size significantly especially because they're sending 250 XTC trajectories (usually it's more like 20 to 50 that get sent). The core.xml file would answer why it's bigger.
Re: I think I have a bad one, WU 18244 Core 0x26
Posted: Mon May 12, 2025 11:51 am
by appepi
There are other projects from this family, eg 18251 that (for unknown reasons) run very slowly (~1/3 usual PPD) on TU106 version RTX 2060, also RTX 1660 and 1660 Super. The only solution seems to be to avoid them. See
viewtopic.php?t=42221 for more