I think I have a bad one, WU 18244 Core 0x26

Moderators: Site Moderators, FAHC Science Team

Post Reply
dschief
Posts: 163
Joined: Tue Dec 04, 2007 5:56 am
Hardware configuration: Gigabyte Z790 UD AC , i7-14700K (x2) Win11
ASUS Z97_K , i5-4460 (x2) Win10 & UBUNTU 24.04
GPU's ASUS RTX1660 x2, RTX3050 x2
EVGA1060
Location: California Wine country

I think I have a bad one, WU 18244 Core 0x26

Post by dschief »

this has failed 3 times in the last 2 days for other folders. It started out at over 1 hour per frame.
that has gradually come down . now run time per frame is 13 m :28 sec

this is on a GTX 1660 Super, Same card in my win11 box is putting out over 1 Million PPD

Might be worth keeping an eye on
Joe_H
Site Admin
Posts: 8118
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4
Location: W. MA

Re: I think I have a bad one, WU 18244 Core 0x26

Post by Joe_H »

What are the Run, Clone, and Gen numbers for the WU? No way for us to tell and keep an eye on it without those.
Image
dschief
Posts: 163
Joined: Tue Dec 04, 2007 5:56 am
Hardware configuration: Gigabyte Z790 UD AC , i7-14700K (x2) Win11
ASUS Z97_K , i5-4460 (x2) Win10 & UBUNTU 24.04
GPU's ASUS RTX1660 x2, RTX3050 x2
EVGA1060
Location: California Wine country

Re: I think I have a bad one, WU 18244 Core 0x26

Post by dschief »

RCG 441,0,59

runtimes are now 5m:55sec


I've never had a wu that frame times went from over an hour to 5:55. they are always the same down to the second ??
arisu
Posts: 466
Joined: Mon Feb 24, 2025 11:11 pm

Re: I think I have a bad one, WU 18244 Core 0x26

Post by arisu »

Was it the frame estimate that went from over an hour to 5:55 or did you time the actual duration of the frames? Because the initial frame estimate is sometimes very inaccurate even if it is often correct down to the second. What matters is how long each frame is actually taking.

All the failures from the other three people for P18244 R441 C0 G59 are instant failures which indicates a core compatibility problem, and those are not unheard of for core26.

Can you post the log for that core so we can see how long each frame takes? Because I suspect that it's just an ETA estimation issue along with the WU getting unlucky and being sent to three people in a row with incompatible libraries for core26, which makes it seem like it's bad but could just be a false positive.
dschief
Posts: 163
Joined: Tue Dec 04, 2007 5:56 am
Hardware configuration: Gigabyte Z790 UD AC , i7-14700K (x2) Win11
ASUS Z97_K , i5-4460 (x2) Win10 & UBUNTU 24.04
GPU's ASUS RTX1660 x2, RTX3050 x2
EVGA1060
Location: California Wine country

Re: I think I have a bad one, WU 18244 Core 0x26

Post by dschief »

The wonky numbers are from the " View Work Unit Details" when you click on the little i in the
client window. the actual log shows a stable 4:06. It's the view details page that's messed up.
Sorry for the confusion, I should have dug a little deeper before pushing the panic button
arisu
Posts: 466
Joined: Mon Feb 24, 2025 11:11 pm

Re: I think I have a bad one, WU 18244 Core 0x26

Post by arisu »

A lot of people get confused by the way the client interface works. Some people have even dumped perfectly good work units because it seemed to be glitchy when it suffered a known bug that caused PPD to inflate to impossible numbers. The interface, and the estimation algorithm, definitely needs improvement.
dschief
Posts: 163
Joined: Tue Dec 04, 2007 5:56 am
Hardware configuration: Gigabyte Z790 UD AC , i7-14700K (x2) Win11
ASUS Z97_K , i5-4460 (x2) Win10 & UBUNTU 24.04
GPU's ASUS RTX1660 x2, RTX3050 x2
EVGA1060
Location: California Wine country

Re: I think I have a bad one, WU 18244 Core 0x26

Post by dschief »

Follow-up, dug back through the log, the problem child { W1 } finished up over night.

06:13:45:11 W1 finished 100%

06:13:49:11 W1 Credited. It must have been big, it seemed longer than avg. to upload

client seems to be back to normal what ever that is. frame times are right at 3 min

Another crisis survived, case closed
Joe_H
Site Admin
Posts: 8118
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4
Location: W. MA

Re: I think I have a bad one, WU 18244 Core 0x26

Post by Joe_H »

Project 18244 is a medium sized project in terms of atom count, but the researcher may be loading up additional information on the return. More frequent checkpoints and sanity checks would result in a larger upload.
Image
arisu
Posts: 466
Joined: Mon Feb 24, 2025 11:11 pm

Re: I think I have a bad one, WU 18244 Core 0x26

Post by arisu »

Joe_H wrote: Sat May 10, 2025 4:10 pm Project 18244 is a medium sized project in terms of atom count, but the researcher may be loading up additional information on the return. More frequent checkpoints and sanity checks would result in a larger upload.
More frequent checkpoints don't increase the total size (only the last one is sent), but more frequent XTC and especially TRR trajectory snapshots would. Those are files that encode the types, positions, and velocities of each particle. XTC is like TRR but contains reduced-resolution trajectories to reduce file size. 18244 doesn't use TRR trajectories and that's what would account for the biggest size increase.

It does use more frequent XTC trajectory snapshots than most projects (every 0.4% so 250 total) but that wouldn't make it extremely large by itself, and it's not sending any extra files (the only big files are the serialized final checkpoint and the XTC file).

Usually only non-solvent atoms are written to the XTC file but because they're testing a new type of water (OCL3-pol), it might be that the XTC file contains the solvent as well, which would inflate size significantly especially because they're sending 250 XTC trajectories (usually it's more like 20 to 50 that get sent). The core.xml file would answer why it's bigger.
appepi
Posts: 84
Joined: Wed Mar 18, 2020 2:55 pm
Hardware configuration: HP Z600 (5) HP Z800 (3) HP Z440 (3)
ASUS Turbo GTX 1060, 1070, 1080, RTX 2060 (3)
Dell GTX 1080
Location: Sydney Australia

Re: I think I have a bad one, WU 18244 Core 0x26

Post by appepi »

There are other projects from this family, eg 18251 that (for unknown reasons) run very slowly (~1/3 usual PPD) on TU106 version RTX 2060, also RTX 1660 and 1660 Super. The only solution seems to be to avoid them. See viewtopic.php?t=42221 for more
Image
Post Reply