Page 1 of 1

7.6.10 Fails to upload CPU jobs

Posted: Mon Apr 20, 2020 6:04 pm
by tomralph
Been chasing something like this since 7.6.8, but I think I worked on it enough to have helpful information for the 7.6.10 release.

My CPU jobs fail to upload with various time outs on 7.6.8+

Code: Select all

16:46:31:WU05:FS00:Trying to send results to collection server
16:46:31:WU05:FS00:Uploading 2.72MiB to 129.213.40.229
16:46:31:WU05:FS00:Connecting to 129.213.40.229:8080
17:03:05:WU05:FS00:Upload 13.77%
17:03:05:ERROR:WU05:FS00:Exception: Transfer failed
I did a tcpdump to see what was happening and if there was a firewall eating my connections. What I observe is a healthy tcp session to the remote server:

Code: Select all

No.	Time	Source	Destination	Protocol	Length	Info
48	1961.252835	10.x.x.x	129.213.40.229	TCP	74	42396 → 8080 [SYN] Seq=0 Win=26880 Len=0 MSS=8960 SACK_PERM=1 TSval=341073309 TSecr=0 WS=128
49	1961.348237	129.213.40.229	10.x.x.x	TCP	74	8080 → 42396 [SYN, ACK] Seq=0 Ack=1 Win=62636 Len=0 MSS=8960 SACK_PERM=1 TSval=2976270607 TSecr=341073309 WS=128
50	1961.348291	10.x.x.x	129.213.40.229	TCP	66	42396 → 8080 [ACK] Seq=1 Ack=1 Win=26880 Len=0 TSval=341073333 TSecr=2976270607
51	1961.348402	10.x.x.x	129.213.40.229	TCP	171	42396 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26880 Len=105 TSval=341073333 TSecr=2976270607 [TCP segment of a reassembled PDU]
We then try to transmit full frames to the server using the NIC's MTU with the 'do not fragment' bit set:

Code: Select all

Frame 52: 9014 bytes on wire (72112 bits), 9014 bytes captured (72112 bits)
Ethernet II, Src: Vmware_81:4d:a3 (00:50:xx:xx:xx:xx), Dst: 02:50:56:56:44:52 (02:50:56:56:44:52)
Internet Protocol Version 4, Src: 10.x.x.x, Dst: 129.213.40.229
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
    Total Length: 9000
    Identification: 0x1baf (7087)
    Flags: 0x4000, Don't fragment
        0... .... .... .... = Reserved bit: Not set
        .1.. .... .... .... = Don't fragment: Set
        ..0. .... .... .... = More fragments: Not set
        ...0 0000 0000 0000 = Fragment offset: 0
    Time to live: 64
    Protocol: TCP (6)
    Header checksum: 0xf00f [validation disabled]
    [Header checksum status: Unverified]
    Source: 10.x.x.x
    Destination: 129.213.40.229
The NIC's MTU is 9000

Code: Select all

ens192    Link encap:Ethernet  HWaddr 00:50:xx:xx:xx:xx
          inet addr:10.x.x.x  Bcast:10.x.x.x  Mask:255.x.x.x
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
I believe the 9014 byte packet from above is being dropped at the NIC.

What I capture a GPU workload the packets never grow beyond 5k, they slow start at 2k then ramp up over time to 5k.

Code: Select all

Frame 5: 2802 bytes on wire (22416 bits), 2802 bytes captured (22416 bits)
Ethernet II, Src: Vmware_81:4d:a3 (00:50:xx:xx:xx:xx), Dst: 02:50:56:56:44:52 (02:50:56:56:44:52)
Internet Protocol Version 4, Src: 10.x.x.x, Dst: 128.252.203.2
    0100 .... = Version: 4
    .... 0101 = Header Length: 20 bytes (5)
    Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
    Total Length: 2788
    Identification: 0x5ff1 (24561)
    Flags: 0x4000, Don't fragment
        0... .... .... .... = Reserved bit: Not set
        .1.. .... .... .... = Don't fragment: Set
        ..0. .... .... .... = More fragments: Not set
    ...0 0000 0000 0000 = Fragment offset: 0
    Time to live: 64
    Protocol: TCP (6)
    Header checksum: 0x22cd [validation disabled]
    [Header checksum status: Unverified]
    Source: 10.x.x.x
    Destination: 128.252.203.2
Questions... is there any difference in the CPU v. GPU workload upload code that would cause the network packets to be generated differently? I can resolve the failure to upload by setting the NIC to 1500. This allows both CPU and GPU workloads to upload without error.

Re: 7.6.10 Fails to upload CPU jobs

Posted: Mon Apr 20, 2020 7:20 pm
by Tohya