7.6.10 Fails to upload CPU jobs
Posted: Mon Apr 20, 2020 6:04 pm
Been chasing something like this since 7.6.8, but I think I worked on it enough to have helpful information for the 7.6.10 release.
My CPU jobs fail to upload with various time outs on 7.6.8+
I did a tcpdump to see what was happening and if there was a firewall eating my connections. What I observe is a healthy tcp session to the remote server:
We then try to transmit full frames to the server using the NIC's MTU with the 'do not fragment' bit set:
The NIC's MTU is 9000
I believe the 9014 byte packet from above is being dropped at the NIC.
What I capture a GPU workload the packets never grow beyond 5k, they slow start at 2k then ramp up over time to 5k.
Questions... is there any difference in the CPU v. GPU workload upload code that would cause the network packets to be generated differently? I can resolve the failure to upload by setting the NIC to 1500. This allows both CPU and GPU workloads to upload without error.
My CPU jobs fail to upload with various time outs on 7.6.8+
Code: Select all
16:46:31:WU05:FS00:Trying to send results to collection server
16:46:31:WU05:FS00:Uploading 2.72MiB to 129.213.40.229
16:46:31:WU05:FS00:Connecting to 129.213.40.229:8080
17:03:05:WU05:FS00:Upload 13.77%
17:03:05:ERROR:WU05:FS00:Exception: Transfer failed
Code: Select all
No. Time Source Destination Protocol Length Info
48 1961.252835 10.x.x.x 129.213.40.229 TCP 74 42396 → 8080 [SYN] Seq=0 Win=26880 Len=0 MSS=8960 SACK_PERM=1 TSval=341073309 TSecr=0 WS=128
49 1961.348237 129.213.40.229 10.x.x.x TCP 74 8080 → 42396 [SYN, ACK] Seq=0 Ack=1 Win=62636 Len=0 MSS=8960 SACK_PERM=1 TSval=2976270607 TSecr=341073309 WS=128
50 1961.348291 10.x.x.x 129.213.40.229 TCP 66 42396 → 8080 [ACK] Seq=1 Ack=1 Win=26880 Len=0 TSval=341073333 TSecr=2976270607
51 1961.348402 10.x.x.x 129.213.40.229 TCP 171 42396 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26880 Len=105 TSval=341073333 TSecr=2976270607 [TCP segment of a reassembled PDU]
Code: Select all
Frame 52: 9014 bytes on wire (72112 bits), 9014 bytes captured (72112 bits)
Ethernet II, Src: Vmware_81:4d:a3 (00:50:xx:xx:xx:xx), Dst: 02:50:56:56:44:52 (02:50:56:56:44:52)
Internet Protocol Version 4, Src: 10.x.x.x, Dst: 129.213.40.229
0100 .... = Version: 4
.... 0101 = Header Length: 20 bytes (5)
Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
Total Length: 9000
Identification: 0x1baf (7087)
Flags: 0x4000, Don't fragment
0... .... .... .... = Reserved bit: Not set
.1.. .... .... .... = Don't fragment: Set
..0. .... .... .... = More fragments: Not set
...0 0000 0000 0000 = Fragment offset: 0
Time to live: 64
Protocol: TCP (6)
Header checksum: 0xf00f [validation disabled]
[Header checksum status: Unverified]
Source: 10.x.x.x
Destination: 129.213.40.229
Code: Select all
ens192 Link encap:Ethernet HWaddr 00:50:xx:xx:xx:xx
inet addr:10.x.x.x Bcast:10.x.x.x Mask:255.x.x.x
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
What I capture a GPU workload the packets never grow beyond 5k, they slow start at 2k then ramp up over time to 5k.
Code: Select all
Frame 5: 2802 bytes on wire (22416 bits), 2802 bytes captured (22416 bits)
Ethernet II, Src: Vmware_81:4d:a3 (00:50:xx:xx:xx:xx), Dst: 02:50:56:56:44:52 (02:50:56:56:44:52)
Internet Protocol Version 4, Src: 10.x.x.x, Dst: 128.252.203.2
0100 .... = Version: 4
.... 0101 = Header Length: 20 bytes (5)
Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
Total Length: 2788
Identification: 0x5ff1 (24561)
Flags: 0x4000, Don't fragment
0... .... .... .... = Reserved bit: Not set
.1.. .... .... .... = Don't fragment: Set
..0. .... .... .... = More fragments: Not set
...0 0000 0000 0000 = Fragment offset: 0
Time to live: 64
Protocol: TCP (6)
Header checksum: 0x22cd [validation disabled]
[Header checksum status: Unverified]
Source: 10.x.x.x
Destination: 128.252.203.2