MTU导致scp假死的问题

昨天遇到一个诡异的网络问题，让我郁闷了一会儿。
公司某台机器用svn，scp，git，rsync等各种方式向外网传数据时总是会假死。

比如svn提交，

work@ubuntu:~/svn_repos$ svn ci Resources/anim/images/walk.png  -mx
Adding  (bin)  Resources/anim/images/walk.png
Transmitting file data .

当输出最后那句话之后就死掉了，等很久都没有反应。Ctrl+C之后，提示如下：

svn: Commit failed (details follow):
svn: While preparing '/home/work/svn_repos/Resources/anim/images/walk.png' for commit
svn: Caught signal

但是文本文件就能成功提交。第一反应是网络慢，或者svn的binary数据传输有问题。各种尝试，git，scp，rsync，现象都类似。
比如scp：

work@ubuntu:~/svn_repos$ scp  Resources/anim/images/walk.png work@example.com:~/
walk.png         100%   89KB  89.5KB/s   00:00

终端显示100%，然后就停住了。Ctrl+C都没用，只能kill进程。用-vvv开启详细日志，也看不出什么问题。

work@ubuntu:~/svn_repos$ scp -vvv Resources/anim/images/walk.png work@example.com:~/

… … debug1: Sending command: scp -v -t ~/ debug2: channel 0: request exec confirm 1 debug2: fd 3 setting TCP_NODELAY debug2: callback done debug2: channel 0: open confirm rwindow 0 rmax 32768 debug3: Wrote 192 bytes for a total of 2391 debug2: channel 0: rcvd adjust 2097152 debug2: channel_input_status_confirm: type 99 id 0 debug2: exec request accepted on channel 0 Sending file modes: C0644 91624 walk.png debug3: Wrote 64 bytes for a total of 2455 debug2: channel 0: rcvd ext data 27 Sink: C0644 91624 walk.png debug2: channel 0: written 27 to efd 6 walk.png 100% 89KB 89.5KB/s 00:00 debug3: Wrote 12924 bytes for a total of 15379

最后显示100%，最后还是会假死。

搜索 scp hangs at 100, scp binary file hung up，都没有找到什么有用的答案。

文本可以，binary不行，svn，git，scp，rsync各种工具现象都类似，而且只有这台机器有问题，我自己的机器做这些操作向远程传文件一切正常。从来没有遇到过这样怪异的问题，一时间有点焦头烂额的感觉。最后想，也许是文件大小的关系，建了一个50KB的文本文件尝试提交，发现也会假死，算是确认了这个想法。接下来的问题就是找到究竟是传多大的文件的时候会死掉。

这里用dd命令可以创建一个指定大小的文件。

# 建一个 1KB的文件
dd if=/dev/zero of=file_to_create bs=1 count=1024
建一个 1KB的文件
dd if=/dev/zero of=file_to_create bs=1k count=1
建一个 10MB的文件
dd if=/dev/zero of=file_to_create bs=1m count=10

通过二分查找，最终定位到文件大小是1390字节时可以scp成功，1391字节时就会假死。

1391是个什么奇怪的数字？再搜 scp hung up 1391，终于找到一篇Why does SCP hang on copying files larger than 1405 bytes?

This definitely sounds like MTU problems (like Konerak pointed out), this is how I would test this:
   ip link set eth0 mtu 1400
This temporally sets the allowed size for network packets to 1400 on the network interface eth0 (you might need to adjust the name). Your system will then split all packets above this size before sending it on to the network. If this fixes the scp command, you need to find the problem within the network or make this ugly fix permanent ; )

再看看这台有问题的机器，发现默认的MTU是1500。

work@ubuntu:~$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 00:24:e8:2c:3d:76 brd ff:ff:ff:ff:ff:ff

照着上面的办法把MTU设置为1400，果然就好了！

应该是网络里某台路由的问题吧，也懒得去找了。话说这个因为MTU设置错误导致路由器丢包的问题，以前做课程项目的时候也遇到过，当时是在程序里自动切包解决的问题。

Last modified on 2011-12-23