Wayne Stallwood wrote:
Some of the things you mention below almost sound like a MTU miss-match problem, but why this would be intermittent I am not sure. MTU problems could explain why a ping works until the packet size is increased and could also explain why traffic encapsulated by your VPN may work while other connections fail.
That was what I thought which is what lead me in that direction, but I can't really say it's got me anywhere.
The MTU is set on the router to 1500. If I ping with packets of 1472bytes disabling fragmentation (ie to allow for the 28-byte TCP/IP overhead) I get consistent good results, with 1473byte packets I fail on fragmentation, so the path does seem to support 1500 MTU. But if I increase the packet size beyond 1472 whilst allowing fragmentation I get intermittent results. I don't really know what that "means", though. Maybe some routers are just dropping some ICMP packets which would allow normal packets (although it's odd that its not consistent). For example, I can ping "www.lug.org.uk" with packets of 1472 and below, but not 1473+, even though the connection seems fine right now. So I assume that something is dropping the large packets somewhere (consistently). If I try the same to my own server (www.more-solutions.co.uk) I get some good results and some bad ones. (If someone else can try pinging those two with 1473-byte packets and tell me if my results differ from theirs that would be useful. "ping -s 1473 www.more-solutions.co.uk" under Linux should do it. NB I've been using a 2sec timeout, I should probably increase that to see if the packets are just very slow coming back.)
What are the line stats on your routers showing ? It doesn't sound completely like an ADSL/Line problem but the intermittent nature sort of points in that direction.
All numbers Downstream/Upstream: SNR Margin 26.0/28.0dB Line Attenuation 57.4/31.5dB Errored Seconds 74/174 Loss of Signal 0/0 Loss of Frame 0/0 CRC Errors 210/273 Data Rate 1152/288kbps
Whilst those CRC errors aren't great (uptime is 3days 4hrs) they're not atypical for my office ADSL line. My home one is at 8MB (nice and close to the exchange!) but I can't get at the stats remotely.
NB: Restarting the router sometimes makes a difference for a minute or two after the restart, but no more than that. And of-course it means any established SSH connections get dropped, so it's not really worthwhile. Both routers are pretty cheap-n-cheerful things (office one is Connexant chipset, not sure about home one but it's Safecom branded and I think they're usually Conexant too). I've updated firmware without effect. If I get chance I'll build an IPCop box, stick a USB ADSL modem in it, and see if that works any better.
When you are in a state of not being able to start a new ssh connection can you still telnet to port 22 of an external ssh server and see the server banner (or do you get connection timed out even doing that)
Next time it fails I'll try that. My guess would be that the connection will timeout even though an established connection to the same server will continue just fine, but I'm not 100% sure I've actually tried it to be sure.