I have two ADSL lines from completely different providers (Demon and The Phone Coop (www.thephone.coop), both behaving the same. The lines are both in the Peterborough area but on different exchanges. Although this isn't a Linux problem I think Linux is most likely to give me the tools to resolve it.
Basically, some of the time the ADSL lines just work: I can browse the web, collect email, download ISOs, etc.
However the rest of the time (and its becoming increasingly "most of the time") I can't. To all intents and purposes the connection has gone. Except, and this is what's throwing me, connections which were already up will continue to work (eg if I have an SSH connection somewhere it'll still work, I use Hamachi (www.hamachi.cc) as a VPN client and I can still access my home PC from work and vice-versa through it when the connection is otherwise useless, if I'm downloading an ISo using BitTorrent it'll usually continue just fine). The ADSL router isn't showing anything unusual (I have different hardware at home from at work and have swapped both with no effect).
All I have managed to do is: (a) Confirm that DNS is not the problem. DNS doesn't work properly during an "outage" but equally I can't access sites/email/etc by IP address either (b) Small packet stuff seems OK. Often (not always) I can ping a site which I can't view. (c) Following on from the above, if I tweak the ping packet size so that fragmentation occurs the ping becomes unreliable. (d) Optimised sites like Google are far more likely to work than non-optimised sites with lots of data.
There's no obvious patterns to the time of day/day of week (non-peak times are probably better but its not clear cut). Rain makes things worse (but our office ADSL line has never liked rain, and I think that's a different problem).
Any idea what is going on or how to find out? The ISPs so far haven't been much help, but since I have the same problem with two different ISPs I don't think its an ISP problem. A traceroute from my office to somewhere and from my home to the same place don't show any obvious points of commonality until they reach the destination (ie I'm pretty sure that apart from going through BT cables the two ISPs aren't sharing the same hardware/connectivity somewhere).
At home my router uses an embedded Linux which I can SSH into, so suggestions for things I can try from there welcome. It uses BusyBox and has limited tools but there may be some useful stuff under /proc if I knew what to look for.
PS: This has been going on for a few weeks but is definitely getting worse.
On Sat, 2006-10-07 at 10:52 +0100, Mark Rogers wrote:
I have two ADSL lines from completely different providers (Demon and The Phone Coop (www.thephone.coop), both behaving the same. The lines are both in the Peterborough area but on different exchanges. Although this isn't a Linux problem I think Linux is most likely to give me the tools to resolve it.
Basically, some of the time the ADSL lines just work: I can browse the web, collect email, download ISOs, etc.
However the rest of the time (and its becoming increasingly "most of the time") I can't. To all intents and purposes the connection has gone.
Very strange.
Some of the things you mention below almost sound like a MTU miss-match problem, but why this would be intermittent I am not sure. MTU problems could explain why a ping works until the packet size is increased and could also explain why traffic encapsulated by your VPN may work while other connections fail.
What are the line stats on your routers showing ? It doesn't sound completely like an ADSL/Line problem but the intermittent nature sort of points in that direction.
When you are in a state of not being able to start a new ssh connection can you still telnet to port 22 of an external ssh server and see the server banner (or do you get connection timed out even doing that)
Wayne Stallwood wrote:
Some of the things you mention below almost sound like a MTU miss-match problem, but why this would be intermittent I am not sure. MTU problems could explain why a ping works until the packet size is increased and could also explain why traffic encapsulated by your VPN may work while other connections fail.
That was what I thought which is what lead me in that direction, but I can't really say it's got me anywhere.
The MTU is set on the router to 1500. If I ping with packets of 1472bytes disabling fragmentation (ie to allow for the 28-byte TCP/IP overhead) I get consistent good results, with 1473byte packets I fail on fragmentation, so the path does seem to support 1500 MTU. But if I increase the packet size beyond 1472 whilst allowing fragmentation I get intermittent results. I don't really know what that "means", though. Maybe some routers are just dropping some ICMP packets which would allow normal packets (although it's odd that its not consistent). For example, I can ping "www.lug.org.uk" with packets of 1472 and below, but not 1473+, even though the connection seems fine right now. So I assume that something is dropping the large packets somewhere (consistently). If I try the same to my own server (www.more-solutions.co.uk) I get some good results and some bad ones. (If someone else can try pinging those two with 1473-byte packets and tell me if my results differ from theirs that would be useful. "ping -s 1473 www.more-solutions.co.uk" under Linux should do it. NB I've been using a 2sec timeout, I should probably increase that to see if the packets are just very slow coming back.)
What are the line stats on your routers showing ? It doesn't sound completely like an ADSL/Line problem but the intermittent nature sort of points in that direction.
All numbers Downstream/Upstream: SNR Margin 26.0/28.0dB Line Attenuation 57.4/31.5dB Errored Seconds 74/174 Loss of Signal 0/0 Loss of Frame 0/0 CRC Errors 210/273 Data Rate 1152/288kbps
Whilst those CRC errors aren't great (uptime is 3days 4hrs) they're not atypical for my office ADSL line. My home one is at 8MB (nice and close to the exchange!) but I can't get at the stats remotely.
NB: Restarting the router sometimes makes a difference for a minute or two after the restart, but no more than that. And of-course it means any established SSH connections get dropped, so it's not really worthwhile. Both routers are pretty cheap-n-cheerful things (office one is Connexant chipset, not sure about home one but it's Safecom branded and I think they're usually Conexant too). I've updated firmware without effect. If I get chance I'll build an IPCop box, stick a USB ADSL modem in it, and see if that works any better.
When you are in a state of not being able to start a new ssh connection can you still telnet to port 22 of an external ssh server and see the server banner (or do you get connection timed out even doing that)
Next time it fails I'll try that. My guess would be that the connection will timeout even though an established connection to the same server will continue just fine, but I'm not 100% sure I've actually tried it to be sure.
I wrote:
Next time it fails I'll try that. My guess would be that the connection will timeout even though an established connection to the same server will continue just fine, but I'm not 100% sure I've actually tried it to be sure.
OK, conveniently enough it just failed on me.
Yes, I can still establish a new SSH connection to my server. I cannot, however, establish a web connection to a page on the same server.
Now, a few minutes later, the connection seems to have died altogether. SSH dropped, VPN dropped, etc. Router still thinks it has a good connection, but I can't get anything out of it now. Time for a router reboot methinks...
Hmm, restart left my router IP-less. Swapped to bt_test login, that worked, back to my own and that's also now worked but connection still "dead" (no SSH even) but pinging my server is fine (1472bytes packets, 1473 packets intermittent). Ah, several minutes later it's back.
I'm going to create some test pages on my web server of varying sizes to see if they work when the link seems dead.
This is all pretty weird stuff I must admit
FYI pinging large packets to more-solutions.co.uk worked for me with very consistent results. I tried 1473 and larger with no problem and a quick reply.
It's interesting that you can get the ssh banner page with telnet, I guess it is the fragmented packet thing not working that is stopping a proper ssh connection from being forged, the banner page is probably less than one transmission unit but a full key exchange to get to the login prompt probably isn't.
So it sort of looks like anything that involves data exchange over your MTU size fails, which sort of points to a very broken IP stack somewhere. But given that you have got this on two different machines, with two different connections via two different routers I can't think why. Are there any commons between machines, same kernel ? same firewall configuration ?
I have a larger box of straws somewhere we can try clutching at..bear with me while I try to find it :-)
Ahh here we go, extra large Tesco Value pack....Can you ping from the linux router ? Can you do the ping large packet test from there to isolate everything else ?
Wayne Stallwood wrote:
So it sort of looks like anything that involves data exchange over your MTU size fails, which sort of points to a very broken IP stack somewhere.
s/fails/is unreliable/
Ping tests "normally" (ie when things seem OK) show fairly high packet loss using large packets, say around 10-30% depending on the mood of the line. When it breaks I get 90-100% loss on large packets. In both cases I'll usually get 0-5% loss on small packets.
What are correct MTU/MRU/MSS values? I have 1500/1500/1432 at work, but MSS looks wrong to me. not entirely sure what MSS affects, though.
But given that you have got this on two different machines, with two different connections via two different routers I can't think why. Are there any commons between machines, same kernel ? same firewall configuration ?
Nothing in common that I can put my finger on. At both ends I have a LAN, but currently my home LAN is just a Win2k PC and DVD player, plus the (Linux firmware) router. (I have three of my "home" Linux boxes in the office. Almost all my tests so far have been on the office line since that's where I seem to live these days, indeed its where I am now.)
My office LAN has 6-10 Linux boxes of varying flavours, and a similar number of Win2k/XP machines scattered around. All suffer ADSL loss at the same time ("looks like the Internet's gone again" is a common phrase in the office right now). Ping tests I've been running from a RedHat box and a Win2k box, with no obvious differences in results.
According to http://usertools.plus.net/exchanges/mso.php?id=24202 there was a BT outage affecting "Peterborough MUX 001 & 002" tail end of September, and this affected both of my lines (based on searching for both exchanges at http://usertools.plus.net/exchanges/ and getting directed to the same faults page). So that gives me some common BT territory, although whether relevant I don't know.
I have a larger box of straws somewhere we can try clutching at..bear with me while I try to find it :-)
The bear will probably not be interested in the straws!
Ahh here we go, extra large Tesco Value pack....Can you ping from the linux router ? Can you do the ping large packet test from there to isolate everything else ?
Can't check from work but will do from home when I get there. I've just ordered another Linux firmware router (http://www.ebuyer.com/UK/product/90976) which I'll plug in at the office to test with when it arrives on Tuesday (if eBuyer deliver on time - a small possibility).
On Sat, Oct 07, 2006 at 02:57:03PM +0100, Mark Rogers wrote:
What are correct MTU/MRU/MSS values? I have 1500/1500/1432 at work, but MSS looks wrong to me. not entirely sure what MSS affects, though.
MTU should be set to 1458, it might seem to work with a higher setting, but some sites will break and you may see other intermittent problems.
BT (iirc) said some time ago they were going to fix the MTU "problem", so it could be that on the ATM backend it is working sometimes but when a different router comes into play for whatever reason on the ATM network that it breaks because you've got a higher mtu set and this could explain the behaviour you are seeing.
Adam
Adam Bower wrote:
MTU should be set to 1458, it might seem to work with a higher setting, but some sites will break and you may see other intermittent problems.
I've not seen 1458 mentioned anywhere before.
Most people seem to say 1500/1500/1460 (MTU/MRU/MSS) for BT, but I'm happy to try 1458 - what about MRU/MSS? I guess 1458/1418 respectively?
NB: Most of the routers I've tried have had default settings for a BT connection and usually they're 1500/1500/1460 I think, I just noticed that the MSS looks out on this one, probably left over from one of the many tests I've tried recently.
<edit>I've now seen reference to 1458 as BT's preferred MTU so I'm trying it now.</edit>
When tweaking my router's MTU settings, should I need to change my PC settings at all? Presumably the router will fragment packets as required? I guess dropping the network down to 1458 would improve throughput, but it should work if I don't, yes?
BT (iirc) said some time ago they were going to fix the MTU "problem", so it could be that on the ATM backend it is working sometimes but when a different router comes into play for whatever reason on the ATM network that it breaks because you've got a higher mtu set and this could explain the behaviour you are seeing.
Just to be clear: if I ping with a packet size of 1472 and fragmentation disabled, I get consistent success at MTU=1500 (although having said that maybe I should recheck to be sure). If your suspicion is correct, presumably I'd be losing packets above 1458 (ie 1430 ICMP packet size)?
On Sat, Oct 07, 2006 at 06:05:11PM +0100, Mark Rogers wrote:
Adam Bower wrote:
MTU should be set to 1458, it might seem to work with a higher setting, but some sites will break and you may see other intermittent problems.
I've not seen 1458 mentioned anywhere before.
Most people seem to say 1500/1500/1460 (MTU/MRU/MSS) for BT, but I'm happy to try 1458 - what about MRU/MSS? I guess 1458/1418 respectively?
That's interesting, as the recommended amount has been 1458 for over 3 years now, and this is a BT Wholesale recommendation, not just a random suggestion found on the net. I've never seen any advice that suggested otherwise. (although, googling for the "wrong" settings does reveal some people suggesting them, but with no good reason) The one time I had problems with lost/broken packets and dsl was when I was running a USB speedtouch and i'd forgotten to adjust the mtu down to 1458 on a box running ipcop and a usb speedtouch modem.
NB: Most of the routers I've tried have had default settings for a BT connection and usually they're 1500/1500/1460 I think, I just noticed that the MSS looks out on this one, probably left over from one of the many tests I've tried recently.
Just out of interest, which routers? Again, every router i've used has come pre-configured to 1458 apart from a couple of old routers that were set to 1500 prior to the BT wholesale advice. I've never had to change (or bothered to change) MSS or MRU, or even look to what they are set at, which suggests I've not had a problem with them before.
<edit>I've now seen reference to 1458 as BT's preferred MTU so I'm trying it now.</edit>
Good good, let's see how you get on for now.
When tweaking my router's MTU settings, should I need to change my PC settings at all? Presumably the router will fragment packets as required? I guess dropping the network down to 1458 would improve throughput, but it should work if I don't, yes?
Nope, you shouldn't need to adjust the PC settings at all, the router should do the fragmentation (although, what routers are you using, I find it strange that you've found something with an "odd" default? I wonder if the hardware is doing something else or if anyone else who has the same router has had any problems).
Just to be clear: if I ping with a packet size of 1472 and fragmentation disabled, I get consistent success at MTU=1500 (although having said that maybe I should recheck to be sure). If your suspicion is correct, presumably I'd be losing packets above 1458 (ie 1430 ICMP packet size)?
Not necessarily, it might be a red herring and nothing to do with your problems at all. I just noticed that you were setting (what looked to me to be) "odd" MTU sizes and figured it would be best to start with and stick to the BT "recommended" size for now and see what happens as the only problems I ever had were when I had MTU set wrong which broke sites like ebay.co.uk and a couple of others, oh, and it totally screwed up a vpn too.
Thanks Adam
Adam Bower wrote:
That's interesting, as the recommended amount has been 1458 for over 3 years now, and this is a BT Wholesale recommendation, not just a random suggestion found on the net. I've never seen any advice that suggested otherwise.
I have no doubt that you're right, but every time I've Googled for "BT MTU MRU MSS" I've been directed to 1500 for BT and 1490 (IIRC) for AOL.
When I Google for 1458 I see where you're coming from, though!
NB: Most of the routers I've tried have had default settings for a BT connection and usually they're 1500/1500/1460 I think, I just noticed that the MSS looks out on this one, probably left over from one of the many tests I've tried recently. Just out of interest, which routers?
Hard to say since I don't usually check default settings, but I've never seen one set to something different when I have checked. As mentioned elsewhere I'm currently playing with a Safecom unit (cheap+cheerful) and an unbranded Conexant chipset box. We've used branded stuff in the past and if I ever checked the MTU it said 1500 (as I'd remember if it said something else) but I don't recall now which ones I have checked.
<edit>I've now seen reference to 1458 as BT's preferred MTU so I'm trying it now.</edit>
Good good, let's see how you get on for now.
On that point: I'm now running with 1458 at home and work. No idea how stable the work connection is since I'm now at home, but my home one is just as bad as before. (This email being sent via VPN to my work PC since my home PC is having problems).
This is where it gets interesting, though: I can now ssh into my router.
From there:
ping -s 1500 www.more-solutions.co.uk gives: --- www.more-solutions.co.uk ping statistics --- 63 packets transmitted, 63 packets received, 0% packet loss round-trip min/avg/max = 40.0/47.1/50.0 ms
However, the same command from my W2k box (now ping -l 1500 www.more-solutions.co.uk) gives: Ping statistics for 212.69.210.250: Packets: Sent = 4, Received = 0, Lost = 4 (100% loss), Approximate round trip times in milli-seconds: Minimum = 0ms, Maximum = 0ms, Average = 0ms
Explain that! Snide comments about W2k vs Linux I can come up with myself! I need to bring a Linux box back home to try, but when I've sent this email I'll reboot this PC with a LiveCD and see whether that's better.
Nope, you shouldn't need to adjust the PC settings at all, the router should do the fragmentation (although, what routers are you using, I find it strange that you've found something with an "odd" default?
I'm sure 1500 is a very common default! Which routers have you found with defaults of 1458?
Of-course I have to be open to the possibility that the problems at home and at work have completely different causes, even if the symptoms are similar.
On Sat, Oct 07, 2006 at 11:04:44PM +0100, Mark Rogers wrote:
Explain that! Snide comments about W2k vs Linux I can come up with myself! I need to bring a Linux box back home to try, but when I've sent this email I'll reboot this PC with a LiveCD and see whether that's better.
Perhaps the W2k box is setting a DNF (do not fragment) or something so the router is just dropping the packet? Was that ping running over the vpn?
Nope, you shouldn't need to adjust the PC settings at all, the router should do the fragmentation (although, what routers are you using, I find it strange that you've found something with an "odd" default?
I'm sure 1500 is a very common default! Which routers have you found with defaults of 1458?
Anything Netgear, Speedtouch USB modems all seem to default to 1458 with Windows, a couple of brands I forget but I think possibly Dlink, Dynamode and 3com. It would be quite easy I guess given how many consumer routers are out there to find plenty that are set to 1500 by default.
Of-course I have to be open to the possibility that the problems at home and at work have completely different causes, even if the symptoms are similar.
Very possibly, try perhaps using ping and setting do not fragment bits and see what happens. I don't really have too much recent experience of fiddling lots with the innards of tcp/ip so i'm a bit out of ideas for now. I'll try re-reading the thread when I feel less ill perhaps.
Adam
Adam Bower wrote:
On Sat, Oct 07, 2006 at 11:04:44PM +0100, Mark Rogers wrote:
Explain that! Snide comments about W2k vs Linux I can come up with myself! I need to bring a Linux box back home to try, but when I've sent this email I'll reboot this PC with a LiveCD and see whether that's better.
Perhaps the W2k box is setting a DNF (do not fragment) or something so the router is just dropping the packet? Was that ping running over the vpn?
No, I've not done any tests over the VPN, I've just used the VPN to give me VNC access to a PC in the office (or vice versa) to do remote tests.
Right, some more test results...
I rebooted to a Mepis LiveCD (closest to hand) and got identical results. So I changed MTU back to 1500 and tried that, and that worked.
So I rebooted to W2k and it also seemed pretty reliable at 1500, so I left it overnight and found the router "dead" in the morning. (I could ping the router but not access its web front end, SSH into it, etc; no traffic was going through it either.) So I think I have a router issue at home, so have switched back to my old router (which is the same as the one in the office, an unbadged Connexant unit from Solwise). I left that at 1500 for the time being and it seems sort of OK, but not what I'd expect from a 8Mbps connection (based on what I had before). I need therefore to do some proper tests when I get chance, probably tomorrow evening or Tuesday.
Meanwhile, I've just popped into the office and (with 1458 as MTU) it seems OK right now, but that's almost meaningless; I'll have a better idea after its been in use for a day tomorrow.
Anything Netgear, Speedtouch USB modems all seem to default to 1458 with Windows, a couple of brands I forget but I think possibly Dlink, Dynamode and 3com. It would be quite easy I guess given how many consumer routers are out there to find plenty that are set to 1500 by default.
Interesting, I'll get my colleagues to check which brands use what default in future.
Of-course I have to be open to the possibility that the problems at home and at work have completely different causes, even if the symptoms are similar.
Very possibly, try perhaps using ping and setting do not fragment bits and see what happens. I don't really have too much recent experience of fiddling lots with the innards of tcp/ip so i'm a bit out of ideas for now. I'll try re-reading the thread when I feel less ill perhaps.
OK, I'm going to do nothing for a day (mostly because I don;t have much choice) and get a proper idea of what is really going on while I'm not changing anything, then try some more tests and document what happens properly this time around. I'm determined to work this out!!
I hope you feel less ill soon, and only in a small part for selfish reasons. Thanks for your help.
PS: Any good TCP/IP book recommendations? I know a fair bit but this exercise is highlighting just how much I don't know (or thought I know but don't).
On Sun, Oct 08, 2006 at 01:56:33PM +0100, Mark Rogers wrote:
OK, I'm going to do nothing for a day (mostly because I don;t have much choice) and get a proper idea of what is really going on while I'm not changing anything, then try some more tests and document what happens properly this time around. I'm determined to work this out!!
Well, changing too much at once in the situation you're in probably won't help as when it breaks again you'll get different problems/symptoms. This is why I suggested using tcpdump as it gives your a proper record of what happened with various settings. Of course as you don't have root on the webserver it probably won't be too much help.
I hope you feel less ill soon, and only in a small part for selfish reasons. Thanks for your help.
Thanks, feeling lots better today, had a nasty cold which always makes my head feel fuzzy so I can think a bit better now.
PS: Any good TCP/IP book recommendations? I know a fair bit but this exercise is highlighting just how much I don't know (or thought I know but don't).
Discussed this on irc just yesterday, best book i've ever read on the subject has to be "TCP/IP Illustrated: The Protocols v. 1" by W.Richard Stevens ISBN 0201633469, it was released in 1994, and I read it circa 2000. Don't put the age off, it's a _very_ good book, at least to get started. I've never read the following 2 volumes (one of which isn't by the same author iirc) but I've heard that they are not as good as the first volume, and his other books on Unix programming are very good and still highly respected even if they sound a bit dated.
Adam
On 08-Oct-06 Adam Bower wrote:
[...] Discussed this on irc just yesterday, best book i've ever read on the subject has to be "TCP/IP Illustrated: The Protocols v. 1" by W.Richard Stevens ISBN 0201633469, it was released in 1994, and I read it circa 2000. Don't put the age off, it's a _very_ good book, at least to get started. I've never read the following 2 volumes (one of which isn't by the same author iirc) but I've heard that they are not as good as the first volume, and his other books on Unix programming are very good and still highly respected even if they sound a bit dated.
W. Richard Stevens was famous for the quality of his writings and teaching on Unix (especially networking issues).
He died in 1999 -- see an obituary at
http://dan.drydog.com/6bone/w.richard.stevens.obituary.html
but his resuscitated personal website is still alive:
(it was originally just www.kohala.com). You can see the list of his publications etc. there.
Ted.
-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@nessie.mcc.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 08-Oct-06 Time: 20:44:51 ------------------------------ XFMail ------------------------------
On Sunday 08 October 2006 15:06, Adam Bower wrote:
best book i've ever read on the subject has to be "TCP/IP Illustrated: The Protocols v. 1" by W.Richard Stevens ISBN 0201633469, it was released in 1994, and I read it circa 2000. Don't put the age off, it's a _very_ good book, at least to get started.
Indeed, they are invaluable reference books even today - It is just criminal that Norfolk Library Services saw fit to remove both Vol.1 & Vol.2 from their catalogue earlier in the year.
Regards, Paul.
Adam Bower wrote:
Well, changing too much at once in the situation you're in probably won't help as when it breaks again you'll get different problems/symptoms. This is why I suggested using tcpdump as it gives your a proper record of what happened with various settings. Of course as you don't have root on the webserver it probably won't be too much help.
I'm looking at the feasibility of this; I just need to set something up somewhere that does work that I can give myself access to from home/work. Normally of-course I'd just do something between home and work but that means working with two known-bad connections.
PS: Any good TCP/IP book recommendations?
Discussed this on irc just yesterday, best book i've ever read on the subject has to be "TCP/IP Illustrated: The Protocols v. 1" by W.Richard Stevens ISBN 0201633469,
Thanks, will look this up.
Had "TCP/IP Analysis and Toubleshooting Toolkit" (Kevin Burns, 0-471-42975-9) recommended to me elsewhere, which I just happened to find on eBay and has already been delivered, so that gives me something to get started with.
In the meantime:
Having reduced MTU to 1458 the connection is noticably better but still noticably bad. From what I'm starting to understand, I'm not doing much to control the size of my received packets (even with MRU set to the same as MTU) so I wonder whether that's an issue. Actually knowing what data was getting through would of-course help. I'm finding comments about "MSS clamping" which sound relevant but which currently mean nothing to me. For example: http://www.solwiseforum.co.uk/showthread.php?t=1902&page=2
Assuming I get two test PCs, one at a known good location, to which I have root access, what do I need to do to install tcpdump and set up a useful test? Are there any LiveCD's I could use? (It would make it easier to use someone-else's hardware temporarily.)
Mark Rogers wrote:
Assuming I get two test PCs, one at a known good location, to which I have root access, what do I need to do to install tcpdump and set up a useful test?
OK, I now have root access to a machine with tcpdump available: what should I do next?
On Wed, Oct 11, 2006 at 09:06:16AM +0100, Mark Rogers wrote:
Mark Rogers wrote:
Assuming I get two test PCs, one at a known good location, to which I have root access, what do I need to do to install tcpdump and set up a useful test?
OK, I now have root access to a machine with tcpdump available: what should I do next?
You'll need to do something along the lines of
tcpdump -i eth0 -s 0 -w myoutputfile host $nameofdesktop
and then a
tcpdump -i eth0 -s 0 -w myoutputfile host $nameofremote
-i chooses interface, -s 0 means to grab the entire packet, -w is the name of the output file (be careful if you are sending lots of data as the output packet could be large, and also if you have an ssh connection open you will log that and end up in a kind of positive feedback loop if you are not careful!) and the host bit means only grab packets going to<>from that remote machine, you can specify either an ip address or domain name.
Then you really want to get copies of the output files onto your desktop to examine them in the program Ethereal (which has recently changed its name to wireshark due to a trademark problem). You can also use ethereal for packet capture btw, and there is a command line version. I can't really help too much with packet analysis as it has been a while since I last played with this kind of thing.
You can at least examine packet headers and payloads and see how big the packets when they leave the machine, and how big they are at the other side. It will be a bit of a learning curve but I think the easiest way for you to make progress is to just try this and see what happens and ask for specific advice if you get stuck. Or alternately you could perhaps make some packet captures available for others to look at online.
Thanks Adam
Adam Bower wrote:
You'll need to do something along the lines of
tcpdump -i eth0 -s 0 -w myoutputfile host $nameofdesktop
Thanks, I'm playing with that now.
[snip] and also if you have an ssh connection open you will log that and end up in a kind of positive feedback loop if you are not careful!)
Good point, wouldn't have thought of that! I can use my VPN to ssh from my home PC to set up the logging without the SSH session getting in the way now that's you've pointed that out.
Then you really want to get copies of the output files onto your desktop to examine them in the program Ethereal (which has recently changed its name to wireshark due to a trademark problem). You can also use ethereal for packet capture btw, and there is a command line version. I can't really help too much with packet analysis as it has been a while since I last played with this kind of thing.
I have wireshark installed and have played with it before (it was Ethereal then, of-course); I'll see how much I can work out for myself then come back here with the stuff I get stuck with.
You can at least examine packet headers and payloads and see how big the packets when they leave the machine, and how big they are at the other side. It will be a bit of a learning curve but I think the easiest way for you to make progress is to just try this and see what happens and ask for specific advice if you get stuck.
That's all good advice: much better to investigate for myself at first. Everything still points to this being a packet loss issue with large packets, though, so I suspect all I'll be able to do is get some traces from each end showing certain packets making it and others not; if nothing else a tcpdump .cap file sent to the ISP might get passed on to someone who won't tell me to try rebooting the router again :-) If I can just prove that there is a problem and get past the "nobody else has reported any problems" response that'll help!
I wrote:
I'm going to create some test pages on my web server of varying sizes to see if they work when the link seems dead.
Some test results from this:
I wrote a short PHP script which dumps as many chars as I tell it to in the URL, plus the HTTP header of-course.
Tested from home, I cannot access www.more-solutions.co.uk (homepage), but can access: http://www.more-solutions.co.uk/test.php?bytes=10 http://www.more-solutions.co.uk/test.php?bytes=100 http://www.more-solutions.co.uk/test.php?bytes=1000 http://www.more-solutions.co.uk/test.php?bytes=1200 http://www.more-solutions.co.uk/test.php?bytes=1276 but cannot access: http://www.more-solutions.co.uk/test.php?bytes=1277 .. and above.
If I use: wget -s http://www.more-solutions.co.uk/test.php?bytes=1276 .. I get a file of 1411 bytes (ie including HTTP headers etc). An MTU of 1458 means this is below the MTU with a bit of a margin, implying that the actual MTU limit should be lower still?
I am getting so confused!!
On Sat, Oct 07, 2006 at 11:37:02PM +0100, Mark Rogers wrote:
If I use: wget -s http://www.more-solutions.co.uk/test.php?bytes=1276 .. I get a file of 1411 bytes (ie including HTTP headers etc). An MTU of 1458 means this is below the MTU with a bit of a margin, implying that the actual MTU limit should be lower still?
I am getting so confused!!
Can you use tcpdump (or ethereal, or whatever) to grab the entire packet on the remote server and also grab it on your local lan and see exactly how big it is?
Also, is this traffic going over real http or is it being encapsulated in the vpn stuff? As some vpns set do not fragment bits iirc, and when you add on the packet+vpn header+extra tcp header (or whatever) and you end up with a giant packet overall that can't be sent or won't fit through the link with out being repackaged but this then breaks the vpn (i think, I can't remember exactly what the mechanism is, but it is worth considering).
I think the first way forwards is both local and remote tcpdumps, if you can't work out the answer perhaps putting the dumps online would mean we could diagnose a bit for you?
Adam
Adam Bower wrote:
Can you use tcpdump (or ethereal, or whatever) to grab the entire packet on the remote server and also grab it on your local lan and see exactly how big it is?
I'll need some help but in principle yes I can do this.
Also, is this traffic going over real http or is it being encapsulated in the vpn stuff?
Real http. I didn't describe things very well: I was using VNC over the VPN to run the test from the office on my office PC, just without me being in my office. The http request was from my office PC to my web server, directly over ADSL.
I think the first way forwards is both local and remote tcpdumps, if you can't work out the answer perhaps putting the dumps online would mean we could diagnose a bit for you?
Can you describe in more detail what you need?
NB: I don't have root access to my web server; there's a lot I can do via SSH to it but its not 100% in my control. If this becomes an issue I'll see if I can set something else up to test against; I'm only using my web server as a test because its convenient. The problems are "global" for me.