The most confusing network problem ever!
I’ve spent the last week with a very very odd networking problem, and all my attempts at figuring it out and fixing it have yielded nothing but more questions than answers.
I first noticed the problem after having taken my network apart because I got a new Rogers VOIP cable box (Rogers Home Phone service) added, and I opted to hook it up myself so I could do a nice wiring job. I got that done and hooked everything back up just the way it was – powered on my firewall/server box, cable modem and router all hooked up just they were before. Things seemed normal until I tried to go to windowsupdate.microsoft.com and my browser stalled half way trying to load it up – refreshes wouldn’t help much, sometimes it would get a bit further, but it never got to loading the ActiveX part of it. Other sites were working fine though which was really confusing.
I thought maybe Windows XP was pwned or something, so I tried Vista – it worked which led me to think XP was in trouble – so I spent countless hours searching for anything – something – that I thought might be the cause. I tried fixing *everything* from network stacks, registry, hardware, drivers, WSH reinstall – you name it. Nothing worked. Then I noticed file transfers were very slow for uploads through XP and Vista, so I got to tinkering with stuff to figure that out.
Eventually I came to messing with cables and I thought maybe my wireless router was to blame. I reset it to factory defaults and set it up again, but it made no difference. So then I tried tinkering with how my network was routed. My network was set up as such: Internet -> cable modem -> linux firewall box -> router -> workstations. That configuration used to be working fine until now. After running various different setups, the following setups work fine: Internet -> cable modem -> workstation; Internet -> cable modem ->router -> workstations; Internet -> cable modem -> router -> linux firewall box -> workstation.
That final configuration is what really has me boggled – if the router itself works, and the workstation to the server to the router works, why doesn’t the workstation to the router to the server work? I thought maybe it was some upgrade I had done to the box (which is running Gentoo), so I rebuilt every program on the system, recompiled the kernel, made sure iptables (which does masquerading from one nic out the other to the cable modem normally) was set up right and even tried upgrading and downgrading iptables versions – all met with no change in the problem. I even checked stuff like IPv4 forwarding, made sure ECN was disabled and not compiled in the kernel – I honestly don’t know what else it could be. Yes, I *even* made new cables just in case the other ones I had started to fray a bit in the ends.
So, I’m sitting here now, using my wireless router to handle the cable modem DHCP and my linux server just hanging off the router along with the workstations – not the setup I want, but at least sites are working. I just hope someone reads this and has a clue as to what the hell might be going on to cause this problem! It’s driving me f-ing nuts and I want it figured out so I can fix it!
*UPDATE*
After having just given up on the problem after a week or so of messing around, I finally went and baselined the server to get a fresh Gentoo install on it since it really did need one, and I kinda hoped that it would fix the problem as well. Unfortunately it didn’t – it was just as bad as before… however this gave me the final piece of the puzzle to really nail down what the hell was going on (oh, and I had baselined my workstation too with a fresh install of Vista Ultimate).
So, I had 2 fresh OS installs, a network cable I knew worked, a router I knew worked… the thing left was the network cards in use. I was pretty sure the one in my workstation was OK since it worked fine without the server in the way. As for the server – when I put it all back the way I wanted, I noticed that I’d experience lag “spikes” on my SSH terminal (to/from traffic out to the net seemed to be on/off as well, and consistently *slow*). The only odd thing was that neither network card on the server was reporting *any* errors.
Outbound to the Internet from the server itself seemed fine, coupled with the local lag spikes I was encountering… I figured perhaps it’s the local network card on the server despite no errors. They were both $10 Realtek NIC’s I bought over 6 years ago and I had one die on me unexpectedly before in my workstation (and then it decided to work again months later when I forgot it was busted and swapped it in to a system). So, since I had onboard with my workstation and Vista had drivers for it, I took out my other NIC from there and removed the local NIC on my server and replaced it with the one from my workstation. Power everything back on, check to make sure things are running right and presto!
So, next on my shopping list are two new $23 (ooh, I’m upgrading in price!) D-Link 530′s to replace those crappy Realtek’s. I’m wondering if I’ll notice an over-all speed increase from swapping them both out too – that’d be a nice little bonus.
*2nd UPDATE*
It’s been a while and things were going OK, however I started noticing that my ISP POP3 account was lagging out horribly whenever I’d check for email, and also uploads were going slow again. I thought “crap, the damned problem is still there”, which it was – it never really went away. I went out and got 2 replacement D-Link DGE-530T cards (gigabit nic’s) which also support VLAN and other fancy stuff. Initially I had just the one card and replaced the outbound NIC hoping it was just that one (this is the one I didn’t replace yet). It didn’t seem to make any difference. I then got the other NIC, and swapped it in and even switched power supplies with a beefier one I had laying around from an old system (had to cannibalize a replacement fan for it since I had yanked the one out of the new power supply for a case fan in my desktop). I figured two new cards and a stable power supply should do the trick… but nope, still didn’t make a difference!
I figured the last thing I could play with were the MTU settings – I had noticed a while back that my outbound card on the server was getting 576 from the cable modem, and my internal LAN had it’s usual 1500. I thought that Linux would just gracefully handle the additional packet info and split it up accordingly and deal with it for me – I mean after all, this problem wasn’t apparent since I’ve first had the cable modem. Even so, I figured what the hey – chop off 16 bits (8 for the wireless router, another 8 for the server – just to be sure) and give it a whirl. So I set my server’s internal connection to have an MTU of 560, and the same for the router’s WAN port which goes to the server. Give it a shot and what do you know – full upload speed!
So, after *all* this headache, all I had to do was tweak my MTU settings! I figure that when they came and installed my VOIP home phone service that they changed the cable modem’s MTU setting, either that or just coincidence that on that day, the ISP lowered the MTU setting enough to cause a noticable problem.