There’s nothing like Incident Driven Development, even in your home office or homelab. Monday morning (Sep 23, 2024), as I was getting ready to hop into the shower, I started to have internet trouble. I’ve been having some small issues with the 6 GHz band covering my house, so at first I assumed this was the problem, but after swapping to the 2.4 GHz SSID, the issues persisted. It turned out that my AT&T fiber connection was offline - quite the rarity, as outside of power outages, I’ve never had a major outage incident. The initial status report reported a resolution of 5 PM that day, which I interpreted as a “we’re still figuring things out, so we’ll Scotty Method the estimate” kind of thing. As the day went on however, the Estimated Time for Resolution (ETR) of 5 PM became 6 PM, and then 10 PM. By this point we hit the third delay, my IT Spidey Sense was tingling. This wasn’t going to be a quick-ish fix, and given that I work from home, I needed a contingency plan.
Prior to installing fiber internet, I was a happy Spectrum customer. While I only had a 300/20 plan, I was mostly happy with the service I provided. I was still able to serve my homelab, and work from home without issue. I even had a great customer service story with them. One day, I was having issues getting to my work network, which was hosted in a datacenter not far from my house. A rare occasion. However, all of our monitoring and alerts reported that our sites were online. Getting the ol network troubleshooting tools out, I discovered that there was a routing issue between Spectrum and the datacenter. I called the NOC, to which they reported that they were aware of the issue, but things seemed to be slow going on Spectrum’s side. I was able to call the regular phone support line, and when talking to the technician, I explained the situation, and provided my findings. The technician was able to escalate the issue to the Spectrum NOC, and within 30 minutes, the issue was resolved. I was impressed. If Spectrum could have offered symmetrical service, I honestly would have stayed with them.
However, when AT&T Fiber became available in my neighborhood, the price was right, getting symmetrical gigabit speeds was a huge boon in multiple ways, and it saved me a little money by offsetting the cost of one of the streaming services I was using, so it was hard to say no. I got my install completed, gave it a shakedown cruise for a few days, and then called Spectrum to cancel my service. I disconnected my cable modem, and stored it in my office closet. I never thought I’d have cable service again, but it turned out having it on hand would save my bacon.
At 6:30 PM on Monday, I rang up Spectrum sales. I had to sit through a slog of offers (sales folk are just doing their job, gotta let them try), but once I had settled on their cheap 100/10 plan, I was able to wire up my modem, talk with the activation tech, and I was online again in a matter of minutes. For $30/month for the next year, I have a new backup internet connection. It’s certainly not the fastest thing on the planet, but I’m not going to complain about the cost, nor the ease in getting it set up.
However, there’s no way in heck that I was going to lose an opportunity to do have some fun and learn a thing or two here. I’m paying for a backup connection now, so I wanted to see how easy it was to set up automatic WAN failover. The router for my home network is a Mikrotik RB4011iGS+RM, a powerful little device that I purchased when I got my fiber connection. This was far from my first rodeo with Mikrotik hardware, and for those that haven’t used them before, you get a hell of a stateful firewall, ISP level routing protocol support, and a slough of other enterprise-grade features for significantly less than other enterprise-grade router firewall devices. They do come with a bit of a learning curve however, the documentation can be a bit obtuse at times, and they don’t hold your hand when it comes to setting up advanced features. I’m slowly falling out of love with them, but I haven’t quite settled on what will be my next router, and it’s still a solid piece of kit.
Mikrotik has a feature called netwatch
, which is a fairly simple system for monitoring the status of a host or IP address. When an event is detected, you can trigger a script to run based on the state of the monitored host. To set up automatic failover, I needed to do a few things:
- Add the gateway IP address of my primary and secondary connections to Netwatch. My netwatch configuration for each interface is checking each address every 10 seconds, with a 1 second timeout. I may adjust this later, to avoid flapping between connections for brief outages, but things seem to be working well for now.
- Modify the DHCP configuration of any ISP connection to not populate the default gateway. (While notating the default route that the DHCP connection would provide).
- For any ISP connections offered via DHCP, I needed to add a static route to the default gateway that the DHCP connection would provide.
- For your primary connection, set the route metric to 1.
- For your secondary connection, set the route metric to something higher than 1 (I used 10, adjust this accordingly for your own network connection).
- On the primary internet connection, we need to define an up and down script. The up script enables the route to the primary connection’s default gateway, and the down script disables the route to the primary connection’s default gateway.
This basic configuration is all that is needed to set up Netwatch. You may need to adjust your interface configuration, ensuring that your network interfaces aren’t part of a switch group or bridge. Mikrotik’s versatility makes this very possible, but configuring interfaces as such is out of scope, as I already had two interfaces that were configured appropriately for this setup.
With this configuration in place, automatic failover and failback seems to work like a charm! I do need to do some additional testing, as when I tried to failback this morning (while the network connection was not functioning appropriately), the Level 3 connectivity was working just enough to cause a false positive failback. I may need to do additional testing on the next hop, or perhaps to a trusted endpoint on the endpoint, or perhaps change the Netwatch type to something other than “simple”.
Cool! Failover works! But I want to know when my internet connections are functioning, and when failover is happening. Fortunately, I have a tool in my homelab toolbox that I use for notifications of all kinds called ntfy. Ntfy is a wonderful tool, and it’s offered both as a public service and self-hostable. Not only can you send basic text notifications, it has support for priority, images and attachments, and it even has a facility for sending push notifications to your phone (even if you’re self hosting!). Sending a message with ntfy is as simple as sending an HTTP POST to the ntfy server with the appropriate information configured as either headers or a JSON payload. I updated my scripts to send a notification to my ntfy server when the primary connection goes down, and when the primary connection comes back up. I still would like information when the secondary connection goes down - I just haven’t gotten around to setting up those scripts yet. Mostly because Mikrotik’s scripting language breaks my brain a little bit, and even though I can likely copy-pasta the script I used for my primary connection, the thought of having to deal with that makes me grumpy, so I put it off for now. 😁
As of the time of finishing this post, AT&T has (actually) restored service this time. While the outage page still reports that the incident is on going, I’m back in business on my primary connection. For right now, given the cost of the Spectrum connection, I’m going to keep it around. We’ll see if after the promotional period ends if I keep that connection around, or go back to investigating a cellular failover solution, which is what I had been considering before Incident Driven Development encouraged me to set up something with the gear I already had on hand. Maybe I can convince Spectrum to keep me on the sweetheart deal since I’m not actively using the connection. That assumes that I even want to deal with the Retentions department to keep the price. 😁