Which Managed Switch? Part 2
We are often consulted on which managed switch series to use for a project. Sometimes this is part of a much larger greenfield design, and other times it’s an informal direct message asking, “Hey Josh, what do you think of XYZ switch?” Anyone who has ever asked me that question probably wasn’t prepared for the waterfall that followed. I stand firmly in the position there is no one-size-fits-all answer, so in this series I’ll attempt to address some of the nuances.
Part 1 covered considerations for selecting a switch for IT vs OT networks. Here in Part 2, I’m specifically zooming in on switch selection for Industrial/SCADA/PCS/DCS/[OT] networks.
Vendors have increasingly muddied the waters with product offerings ranging from “light” to “basic” to “fully” managed. These categories are vendor-specific and have no objective standard for comparison. Below, I will be highlighting specific product features we find most frequently valuable and some additional key considerations. In our experience even the most mature OT networks are typically utilizing under 20% of the the features available so let’s try and narrow the focus.
1) Diagnostics:
Many people see the network as a black box, treat it as “plug and pray”, and when problems occur aren’t sure where to look. In fairness, on an unmanaged switch, you are operating blind. Anecdotally, the overwhelming majority (some statistics say as high as 70-80%) of network problems are Layer 1 or physical issues. One of the major benefits of a managed switch is being able to view port statistics, particularly CRC errors and collisions, as this can point you quickly towards a root cause and fix: physical termination or potential EMI/RFI for increasing CRCs and duplex configuration for increasing collisions. N:1 Port Mirroring is another important capability as it allows feeding traffic to monitoring tools such as Wireshark for detailed packet analysis. We share multiple examples of leveraging this capability to shorten troubleshooting and resolution in our OT networking training classes.
2) 802.1Q VLAN:
Almost every network we encounter in OT is “flat” - one big network in a single subnet and broadcast domain. This normally starts out OK and gets worse over time as the system grows. Segmentation through the use of VLANs is one of the best technologies OT networks can benefit from for increased resilience. Most managed switches have 802.1Q VLAN support, so my point here isn’t just inclusion of the feature but more importantly how VLANs are configured. Compare the tasks of configuring a single Access VLAN port, a VLAN Trunk, and certainly adding a new VLAN and the effort require to make it available across the entire network; you will likely find that these tasks are meaningfully simpler on some platforms vs. others.
3) Layer 2 Redundancy Protocols:
Most managed switches will support one or more Spanning Tree variants whether industry standard (RSTP, MSTP) or proprietary (PVST, Rapid PVST+) all of which I pretty equally despise in OT networks. We can and often do make one of those work if we have to but generally speaking multiple characteristics of Spanning Tree IMO make it a bad fit for OT. Beyond STP, most vendors will offer one or more "ring” technologies and possibly some “chain” or “segment” technologies as well which can be very useful. Finally; however, the best solutions in my view will also offer various options to connect, for example, two rings together in a redundant fashion through some type of ring “coupling”. When a vendor lacks this last option, you may find the lack of flexibility frustrating when trying to deploy your ideal redundant network architecture. It’s also worth comparing the behavior of these technologies as you configure them. I find some to be more annoying than others, creating extraneous communication disruptions during initial commissioning than others. While this technically only matters for “live” production network conversions, I will always prefer any version offering the least disruptions possible. I love being able to add redundancy to a network without having to ask for production downtime to make it happen.
4) IGMP Snooping and Querying:
Improper multicast management is possibly the most frequent and certainly most impactful OT network killer I run into. Years ago, many automation vendors and particularly Allen-Bradley/Rockwell implemented multicasting as part of their communication options. This is often presented as an option to the automation engineer/PLC programmer as “implicit” or “produce/consume” messaging that is simple and easy to configure vs. legacy “explicit” messaging. And multicasting on a properly configured managed network absolutely has substantial benefits for the producers of data AND the network itself in terms of efficiency and utilization. The problem is most networks are NOT setup properly for it and the results of this range from annoying (laggy HMIs) to disastrous (automation devices including PLCs locking up, tipping over, crashing, etc. until they are power cycled). IGMP Snooping and Querying is THE critical managed switch feature that when properly configured will facilitate proper distribution of multicasts on your network to maximize its potential and avoid the productivity-crushing flooding associated with its absence.
5a) Firmware Upgrades - General:
Some vendors have multiple firmware variants available for a given switch. These variants offer increasing levels of functionality at higher price points. With some vendors, you can purchase at one level and later purchase a license to field-upgrade the switch while with others your decision at time of purchase is what you are stuck with. Oddly though despite this, I have been witness to someone accidentally downloading and installing a Layer 2 firmware onto a Layer 3 switch. This killed all operational routing on the Layer 3 switch which had a major production impact until this bizarre cause was identified. I was shocked to learn that a switch that couldn’t be field-upgraded could be field-DOWNgraded by accident. Meanwhile, other vendors ask you to make a single choice at purchase - Layer 2 or Layer 3 and then actually share the same firmware package across the entire product line so such mistakes aren’t in play. And what about version upgrades over time - are they free or do they require some paid support contract? Lots of mines in this category so buyer beware!
5b) Firmware Upgrades - Time:
I am accustomed to a few minute upgrade time across multiple vendors and this is an important factor in being able to keep firmware up to date let alone accomplish multiple rounds of firmware upgrades through very large projects with longer commissioning schedules. Unfortunately however some industrial switches have SUBSTANTIALLY longer firmware upgrade times, as long as 45-60 minutes. Speaking plainly, I’m not sure how anyone finds this acceptable. You can build maintenance plans around any amount of time as long as it is known, but I’ve observed on too many occasions times this long used as a reason to avoid upgrades.
5c) Firmware Upgrades - Regression Testing:
Working with countless managed switches over the last decade+ I’ve noticed a HIGH degree of variability in firmware regression testing between vendors. Some seem to thoroughly test firmware before releasing and as a result rarely have major bugs that affect basic operation. The results from others however suggest clearly that testing must be extremely limited as on multiple occasions we have encountered bugs that beg the question “Did anyone even try this out in a lab environment?”.
Everyone has bugs and on occasion has significant ones; I am not here to imply otherwise. But if you are evaluating switches, one exercise I would strongly recommend is pulling the entire release note history for the switch’s firmware and spend just 10 minutes looking at the version release dates, bug fix descriptions and known anomaly entries. It becomes fairly trivial to recognize which vendors do more testing prior to release by performing this exercise.
6) Boot Times
Similar to the upgrade time discussion above, I am accustomed to switches with boot times ranging from 30 seconds to maybe 2 minutes on the high end. There are others however with boot times closer to 7 minutes. Shall we take a guess which I prefer? It would also be one thing if the switches with 7 minute reboots didn’t require reboots often, but they seem to also have a pesky habit of requiring reboots for all sorts of occasions. Granted this is a whole separate issue unto itself but these two compounded lead to quite the non OT-friendly situation. Just. No.
7) Management Options:
Consider what your preferred management method will be (Command Line Interface, Graphical User Interface, Network Configuration or Network Management Software) and evaluate this interface against frequent configuration, maintenance, and security tasks. If you plan to manage through individual switch GUIs, compare the performance and capability of the GUI. We’ve observed some GUIs are highly responsive and reliable while others can be laggy and worse yet, have wildly inconsistent results in applying configurations. Refer to those firmware release notes mentioned in the previous section for some examples. Further, check if the vendor has a Network Configuration or Network Management software package to aid in configuration and monitoring. These packages can make complex and time consuming tasks simple, automated/scheduled and some can even integrate alarming and visualization of network health into your SCADA system. These are some of my favorite projects.
Lastly consider whether you require quick device replacement by non technical personnel. Idea here is a 2AM switch failure impacting production. Do you have staff on-site at 2AM that can load configuration on your shelf spare or would you like the ability for flash media (USB, SD card, etc.) restore to a shelf spare? Think about a non technical user being able to pull an SD card out of the failed unit, swap out the spare placing the same SD card back in and powering up. Keep in mind these options vary widely as do they way people handle them (does the USB stick stay installed 24/7 or is it sitting in a desk potentially missing config updates?). Also be sure to actually test that a vendor’s flash media restore option WORKS as we have certainly encountered situations where they do NOT (firmware bug, etc.).
8) Security:
Security features on industrial switches are evolving at a rapid rate which is a great change to see. At this time I don’t have any SINGLE security feature to call out but rather recommend first developing a security policy and then evaluating to ensure the policy can be enforced with the selected switch. A partial list of potential security features include: MAC-based Port Security, Port-based Access Control with 802.1X, Guest/unauthenticated VLAN, Integrated Authentication Server (IAS), RADIUS VLAN Assignment, Denial-of-Service Prevention, DoS Prevention Drop Counter, VLAN-based ACL, Ingress VLAN-based ACL, Basic ACL, Access to Management restricted by VLAN, Device Security Indication, Audit Trail, CLI Logging, HTTPS Certificate Management, Restricted Management Access, Appropriate Use Banner, Configurable Password Policy, Configurable Number of Login Attempts, SNMP Logging, Multiple Privilege Levels, Local User Management, Remote Authentication via RADIUS, User Account Locking, Password change on first login
9) Support:
This one is tough to confirm until you try, but you could ask industry colleagues for their experience. How is the vendor’s (and/or that vendor’s distributor/VAR/partner you will be purchasing from) support both commercially and technically? Can you easily get what you need when you need it? While listed last in this post, this one could have a tremendous impact on your overall experience so make sure this question is appropriately weighted.
As mentioned in previous product selection posts, we appreciate these considerations are not trivial and require a lot of investigation to answer on both the internal and equipment vendor sides. Having gone through this gauntlet many times, we can certainly help accelerate and add clarity to the process. Don’t hesitate to reach out if a consultation would help for your next project!