Posted Saturday, February 21, 2015 in Network Engineering
As companies grow so do their information needs. More offices can mean more wide area network interconnects. More data centers, servers, and higher bandwidth applications can mean more network devices with a complex mesh of differing networking protocols. Network administrators can easily find themselves awash in an unmaintainable mass of network devices and configurations. It is therefore important to understand where network scalability can be hindered by administrative overhead.
One straightforward approach to managing a network device can be through a console cable and terminal session, directly inputing configurations through the device's command line interface. This approach can be perfectly acceptable for a small number of network devices but what happens when these devices are no longer physically accessible? What happens when there are no longer a few network devices but several hundred? Engineering solutions for these questions and more require an automated and centralized approach.
The first step to understanding the engineering undertaking required to manage networking at scale is to consider the components of networking devices. One concept often referenced when considering software-defined networks is that of the control plane and data plane (Kirkpatrick, 2013). The basics of this concept is that the control plane handles network routing whereas the data plane handles network switching. Separating the control plane and the data plane, at the least conceptually, can be an important management consideration. Being able to decouple the control plane and data plane from a single device to a centralized management architecture is a core concept of software-defined networks (Fundation, 2013). The same concepts underpinning software-defined networks can also be implemented in decentralized network architectures. In a decentralized network the control plane and data plane still exist but in the form of layer 3 switches and routers. Layer 3 refers to the network layer in the open systems interconnection model (OSI) which describes layer 3 in TCP/IP as performing IP routing between dissimilar IP subnets.
Decoupling the control plane and data plane, at least conceptually, can assist in formulating management strategies for networking at scale. When a network engineer was interviewed, they reported that two main management techniques are to use the simple network management protocol (SNMP) and IP flow information export (IPFIX). The interviewee (see Appendix A) remarked on using SNMP for interface management, a data plane configuration, and using NetFlow (Cisco's proprietary implementation of IPFIX) for traffic inspection, a control plane level inspection.
Management of a large number of network devices can be difficult, especially when multiple vendors and network operating systems are involved. Command line access to a network device either through a direct console connection or through a remote secure shell (SSH) connection can require different commands and configurations according to the device's operating system. A universal approach to managing network devices can be achieved through the use of SNMP. Use of SNMP allows for both management and monitoring of network devices through management information base (MIB) ids which can be similar across operating systems and vendors for common functionality (Stallings, 2007). Shaffi & Al-Obaidy (2013) wrote that use of SNMP has drastically reduced the administrative effort compared to direct command line management. Management of network devices can be centralized and supplemented by a network management system utilizing SNMP. The interviewee spoke of the use of Cisco's LAN Management Solution (LMS) which utilizes SNMP to perform management functions on a large scale. Devices can be cataloged and configurations conformed to an organization wide standard further simplifying management and troubleshooting.
Shaffi & Al-Obaidy (2013) listed the uses of SNMP as wide-ranging between configuration and fault management to security and accounting. An important component of SNMP is that it can be used programmatically by network management systems to perform a wide variety of functionality. Other systems can also utilize SNMP to perform supplementary management functions as well such as inventory and licensing control by using SNMP to query network device information. Use of SNMP can be compared to an application programming interface (API) in the regard of remote access to dissimilar systems. Non-network devices such as servers and sensor devices can also utilize SNMP in a similar fashion such that a central monitoring system may monitor all devices on a network.
Use of SNMP can be used for both control plane and data plane management. While there are some limitations in SNMP to monitor IP flow information, Antoniades, Hu, Sim and Dovrolis (2013) found that SNMP can still be used to infer IP flow information from the aggregation of link performance. There are also limitations for SNMP based management of software-defined networks. In the decoupled control plane and data plane paradigm, an access method is needed to allow the data plane to be managed by a centralized control plane. Esteves, Granville and Boutaba (2013) reported on several such technologies that in addition to enabling a software-defined network also replaces many management functions of SNMP. Fundation (2012) noted that use of a software-defined network can result in allowing the use of inexpensive networking devices why need only the data plane functionality with the ability to communicate with a centralized control plane. In one sense, software-defined network protocols such as OpenFlow (Esteves et al., 2013) can be considered as having SNMP integrated since it provides management functionality similar to SNMP but as a centralized framework.
Unlike software-defined networks, SNMP is ubiquitous with network devices (Shaffi & Al-Obaidy, 2013). Where complete management using software-defined networks isn't supported or implemented, SNMP offers a widely used method for centralized management.
While SNMP doubles as both a management and monitoring protocol for network devices there are some cases where more data is required to make an informed management decision. One method of extending the information provided from monitoring systems is through the use of packet captures. A packet capture is a raw data representation of the packets sent and received over an interface. However, packet capture can be a resource-intensive operation requiring significant processing and storage capabilities. Using IP flow information export (IPFIX) or one of the proprietary implementations such as Cisco's Netflow reduces the resource requirements of packet captures using packet sampling and summarization (Drago, Barbosa, Sadre, Pras & Schönwälder, 2011). Data from IPFIX can be used at a control plane level across the entirety of a monitored network. Drago et al. (2011) reported on common uses of IPFIX data such as finding which protocol and/or IP is causing the main load on a given link, determining TCP flag and port number usage between end nodes, and what settings are configured for specific protocol traffic. While SNMP can be used to set and monitor configurations such as those on ISDN links described by the interviewee, IPFIX data is needed to understand the performance of such configurations on a network-wide level such as bottlenecks and routing inconsistencies.
Choudhary and Srinivasan (2013) presented an approach to security management in which IPFIX data was utilized to accurately detect intrusion attempts. Similarly, software-defined networks can perform similar detection operations to not only detect attacks but reroute suspicious or important network traffic to route through security devices such as a firewall or intrusion detection system (Fundation, 2012). However, Esteves et al. (2013) argued that IPFIX data should not be utilized due to its potential for private information to be disclosed. Instead, Esteves (2013) recommended that SNMP monitoring should be used to infer information normally provided by IPFIX data. Esteves et al. (2013) wrote that protocol and IP data provided by IPFIX could be replaced by monitoring where traffic originated from. For example, if a spike in throughput from an email server occurred that correlated in a spike in throughput on a WAN link, then the spike could be attributed to email traffic. Unfortunately, with the variety of uses pointed out by Drago et al. (2011) of IPFIX data, such as usage of TCP flags and protocols, it would be unlikely that Esteves et al. (2013) recommendation could fully replace IPFIX.
Simply utilizing a network management system that implements management technologies such as SNMP, IPFIX, and software-defined networks isn't the only factor to efficiently managing a network at scale. Complexity can arise due to procedures, configurations, and practices. It is therefore important to be able to identify complexity and plan for an effective management strategy.
The first step to identifying complexity is to understand its impact. Burgess (2012) wrote of red flags in network management which may seem to be properly managed until an incident arises. If a network incident were to occur in which network services were degraded or lost, Burgess (2012) wrote that IT management should consider what the resolution time would be, what are alternatives to service work, what proactive work is performed, and how much of the network is documented. Using such criteria, certain aspects of network management can be identified for being potential areas of complexity.
Multiple different physical interfaces can be one such area of complexity. Analog interfaces in particular such as integrated services digital network (ISDN) connections require TCP parameters to be configured for the connection to be properly utilized (Khan, 2012). Documenting connections, especially those that require a configuration unique to the network is one method of ensuring faster incident resolution. Contact information should also be listed for links that require support from a service provider such as WAN link.
Another important consideration is layer 3 routing configurations. Routing protocols such as open shortest path first (OSPF) use configurations such as cost metrics to determine routes between sender and receiver. Protocols such as OSPF and routing information protocol (RIP) utilize a dynamic routing mechanism which calculates a network route according to an algorithm. A pitfall that comes with configurations for dynamic routing is that an improper configuration may still be a working configuration. That is, network traffic may be improperly balanced between multiple paths but would still operate in a degraded state. Using link statistics from SNMP can provide an interface for detecting inefficient routing (Shaffi & Al-Obaidy, 2013). For example, a network management system may provide a logical graphical map that displays network throughput across routes. An intuitive display can assist in identifying potential areas of concern so that routing configurations may be corrected in the event that network service is degraded or lost. Using IPFIX data can also assist in identifying routing information which can expose routing issues even with limited access to all network devices in a network route (Drago et al., 2011).
Similar to routing configurations, virtual networks can be a cause for complexity due to less than obvious configuration states. As Esteves et al. (2013) pointed out, there are several different management frameworks for virtual networks that may or may not include SNMP and IPFIX. Management frameworks for virtual networks need to be included in a network management strategy as routing decisions, security measures, and TCP configurations can occur in the virtual network that affects physical network performance. For example, in a virtual infrastructure, several virtual machines may reside on a single physical host. A virtual machine may be transferred to another physical host as part of a failover or load balancing of resources, causing the virtual network interface for the virtual machine to be transferred and virtualized on a different physical interface. The new physical interface may be unable to handle the increase in network traffic or may be improperly configured to accept traffic from the migrated virtual machine's IP address. If the network administrator was unaware of how the virtual network is configured and how it operates then such a virtual machine migration could be the cause of a network outage.
Another cause for network outages due to unmanaged complexity can be from security measures. Network management systems, security devices, and security-focused routing can all be potential causes for legitimate network traffic being dropped. For example, Choudhary and Srinivasan (2013) described a method of blocking network attacks from recognized patterns in IPFIX data. However, without fully understanding and documenting such a system, network administrators may be caught off guard when legitimate data is dropped due to a false positive with the security system. Similarly, firewalls and network intrusion prevention systems can block legitimate network traffic. Without awareness of where the network traffic is being dropped or what the configuration of the security devices are, the resolution time for a security caused outage could be unnecessarily long.
Three major strategies can be employed to perform efficient network management at scale. These strategies are monitoring, remote centralized management, and policy-based management. The first of these strategies, monitoring, can be defined as a proactive approach to network management. Data from SNMP and IPFIX can be centrally collected by a monitoring system to provide both a view of the current state of the network and an immediate view of some configuration data. For example, a monitoring view may display information such as the throughput of each interface on a network device as well as interface configuration data. Much of the information relevant to determining the state of a network device can be retrieved using SNMP (Shaffi & Al-Obaidy, 2013), making SNMP the primary information mechanism for most monitoring tools. More detailed information can be retrieved using IPFIX to describe traffic flow between paths (Pras, Sadre, Sperotto, Fioreze, Hausheer, & Schönwälder, 2009) and classification of traffic to determine network utilization by applications (Dario & Silvio, 2010).
Routine maintenance may also be supplemented by a monitoring system. For example, alerts for licenses and warranties that are about to expire can facilitate proper inventory management. Setting up an effective monitoring solution can be very time intensive, but can also significantly reduce management efforts by catching issues before they become an outage. For example, detailed monitoring of input errors on an interface with a connected alert can help indicate when a physical cable has become degraded. Replacing the degraded cable in a controlled manner could be the difference between full availability of network services and an unexpected network outage.
Along with an effective monitoring solution, a comprehensive remote network management system is critical to success. Also, just as with the monitoring system, prior setup of a network management system and procedures is important in long-term management efficacy. Remote management protocols such as SSH, SNMP, and IPFIX requires configuration of the network device. Lack of configuration could result in an unmanageable device until physical access to the device is obtained. Likewise, improper configuration of remote management protocols can leave a device open to unauthorized access. Some common issues with misconfiguration of remote management protocols are allowing any IP address to connect, using generic access credentials, not using access restrictions, or not using SNMP encryption. However, when remote management protocols are globally configured and secure access controls are utilized, a network management system can be utilized to greatly reduce the management impact of a large network (Shaffi & Al-Obaidy, 2013).
Lastly, policy-based management can be used to facilitate several management decisions. When working with network management the question of what is the acceptable level of service may arise. What should the expected availability of a network service be and how much administrative effort should be expended in the event of an incident? These questions can drive both how the network is designed and the administrative effort expended during an incident. For example, if a 99.995% availability of network service to an organizations e-commerce site is expected then no more than 28 minutes downtime can occur for network access to the site. This availability requirement may be part of a service level agreement or determined by the maximum amount of loss of revenue acceptable against the cost of achieving the availability. To achieve such a high level of availability the network engineers must design the network with multiple redundancies, multiple Internet routes, and with precise monitoring of possible issues. Additionally, policy-based management may be directed by configuration management. For example, templates may be used to control the exact configuration of each network device. A configuration review board could then require that all changes to the approved templates be reviewed, tested, and approved before being implemented. Precise control over how the network is configured with multiple individuals involved can help prevent complex and undocumented configurations from slowing down incident resolution times.
In preparation for the review of network management concepts, an interview was held with a network engineer of an organization with a large network footprint. The architecture of this network comprised a diverse set of network devices, a virtual infrastructure, and multiple data centers. The large majority of the network as centrally managed by a Cisco network management system with a comprehensive monitoring solution. However, when discussing the virtual infrastructure it was found that the virtual networks were under a different management strategy than the physical network. This meant that the control plane for each data center was internally managed by the virtual infrastructure independent of each data center. If a virtual machine needed to be migrated from one data center to another either the control plane configuration would need to be modified to accommodate the virtual machine or the virtual machine's address space would need to be reconfigured (data plane configuration). Fortunately, the organization planned for an upgrade to utilize Cisco's control plane across all data centers and virtual networks. In doing so, virtual machines could seamlessly migrate between data centers without reconfiguration since the data plane would be effectively decoupled from the control plane. It would appear that software-defined networking is not fully open between vendors and many proprietary solutions such as Cisco's vPath require a specific vendor selection. While more entrenched management technologies such SNMP and IPFIX can be expected, support for software-defined networking requires close inspection of vendor support.
Managing a large network with a diverse variety of network interfaces and protocols requires an effective strategy to reduce administrative overhead. Ideally, administrative efforts to include monitoring and management, should occur in a centralized, documented, and methodical manner. Utilizing open network management protocols such as SNMP and IPFIX can facilitate centralized management due to their wide availability across multiple network vendors. However, software-defined networks that utilize a single protocol such as OpenFlow may also provide a simple yet powerful replacement to SNMP based network management.
As Burgess (2012) explained, red flags from a lack of network management planning can easily catch a network administrator off guard. The Proactive configuration of monitoring and management solutions can provide significant long-term returns on investment of effort. Senior management of network services should also be aware that network administration cannot occur in a void. Policies that define availability and level of effort in the event of an incident are needed to properly design and manage a network. Without a clear understanding of the level of service needed, costs can easily get out of control or the network could be under-protected leading to significant losses. For example, if network service to an e-commerce site accounted for significant sources of revenue then the design of the network should account for a high availability requirement. Conversely, multiple high-cost network redundancies should not be expended on access to an information only site with little loss due to downtime.