Originial document is at: http://www.cert.org/research/ch2.html Untitled

CERT Coordination Center

DRAFT Technical Report

June 1996

Last Revised June 3, 1996

2 Toward the Design of Survivable Systems Under Construction

Four fundamental problems must be overcome if our vision of survivable networked information systems is to be realized.

  1. System designers, managers, administrators, and users lack an awareness and understanding of the concepts and practices associated with system survivability. Basic research is necessary to identify the foundational concepts and techniques that can be used to characterize, evaluate, and ultimately quantify the survivability attributes of a system. The techniques associated with traditional computer systems security are better understood, but are typically poorly applied. Survivability has yet to overcome the first hurdles of awareness and basic understanding.
  2. Systems are rarely designed from the outset with security in mind, leaving security to be addressed only through post-design patches, through add-ons, or not at all. We need not only to address this problem, but also to generalize our approach to encompass survivability by design. We assert that survivability by design is the only viable approach that can withstand both the evolution of the system and the evolution of the networked environment of which the system is a part.
  3. System security is typically not assessed in the context of the other attributes of software quality, such as performance, ease of use, extensibility, maintainability, and interoperability. As a result, appropriate engineering tradeoffs among critical software attributes are not explicitly made, with typically unfortunate consequences for delivered systems, and yielding inadequate evaluations of existing systems. We need to broaden the focus of our concern and develop engineering methodologies that ensure that system survivability is assessed in the full context of the other attributes of software quality.
  4. Most of the security and (very limited) survivability research and practice to date have been based on a bounded system paradigm which assumes administrative control over all of a system's computational and communication resources. This approach does not support the design of systems which must survive in an unbounded network domain such as the Internet or its future incarnation.

For a software engineering methodology to successfully support the design and analysis of survivable systems, it will have to effectively address the four fundamental problem areas described above. We believe that a solution for each of these problems lies in an architecture-based engineering tradeoff approach to software analysis and design. The goal is to abstract the fundamental design principles of survivability into an architectural representation which makes clear the survivability properties of proposed and delivered systems. Designers and evaluators must be able to model the behavior of a system at the abstract architectural level to provide the leverage necessary to have a sweeping and profound impact on improving the survivability of delivered systems. This approach will enhance survivability in the context of appropriate tradeoffs among the other attributes of software engineering quality.

We assert that an architecture-based approach to survivability by design will not only improve survivability in today's unbounded network environment (e.g., the Internet), but will also successfully mitigate risks to survival in the most plausible future network environments. We believe that one such environment will be characterized by a shift from customer-owned and -controlled computing resources (communicating over an unbounded network) to a world of autonomous agents where most of the computing resources will be contained in the unbounded network infrastructure and will be provided by a multitude of computing and communications service providers. In this future environment, most computing and communications tasks will be accomplished by remote agents on behalf of the customers they represent. We will show that an architecture-based approach can extend its survivability design and evaluation capabilities into this and other plausible future environments.

2.1 The Promise of Survivable Systems - An Ideal Future State

In this subsection we first define survivable systems as an abstract, idealized concept. We next describe a vision of a future computing environment which we believe will form the context within which future survivable systems will function. Finally, we outline several desirable design capabilities for future survivable systems

2.1.1 Survivable Systems - Assumptions and Basic Principles

In the realm of networked information systems, the natural escalation of offensive threats versus defensive countermeasures has demonstrated time and again that no practical systems can be built which are invulnerable to attack. Affordability will always be a significant factor in the design, implementation, and maintenance of the systems we build to support our national infrastructure (such as the power grid, the public switched communications networks, and the financial networks) and our national defense. In fact, the trend toward increased sharing of common infrastructural components in the interests of economy will ensure that the civilian networked information infrastructure will always be an inseparable part of our national defense.

Practical, affordable systems are virtually never 100% customized, but rather are built from commonly available off-the-shelf components whose internal structure is no secret to the community at large. The trend toward building systems through integration and reuse, rather than through "one-shot," customized design and coding efforts, is one of the cornerstones of modern software engineering strategy. Unfortunately, the intellectual complexity associated with most software design, coding, and testing efforts ensures that exploitable bugs can and will be discovered in commercial and public domain products whose internal structures are widely known. When these products are incorporated as components of larger systems, those systems become vulnerable to attack strategies based on the exploitable bugs associated with the component products. Popular commercial and public domain components offer an attacker a ubiquitous set of targets, with well-known and typically unvarying internal structures. This lack of variability among components (and therefore lack of variability among systems) allows a single attack strategy to have a wide-ranging and potentially devastating impact.

The traditional use of the term "computer security" has a binary implication which suggests that at any moment in time a system is either safe or is compromised. We believe that the term "computer security" and the limited viewpoint it engenders (e.g., largely ignoring the aspects of recovery from a compromise and of maintaining performance during and after an intrusion), is inadequate to support the necessary improvement of the state of the art and the state of the practice of protecting computer systems from attack. In contrast, the term survivable systems refers to systems whose components collectively accomplish their mission even under attack and despite active intrusions which effectively damage some significant portion of the system. Accomplishment of mission refers to the ability to deliver essential functionality in a timely manner, where the definition of essential and timely are up to the users of the system. This collective capability should not be dependent upon the survival of a specific information resource, computation, or communication. In a military setting, essential might refer to the maintenance of overwhelming technical superiority, and timely may refer to the ability to deliver results in less than one decision cycle of the enemy. In the public sector, essential and timely may refer to some composite measure of the disruption of stock trades and bank transactions, representing a threshold that must not be exceeded.

In short, the nature of computing dictates that even hardened systems can and will be broken. Robustness under attack is at least as important as hardness or resistance to attack. Hardness certainly contributes to survivability, but robustness under attack is the essential characteristic that distinguishes survivability from traditional computer security. Although the concepts and techniques associated with system survivability are embryonic, they include (but are not limited to) traditional areas of software engineering and computer science such as reliability, testing, dependability, fault tolerance, verification of correctness, performance, and computer security.

Survivability requires robustness under conditions of war, terrorism, crime, and accident. The concept of survivability includes fault tolerance, but is not equivalent to it. Fault tolerance relates to the statistical probability of an accidental fault or combination of faults, not to a malicious attack. For example, an analysis of a system may determine that the simultaneous occurrence of the three statistically independent faults f1 to f3 will cause the system to crash. The probability of the three independent faults occurring simultaneously by accident may be vanishingly small, but a malicious intruder with a knowledge of the system's internals can orchestrate the simultaneous occurrence of the three faults needed to bring down the system. A fault tolerance system would not need to address the possibility of the three faults occurring simultaneously, if the probability of occurrence is below a threshold of concern, whereas a survivable system would require a contingency plan to be in place to deal with it.

Redundancy is another factor that can contribute to the survivability of systems, but redundancy alone is insufficient since multiple identical backup systems share identical vulnerabilities. A survivable system would require each backup system to offer equivalent functionality, but to exhibit significant variance from the others in its implementation. This would thwart any attempt to compromise the primary system and all backup systems with a single attack strategy.

In summary, when we discuss system survivability, we are referring to the collective capability of a system to deliver essential functionality in a timely manner, in the face of an actual or threatened attack or accident. This collective capability should carried out even if a significant portion of the system is incapacitated by the attack or accident.

2.1.2 The Future Networked Computing Environment

As an idealized concept, it is hard to argue with the notion of survivable systems. However to evaluate the prospects for the success of any proposed realization of a survivable system, one must first understand the context (i.e., the computing environment) within which that survivable system will operate. This subsection describes some of the characteristics of what we believe to be the most plausible future networked computing environment: a largely unbounded network domain infrastructure supporting an agent-based computing and communication paradigm.

2.1.2.1 Unbounded Network Domains - The Context for Survivable Systems

To better understand the context within which survivable systems must operate, it is important to distinguish between bounded systems and unbounded systems. A bounded system is one in which all of the parts of the system are under a unified administrative control and can be completely characterized and controlled. At a bare minimum, the behavior of a bounded system can be understood and all of its various parts identified. In an unbounded system there is no unified administrative control over its parts. We use the term administrative control in the strictest sense, which includes the power to impose and enforce sanctions, not merely to recommend an appropriate security policy. It is possible that some parts of an unbounded system cannot even be identified by an outside observer, let alone have actions or behaviors that fall within predictable limits.

Unbounded systems are a significant component of today's computing environment and will play an even a larger role in the future. The Internet - a non-hierarchical network of systems under local administrative control only - is one current example of an unbounded system. There are conventions that allow the parts of the Internet to work together, but there is no global administrative control to assure that these parts are behaving according to these conventions. Unfortunately, the security problems associated with the operation of unbounded systems are greatly underestimated.

The architecture of secure bounded systems is built upon the notion of a security policy - its existence, its enforcement, or its lack - imposed by the exercise of administrative control. In contrast, an unbounded system has no administrative control with which to impose global security policy. For instance, on the Internet today the backbone architecture is independent of security policy considerations because there is no global administrative control.

An unbounded system can be composed of bounded and unbounded systems connected together in a network. Figure 1, below, represents an unbounded domain consisting of a collection of bounded systems, where each bounded system is under separate administrative control. If each bounded system were completely disconnected from the other systems, it would be possible to fully characterize the security state of each system in terms of security policies.

Domain Picture

Figure 1: An unbounded domain viewed as a collection of bounded systems

2.1.2.2 Agent-Based Computing - An Emerging Computing and Communications Paradigm for Internetworked Information Systems

Agents are executable software objects whose executions are not tied to any specific host or computing resource or to any geographical or logical network location. Agents perform computation and communication on the behest of a user, but, from the user's perspective, the execution platforms typically exist somewhere within an unbounded domain. The conceptual model of agent operation is one where the agent, at the request of a user, goes to one or more remote hosts to perform a computation or gather information, and then returns to the user with the results of that "journey of computation." An agent's mode of operation may range from partially to fully autonomous, and the degree to which an agent is autonomous may vary throughout the life of that agent.

Blurring the line between computation and communication, these agents provide a means of remote distributed computation that is radically different from traditional computing paradigms. An agent has a computational task to fulfill for its user and has some associated resource budget and authorization and authentication primitives. Redundancy for agents (through replication) is a lightweight operation from a resource standpoint at the individual replicant level, but the ability to spawn an unlimited number of replicants may forever change the face of computing and resource allocation as we know it today. For example, supporting survivability through inference channels would be an efficient strategy in an agent-based computational environment.

Just as domain name servers advertise hostnames, domain service servers will advertise services and prices and authorization requirements. For example, services may be free to educational users, while others are charged depending on the number of service-based units of work delivered, or on the more traditional resource-based units consumed, such as CPU time, storage, etc. Agents may contain built-in discriminators to prioritize and negotiate selection of the service offerings, or they may rely on third party broker agents which do the searching and selecting of services desired by the primary agent. Natural biological and immunological analogs appear to apply more readily to agent-based computation than to more traditional complex process set computation. Much of today's object-oriented and neural-net research can readily be applied to agents. Analogies to DNA are strong, and replication strategies that require two genetically similar but diverse agents to merge prior to replication provide added fault-tolerance.

Here are some of the salient features of a future agent-based computing environment:

2.1.3 Desirable Design Capabilities for Survivable Systems

We believe that an architecture-based approach to the analysis and design of survivable systems in unbounded network domains would make possible the following capabilities. Each of these capabilities would lead to significant improvement in the development and deployment of large-scale information systems.

2.2 The Current State of the Practice in Survivable Systems

Much of today's research and practice in the field of computer systems security takes a perilously narrow view of the means by which one can defend against computer intrusions. This narrow view is dangerously incomplete because it focuses almost exclusively on hardening a system (e.g., using firewall technology or an Orange book approach to host protection) to prevent a break-in or other malicious attack, but says little about how to detect a computer security incident or what to do once an security incident has occurred or is under way. The view is accompanied by security evaluation techniques that limit their focus to the relative hardness of a system, as opposed to assessing a system's robustness under attack.

Although the application of the term survivability to computer systems is relatively new, the practice of survivability is not. Much of the survivability practice to date has been in the realm of incident response (IR) teams, and in fact the CERT Coordination Center has, since its inception, been supporting the survivability of the Internet community by providing incident response services (helping organizations respond to and recover from incidents), and publishing and distributing vulnerability advisories (akin to public health notices). The CERT/CC has also encouraged variability by encouraging folks to separate different services onto different machines. Thus, the CERT/CC (and its IR team in particular) has always been concerned about survivability (though it has never called it that) and has been very successful in helping sites with risk mitigation and recovery. The realm of incident response is the survivability of deployed systems, and that's the state of the art for deployed systems.

Our seven years of experience at the CERT Coordination Center in responding to computer security incidents has shown us that how organizations respond to (and recover from) computer intrusions is at least as important as the steps they take to prevent them. Although the CERT/CC deals with survivability primarily at the level of the system administrator, which is quite different from the architectural level which is the focus of this paper, its experience provides ample evidence of the effectiveness of dealing with survivability issues on a broad scale. We believe that the widespread availability and use of survivable systems by the Internet community and throughout the Internet infrastructure itself will provide the best hope for the dramatic improvements necessary to make the Internet a more survivable networked information "system of systems" that will be viable for commerce, defense, the conduct of government, and the support of major elements of the national infrastructure (e.g., power grid, public switched network, and air traffic control).

Currently, little of the basic technology in security engineering and system integration applies to unbounded systems, but instead assumes that the capability exists to identify, define, and characterize the extent of administrative control over a system, all access points to that system, and all signals that may appear at those access points. In unbounded systems such as the current Internet and the future National Information Infrastructure, these boundary conditions cannot be fully determined.

On the Internet today, the cornerstone of security is based upon the notion of a firewall, which is an attempt to create a logically bounded system within a physically unbounded one. We assert that "bounded-system thinking" within unbounded domains leads to security designs and architectures that are fundamentally flawed. One notable example is the use of a firewall as the basic security component of the Internet. This approach is severely limited and can be readily circumvented by exploiting the fundamental differences between bounded and unbounded systems. Traditional firewalls are the state of the art for security architectures, but not for survivable systems, because they are passive, filter-only devices. The evolution of firewalls into active firewalls, which have detection and response capability, will allow firewalls to have a role in survivable systems.

2.3 Steps Toward Achieving the Promise of Survivable Systems

This focus of this section is on how to bridge the gap between our current state and the future promise of survivable systems. We first look at the critical scalability problems facing incident response teams, and then the issues facing designers of survivable systems.

2.3.1 Issues in Incident Handling "In-the-Large"

One of the most crucial issues coincident with the exponential growth of the Internet is how to scale incident response and vulnerability resolution activities to keep up with the dramatic increase in intruder activity. Despite heroic efforts by individual response team members, incident response teams have fallen behind in the quantity and level of services they offer, and are steadily losing ground.

A scalable incident response capability requires at least a partially shared knowledge base with extensive automated workflow support. This section gives a brief overview of several of the issues and design considerations for the development of a scalable security information infrastructure. The purpose of this infrastructure is threefold:

  1. To provide extensive automated support for the global coordination of incident response and vulnerability resolution activities
  2. To facilitate the collection of sufficient data during the incident handling process to generate new knowledge, promoting the measured improvement of the state-of-the-practice and the state-of-the-art of networked computer systems security,
  3. To promote the measured improvement of the state-of-the-practice of incident handling itself.

The exponential growth of the Internet has produced an unprecedented rise in the number of computer security incidents, threatening the capacity and ability of current incident response teams to successfully intervene. Moreover, the dramatic shift in the character of the Internet from solely an educational and research network to a global network supporting corporate, government, national defense, and other high stakes activities has greatly increased the incentives for computer intrusions. This shift in the character of the Internet has also increased the likelihood of more skilled attention being devoted to the task of breaking network security, including attacks on the infrastructure of the Internet itself.

The scalable security information infrastructure will provide for the capture (for later query and analysis) of incident-related information, comprising an incident response and software vulnerability knowledge base. This knowledge base will allow researchers, vendors, software engineering practitioners, educators, and incident response personnel to learn from real-world evidence (and actual artifacts) of survivability mistakes and successes, and to take proactive steps to improve the survivability of networked information systems. For example, the analysis of the information contained in this knowledge base will help in the identification of those system administration and software engineering practices that make sites vulnerable to attack. Other analyses will provide valuable insights into attack patterns and incident trends, allowing investigators to discern the broad patterns shaped by individual incidents. Researchers will use the "lessons learned" to improve the survivability of the next generation of networked information systems. Finally, analyses of data on the workflow interactions between response teams and their constituencies (and between cooperating response teams) will contribute to the continuous improvement of the practice of incident handling itself.

2.3.2 Issues in Software Design for Survivable Systems

2.3.2.1 Survivability by Design

The CERT Coordination Center's collection of software vulnerability data provides empirical evidence that vendors continue to release software containing essentially the same classes of security flaws, over and over again, year by year. The vendor response to these flaws is typically the issuance of a patch, or add-on, that addresses the immediate problem at hand, but does not solve the problem at the design level, which is often the cause of the subsequent reappearance of the same type of flaw in the same software.

We can no longer afford to ignore the survivability aspects of the creation and continuous improvement of large-scale systems, or to treat the security and survivability of such systems in isolation (i.e., as mere add-ons or patches to existing designs). Security and survivability must be designed in from the outset, and must be an integral part of an evolutionary design process for that system.

2.3.2.2 Architecture-Based Development of Survivable Systems

Currently, security changes to a software system are rarely made at the architectural level. Security enhancements or corrections for security vulnerabilities are typically made in the form of "patches" and are not reflected in the architectural description of the system. This "disconnect" between the architecture and the implementation of a system all too often has an unexpected and undesirable impact on security, as well as on the other attributes of software quality. On a daily basis, the CERT Coordination Center sees the real-world damage (in the form of system intrusions) caused by the lack of a theoretical foundation upon which to build sound software engineering practices for the design, implementation, and maintenance of survivable systems.

One of the major goals of research into software architectures for survivable systems is to build a predictive model to describe the survivability implications of multiple bounded and unbounded systems that collectively form a large, unbounded domain. Another major goal is to provide the architectural primitives, techniques, and the supporting tools necessary to build highly survivable systems that exist within unbounded domains.

A description of the two major goals of this work is provided below:

  1. Security and Survivability "what if" modeling of an unbounded system
  2. Security and Survivability architectures for unbounded systems

Security and Survivability "What If" Modeling of an Unbounded System

We will create a predictive security and survivability model of unbounded systems in general and the Internet in particular. The most salient characteristic of our model is that it will allow a user to conduct "what-if" analyses of the survivability characteristics of a particular component, collection of components or system.

The deliverable is a detailed description of a predictive model for understanding and evaluating security attributes in an unbounded system. The predictive nature of the model will allow us to derive security and survivability characteristics from the specific architectural descriptions of model elements in a simulation of an unbounded system (such as the Internet). It is expected that this model (or modeling technology) will be a step towards creating more secure architectures for the Internet, the NII, and future unbounded systems.

Currently, the CERT Coordination Center performs vulnerability analysis to create and distribute workarounds, and to assist vendors in creating and distributing patches. During the data analysis phase of this project, we will abstract the vulnerability record into classes of architectural and implementation practices that lead to specific vulnerabilities. This analysis will be linked with an analysis of intruder behavior to help determine the real-world security impact of specific architectural and implementation practices. The abstraction and classification of vulnerabilities and their links to real-world intrusions will contribute the creation of our predictive model by allowing us to extrapolate behavior and system vulnerabilities for future systems.

The aspects of reality that the model must have abstract representations for include organizational culture (e.g., security practices), technology, organizational policies, and the intruder environment. The representation of the model must be reliable (e.g., clearly mappable to reality), precise, and limited for practicality. The model should also support fuzzy numbers and fuzzy relationships. Note that user representation provided by the model may range from a simple gauge package (e.g., dashboard) to fully sensualized visualization techniques.

Our view of the most useful structure for the model is one that would support scenario -based "what-if" analyses, much like a "spreadsheet for survivability". This spreadsheet will provides a composite view of a system's survivability in the context of an organization, and the intruder environment, and will allow you to ask questions. For example, "An IP-spoofing attack is directed at X. What's at risk?"

Such a model would be well suited for the kinds of risk-based tradeoff decisions that business executives typically make. This constraint-based propagational model has significant advantages over models that would merely "stamp" a component or system with a numerical survivability rating. Simple survivability rating numbers won't tell you what would happen in a specific attack scenario or how to improve your system's capacity to survive an attack. Moreover, a numerical rating won't tell you how a specific component or system would perform in the context of your organization's unique culture, policies, threats, and vulnerabilities.

Security and survivability architectures for unbounded systems

This work will enable us to understand and specify the security and survivability tradeoffs between alternative architectures. Ultimately it will allow us to understand and specify the tradeoffs between security and other attributes of software quality (such as performance, maintainability, and ease-of-use). The architecture models in conjunction with the unbounded domain model, will enable us to characterize (and perhaps quantify) the risks associated with using specific architectures for particular applications or purposes, in the context of an unbounded domain.

Another benefit of our work will be to produce alternative architectures to replace the contemporary (flawed) firewall architecture. System administrators on the Internet are currently employing filtering routers and application gateways at critical points in the Internet infrastructure to restrict various services that have been found to be insecure. This architectural solution does not take into account the rapid growth of service types and advances that render firewall architecture ineffective (such as Mobile-IP, which allows a user to change their access point to the network while keeping the same source IP address). Our work will produce alternatives to the existing firewall architecture to allow administrators to maintain a high state of security at any specified point in the Internet or any other unbounded system.

2.3.2.3 Survivability and Architecture Tradeoff Analysis

The flip side of the problem where designers neglect to consider security or survivability in their design, is that when security or survivability actually is considered, it is typically in isolation from the other attributes of software engineering quality, such as performance, dependability, modifiability and ease of use. There is no common model, no common architectural language for understanding, assessing, and articulating the software engineering tradeoffs available to designers of survivable systems. The bottom line is that software designers, their managers, and their customers need to be able to specify the tradeoffs between enhanced survivability and the other attributes of software quality (including affordability). They also need to be able to evaluate the success of competing designs (and implementations) in achieving their overall specifications, including survivability.

In the next few years, we expect to see ground-breaking research attempt to bring together the various attributes associated with software engineering into an integrated software evaluation framework. This framework will allow the software design practitioner to consolidate the metrics associated with the individual attributes of software engineering for the purpose of deriving composite measures of software quality. As a result of this work, the practitioner will be able to analyze and evaluate tradeoffs among the various attributes on both a qualitative and quantitative basis. These software engineering attributes currently include:

At present, researchers in each of these separate "ilities" do not share a common language or common methodology, and therefore metrics for the various attributes cannot be combined and tradeoffs cannot be evaluated in any sensible way. The most glaring deficiency seen in the design of software systems today is that survivability is typically not one of the attributes being considered in the set of software quality attributes. Unfortunately, this is not surprising, given the fact that in traditional software engineering the other quality attributes dominate the design process. Survivability and security (if considered at all) are an afterthought. Survivability and security are relegated to a series of add-on patches, which are put in place after problems are discovered and reported by the customer or by an incident response team. Because such patches are typically knee-jerk reactions to an emergency situation, rather than the result of a principled systems engineering design, they do not solve broad classes of problems, but instead only close a small number of holes and leave countless others open. In fact such patches occasionally introduce new security vulnerabilities (or bring old ones back to life).

Even at the time that a patch is being designed, the focus of "how to patch the problem" is usually on speed and ease of solution, as well as performance and functionality (e.g., "don't degrade existing abilities") rather than on designing the best solution from a survivability or security standpoint. Conversely, the decision is sometimes made to maximize security and survivability (i.e., "close the hole") at any cost, leading to a design effort in which there is no consideration of appropriate tradeoffs.

The characteristics and the dimensions of the design tradeoffs that will arise once survivability is made an inherent part of the software engineering process is a fertile area for exploratory and applied research, with the potential for a very high-impact payoff in software engineering process improvement. Note that we will treat key survivability attributes at the architecture level. For instance we will attempt to model variability at the component level.

The results of such work will include the development of survivability metrics, composite metrics, key practices, an understanding of software engineering tradeoffs at the point of design instead of as post-design patches. Such work would also support the continuous improvement of survivable systems, using a principled evolutionary design process that allowed architecture tradeoff analysis at every step of the way.