WHAT ARE FIRST FAULT PROBLEM RESOLUTION TECHNOLOGIES???
by Dan Skwire
OVERVIEW
This paper will describe just what we mean by 'first fault problem resolution technologies',
how these technologies can contribute to lower costs, shorter service outages, better service delivery, and improved customer
usage of computer systems. We contrast these technologies with 'second fault problem resolution technologies' and the tools
used in both. We survey the broad range of facilities provided by commercial operating systems vendors for 'first and second'
fault resolution, and additionally describe the current spectrum of vendor products that support first fault problem resolution.
We recognize that we are using some terms uniquely and they thus require explicit definition:
-
first fault problem resolution technologies
-
second fault problem resolution technologies
-
tools
-
vendor products
-
problem reproduction – problem recreation – 'problem do-over' (equivalent terms)
DEFINITIONS
First fault problem resolution technologies are a set of terms and concepts that deserves some
detailed descriptions. It is fundamental to understanding the issues involved and the opportunities for improvement.
Let's break down the phrase into components: “first fault”, “problem”,
“resolution”, and finally “technologies/ Then examine the phrase 'first fault problem resolution technologies'
in its entirety.
We later discuss 'second fault problem resolution' and related tools'. We will compare 'first-fault'
problem resolution and its tools to 'second-fault' problem resolution and tools.
'First fault' refers to the fact that in the real world, many products (software, hardware,
otherwise) will eventually exhibit some indicator of an undesired behavior – a 'fault'. This 'fault' or 'defect' may
be a major, or a minor fault. Often in the computer world faults have various levels of severities, giving more detail
to the impact of the fault. The concept associated with receiving a 'first fault' implies that one can do something with this
notification, and this is often the case. In many environments, a complete failure is used to generate a repair action; in
other environments, a fault that does not imply total failure can be a reason to initiate some kind of repair and/or recovery
action, or re-action.
A “problem” refers to the fact that there is such an anomaly, a behavior that is
either not normal, or not desired. Problems need to be resolved. You can call an unusual behavior a 'quirk' or an 'anomaly',
but if it indicates a system state that needs to be altered for continued optimal operation, then we call that a 'problem'.
“Resolution” refers to some change in behavior that is needed – either total
or partial repair, or just some modification (if you are using 'gas' too quickly in a car, you can reduce your speed to lower
the rate at which you consume the gasoline, thus using it more efficiently). Thus, using multiple words together, 'problem
resolution' means performing some change in behavior to reach a more optimal, or desired, or normal state.
The whole point we are making here is that we know we not only need to identify, understand,
and devise a corrective solution for major problems, like computer crashes, but also, if we can get word of some intermediate
state of computer hardware, software, then we might need to take some corrective action to avoid a totally failed 'crashed'
state. Surely, we want to be aware of any intermediate problem indications so we have the opportunity to implement a defensive,
corrective action. Example: seeing that there are 'soft' disk errors may lead one to watch or even aggressively replace the
affected disk drive before one experiences a solid 'hard' failure. Thus, 'first fault problem resolution' refers to being
aware of an initial computer problem with sufficient information to form a defensive action plan to resolve that error state.
“First fault problem resolution technologies” refer to the many ways, the many technologies
available to perform resolution of computer problems when their 'first' error indication is received. These technologies can
involve manual methods (a set of operator-readable scripts describing potential system states and recommended human actions
to take), or automatic methods (programmed systems) that would initiate system actions upon the receipt of particular error
stimuli.
Key steps in this process of 'first fault problem resolution technologies” are:
-
recognizing that an error has occurred
-
data collection in such detail and accuracy that the problem can be analyzed and associated
with a solution/resolution to that single occurrence of the fault.
-
initiating an alert to a human or another program to start the resolution process
-
verifying that the alert has been received
-
verifying that the recovery action has started, processed and been successful
-
verifying that the reoccurrence of the problem's occurrence is minimized
Second fault occurrences refers to a reoccurrence of that same first fault, either by accident
(reoccurrence in 'production'), or by explicit effort, as in a vendor or customer 'test laboratory'.
Very often 'problem reproduction' or 'problem recreation' is used in primitive technologies,
or in environments where the code is new, and tools haven't been created, or in environments where there just are a minimum
of production tools to 'trap' and/or 'catch' original (first) instance of the problem's occurrence. Sophisticated operating
system environments have built-in 'second-fault' tools, but many other environments do require custom-building that second-fault
tool.
Along with 'problem reproduction/recreation' is an underlying, usually unspoken implication
that additional test tools often will be created, or code will be altered ('instrumented') to obtain sufficient diagnostic
information to debug that problem. With pre-planning, all that tooling and customized effort, often abandoned when the problem
is debugged, could have been built in advance – that, in essence, is what a FIRST FAULT PROBLEM RESOLUTION tool is!
Very often 'first fault problem resolution' tools are not created by designers building a product, a priori, in a laboratory,
but by experienced problem-solvers, who have had to build this instrumentation/scaffolding for 'second-fault problem resolution'
after they are given a product with insufficient first-fault problem resolution capabilities!
“Problem reproduction (recreation)” is unfortunate, and it yields many challenges:
-
how do you know for sure that you haven't just recreated the same problem's symptom with a different
problem? You end up chasing a different problem, and only find out much later on.
-
It takes time to set up a problem reproduction environments
-
it costs money for the additional problem reproduction environment for additional hardware,
software, environmentals, support personnel, licenses, etc
-
while the support person is setting up the problem repro, the problem may happen again in production
to the customer. Don't you have any tools to collect better data at the customer site where the problem REALLY happens (perhaps
again and again). Surely it is embarrassing when the customer continues to have the problem and you can only obtain data in
your 'simulated customer environment', while the real customer has no data-capture tools in his environment!
-
There may be repeated 'CNN stories' as newspapers and other media repeat the story of the calamity
that continues to repeat because insufficient problem resolution facilities were available when the problem first occurred.
Of course, rapid 'first fault' problem resolution will prevent:
- availability shortfalls/problems
- instrumenting the product
-
reproducing/recreating the problem
-
explaining excessive downtime
-
user dissatisfaction while their problem is unresolved
FAULT TOLERANT SYSTEMS CHALLENGES
A very proactive organization will use one of many fault tolerant techniques (Microsoft Cluster,
SUN Clustering, AIX clustering), Tandem Non-Stop, Stratus, Marathon, IBM's MVS, etc, to provide a means of automatically recovering
from a major fault that impacts major hardware (servers/storage), transparently. These functions work, but may help to mask
and make even more difficult the task of determining the original error. The reason is that the recovery process can re-initialize
many of the data describing system status indicators, and they would need to be captured before (or during) system recovery
so as to debug the original problem. What good is recovering from a problem if you continue to rapidly have that problem,
over and over again? You need a way of capturing data sufficient to debug that problem on the problem's firs occurrence (hopefully,
you can stop data collection for any repeat occurrences of that same problem, if you so choose).
KINDS OF TECHNOLOGIES AVAILABLE FOR FIRST FAULT PROBLEM RESOLUTION
We then survey the range of products available to facilitate first-fault problem resolution,
describing first vendor operating systems features, then the various classes of additional vendor tools.
VENDOR OPERATING SYSTEM FEATURES
Messages: Of note are the IBM mainframe operating systems, which contain coded and architechted
messages which facilitate message automation and highly effective database search. Other platforms have varying message schema,
both for normal operation – status indicators, and error status descriptions. Usually they are kept in a disk-based
log file. There is a vendor product, loglogic, which does analysis of many platforms' accumulated message logs.
System trace: This internal operating system feature in the IBM mainframe operating systems,
provides a continuous wrap-around trace of major operating system events (task dispatches, I/O initiations and interrupts,
faults, etc). It has been continually engineered with microcode and operating system tailoring, analysis, and speedups since
the 1960's. It is a true 'black-box'.
The IBM z/OS and its predecessors are the only known operating system environments to have the
'system trace' running by default when the system first comes up. Other vendors have been known to have various sub-system
and/or application traces running concurrently, by default, when their software first comes up, but it not yet a known and
accepted practice in the computer industry. There is a continual fear of the few percentage points of performance lost due
to trace overhead. However, in non-computer systems, such as automobiles, trains, and trucks, black-boxes are being successfully
implemented for first-fault problem resolution.
Storage dumps: these are usually available for system problems and application problems. Problems
detected within the operating system require dumping of system internal data areas; application-detected problems will generate
a usually smaller dump of the application data space. Both kinds of storage dumps may be challenged in that rapidly-changing
data areas ('volatile data') may require special processing in order to be useful. Often there are storage dump tailoring
options available. There is great sophistication in this area in the IBM z/OS mainframe operating system, but many facilities
recently added to the SUN Solaris system. Dumping of system data is primitive in Linux operating systems. Microsoft has improved
its system storage dumping and formatting facilities. The formatting and analysis of the storage dump by a vendor provided
tool is also very key. IBM's IPCS is a leader, unix environments have a great collaborative tool in KDB and MDB.
Performance monitor data: This gross macroscopic data, containing CPU utilization levels, channel
utilization, device utilization, storage utilization and paging rates, etc, can be very valuable in working performance problems,
of course. The data can be of limited value in resolving a defect where a part of the software just breaks, or crashes. Major
examples include: unix' iostat, mpstat, cpustat, Microsoft Windows Task Manager, IBM z/OS' RMF.
Error data: Sophisticated computer systems have separate databases of the various kinds of errors
received – soft and hard errors, within various components such as server, storage, network, or software, etc. In the
UNIX world, there is a verb 'errpt' which generates error record reports. Similarly, in the IBM mainframe world, the program
known as EREP will perform similar functions, but also includes archiving, trending, reporting/analysis etc. Microsoft Windows
has a similar EVENTS log which is viewed as an administrator's tool, and it has had increasing levels of detail over time.
Generalized data collectors: Many systems have features that will either manually or automatically
collect appropriate diagnostic data. Various unix vendors have tools to manually collect data: SUN Solaris has the 'explorer'
report, IBM AIX has the 'snap' data collector. In addition, SUN storage has an 'extractor' tool to collect storage array information.
The IBM z/OS (“MVS”) operating system includes automatic data collection, performed by the recovery/termination
process: trace, stack, volatile register data are all collected as part of the “SUMDUMP” system dump process.
There is no formal 'data collector' for Linux, however, one was written by an IBM Linux person and made generally available
on the internet (see 'Best Practices for Solving Problems In Non-z/OS Environments').
Very often 'alerts' are generated for transient errors – the classic example is a disk
drive 'soft' error. These soft errors are remedied via retry of the temporary error; similarly, processor storage can experience
a transient error and can have data recomputed via parity or Error-Correcting Codes (ECC).
A proactive organization or product would want to get notified of these soft errors –
for either immediate action, trending, etc.
In many environments, network-connected programs will receive these alerts and there is automation
to process and/or react to them. In other environments, the alerts will trigger a 'phone-home' to an external service organization.
Phone-home has been utilized extensively for storage – disk array technologies since data loss can have great impact.
Initially, phone-home was used for servers (IBM RSS circa 1981).
Usually, the external service organization receiving the 'phone-home' can also 'dial-in' to
remotely examine the system state, in real-time, before sending out a repair person. Often the failing part an be known and
recommended for repair within minutes of initial notification. In recent years, phone-home via the internet has been exploited
as security concerns are resolved.
There are other technologies that provide manual aid for problem diagnosis involving manual
aids for a physical person, onsite) as supplied by Qualtech Systems.
SURVEY OF VENDOR PRODUCTS CURRENTLY AVAILABLE FOR FIRST FAULT PROBLEM RESOLUTION
Probably the most notable tool for general first fault problem resolution today is the BMC AppSight
product. It provides telemetry, and the ability to replay a failing transaction in Microsoft J2EE environments. It includes
a true-black box. There are few competitors – one is Avicode.
There are extensive computer system system monitors: they monitor server performance, up/down
state and multiple performance parameters. There are many many names: nimsoft, etc.
Change management: monitor changes within your Enterprise's server complex, and have an audit
trail to do forensic defensive analysis if there are any problems: Tripwire
SURVEY OF TOOLS AVAILABLE FOR SECOND FAULT PROBLEM RESOLUTION
SUN Solaris has a very highly developed and sophisticated tool, DTRACE, since about the year
2000, to provide 'visibility' into several thousand specific system and application operations. It is very flexible, minimally
disruptive, and very highly received. DTRACE was built as a competitor to the IBM Generalized Trace Facility (GTF), which
monitors and records a smaller number of system and application events. Both these tools come supplied with their respective
operating systems.
Since 1970, an industry-unique tool has been the IBM z/OS Program Event Hardware (“PER”
hardware), which allows monitoring of specific programs or address ranges for instruction-fetching, or specific storage ranges
for address modification; this is accomplished with special proprietary hardware and associated software on the IBM z/OS and
z/VM operating systems. PER monitoring is a powerful second-fault tool that can also be used for performance and/or other
event visibility research. There are no known tools outside of IBM using this feature although at one time, the Intel x86
architecture included a similar functionality, but there is no known software that uses it to this current date. It not known
if the hardware feature still works.
CompuWare STROBE – production 'hot-spot' performance visibility and analysis tool, for
IBM mainframes.
There are many other software development tools to provide instruction analysis, but very few
are suitable for production (high CPU usage, and performance-sensitive), vs. the above listed tools.