First Fault Problem Resolution Technologies LLC
WHAT ARE FIRST FAULT PROBLEM RESOLUTION TECHNOLOGIES?
Home
Services & Consulting
Biography - CV - Resume
Book information
Contact us

 WHAT ARE FIRST FAULT PROBLEM RESOLUTION TECHNOLOGIES???

by Dan Skwire

OVERVIEW

This paper will describe just what we mean by 'first fault problem resolution technologies', how these technologies can contribute to lower costs, shorter service outages, better service delivery, and improved customer usage of computer systems. We contrast these technologies with 'second fault problem resolution technologies' and the tools used in both. We survey the broad range of facilities provided by commercial operating systems vendors for 'first and second' fault resolution, and additionally describe the current spectrum of vendor products that support first fault problem resolution.


We recognize that we are using some terms uniquely and they thus require explicit definition:

  • first fault problem resolution technologies

  • second fault problem resolution technologies

  • tools

  • vendor products

  • problem reproduction – problem recreation – 'problem do-over' (equivalent terms)


DEFINITIONS

First fault problem resolution technologies are a set of terms and concepts that deserves some detailed descriptions. It is fundamental to understanding the issues involved and the opportunities for improvement.


Let's break down the phrase into components: “first fault”, “problem”, “resolution”, and finally “technologies/ Then examine the phrase 'first fault problem resolution technologies' in its entirety.

We later discuss 'second fault problem resolution' and related tools'. We will compare 'first-fault' problem resolution and its tools to 'second-fault' problem resolution and tools.


'First fault' refers to the fact that in the real world, many products (software, hardware, otherwise) will eventually exhibit some indicator of an undesired behavior – a 'fault'. This 'fault' or 'defect' may be a major, or a minor fault. Often in the computer world faults have various levels of severities, giving more detail to the impact of the fault. The concept associated with receiving a 'first fault' implies that one can do something with this notification, and this is often the case. In many environments, a complete failure is used to generate a repair action; in other environments, a fault that does not imply total failure can be a reason to initiate some kind of repair and/or recovery action, or re-action.


A “problem” refers to the fact that there is such an anomaly, a behavior that is either not normal, or not desired. Problems need to be resolved. You can call an unusual behavior a 'quirk' or an 'anomaly', but if it indicates a system state that needs to be altered for continued optimal operation, then we call that a 'problem'.


“Resolution” refers to some change in behavior that is needed – either total or partial repair, or just some modification (if you are using 'gas' too quickly in a car, you can reduce your speed to lower the rate at which you consume the gasoline, thus using it more efficiently). Thus, using multiple words together, 'problem resolution' means performing some change in behavior to reach a more optimal, or desired, or normal state.


The whole point we are making here is that we know we not only need to identify, understand, and devise a corrective solution for major problems, like computer crashes, but also, if we can get word of some intermediate state of computer hardware, software, then we might need to take some corrective action to avoid a totally failed 'crashed' state. Surely, we want to be aware of any intermediate problem indications so we have the opportunity to implement a defensive, corrective action. Example: seeing that there are 'soft' disk errors may lead one to watch or even aggressively replace the affected disk drive before one experiences a solid 'hard' failure. Thus, 'first fault problem resolution' refers to being aware of an initial computer problem with sufficient information to form a defensive action plan to resolve that error state.


“First fault problem resolution technologies” refer to the many ways, the many technologies available to perform resolution of computer problems when their 'first' error indication is received. These technologies can involve manual methods (a set of operator-readable scripts describing potential system states and recommended human actions to take), or automatic methods (programmed systems) that would initiate system actions upon the receipt of particular error stimuli.


Key steps in this process of 'first fault problem resolution technologies” are:

  • recognizing that an error has occurred

  • data collection in such detail and accuracy that the problem can be analyzed and associated with a solution/resolution to that single occurrence of the fault.

  • initiating an alert to a human or another program to start the resolution process

  • verifying that the alert has been received

  • verifying that the recovery action has started, processed and been successful

  • verifying that the reoccurrence of the problem's occurrence is minimized



Second fault occurrences refers to a reoccurrence of that same first fault, either by accident (reoccurrence in 'production'), or by explicit effort, as in a vendor or customer 'test laboratory'.

Very often 'problem reproduction' or 'problem recreation' is used in primitive technologies, or in environments where the code is new, and tools haven't been created, or in environments where there just are a minimum of production tools to 'trap' and/or 'catch' original (first) instance of the problem's occurrence. Sophisticated operating system environments have built-in 'second-fault' tools, but many other environments do require custom-building that second-fault tool.


Along with 'problem reproduction/recreation' is an underlying, usually unspoken implication that additional test tools often will be created, or code will be altered ('instrumented') to obtain sufficient diagnostic information to debug that problem. With pre-planning, all that tooling and customized effort, often abandoned when the problem is debugged, could have been built in advance – that, in essence, is what a FIRST FAULT PROBLEM RESOLUTION tool is! Very often 'first fault problem resolution' tools are not created by designers building a product, a priori, in a laboratory, but by experienced problem-solvers, who have had to build this instrumentation/scaffolding for 'second-fault problem resolution' after they are given a product with insufficient first-fault problem resolution capabilities!


“Problem reproduction (recreation)” is unfortunate, and it yields many challenges:

  • how do you know for sure that you haven't just recreated the same problem's symptom with a different problem? You end up chasing a different problem, and only find out much later on.

  • It takes time to set up a problem reproduction environments

  • it costs money for the additional problem reproduction environment for additional hardware, software, environmentals, support personnel, licenses, etc

  • while the support person is setting up the problem repro, the problem may happen again in production to the customer. Don't you have any tools to collect better data at the customer site where the problem REALLY happens (perhaps again and again). Surely it is embarrassing when the customer continues to have the problem and you can only obtain data in your 'simulated customer environment', while the real customer has no data-capture tools in his environment!

  • There may be repeated 'CNN stories' as newspapers and other media repeat the story of the calamity that continues to repeat because insufficient problem resolution facilities were available when the problem first occurred.

Of course, rapid 'first fault' problem resolution will prevent:

  • accruing excessive downtime

  • bad publicity world-wide

- availability shortfalls/problems

- instrumenting the product

  • reproducing/recreating the problem

  • explaining excessive downtime

  • user dissatisfaction while their problem is unresolved


FAULT TOLERANT SYSTEMS CHALLENGES


A very proactive organization will use one of many fault tolerant techniques (Microsoft Cluster, SUN Clustering, AIX clustering), Tandem Non-Stop, Stratus, Marathon, IBM's MVS, etc, to provide a means of automatically recovering from a major fault that impacts major hardware (servers/storage), transparently. These functions work, but may help to mask and make even more difficult the task of determining the original error. The reason is that the recovery process can re-initialize many of the data describing system status indicators, and they would need to be captured before (or during) system recovery so as to debug the original problem. What good is recovering from a problem if you continue to rapidly have that problem, over and over again? You need a way of capturing data sufficient to debug that problem on the problem's firs occurrence (hopefully, you can stop data collection for any repeat occurrences of that same problem, if you so choose).



KINDS OF TECHNOLOGIES AVAILABLE FOR FIRST FAULT PROBLEM RESOLUTION

We then survey the range of products available to facilitate first-fault problem resolution, describing first vendor operating systems features, then the various classes of additional vendor tools.



VENDOR OPERATING SYSTEM FEATURES

Messages: Of note are the IBM mainframe operating systems, which contain coded and architechted messages which facilitate message automation and highly effective database search. Other platforms have varying message schema, both for normal operation – status indicators, and error status descriptions. Usually they are kept in a disk-based log file. There is a vendor product, loglogic, which does analysis of many platforms' accumulated message logs.


System trace: This internal operating system feature in the IBM mainframe operating systems, provides a continuous wrap-around trace of major operating system events (task dispatches, I/O initiations and interrupts, faults, etc). It has been continually engineered with microcode and operating system tailoring, analysis, and speedups since the 1960's. It is a true 'black-box'.


The IBM z/OS and its predecessors are the only known operating system environments to have the 'system trace' running by default when the system first comes up. Other vendors have been known to have various sub-system and/or application traces running concurrently, by default, when their software first comes up, but it not yet a known and accepted practice in the computer industry. There is a continual fear of the few percentage points of performance lost due to trace overhead. However, in non-computer systems, such as automobiles, trains, and trucks, black-boxes are being successfully implemented for first-fault problem resolution.


Storage dumps: these are usually available for system problems and application problems. Problems detected within the operating system require dumping of system internal data areas; application-detected problems will generate a usually smaller dump of the application data space. Both kinds of storage dumps may be challenged in that rapidly-changing data areas ('volatile data') may require special processing in order to be useful. Often there are storage dump tailoring options available. There is great sophistication in this area in the IBM z/OS mainframe operating system, but many facilities recently added to the SUN Solaris system. Dumping of system data is primitive in Linux operating systems. Microsoft has improved its system storage dumping and formatting facilities. The formatting and analysis of the storage dump by a vendor provided tool is also very key. IBM's IPCS is a leader, unix environments have a great collaborative tool in KDB and MDB.


Performance monitor data: This gross macroscopic data, containing CPU utilization levels, channel utilization, device utilization, storage utilization and paging rates, etc, can be very valuable in working performance problems, of course. The data can be of limited value in resolving a defect where a part of the software just breaks, or crashes. Major examples include: unix' iostat, mpstat, cpustat, Microsoft Windows Task Manager, IBM z/OS' RMF.


Error data: Sophisticated computer systems have separate databases of the various kinds of errors received – soft and hard errors, within various components such as server, storage, network, or software, etc. In the UNIX world, there is a verb 'errpt' which generates error record reports. Similarly, in the IBM mainframe world, the program known as EREP will perform similar functions, but also includes archiving, trending, reporting/analysis etc. Microsoft Windows has a similar EVENTS log which is viewed as an administrator's tool, and it has had increasing levels of detail over time.


Generalized data collectors: Many systems have features that will either manually or automatically collect appropriate diagnostic data. Various unix vendors have tools to manually collect data: SUN Solaris has the 'explorer' report, IBM AIX has the 'snap' data collector. In addition, SUN storage has an 'extractor' tool to collect storage array information. The IBM z/OS (“MVS”) operating system includes automatic data collection, performed by the recovery/termination process: trace, stack, volatile register data are all collected as part of the “SUMDUMP” system dump process. There is no formal 'data collector' for Linux, however, one was written by an IBM Linux person and made generally available on the internet (see 'Best Practices for Solving Problems In Non-z/OS Environments').


Very often 'alerts' are generated for transient errors – the classic example is a disk drive 'soft' error. These soft errors are remedied via retry of the temporary error; similarly, processor storage can experience a transient error and can have data recomputed via parity or Error-Correcting Codes (ECC).

A proactive organization or product would want to get notified of these soft errors – for either immediate action, trending, etc.


In many environments, network-connected programs will receive these alerts and there is automation to process and/or react to them. In other environments, the alerts will trigger a 'phone-home' to an external service organization. Phone-home has been utilized extensively for storage – disk array technologies since data loss can have great impact. Initially, phone-home was used for servers (IBM RSS circa 1981).


Usually, the external service organization receiving the 'phone-home' can also 'dial-in' to remotely examine the system state, in real-time, before sending out a repair person. Often the failing part an be known and recommended for repair within minutes of initial notification. In recent years, phone-home via the internet has been exploited as security concerns are resolved.



There are other technologies that provide manual aid for problem diagnosis involving manual aids for a physical person, onsite) as supplied by Qualtech Systems.


SURVEY OF VENDOR PRODUCTS CURRENTLY AVAILABLE FOR FIRST FAULT PROBLEM RESOLUTION

Probably the most notable tool for general first fault problem resolution today is the BMC AppSight product. It provides telemetry, and the ability to replay a failing transaction in Microsoft J2EE environments. It includes a true-black box. There are few competitors – one is Avicode.


There are extensive computer system system monitors: they monitor server performance, up/down state and multiple performance parameters. There are many many names: nimsoft, etc.


Change management: monitor changes within your Enterprise's server complex, and have an audit trail to do forensic defensive analysis if there are any problems: Tripwire


SURVEY OF TOOLS AVAILABLE FOR SECOND FAULT PROBLEM RESOLUTION

SUN Solaris has a very highly developed and sophisticated tool, DTRACE, since about the year 2000, to provide 'visibility' into several thousand specific system and application operations. It is very flexible, minimally disruptive, and very highly received. DTRACE was built as a competitor to the IBM Generalized Trace Facility (GTF), which monitors and records a smaller number of system and application events. Both these tools come supplied with their respective operating systems.


Since 1970, an industry-unique tool has been the IBM z/OS Program Event Hardware (“PER” hardware), which allows monitoring of specific programs or address ranges for instruction-fetching, or specific storage ranges for address modification; this is accomplished with special proprietary hardware and associated software on the IBM z/OS and z/VM operating systems. PER monitoring is a powerful second-fault tool that can also be used for performance and/or other event visibility research. There are no known tools outside of IBM using this feature although at one time, the Intel x86 architecture included a similar functionality, but there is no known software that uses it to this current date. It not known if the hardware feature still works.


CompuWare STROBE – production 'hot-spot' performance visibility and analysis tool, for IBM mainframes.


There are many other software development tools to provide instruction analysis, but very few are suitable for production (high CPU usage, and performance-sensitive), vs. the above listed tools.



Enter content here

Enter supporting content here