Classifying problems
Faults need to be classified according to
their criticality – this is what sort of impact a specific problem may have on
business operations. The questions that will need to be answered may include:
·
How critical is this problem?
·
What is the impact on the
overall operations of a business?
·
Should the contingency and
disaster recovery plan be enacted?
·
Does the business have the
expertise to deal with the problem and provide a satisfactory solution?
Problems that are regarded as non-critical
(low criticality), won’t represent a threat to the daily operations of a
business. Operations will continue with some level of disruption. This
disruption may affect a standalone system, a series of systems or an entire
network. An example of a problem regarded as non-critical would be an Internet
server going down due to a hardware failure – this is certainly non-routine,
but assuming that the business does not use the Internet for their core
business operations, business operations may continue, but without Internet
access.
Problems that are regarded as critical are
certainly serious. These problems have the potential to seriously impair the
function of a business. These types of faults will generally require IT
personnel to enact a contingency and disaster recovery plan. Business that are
not prepared for these types of faults and that have not formulated a sound
contingency and disaster recovery strategy will suffer serious consequences,
including a total halt of business operations and loss of revenue. An example
of this type of fault would be an inaccessible database server holding
inventory, ordering and sales data, without which business cannot proceed.
Quite often, IT support managers and
supervisors are responsible for assessing the criticality of faults. Many
companies have different scales for representing criticality. The following is
a suggestion of how this could be implemented:
Criticality Level or Risk
|
Definition
|
Disaster Recovery
|
1
|
High potential impact to large number of
users
It involves network/system down time
|
Enact Disaster Recovery Plan
|
2
|
High potential impact to large number of
users or business critical service.
May result in some down time
|
May require enacting Disaster Recovery
Plan
|
3
|
Medium potential impact to smaller number
of users or business service
Resolution may require some down time.
|
Disaster Recovery Plan enactment not
warranted. Remedial action required.
|
4
|
Lower potential service or user impact.
Change may require some down time.
|
Disaster Recovery Plan enactment not
warranted. Remedial action required.
|
5
|
No user or service impact.
No down time.
|
Disaster Recovery Plan enactment not
warranted. Remedial action optional.
|
Hardware faults
Apart from faults being classified as
critical and non-critical, you will need to use other classifications in order
to aid the troubleshooting process. One of the typical classifications of
faults is whether the source of the fault is a hardware device or component, or
whether the source of the fault is found on software – system or application.
Hardware faults are reasonably easy to
troubleshoot, as the symptoms of the fault are fairly obvious. For example, if
the power supply unit of a computer fails, the computer will not power up.
Sometimes though, hardware faults can be difficult if the fault and symptoms
only appear intermittently – that is, the fault is not present all the times.
For example, some hardware components only develop faults under certain
conditions, such as when the temperature of the device reaches a certain
threshold.
Hardware faults sometimes can be rectified
fairly quickly, by replacing the failed component. Usually, technicians will
have common Field-Replaceable-Units (FRU) available. FRUs are simply common
components that can be replaced on the field with reasonable ease. Examples of
FRU may include:
·
Hard Disk Drives
·
Floppy Disk Drives
·
Optical Drives (CD, CDR, DVD
etc)
·
Memory (RAM)
·
Sound Cards,
·
Video Cards,
·
Keyboard & Mouse
·
Network Interface Cards
·
Network Patch Leads
Software faults
As you might have guessed, software faults
are those faults that are caused by a software component. The software
component may be part of the system’s software or may be applications software.
Software faults sometimes can be tricky to
troubleshoot. Even though the source of the problem is found to be software,
not always it is crystal clear which software component is actually causing the
fault.
System Software Faults – are those faults
that are caused by system software. Generally speaking, the operating system is
regarded as system software. However, some application software might also
install some system components it needs to run, which could become [and quite
frequently are] the source of faults. The source of software faults can be
caused by:
·
Software components corruption
·
System incorrect configuration
·
Documented and undocumented
bugs
·
Compatibility issues (hardware
and software)
System software faults can have system-wide
implications, which might hinder the operations of the whole system.
Application Software Faults – these types of
faults are rooted in application software components. Generally, these types of
faults only affect the application software in question – the rest of the
system operates normally. Similar to system software faults the source of these
faults can be tracked down to one or more of the following reasons:
·
Software components corruption
·
Application incorrect
configuration
·
Documented and undocumented
bugs
·
Compatibility issues (hardware
and software)
Security-related faults
These faults are faults that develop in
systems, and might have their source in hardware, software, configuration or
design.
More often than not, security related faults
are the consequence of:
·
Other faults (for instance, a
hardware fault with a firewall device might expose systems that would normally
be protected by the firewall device)
·
improper configuration,
·
un-patched software bugs
·
system design flaws
·
undiscovered security
holes/backdoors
Generally, the occurrence of any of the
above issues, will result in security being compromised, possibly exposing
confidential and private information. Generally, to rectify this type of fault
requires engaging personnel with expertise in the area.
Security faults are sometimes referred to as
‘exploits’ since, the security fault does not in itself represent a real threat
unless someone malicious discovers and chooses to exploit the fault. It is
imperative that proactive action be taken to minimise the effect of security
compromises.
Boot time faults
Boot time faults are faults that occur
during the start-up sequence of a computer system. Boot time faults are
critical in that they can potentially halt the boot sequence possibly halting
the system altogether, rendering it unusable.
Boot time faults can have their source in
software – usually due to improper configuration, missing system files or
incompatibilities (usually after new software has been deployed), or hardware –
usually due to boot device (typically hard disk drive) failure, or other major
component failure such as RAM, Video etc. Failed hardware peripherals might
have an impact on booting up, but not necessarily halt the system or make it
unbootable.