15. Reliability and Availability

CS/클라우드컴퓨팅

15. Reliability and Availability

호프 2023. 12. 7. 17:14

Reliability and Availiability

Service Availability

Service availability

$ Availability = \frac{Uptime}{Uptime + Downtime}$
$ Availability = \frac{TotalInServiceTime - Downtime}{TotalInServiceTime} $
- calculate availability based on service downtime
$ Availability = \frac{MTBF}{MTBF + MTTR} $
Service availability ratings are commonnly quantified as number of nine's (9)

Availability Metrics

Mean time between failure (MTBF)

average time btw when a workload begins normal operation and its next failure

Mean time to repair(recovery) (MTTR)

period of time when the workload is unavailable while the failed subsystem is repared or returned to service

Mean time to detection (MTTD)

amount of time btw a failure occuring and when repair operations begin

Availability with Redundancy

Service Reliability

Reliability

defined as the ability of an item to perform a required function under stated conditions for a stated time period
$ Service Reliability = \times{\frac{Successful Responses}{Total Requests}}{100%} $
more convenient to focus on the much smaller number of unreliable service defects since most services are very reliable
$ DPM = \times{\frac{Unsuccessful Requests}{Total Requests}}{100%}$
- DPM = Defects per million

Reliability Curve

The bathtub curve
The failure rate of a system usually depends on time

Failure types

Failure

Failure

evnet that makes a system fail to operate
failures are inevitable in complex system
failures can impact the service deliverd to users
- servie response time can degrade
- isolated service requests can fail to respond within an acceptable time
- repeated service requests can fail

Failure types

Permanent faults (hard error)

A continuing error
Caused by some physical failures

Temporary faults

Transient fault (soft error) leads to independent one-time errors
Intermittent faults occur due to a weak system component

Service-Level vs Machine-Level Failure

Service-Level Failure

특정 서비스가 사용 가능한 기간 동안 정상적으로 동작하지 않는 상황
Operator-caused or misconfiguration errors are the largest contributors
Hardware-related faults only 10-25% of the total failure events
- But this is because fault-tolerant techniques implemented in hardware
It is harder to tolerate general software bugs or operator mistakes than known hardware failure patterns

Machine-Level Failure

전체적인 시스템 또는 장치가 작동하지 않는 상태

Hardware Failure and Software Failure

Hardware Failure

Processor errors: aging
DRAM soft erros: leakage power
DRAM hard errors: radiation from the universe during delivery
Disk errors (the most dominant reason of failures in datacenter)
- typically ranged btw 2 ~ 4%
- disk can crash
Hydrophilic dust
Network failure due to communication channel breaking
Random failures from manufacturing defects

Software Failure

major reason = upgrades

Other types of Failure

Cloud Management System (CMS) Failures

Overflow: if the queue is full, new requests will be dropped
Timeout: when waiting time of the queued requests is over the due time
Data resource missing: when some data are removed but the data resource isn't updated
Computing resource missing: turning off the PC without notifying the CMS

Security Failure

Customer faults: about 95% of failures
Software breaches: when attackers can gain access to the customer information
Security policy failures

Human Operational Faults

Misoperation: accidental faults made by human personnel operating
Misconfiguration: network node software, cloud management software is misconfigured

Environmental Failure

Environmental disasters: main role in the dependability of cloud system
Cooling system failure: servers will shut down completely or be under-utilized regarded as unavailable

Features for Reliability

Hardware Features for Reliability

Hardware Features for Reliability

Processor
- error detection with instruction retry
- errors detected by residue checking
Memory
- parity or ECC protection of memory componenets
- redundant array of independent memory
Storage
- RAID configuration
- journaling file systems for file repair after crashes
- checksums on both data and metadata
System
- hot swapping of components
- partitioning, virtual machines

Mitigating Hardware Failures via Virtualization

Virtualization layer of softawre decouples the VM instance from the physical hardware
Virtual CPU
- abstraction of the available physical CPUs or processor cores
- Virtual CPU can be used by the hypervisor to mitigate the impact of a single physical CPU failure
- If failure in physical CPU -> rellocate another physical CPU resources to the affected virtual CPU, and restart VM
Virtual NIC
- NIC (network interface card): hardware component that connects the host computer to the external network
- Virtual NIC provide abstraction of that physical component to gestOS by mapping to a physical NIC or to a virtual network
- VM can be configured to multiple physical NIC's via their virtual NICs

Service Level Agreement(SLA)

Service Level Agreement (SLA)

SLA serves as the blueprint and warranty for cloud computing services
document specific parameters minimum service levels and remedies for any failure to meet the specified requirements
determine the pricing model and payment terms
describe QoS features, guarantees and limitations of one or more cloud-based IT resources
- use. service quality metrics to express measurable QoS characteristics
- Availability, reliability, performance, scalability, resiliency (elasticity)

저작자표시 (새창열림)