Data Centre Failure – Series 3
Storage is the most sensitive part of your infrastructure. If you lose your data you have lost everything. There are many factors to take into consideration, such as the type of storage solution, synchronous or asynchronous replication, latency and bandwidth restrictions between data centres. Every solution will have different requirements. Latency can drastically affect synchronous replications as a round trip time (RTT) is added to every write action but for asynchronous replications this may not be as much of an issue. Design errors may also become apparent from certain failure scenarios, such as data centre interconnect failure. Potentially resulting in split brain scenarios so be careful when you try to over-automate and over engineer things that should be kept simple in the first place. Split brain occurs when both are active at the same time. Everything becomes out of sync, which may result in full tap storage restores.
History of Storage
Small Computer System Interface (SCSI) was one of the first open standards for storage. It was developed by the American National Standards Institute (ANSI) for attaching peripherals, such as storage to computers. Initially, it wasn’t very flexible and could connect only 15 devices over a flat copper ribbon cable of 20 meters. Fiber channel replaced the flat cable with a fiber cable. Now, we have a fiber infrastructure that overcomes the 20-meter restriction. However, it still uses the same SCSI protocol, commonly known as SCSI over fiber. Fibre Channel is used to transport SCSI information units over optical fiber.
We then started to put disks into enclosures, known as storage arrays. Storage arrays increase resilience and availability by eliminating any single points of failure (SPOFs). Applications would not write or own a physical disk but instead write to what is known as an LUN (Logical disk). An LUN is described as a unit that logically supports read/write operations. LUN’s allow multi-access support by permitting numerous hosts to access the same storage array.
Eventually, vendors designed storage area networks (SAN). SAN networks provide access to block level data storage. Block-level storage is used for SAN access while file level storage is used for network-attached storage (NAS) access. They did not use SCSI anymore and invented their own routing protocol, known as FSPF routing. FSPF was invented by Brocade and is conceptually similar to OSPF for IP networks. They also implemented VSAN, which are similar to VLANs on Layer 2 networks but used for storage. VSAN is a collection of ports that represent a virtual fabric.
Remote Disk Access
Traditionally servers would have a Host Bus Adapter (HBA) and run FC/ FCoE/iSCSI protocols to communicate with the remote storage array. Another method is to send individual file system calls to a remote file system, known as a NAS. The protocols used for this are CIFS and NFS. CIFS was developed by Microsoft and is an open variation of the Server Message Block Protocol (SMB). NFS, developed by Sun Microsystems, runs over TCP and gives you access to shared files, opposed to SCSI that gives you access to remote disks. The speed of file access depends on your application. Slow performances are generally not related to the protocols NFS or CIF. If your application is well written and can read huge chunks of data it will be fine over NFS. But, on the other hand, if your application is badly written, it is best to use iSCSI then the host will do most of the buffering.
Why not use LAN instead of fiber channel?
Initially, there was a huge variety of different operating systems. And most of these operating systems already used SCSI and the device drivers that implemented connectivity to load SCSI host adapters. The storage industry decided to offer the same SCSI protocol to the same devices driver but over a fiber channel physical infrastructures. Everything above the fiber channel was not changed. This allowed backward compatibility to old adapters, which is why they continued to use the old SCSI protocols. Fiber channel has its own set of requirements and terminology. The host still thinks they write to a disk 20m away, requiring tight timings. It must have low latency and the minimum distance of around 100 km. Nothing can be lost so it must be lossless and in order packets are critical. The result of all this is that FC requires lossless networks, which usually result in a very expensive dedicated network. With this approach you have one network for LAN and one network for storage.
Fiber channel over Ethernet was used to get rid of fiber only networks by offering I/O consultations between the server and the switches. They took the entire fiber frame and put it into an Ethernet frame. FCoE requires lossless Ethernet (DCB) between the servers and first switch i.e VN and VF ports. It is mainly used to reduce the amount of cabling between servers and switches. It is an access tier solution. On the Ethernet side, we must have lossless Ethernet. There are a number of standards IEEE formed for this. The first limited the sending device by issuing a PAUSE frame, known as 802.3x, which stops the server sending data. As a result of the PAUSE frame, the server stops ALL transmission. But we need a way to stop only the lossless part of the traffic i.e the FCoE traffic. This is known as 802.1qbb and allows you to stop a single class of services. There is also QCN (Congestion notification 802.1Qua) which is an end-to-end mechanism that can tell the sending device to slow down. All the servers, switches and storage arrays negotiate the class parameters deciding what will be lossless.
Storage Replication for Disaster Recovery
The primary reason for storage replication is for disaster recovery and fulfilling service level agreement (SLA). When data centre services fail from one DC to another, how accurate will your data be? The level of data consistency depends on the solution in place and how you choose to replicate your data. There are two types of storage replication – synchronous and asynchronous.
Synchronous has a number of steps. The host writes to the disk, the disk writes to the other disk in the remote location and only when the other disk says OK will an OK be sent back to the host. Synchronous replication guarantees that the data is perfectly in sync. However, it requires tight timeouts, severely limiting the distance between the two data centres. If there are long distances between data centres you need to implement asynchronous replication. The host writes to the local disk and the local disk immediately says OK without writing or receiving notifications from the remote disk. The local disk sends a write request to the remote disk in the background. If you are using traditional LUN based replication between two data centres, then most solutions make one of these disks read-only and the other read-write. Problems with latency occur when a VM is spawned up in the data centre that only has the read-only copy, resulting in replication back to the writable copy. One major effecting design factor is how much bandwidth is consumed by storage replication between data centres?
A better storage architecture is to use a system with a distributed file systems – both ends are writable. Replication is not done at disk level but on a file level. The type of replication you use is down to recovery point objective (RPO). RPO is terminology used for business continuity objectives. If you require an RPO of zero, then you must use synchronous replication. As discussed requires a number of steps before it is acknowledged to the application. Synchronous also has distance and latency restrictions, which vary depending on chosen storage solution. For example, VMware VSAN supports RTT of 5 ms. It is a distributed file system so the replication is not done on a traditional LUN level but on a file level. It employs synchronous replication between data centres, adding RTT to every single write.
Most storage solution are eventually consistent. You write to a file, the files locks and then the file is eventually copied to the other end. This offers much better performance, but obviously RPO is non-zero.