Storage Systems

 

Overview

Today we will talk about storage of data. This refers to non-volatile data storage (which means the data persists without power) as opposed to a computer system's main memory (RAM). We'll look at some different types of storage technology and their trade-offs, and some tools for improving efficiency and reliability of data, and talk about backups.

With storage, there are a few factors that are important to consider, which will be given different weight for different applications:


 

Storage Technology

There are multiple types of storage technology that have been used over the years:

There are also multiple types of storage interface protocols. These define the way that the storage device is connected to the rest of the computer through the motherboard. Some common ones are:


 

RAID

RAID (Redundant Array of Inexpensive/Independent Disks) refers to using multiple disk drives together as one logical drive. It appears to the host machine as one drive, but actually stores data across multiple. RAID provides two functions:

  1. Reduce the time needed to read/write data
  2. Provide increased data integrity

There are several different RAID configurations based on the number of disks and the desired balance between performance and reliability. There are three techniques employed by the different configurations:


 

RAID Configurations

There are different RAID configurations which are geared towards different trade-offs of performance vs. reliability and require different numbers of drives. Some of the most widely used are:

There are other RAID configurations (as you can infer from the missing numbers), but these are the most widely used in practice.

On Linux, RAID can be setup using the mdadm tool, but doing so is a bit beyond the scope of this class.


 

NAS

A Network-Attached Storage device is a server on a network whose primary function is to serve files. It's essentially a computer with a set of drives attached to it, usually using some RAID configuration. The operating system, which is often a Linux or other UNIX variant, runs off a separate boot drive which is not attached to the RAID drive configuration.

They generally support a number of network authentication/transfer protocols depending on the environment:

They provide an option for storage that is more "plug and play" then configuring RAID drives yourself. They are also typically smaller and quieter than having a full server with attached drives.


 

Cloud Storage

Just like hosting computational resources have been moved to the cloud over recent years, we also have seen organizations outsource data storage to cloud services. These take care of things like redundancy, performance, etc. but of course come with recurring costs, for both the amount being stored and how much it is accessed.

These services typically offer different tiers for types of data. For example, "live" data being accessed regularly by an application will generally need to be accessed much more quickly than data treated as an off-site backup.

Google has prices listed on their cloud storage page.


 

Backups

All important data should be backed up regularly. The time frame of how often backups should be conducted varies, but at least once a week is a good rule of thumb. Many organizations backup data nightly.

There is a commonly cited "3-2-1 Rule" when it comes to performing backups:


 

Rsync

To actually perform a backup on a Linux system, the rsync command is especially helpful. One could use the cp or scp commands which we've already seen. However the issue with copying files with one of these is that if the files already exist at the destination (with some differences) it copies everything from scratch.

Instead rsync only copies the differences between the source and destination, greatly speeding up the backing up of data. It can be used to make a copy on a local system (such as from one drive to another), or across the network. When working over the network, it uses SSH for authentication.

To perform such a backup locally we could do something like the following:

$ rsync -r /home/projects/ /media/ifinlay/external

This copies everything from the projects directory into an external hard drive, if it's mounted at that location. rsync wil automatically only copy the files that have changed (and then only those parts that have changed). The -r flag is used so everything under projects is copied recursively.

We can copy to an external server using SSH with syntax like this:

$ rsync -r /home/projects/ ifinlay@cpsc.umw.edu:/home/ifinlay/backup

Of course you need to be able to SSH into the given machine with the given username for this to work. With SSH keys setup, it will not ask for a password. The remote machine can be specified using either a domain name or IP address.

Either or both of the source and destination can be remote. So rsync can be used to push files to a remote server, pull files from a remote server, or even transfer between one remote server and another.

Some useful rsync flags:

One can of course invoke rsync from a cron or anacron job so that backups are conducted automatically on a fixed schedule.