Cryptography

Overview

Cryptography refers to the practice of securing communication between two parties. We will look at methods for securing messages so they can't easily be read by third parties, as well as the concepts behind digital signatures and hashing.

Most cryptography systems rely on computational security, which can in theory be broken but which requires far more computation than is feasible. This is the case for most encryption schemes.

There is also unconditional security, in which there is no way for an attacker to distinguish the correct plain text, even with unlimited computation power.

Caesar Ciphers

There are many simple encryption schemes which are not suitable for practical use as they are easily broken by computers. One of these is the Caesar cipher, so called because Julius Caesar used it in his writing.

In a Caesar cipher, we shift each character by a fixed amount:

D is mapped to A, E to B, F to C, and so on
Credit: Wikipedia

The bsdgames package comes with a caesar command which can be used to compute Caesar ciphers on text:

$ echo "hello" | caesar 1
ifmmp

In encryption, we have the plain text, which in this case is "hello", the cipher text, "ifmmp", and the key, which is the information needed to transform the cipher text back into the plain text.

There is a special case of Caesar cipher where the shift amount is 13. In this case, the value 13 can be used for both encryption and decryption. This is called ROT13 and has been used in the early days of the Internet to weakly hide information, such as spoilers on message boards.

$ echo "this is a spoiler" | caesar 13
guvf vf n fcbvyre

$ echo "guvf vf n fcbvyre" | caesar 13
this is a spoiler

Vim also supports ROT13 which can be done with the g? action. So to apply it to a whole file, one could do ggg?G.

This sort of cipher is easily broken through frequency analysis, which looks at the most used symbols in a cipher text and attempts to intuit the most likely corresponding values in the plain text. For instance, with long enough of a piece of text, 'e' will be the most common letter for English plain text. If we map 'e' on to only one symbol, it will be clear what that is. More advanced frequency analysis has been used to break ciphers in the past.

Vigenère and OTP

The Vigenère cipher, named after a French diplomat and cryptographer, is sort of a more advanced version of the Caesar cipher. Rather than shift each character by the same amount, the key is used to cycle through a sequence of shift amounts. For example, if the plain text is hello and the key is owl, then we will use shift amounts corresponding to the letters in the key: 15, 23, 12. When the key runs out, we repeat it. This would give the cipher text "wbxal". This is harder to break because each letter can be mapped to multiple possibilities. For instance, the two 'l' characters were transformed into 'x' and 'a'.

With a key being shorter than the message being sent, the Vigenère cipher can be broken by computers. However if we make the key the same length as the message, this scheme is unbreakable, providing unconditional security. For this to work, we must use a truly random key, and use it only once, for one message. Re-using the key breaks this property. For this reason, the key is called a "one-time pad".

This scheme has been used in the past, but has some downsides:

For it to achieve unconditional security, the key must be truly random, not pseudo-random.
The key must be sent securely somehow, or agreed upon beforehand.
The key can also only be used once ever.

If there is a secure way of sending the key, one can just use that method to send the message instead (since they are of the same length). This makes it of no practical use in a networked setting.

Other Simple Ciphers

There are also substitution ciphers, where each letter is substituted with another, such as in a cryptograms. A Caesar cipher is really a special case of a substitution cipher. For this type of cipher, the key is the mapping of plain text characters to cipher text characters. These are amenable to frequency analysis for the same reason a Caesar cipher is.

Another simple type of cipher is a transposition cipher, in which a message is written in a two-dimensional grid. We then pick a different ordering for those letters. For example with the plain text message "THE TEST IS AT NINE", we could lay the characters out in a grid:

The letters are laid out in a 5x4 grid

Possible ciphers from this include:

TSTAHTNMEIIXTSNXEAEX
EAEXTSNXEIIXHTNMTSTA
ETEHTASITSENINTXXXMA

Here the "key" is essentially the agreed-upon grid size and secondary ordering of cells.

Symmetric Key Encryption

In ciphers such as the Vigenère cipher, we use the same key for both encryption and decryption. This is called symmetric key encryption. In addition to the weak schemes talked about above, there are several stronger methods which provide some measure of computational security.

The Data Encryption Standard (DES) was developed at IBM and published in 1975. With computing power today, it is not considered secure as it's amenable to brute force attacks. A stronger version called Triple DES was published in 1981. A vulnerability was found in 2016 and it was deprecated by the US National Standards Institute (NIST) in 2019.

The Advanced Encryption Standard (AES) was published by the NIST in 2001, and is still considered to provide strong computational security.

There are several command-line encryption programs. An easy to use one is ccrypt which uses AES. We can encrypt a file giving a key with:

$ ccrypt secret.txt
Enter encryption key: 
Enter encryption key: (repeat)

And then late decrypt it with the -d flag:

$ ccrypt -d secret.txt.cpt

Symmetric key encryption is used to encrypt files, and has uses for securing sensitive information. However, there is one difficulty with symmetric key encryption: communicating the key among multiple parties. This makes it inapplicable for things like secure networking. How could you and another host on a network communicate the secret key?

Asymmetric Key Encryption

This difficulty is addressed with asymmetric key encryption, in which there are two keys involved: a public key and a private key. As the names indicate, you can share your public key freely with the world, but must keep your private key safely private. The two keys are mathematically linked and must be generated together. This scheme is also called public key encryption.

One of the most widely used asymmetric key schemes is RSA, published in 1977 and named after the initials of its three creators. RSA creates key pairs based on the problem of factoring large numbers. The private key is essentially two very large prime numbers, and the public key is the product of those two numbers. Going from the private key to the public key is simple: just multiply the numbers together. The other direction is computationally difficult however. There is no algorithm to factor a large number except try to divide it by every prime number and see if it divides evenly.

RSA is widely used and still secure, if used with large keys. However it has begun to be supplanted by elliptic curve cryptography. These algorithms are also based on a hard computational problem: that of elliptical paths through finite fields. ECC schemes provide greater security at smaller key sizes than RSA, which is why they are becoming more widely used.

Public key cryptography is used for SSH, as we have seen, and is also the basis for HTTPS.

Public/Private Key Functions

Public and private key pairs can be used for two main purposes: encryption and digital signatures:

Encryption:For Alice to send a secret message to Bob, she will use his public key to encrypt it. Bob can freely share his public key and so this way anyone can send him encrypted messages. These messages can only be decrypted with his private key, so only he will be able to read them.
Signing: A digital signature is proof of the source of a message. If Alice wants to send Bob a message that he can trust is from her, she will sign it with her private key. Anyone who gets this message can then use Alice's public key to verify the signature. This will reveal if the matching private key was indeed used to sign the message.

Oftentimes we will want to use both of these features: to send a message which is encrypted and others can trust came from us. However, the order in which we perform these two operations has different effects:

If we sign first, and then encrypt, that looks like this:
```
plain text -> *Sign with Alice's private key* -> signed plain text
         -> *Encrypt with Bob's public key*  -> cipher text
```
Now, Bob decrypts the message with his private key and can verify that Alice did indeed send it. The potential problem is that Bob can now take Alice's signed plain text and encrypt it with Claire's public key and send it to her. This makes it look like Alice sent Claire a message when she didn't.
If we encrypt first, and then sign, that looks like this:
```
plain text -> *Encrypt with Bob's public key* -> cipher text
          -> *Sign with Alice's private key* -> cipher text + signature
```
Here, Bob will use Alice's public key to verify she sent the message, and then use his private key to decrypt the message. The problem in this scenario is someone receiving this message could remove Alice's signature, sign it with their own private key, and then send the message on to Bob, making it look like they sent him the message.

A potential workaround for this issue is to build some form of identity into the message. For instance, if we include Alice and Bob's identities into the original plain text, then sign, then encrypt, then there is no way for Bob to pass it to someone else as a message for them (because the plain text can't be modified without breaking the digital signature).

HTTPS

Encryption and authentication are both used in HTTPS. When connecting to a site over HTTPS, the site signs data with its private key. We then use the public key associated with that site by a certificate authority to verify the identity of the site.

We also then use the site's public key to encrypt data only the site will be able to read. Asymmetric cryptography is only used for the initial "handshake" when connecting to a site. A random symmetric key is communicated between the two parties and after that symmetric key encryption (such as AES) is used for the bulk of the data transfer.

Symmetric key cryptography alone could not be used, because there would be no way to communicate the key securely between the server and the client.

Hashing

Hashing is related to, but distinct from, encryption. Hashing computes a hash value from data, but does not encrypt it. With the appropriate key, one can convert a cipher text back into a plain text. But one cannot compute the original input data from a hash value, under any circumstance.

Hashing is used for the following:

Storing passwords securely
Computing hash values on files to allow for quick integrity checking
Identifying pieces of information by number, such as git commits
In data structures like hash tables and bloom filters

From a systems perspective, they are used to check for data integrity. If you transfer a large file, you can compute a hash for it before and after the transfer and ensure they are the same, that is the data wasn't corrupted accidentally.

Good hash algorithms have the following properties:

Reproducibility: hashing the same data must always result in the same hash.
Entanglement: every bit in the hash depends on every bit in the input. One small change to the input results in a completely different hash value.
Pseudo-randomness: hash values should look indistinguishable from random data.
Non-reversibility: it must be impossible to recover the original data from the hash value.

There are a few common hash algorithms that can be used from the command line. The MD5 hash algorithm can be used through the md5sum command:

$ md5sum project.tar.gz
b57d091e60e47cb9cbb3891329b23e73  project.tar.gz

The MD5 hash algorithm is good for detecting accidental changes to data, or for quickly checking if two files have the same contents where you don't suspect any foul play. However, MD5 has been broken in the sense that an attacker can provide two files which have different data, but which are instrumented to have the same hash value.

In such cases, there are other hash algorithms which are not known to have these flaws. Examples include SHA and Blowfish. The SHA hash can be use with a variety of bit lengths:

$ sha1sum project.tar.gz 
1ad4e648b87fddb3da0ab68296d328b7b8024e64  project.tar.gz

$ sha224sum project.tar.gz 
e7b7cf2731770fa29a2af4effb492d82f1cbc15d5d9d9cca072ac7b4  project.tar.gz

$ sha256sum project.tar.gz 
746c88d41af26e17e61153be7b4c01a076a2fd3dc49eee04b44738516394b999  project.tar.gz

$ sha384sum project.tar.gz 
81b99d08c4cf5c27d915b3384609e55801e803550915c5f38473f67298fe1a215f0c0547c47b6d71ffd260ca2193bc50  project.tar.gz

$ sha512sum project.tar.gz 
a0b6de158b7734f213efd26457743e978d0f14653e95dd91c306641ffe6f3370fb211f922591a9c59b44d21edfb67ddbe433d306d02c697f3bf3e59a512d6220  project.tar.gz

Hashing is used for checking data integrity, but not for actually encrypting data. With hashing there is no key, just the algorithm itself which varies.