TLS and HTTPS

Overview

As seen in the last lab, the data that is sent over TCP is stored as-is inside of TCP packets. Anyone who sees the TCP packet can read the data that is contained in one. Because the Internet is de-centralized, your data packets can be sent through routers operated by many different organizations. Also, on Wireless networks, any machine connected to the network can see packets which are being broadcast, whether they are the recipient or not.

Possibly worse, unencrypted traffic can be modified by a router which handles your packets, such as your ISP. Comcast has injected warnings about bandwidth usage into customer pages, and Verizon has injected cookies to track customers.

In order to avoid third-parties reading data, it must be encrypted.

SSL (Secure Socket Layer) was developed initially by Netscape beginning in 1995. It progressed to version 3.0, but was then renamed to TLS (Transport Layer Security). TLS is still under active development. Sometimes people still refer to it as SSL.

HTTPS is a version of HTTP which uses TLS for securing the requests and the responses. HTTPS is one of the major uses for TLS, but not the only one. SMTP can use TLS for secure email transmission, and your own programs can use TLS as well.

TLS

TLS does two main things:

Encrypt data sent over a socket. Rather than putting the data that the application is sending and receiving into the packets, TLS will put encrypted data in. Now anyone looking at the packets will see unreadable gibberish.
Verify that hosts are who they say they are. You do not want your web client to communicate with a web server that is just pretending to be your banks web server. You need to make sure it really is. Otherwise there is no point in encrypting the traffic.

TLS sort of lives between the application layer and the transport layer. It allows applications to send data over TCP knowing that it is secured.

TLS Limitations

There are some things that TLS does not handle:

The IP address of your machine, and the machine you are connecting to still appear in plain text inside of packets. For IP to work, all routers must be able to read these and so they can't be encrypted.
Likewise port numbers appear unencrypted in packets.
If your program addresses a host by domain name instead of IP, the domain name will be readable as part of a DNS request. We will talk about DNS later on.

So TLS does not stop people from seeing who you are talking to, just what you are saying.

Public and Private Keys

TLS is based on the idea of having a public and a private key. You can tell everyone your public key, but must keep the private key secret. This is called asymmetric cryptography.

This is based on having an algorithm with certain mathematical properties:

It must be virtually impossible to guess the private key from looking at the public key. The keys are linked by some sort of one-way function, such that the private key gives us the public key, but not the other way around.
If something is encrypted with the public key, it can only be decrypted with the private key:

Thus to send encrypted data to somebody, you only need to know their public key. They can use their own private key to decrypt the text (and they are the only one who can do that).
If something is "signed" with the private key, it can be verified with the public key:

If we sign something with our private key, then anyone with our public key can tell that we have done so. As long as the private key is truly kept private, anyone can verify that the data is from us.

TLS supports a number of different cryptographic algorithms (also called "ciphers". The different versions of TLS include different algorithms to choose between. Sometimes flaws are found in certain algorithms, and new ones are added.

Example Algorithm: RSA

RSA is one of the oldest and most widely used asymmetric cryptography algorithms. It is based around the fact that finding the factors of a large composite number is a difficult problem, but finding the product of two numbers is easy.

Consider these examples:

What is 72253 * 59209?
What two prime numbers multiplied together give us 4294365653?

The first question is trivial to answer - it takes a computer no time to find this product. The second question is much harder. There is no efficient way of doing this. We must resort to the brute force approach of trying all possible factors.

RSA essentially works by using the large composite number as the public key, and the two prime numbers as the private key. (There are a few more details we are skipping over, but that's the main idea).

This allows for all three of the requirements of an asymmetric cryptographic algorithm:

Nobody can guess the private key based off the public one.
We can encrypt data using the public key in such a way that it can only be unlocked if you know the factors of the large composite number.
We can sign data using the private key in such a way that people can tell that it was done using the factors of the large composite number.

Other cryptographic functions are based on similar principles. There is a lot of theoretical math that goes into providing good ciphers that work well and are secure.

HTTPS Certificates

With HTTPS, these public keys are contained in certificates. If you direct your browser at an HTTPS server, you need to know the public key the server is using, in order to verify its identity.

When you connect to a site, it will give you its certificate. But how can you trust that the certificate for the site is the real one?

Generating certificates with RSA, or any other cipher, is not a hard thing to do. What is to stop a site from forging a certificate and giving it to your browser?

The only way around this problem is to rely on a third party to provide a list of trusted certificates. This is called certificate authority. These are organizations that maintain a list of sites along with their official certificate, including public key.

Some of the larger authorities are:

IdenTrust
Comodo
DigiCert
GoDaddy
GlobalSign

When you connect your browser to a new web site using HTTPS, the site will provide its certificate. Your browser will then compare this certificate against the one provided by a certificate authority. If they match your browser will continue. If they do not, you will get a security warning.

Assuming they match, your browser will use the public key contained therein to encrypt data before sending it to the server. The server will then use the corresponding private key to decrypt the data.

HTTPS Usage

HTTPS is necessary for sites that communicate private information, such as credit card information. It is not as necessary for sites that don't communicate private information, but still has benefits:

It prevents anyone from modifying the pages you are seeing.
It prevents anyone from knowing which pages you are viewing (but not what domains).
It allows you to have confidence you are seeing the real version of a site.

Most browsers indicate that HTTPS is being used in the address bar. For example, Firefox displays a green lock:

Browsers will generally warn you if a page has a login form and isn't using HTTPS, and will definitely warn you if a site gives the browser a certificate which can't be verified by an authority. The site https://badssl.com/ can be used to test bad certificates.

HTTPS overtook HTTP in terms of market share in September 2018. Google has encouraged adoption by ranking sites which use HTTPS ahead of those which do not.

In order for a site to use HTTPS, it must create a certificate and register it with a certificate authority. This can cost money, but in 2014 an organization called Let's Encrypt was founded which offers free certificate authentication.