Signed and Floating Point Numbers

 

Overview

We have seen how positive, unsigned numbers can be stored in binary and converted back and forth with decimal values. Today we will look at how we can use binary to represent both signed numbers, which can be positive or negative, as well as numbers with a fractional component.


 

2's Complement

One way we could implement signed numbers is to store the unsigned value, and then store one extra bit for the sign. There are two major downsides to this scheme:

  1. There are two zero values: +0 and -0
  2. There's no easy way to do addition and subtraction

Instead we use a system called two's complement. One nice thing about this system is that signed numbers which are positive are stored exactly the same way as unsigned numbers.

The way we encode negative numbers is to follow the following steps:

  1. Take the absolute value of the number, and convert that to an unsigned binary value
  2. Flip all of the bits
  3. Add 1

For example, let's say we want to convert the number -52 to binary. We begin with the binary value for positive 52:

00110100

When talking about signed numbers, we need to be aware of the size of the value. Here, we are assuming we're storing it in a single byte. So two leading zeroes (which will be flipped to leading ones) are shown here. We then flip the bits:

11001011

And finally, we add 1:


    11001011
  +        1
    --------
    11001100

For signed numbers, the leftmost bit tells us the sign: if it's 0 it is positive (or zero), and if it's 1, it's negative.

When we see a binary number, we now need to know whether it is signed or not to say for sure what it's decimal value is. Generally we would use the following to determine the value:

  1. If the number is unsigned, its value is determined the normal way
  2. Otherwise, we look at the left-most bit. If it's 0, then the number is non-negative and we determine the value the normal way
  3. If the left-most bit is a 1, then we must find the absolute value. We do that by following the steps to negate the number (flipping the bits and adding one) to get the absolute value. The value of the signed number is the negative of that.

Remember that signed numbers are not always negative. It's just that they can be negative.


 

Signed Operations

There are a couple of places where we will need to know if we are dealing with signed or unsigned numbers, that applies both to software and hardware.

The first is when we are shifting a binary value to the right. Imagine we have the value -52 as computed above, and we want to shift it one place to the right:


    11001100
>>         1

When we shift the value to the right, we are opening up a new bit position on the left-hand side. If we fill this with a 0, we are effectively changing the value to be positive:


    11001100
>>         1
    01100110

The value of this binary number is 102. However we want to maintain the link between right-shifting and division. Right-shifting by 1 position should be equivalent to dividing by 2. If we instead fill in the left-most bit with a 1, then we get the following:


    11001100
>>         1
    11100110

The value of this number is -26, which is the correct result of dividing -52 by 2. So we need to ensure that when right shifting, we know whether they are signed values or not. If so, we must maintain the sign bit.

This also must be done when extending a signed number. For instance, if we copy a 16-bit short value into a 32-bit int variable, the number will need to be extended. Unsigned values will always be extended with leading 0's on the left. But signed numbers will be sign extended, where the sign bit is preserved.


 

Floating Point Numbers

In the early days of computing, different computer systems used different schemes for storing floating point numbers. Nowadays, almost all systems use the IEEE 754 standard for floating point numbers, published in 1985. It includes 32-bit numbers (float in C) and 64-bit numbers (double in C), as well as other sizes.

These standards break the number into three sections:

BitsSignExponentMantissa
321823
6411152

These numbers are essentially written in "scientific notation", where the binary point (not decimal point) is to the left, and the exponent determines how much to shift it left or right. For example:


1011011.11010001

becomes:


1.01101111010001 * 2^6

The number to the left is called the mantissa. We won't actually store the leading "1.", since it's assumed to be there. The mantissa essentially stores the bit pattern of the number, and the exponent tells us where the binary point goes.

The exponent can be positive (indicating an absolute value greater than 1), or negative (indicating an absolute value less than 1). But rather than allow for signed numbers to be stored for the exponent, it instead uses a bias value which is added to the exponent. The bias value for 32-bit floats is 127 and for 64-bit doubles, it's 1023. This way the smallest possible exponent value is encoded as 0 (which represents a true exponent of -127) and the largest is the largest is all 1's.


 

Floating Point Precision

Just as in decimal, we can't store all floating point numbers numbers with complete precision using floating point numbers. In decimal, for instance, we can't write the value $\frac{1}{3}$ or $\frac{1}{7}$ exactly, as they have repeating decimal values. In general, we can only express decimals precisely if their denominators prime factorization contains only 2's and 5's, the prime factors of our base 10.

In binary, things are even worse, since our base has only the prime factor 2. The only values which can be stored precisely are fractions where the denominator is a power of 2. Other fractional values, such as $\frac{1}{3}$ or even $\frac{1}{5}$ are represented with repeating binary patters. The size of the mantissa determines how many times these values repeat and how close of an approximation our value is.


 

Floating Point Example

As an example, let's convert the decimal value 200.625 to a 32-bit floating point binary number. The fraction .625 is $\frac{5}{8}$, so we can represent this number exactly.

We can start by converting 200 to binary which is 11001000. We can then convert the fractional part to binary. To do that, we need to recognize that the column values to the right of the binary point have values $\frac{1}{2}$, $\frac{1}{4}$, $\frac{1}{8}$, etc. So the value .625 can be written as the binary fraction .101. Putting this together, we get the following binary value:

11001000.101

We can then shift the binary point 7 positions to the left which gives us the mantissa value:

1.1001000101

The leading 1. is not included with the mantissa (as it would always be present, so there's no need to actually store it). We then add the true exponent of 7 to the bias value 127, giving us 134. We can convert that to binary as 10000110. This gives us the following 32-bit float value:

0 10000110 10010001010000000000000

The first bit is the sign bit, the next eight bits are the biassed exponent, and the following 23 bits are the mantissa (with extra 0's on the end to fill the space).


 

Fixed Point Numbers

The reason they are called floating point numbers is because the binary/decimal point can move left and right. There are also numbers called fixed point numbers, where the decimal point is always in a specific place in the number. The example we're most familiar with is money. We are used to seeing something like $23.40 on a price, but $23.4 or $23.407 both look weird.

Fixed point numbers can be represented in a program by essentially changing the units used. Rather than declare money using a floating point number like this:


double price = 54.42;   // dollars

We can do it like this:


int price = 5442;   // cents

This is generally better for cases where precision is desired, such as when dealing with monetary values. As mentioned above, floating point numbers have potential precision and rounding issues, while integers are exact.