Type Systems

Introduction

A data type defines a set of values and a collection of operations that may be performed.

Types are very important to programming languages. They allow us to abstract away the fact that all memory is made of bits.

The type system for a language answers the following questions:

What types are built in?
When are types checked?
What is considered a type error?
How are types converted?
How are new types created?
When are two types considered the same?

Primitives

Primitives are simple data types consisting of only one value. They normally correspond to a single memory cell on a machine.

Integers

Integers come in different sizes which can store different ranges of numbers:

Size	Typical Name	Signed Range
8	char or byte	-128 to 128
16	short or half word	-32768 to 32767
32	int or word	-2147483648 to 2147483647
64	long or double word	-9223372036854775808 to 9223372036854775807

Some languages define the size of integer primitives (Java) and some do not (C and C++). Some languages have no limits on the size of integers (such as Python or Racket).

Some languages allow unsigned numbers which can be twice as large.

Signed numbers are normally implemented with two's complement notation.

Floating Point

Floating point data types model real numbers. They only store approximations.

There are different ways to store floating point numbers. The modern standard is IEEE 754

IEEE 754 floating point numbers store a number called the significand (or mantissa) and an exponent. They also store the sign of the number as one bit.

Bits	Name	Significand Bits	Exponent Bits
32	Single Precision	23	8
64	Double Precision	52	11

Fixed Point

Fixed point numbers are those with a fixed number of decimal positions that can be represented exactly in memory. They are implemented using integers.

Characters

Characters represent text characters such as letters, digits or symbols. Because computers can only store numbers, an encoding us used to store characters. ASCII is traditional, but lacks support for non-Western languages.

Unicode has supplanted ASCII as the standard character encoding. Many languages can use both ASCII and Unicode.

Booleans

The boolean data type can store one of two values: true or false. They are typically implemented using a byte of memory.

Conversions

Languages handle type conversions in different ways:

Automatic.
Casts.
Conversion functions.

Types can be coerced or converted. A coercion means that you treat the type as a different one. A conversion means that you actually change it.

Arrays

Arrays store multiple elements of another data type. The individual elements are accessed with a numerical index.

All elements in an array must be the same type.
Indexing can start at 1 or 0.
Indices can be range-checked.
Some languages support slices.

Arrays are implemented with a contiguous block of memory. If the array is indexed from 0, the address of individual elements is calculated with:


address = start + (index * size-of-element);

Arrays of arrays, or multi-dimensional arrays, are also supported by most languages. This can be done row-major or column-major.

Some languages allow for arrays to be initialized directly:


int list [] = {1, 3, 5, 7, 9};
list = [x for x in range(0, 10) if x % 2 == 1]

Array operations include:

Indexing.
Assignment.
Concatenation.
Matrix / vector operations.

Strings

String types represent a sequence of characters as text.

Strings can be implemented as built-in types, simple arrays of characters, or objects.

Strings can be mutable, or immutable.

Common string operations include:

Assignment.
Comparison.
Concatenation.
Substring.
Pattern matching.

Enumerations

An enumeration is a type where the programmer specifies all possible values. In C-based languages, this looks like:


enum Day {
    Monday,
    Tuesday,
    Wednesday,
    Thursday,
    Friday,
    Saturday,
    Sunday
};

They are typically implemented with integers. Some languages allow them to be coerced to/from integers and some do not.

In ML based languages, enumeration types can aid pattern matching:


type day = Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday

match d with Monday -> // code
           | Tuesday -> //code
    ...

This will give an error if we do not handle every enumerated value.

Enumerations were only added to Python in version 3.4, but with a library and not an intrinsic part of the language.

Some languages, such as Pascal and Ada, have subrange types which allow us to specify ranges of integers as valid values:


subtype index is Integer range 1..1000;

When a variable of type "index" is declared, it will be checked to ensure it is within that range.

Data Structures

Associative Arrays

An associative array is one that maps one type of data to another. For example an associative array can map people's names to phone numbers or book ISBN's to titles etc.

Associative arrays are also called hashes, maps, or dictionaries.
Tuples

A tuple is similar to a record except that instead of accessing the individual fields by name, they are accessed by position. Unlike arrays and lists, tuples are typically used with an exact, known number of elements. They can be used for quickly grouping multiple items into a single type.
Lists

Lists represent linked lists. Some languages implement only arrays or linked lists natively and leave the other to libraries. Some languages (such as Ocaml) implement both natively.

Type Checking

Type checking is the process of making sure that variables and values are used consistently with the type of data they store. Type checking is also extended to functions and expressions.

A type error is a situation where a variable or value is used with an operation that is not supported for that type. The following is a type error in C:


float x = 5.0;
char* name = "Bob";
int comb = x + name;

x.c:8:16: error: invalid operands to binary + (have float and char *)

A language is called strongly typed if it will never allow invalid operations to take place. C is weakly typed, because we can make the compiler do operations like the above:


float x = 5.0;
char* name = "Bob";
int comb = x + (float)((int)name);

Strong/weak typing is not the same as static/dynamic typing:

	Static	Dynamic
Strong	Java, Haskell	Python, Racket
Weak	C, C++	Perl, Shell languages

Static Typing

Static type binding can be done either implicitly or explicitly. Explicit declaration means that the programmer must specify the types of variables. This is done in languages like C, C++, and Java.

The code below has a few type annotations:


int fact(int x) {
  if(x == 0) {
    return 1;
  } else {
    return x * fact(x - 1);
  }
}

Type declarations can become quite verbose when using templates or object-oriented programming:


vector<int>::iterator i = nums.begin();

Implicit static typing means that the programmer does not need to declare types. In languages with type inferencing, the types are still statically bound, but it is done by the compiler. These languages also normally provide the ability to declare types if necessary. Type inferencing can lead to difficult compiler errors if the programmer makes a type error.

Below is the factorial function in Haskell which uses type inference:


fact x =
  if x == 0 then
    1
  else
    x * fact (x - 1)

What are some benefits and drawbacks of type inference?

Dynamic Typing

Dynamic type binding is when the type of a variable is not decided until the program runs. Dynamically-bound types are always implicit.

Below is the factorial function in Python which uses dynamic binding:


def fact(x):
  if x == 0:
    return 1
  else:
    return x * fact(x - 1)

Dynamic typing can lead to run time errors that a compiler would have caught. Consider the following Python code that attempts to add the numbers from from 1 to 100:


total = 0 
number = 1 

while number <= 100:
  toal = total + number
  number = number + 1 

print(total)

This code prints "0" instead of "5050". This type of error can be difficult to debug.

Consider the following Python code:


def getValue(which):
  if which:
    return 10
  else:
    return "oops"


number = getValue(which) + 1 
print(number)

This code will run correctly if "which" is true. However, if "which" is false, the function will return a different type and produce a type error.

What are the trade-offs between static and dynamic typing?

Type Equivalence

Type equivalence refers to the rules on whether deciding two types are equivalent. Consider this code:


struct A {
    int x;
    float y;
};

struct B {
    int t1;
    float t2;
};

void print(A a) {
    printf("%d, %f\n", a.x, a.y);
}

int main() {
    B b;
    print(b);
}

Should this code be allowed?

This depends on the type equivalence rules of the language:

Nominative
Nominative equivalence means that the names of the types must match. This means the above code would not be valid.
Structural
Structural equivalence means that the structure of types must match. This means would be valid.

Most languages use nominal equivalence. Some exceptions:

Ocaml uses structural equivalence in its object system.
Go uses structural equivalence with its interfaces.
C++ uses structural equivalence when expanding templates. Java generics use nominative equivalence.

C is nominally typed for most constructs (including structs), but its "typedefs" are just aliases so they are structurally typed.

Type Systems

A languages type system is composed of the following things:

What data types are available.
What operations are available for each type.
How new data types are created.
The rules for converting and coercing types.
Whether it is strongly or weakly typed.
Whether it is dynamically or statically typed.
What rules are used for type equivalence.

The type system of a language is probably the most important factor in a languages design.

Scope

The scope of a variable is the portion of code over which it is accessible.

Local variables are those whose scope is limited to a single function.

Global variables have a scope of the entire program.

What are the scope and lifetime of "x" in this code?


int function() {
  static int x = 0;
  x++;
  return x;
}

In most languages, there can be multiple variables with the same name in a given scope. What does the following code print?

int y = 10;

void function() {
  int y = 20;
  printf("%d", y);
}

This is called "shadowing". In languages without declarations, this can lead to subtle issues as in this code:


g = 10

def function():
  g = 1 

function()
print(g)

This code would be fixed with the "global" keyword:


g = 10

def function():
  global g                                                                                                                       
  g = 1 

function()
print(g)

There are two ways of determining the scope of variables: static and dynamic.

Static Scope

Static scope is based on the code of the program. This means that scope can be determined just by looking at the code.

When a variable is referenced, it is searched for in the current bolck, then the enclosing block, on and on, until the global scope is reached.


int x = 10;

void f() {
  int x = 5;

  if(x > 0) {
    printf("%d", x);    
  }
}

Many languages allow nested functions:


def function1(y):
  def function2(x):
    return x + 1
  return function2(y)

In this code, function2 itself is local to function1 and can only be called inside of that function.

All common languages use static scoping.

Dynamic Scope

With dynamic scoping, scope is based on the call sequence of the running program. This means you cannot determine which variables are in scope by looking at the code.

References to variables are searched through the stack.


int x = 10;

void f() {
  printf("%d", x);
}

void g() {
  int x = 20;
  f();
}

int main() {
  g();
}

What will this code print using static scoping?

What about dynamic scoping?

LISP was originally dynamically scoped and some variants still are.