Information Representation

From Rice Wiki

Inside a computer, everything is bits. The topic of information representation discusses how different information inside a computer program actually looks like in the background.

Word

Instead of one bit at a time, computers do operations on several at the same time. The word size of the computer is the number of bits representing each piece of data that it processes.

Notably, since instructions are also "data" received by the CPU, each instruction is also constrained by the word size.

Word size is determined by the CPU. For example, a 32-bit CPU has a word size of 32 bits (or 4 bytes).

Byte and Endianess

Most modern day computers are byte-addressable. The byte is the lowest unit of data that has an address and can be accessed at once.

Since we are storing multiple bytes of a word across memory, the order in which we store them, endianess, needs to be specified.

There are two choices: Big Endian and Little Endian. Let's consider how to store 0x10203040 in a 32-bit machine

In Big Endian, the most significant byte is stored at the lowest part of an address (i.e. big end first). Addresses would look something like 0x10, 0x20, 0x30, 0x40. BE is used on the internet.

In Little Endian, the least significant byte is stored at the lowest part of an address (i.e. little end first). Addresses would look something like 0x40, 0x30, 0x20, 0x10. LE is used on intel machines.

Besides different CPUs using different endianess to run instructions, most file formats specify endianess to support different machines. For example, a Unicode text file has a BOM (byte order mark) at the start to denote whether the file is BE or LE.

Signed Integers

While a series of bits naturally represent a binary number, negative numbers are a bit more complicated.

The first method to represent signed integers with bits is signed magnitude. Under this system, the first bit (sign bit) is used as a negative sign; If it is 0, then the number is positive, otherwise it is negative.

The second method is called 2's complement, and it is more widely used due to several advantages covered later.

In this system, think of the first bit in the bit pattern as negative. For example, 101 would be 4 + 1 = 5 as an unsigned number, but -4 + 1 = -3 as a signed number represented by 2's complement.

The advantage of using 2's complement over signed magnitude is twofold:

  • Normal binary arithmetic can be applied.
  • There is no negative zero

To negate signed numbers, flip all bits and add 1. This comes from math:

Floating Point

To store decimal values, the floating point representation is used. Similar to scientific notation, it uses a series of bits called the mantissa to store significant binary digits, and another series of bits called the exponent to store the order of magnitude of the number. The final calculation is something like

To handle negative numbers, the first bit is used as a sign bit.

To handle negative exponents, all exponents are subtracted by a constant to center the range of exponents on 0. For example, when 8 bits is used to represent the exponent (as is usually the case), the range of possible values is 0 to 255, so 127 is subtracted from the exponent so that the actual range is -127 to 128.

Arrays

The simplest structured group of values is an array, in which multiple values of the same type is stored consecutively in a block of memory. We then keep track of the address of the first element in the array and the size of each element so that we can find the n-th element easily.

For example, consider an array of 10 32-bit integers arr. To access the 6th element, I take the address