Information Representation: Difference between revisions

From Rice Wiki
(Created page with "Inside a computer, everything is bits. The topic of '''information representation''' discusses how different information inside a computer program actually looks like in the background. Category:Computer Architecture")
 
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Inside a computer, everything is bits. The topic of '''information representation''' discusses how different information inside a computer program actually looks like in the background.
Inside a computer, everything is bits. The topic of '''information representation''' discusses how different information inside a computer program actually looks like in the background.
= Word =
Instead of one bit at a time, computers do operations on several at the same time. The '''word size''' of the computer is the number of bits representing each piece of data that it processes.
Notably, since instructions are also "data" received by the CPU, each instruction is also constrained by the word size.
Word size is determined by the CPU. For example, a 32-bit CPU has a word size of 32 bits (or 4 bytes).
== Byte and Endianess ==
Most modern day computers are [[byte-addressable memory]]. The byte is the lowest unit of data that has an address and can be accessed at once.
Since we are storing multiple bytes of a word across memory, the order in which we store them, '''endianess''', needs to be specified.
There are two choices: Big Endian and Little Endian. Let's consider how to store 0x10203040 in a 32-bit machine
In '''Big Endian''', the most significant byte is stored at the lowest part of an address (i.e. big end first). Addresses would look something like 0x10, 0x20, 0x30, 0x40. BE is used on the internet.
In '''Little Endian''', the least significant byte is stored at the lowest part of an address (i.e. little end first). Addresses would look something like 0x40, 0x30, 0x20, 0x10. LE is used on intel machines.
Besides different CPUs using different endianess to run instructions, most file formats specify endianess to support different machines. For example, a Unicode text file has a BOM (byte order mark) at the start to denote whether the file is BE or LE.
= Signed Integers =
While a series of bits naturally represent a binary number, negative numbers are a bit more complicated.
The first method to represent signed integers with bits is '''signed magnitude'''. Under this system, the first bit (''sign bit'') is used as a negative sign; If it is 0, then the number is positive, otherwise it is negative.
The second method is [[two's complement]].
= Floating Point =
To store decimal values, the '''floating point''' representation is used. Similar to scientific notation, it uses a series of bits called the '''mantissa''' to store significant binary digits, and another series of bits called the '''exponent''' to store the order of magnitude of the number. The final calculation is something like
<math>1.m\times2^{e}</math>
To handle negative numbers, the first bit is used as a sign bit.
To handle negative exponents, all exponents are subtracted by a constant to center the range of exponents on 0. For example, when 8 bits is used to represent the exponent (as is usually the case), the range of possible values is 0 to 255, so 127 is subtracted from the exponent so that the actual range is -127 to 128.
= Arrays =
The simplest structured group of values is an '''array''', in which multiple values of the same type is stored consecutively in a block of memory. We then keep track of the address of the first element in the array and the size of each element so that we can find the n-th element easily.
For example, consider an array of 10 32-bit integers <c>arr</c>. To access the 6th element, I can simply move 5 integer sizes after the first element to get the address I want:
<sh>
*(&arr + 5*sizeof(int))
</sh>
Notably, this is also a major reason why array indices start at 0 in most programming languages.
= Struct =
A '''struct''' boils down to an array that can have different types of values, each taking up different sizes.
[[Category:Computer Architecture]]
[[Category:Computer Architecture]]

Latest revision as of 15:15, 22 May 2024

Inside a computer, everything is bits. The topic of information representation discusses how different information inside a computer program actually looks like in the background.

Word

Instead of one bit at a time, computers do operations on several at the same time. The word size of the computer is the number of bits representing each piece of data that it processes.

Notably, since instructions are also "data" received by the CPU, each instruction is also constrained by the word size.

Word size is determined by the CPU. For example, a 32-bit CPU has a word size of 32 bits (or 4 bytes).

Byte and Endianess

Most modern day computers are byte-addressable memory. The byte is the lowest unit of data that has an address and can be accessed at once.

Since we are storing multiple bytes of a word across memory, the order in which we store them, endianess, needs to be specified.

There are two choices: Big Endian and Little Endian. Let's consider how to store 0x10203040 in a 32-bit machine

In Big Endian, the most significant byte is stored at the lowest part of an address (i.e. big end first). Addresses would look something like 0x10, 0x20, 0x30, 0x40. BE is used on the internet.

In Little Endian, the least significant byte is stored at the lowest part of an address (i.e. little end first). Addresses would look something like 0x40, 0x30, 0x20, 0x10. LE is used on intel machines.

Besides different CPUs using different endianess to run instructions, most file formats specify endianess to support different machines. For example, a Unicode text file has a BOM (byte order mark) at the start to denote whether the file is BE or LE.

Signed Integers

While a series of bits naturally represent a binary number, negative numbers are a bit more complicated.

The first method to represent signed integers with bits is signed magnitude. Under this system, the first bit (sign bit) is used as a negative sign; If it is 0, then the number is positive, otherwise it is negative.

The second method is two's complement.

Floating Point

To store decimal values, the floating point representation is used. Similar to scientific notation, it uses a series of bits called the mantissa to store significant binary digits, and another series of bits called the exponent to store the order of magnitude of the number. The final calculation is something like

To handle negative numbers, the first bit is used as a sign bit.

To handle negative exponents, all exponents are subtracted by a constant to center the range of exponents on 0. For example, when 8 bits is used to represent the exponent (as is usually the case), the range of possible values is 0 to 255, so 127 is subtracted from the exponent so that the actual range is -127 to 128.

Arrays

The simplest structured group of values is an array, in which multiple values of the same type is stored consecutively in a block of memory. We then keep track of the address of the first element in the array and the size of each element so that we can find the n-th element easily.

For example, consider an array of 10 32-bit integers arr. To access the 6th element, I can simply move 5 integer sizes after the first element to get the address I want:

*(&arr + 5*sizeof(int))

Notably, this is also a major reason why array indices start at 0 in most programming languages.

Struct

A struct boils down to an array that can have different types of values, each taking up different sizes.