Array (Data Structure)
Arrays (data structure) are a type of linear data structure that can hold an ordered collection of values. As opposed to the array (ADT), the array data structure specifies an implementation that the values are of homogeneous size and stored in contiguous memory. They are extremely ubiquitous and among the oldest, most widely used data structures in programming.
Unless otherwise specified, for the remainder of this wiki, the word "array" will refer to the data structure and not the abstract data type.
Arrays are often used to implement abstract data types such as arrays (ADT) and stacks. They could also be used to implement inefficient queues and double-ended queues.
Contents
Properties of the Array
The simplest implementation of an array only stores the memory address of its first index. Some languages also store the size of the array to prevent buffer overflow errors. Insertion and retrieval of an item at a given index is performed by taking the index \(i\), multiplying by the item size \(s\) in the number of bytes, and adding it to the base memory address \(b\).
For zero-based numbering languages, the memory address \(m\) of an index is given by
\[m = b + s i.\]
Consider the following code snippet in C:
1 2 3 4 5 6uint32_t prime_numbers[5]; prime_numbers[0] = 2; prime_numbers[1] = 3; prime_numbers[2] = 5; prime_numbers[3] = 7; prime_numbers[4] = 11;
Line 1 declares an array of length 5 that holds 32-bit unsigned integers. 32 bits is 4 bytes, so our item size \(s\) is 4. This means our array is \(5 \times 4 = 20\) bytes in total size. The first index is assigned an arbitrary memory address; for our example, let's say \(b = \mathtt{0xB3A0}\). The entire array uses memory addresses \(\mathtt{0xB3A0}\) through \(\mathtt{0xB3B3}\).
Lines 2-6 each compute the memory address for the given index and write the value as a 4-byte unsigned integer. For example, on line 5, the memory address of index 3 is \[\mathtt{0xB3A0} + 4 \cdot 3 = \mathtt{0xB3AC}.\] Therefore the value 7 is written as a 4-byte unsigned integer starting at address \(\mathtt{0xB3AC}\).
To store heterogeneously sized items in an array in C, pointers to those items, instead of the items themselves, would have to be explicitly stored. This is how arrays work in higher-level languages like Java and Python. Instead of storing a raw item directly in the array, a reference to the actual item is stored. This all happens implicitly, as Java and Python don't have a pointer type. A consequence of this is that traversing the array sequentially might actually mean accessing non-contiguous portions of memory. Instead of all the values existing in contiguous portion of memory, the values are stored in arbitrary locations in memory that may be far apart from each other. The locality of the data and the ability of the CPU to cache it efficiently are lost.
Arrays have a fixed size which is declared when it is created. A more complex data structure that allows for resizing an array after creation is the dynamic array.
Time Complexity
Arrays are among the fastest data structures since the index lookup process is simply an integer multiplication operation followed by an integer addition operation--both very fast operations on modern CPUs. Below are the time complexities of the basic functions required by the array (ADT).
Operation | Complexity |
get | \(O(1)\) |
set | \(O(1)\) |
Space Complexity
Arrays take up linear space with respect to the number of elements \(n\) they hold. They also have the benefit that no space is wasted in memory (because it is initialized with a certain size):
\[\begin{array} &&\text{Space - O(n)} \end{array}.\]
Two-dimensional Arrays
Two-dimensional arrays are also stored in contiguous memory. For zero-based numbering languages, the memory address of an item at given indices \(i\) and \(j\) in an array of item size \(s\) bytes, \(r\) rows, \(c\) columns, and base memory address \(b\) is given by
\[m = b + s(ci + j).\]
Consider the following code snippet in C:
1 2 3 4 5 6 7uint32_t transform[3][2]; transform[0][0] = 37; transform[0][1] = 0; transform[1][0] = 0; transform[1][1] = 39; transform[2][0] = 1; transform[2][1] = 4;
Line 1 declares a two-dimensional array with \(r=3\) rows and \(c=2\) columns that holds 32-bit unsigned integers. 32 bits is 4 bytes, so our item size \(s\) is 4. This means our array is \(3 \times 2 \times 4 = 24\) bytes in total size. The first index is assigned an arbitrary memory address; for our example let's say \(b = \mathtt{0xC704}\). The entire array uses memory addresses \(\mathtt{0xC704}\) through \(\mathtt{0xC71B}\).
Lines 2-7 each compute the memory address for the given index and write the value as a 4-byte unsigned integer. For example, on line 7, the memory address of indices \(i=2\), \(j=1\) is \[\mathtt{0xC704} + 4 \cdot (2 \cdot 2+ 1) = \mathtt{0xC718}.\] Therefore, the value 4 is written as a 4-byte unsigned integer starting at address \(\mathtt{0xC718}\).
This technique can be generalized for storing \(n\)-dimensional arrays in contiguous memory.
Jagged Arrays
The two-dimensional arrays covered above are by definition always rectangular. Often times this doesn't quite fit the problem at hand and ends up wasting space. This can be avoided by using a one-dimensional array which stores references to other one-dimensional arrays of heterogeneous length. This is slightly slower as it adds a level of indirection, but often times the memory saved makes it worth it. Higher-level languages such as Java and Python only allow jagged arrays for multidimensional arrays.