Arrays: The Cornerstone of Organized Data

Arrays are one of the most fundamental and widely used data structures in computer programming. They serve as a crucial building block for more complex data structures and algorithms.

Contents

Understanding Arrays
- The Advantages and Limitations of Arrays
Advantageous Characteristics of Arrays: Random Access
Insertion and Deletion in Arrays
Dynamic Arrays in Programming Languages

Understanding Arrays

Array is a collection of items stored at contiguous memory locations. The idea is to store multiple items of the same type together, making it easy to calculate the position of each element by simply adding an offset to a base value, the memory location of the first element of the array (generally denoted by index 0).

The Advantages and Limitations of Arrays

Arrays come with a set of advantages that make them a go-to data structure:

Direct Access: Arrays provide O(1) time complexity for accessing elements if the index is known. This is because they enable direct access to their elements via indexing.
Cache-Friendly: Due to their contiguous memory allocation, arrays tend to make better use of CPU cache, a small data storage layer that a computer’s processor uses to quickly access data.

Why does continuous memory allocation make better use of the CPU cache?

When data is stored in contiguous memory locations, as in the case of arrays, fetching elements of the array results in loading neighboring elements into the cache along with the requested data. This takes advantage of spatial locality, which refers to the tendency of programs to access memory locations that are close to each other.

Because CPUs often employ cache lines, which are blocks of contiguous memory that are loaded into the cache together, accessing elements of an array that are stored contiguously increases the likelihood of loading multiple elements into the cache with a single memory access. This reduces the number of cache misses and improves overall performance, as subsequent accesses to nearby elements can be satisfied from the cache rather than fetching from main memory.

In contrast, data structures that do not have contiguous memory allocation, such as linked lists, may not exhibit good spatial locality. Accessing elements in a linked list typically involves traversing non-contiguous memory locations, which can lead to frequent cache misses and poorer cache utilization.

However, arrays also have limitations:

Fixed Size: Once an array is created, it cannot be resized to store more elements than its declared size.
Insertion and Deletion: Operations like insertion and deletion can be costly since they may require shifting elements to maintain the array’s contiguous nature.

Advantageous Characteristics of Arrays: Random Access

In general, Arrays have two limitations:
1. Contiguous memory space
2. Same type of stored data
It is because of these two limitations that it has its advantageous characteristic: random access.

At their core, arrays function by organizing data so that each element is accessible by a computed index. The base address of the array, combined with the index, is used to calculate the address of any element in the array.

For instance, consider an array int a[10]. If the base address is 1000 and the size of an integer is 4 bytes, the address of a[9] would be computed as 1000 + (9 * 4) = 1036.

Insertion and Deletion in Arrays

Inefficient Insertion in Arrays

Insertion in an array is generally an O(n) operation. The efficiency of inserting an element into an array depends on the position at which it needs to be inserted.

End of Array: If you insert at the end of the array and the array has enough capacity, this operation is O(1) because it does not require shifting any existing elements.
Middle or Beginning: Insertion at any position other than the end requires elements to be shifted to make room for the new element, which can be time-consuming for large arrays, hence the O(n) complexity.

Inefficient Deletion in Arrays

Deletion from an array can also be an O(n) operation due to the need to maintain the continuity of elements. Similar to insertion:

End of Array: Deletion at the end is O(1) because it involves merely removing the last element without affecting the rest of the array.
Middle or Beginning: Deleting an element from any other position requires shifting subsequent elements one position to the left to fill the gap, leading to O(n) complexity.

Mark-Sweep of Arrays Deletion

Mark-sweep is a strategy borrowed from garbage collection algorithms in memory management (e.g. JVM). In the context of arrays, it can optimize deletions in scenarios involving multiple deletions.

Mark Phase: Instead of immediately deleting an element and shifting the others, the element is marked for deletion.
Sweep Phase: After all the deletions are marked, the array is traversed to compact it, effectively “sweeping” out all the marked elements and shifting the unmarked ones in a single pass.

This strategy is particularly efficient when multiple elements need to be deleted at once, as it reduces the number of shifts.

Example:

Consider an array [2, 3, 5, 7, 11, 13, 17, 19] where we want to delete all prime numbers less than 10. In the mark phase, we would mark 2, 3, 5, 7 for deletion. In the sweep phase, we would shift the remaining elements once to obtain [11, 13, 17, 19].

Dynamic Arrays in Programming Languages

While the concept of arrays is universal, their implementation can vary across different programming languages. Nowadays, many programming languages provide dynamic arrays (like ArrayList in Java, vector in C++ STL and list in Python) that can resize themselves when elements are added or removed.

Even though dynamic arrays support expansion, if you can determine the size of the data to be stored in advance, it is better to specify the data size in advance when creating the dynamic arrays (like ArrayList).

This is because the expansion operation involves memory request and data transfer, which is time-consuming.

Reference:

Wang, Zheng (2019) The Beauty of Data Structures and Algorithms. Geek Time.