The Bit Array

An Efficient and Multipurpose Data Structure

A bit array, also referred to as bitmask, bit map, bit set, bit string, or bit vector, is a compact array-based data structure designed for efficient bit storage. Its versatility extends to implementing simple set data structures, leveraging bit-level parallelism in hardware for rapid operations. A typical bit array stores kw bits, where w represents the number of bits in the storage unit, such as a byte or word, and k is a nonnegative integer. However, if w does not evenly divide the number of bits to be stored, internal fragmentation may lead to some space wastage.

Definition

A bit array functions as a mapping from a domain (usually a range of integers) to values in the set {0, 1}. These values can be interpreted in various contexts, such as Yes/no ,dark/light, absent/present, locked/unlocked, valid/invalid, and more. Crucially, only two possible values exist, enabling their efficient storage in a single bit. Similar to other arrays, accessing a single bit involves applying an index to the array. Assuming the array's size (or length) is n bits, it can specify a subset of the domain (e.g., {0, 1, 2, ..., n−1}). Here, a 1-bit indicates the presence, while a 0-bit signifies the absence of a number in the set. This set data structure utilizes approximately n/w words of space, where w is the number of bits in each machine word. Whether the least significant bit (of the word) or the most significant bit indicates the smallest-index number is largely arbitrary, with a tendency to prefer the former (on little-endian machines).

In the realm of mathematics, a finite binary relation can be represented by a bit array known as a logical matrix. In the calculus of relations, these arrays undergo matrix multiplication with Boolean arithmetic. Such composition represents the composition of relations, showcasing the bit array's versatility and applicability in abstract mathematical concepts.

The efficacy of bit arrays in harnessing hardware parallelism and providing a compact representation makes them invaluable across various computational scenarios, from low-level hardware operations to abstract mathematical relations. A comprehensive understanding of their structure and applications significantly enhances algorithmic design and computational efficiency.

Data Compression Using Bit Arrays

A bit array offers highly efficient storage for random bits, assuming each bit is equally likely to be 0 or 1, and they are independent. However, since most data exhibits non-random patterns, there's an opportunity for more compact storage. For instance, run-length encoding is commonly employed to compress non-random streams. It's crucial to balance compression efforts, as overly aggressive compression might negate the advantages of bit-level parallelism (vectorization). Instead of compressing bit arrays as bit streams, an alternative is to compress them as streams of bytes or words, as seen in

Bitmap index (compression)Opens in a new tab.

Pros and Cons of Bit Arrays

Despite their simplicity, bit arrays present notable advantages over other data structures:

  • Exceptional compactness: No other data structures can match the ability to store n independent data pieces in n/w words.
  • Efficient manipulation: Small bit arrays can be stored and manipulated in the register set for extended periods without requiring memory accesses.
  • Exploitation of bit-level parallelism: Bit arrays often outperform other structures on practical datasets due to their ability to leverage bit-level parallelism, limit memory access, and maximize data cache utilization, even when other structures might be more asymptotically efficient.

However, bit arrays are not a universal solution:

  • Without compression, they are inefficient for sparse sets (few elements compared to their range) in both time and space. For such cases, compressed bit arrays, Judy arrays, tries, or Bloom filters may be more suitable.
  • Accessing individual elements can be costly and challenging in certain programming languages. For cases where random access is more common than sequential access and the array is relatively small, a byte array may be preferable, unless byte addressing is not available.

Applications of Bit Arrays

Due to their compact nature, bit arrays find applications in scenarios where space or efficiency is critical. Common use cases include:

  • Representation of boolean flags or ordered sequences of boolean values.
  • Priority queues, where a bit at index k is set if and only if k is in the queue. This structure is utilized, for example, in the Linux kernel, benefiting from a find-first-zero operation in hardware.
  • Allocation of memory pages, inodes, disk sectors, etc., often referred to as a bitmap in such cases. However, the term "bitmap" is also commonly associated with raster images, which may use multiple bits per pixel.
  • Implementation of Bloom filters, a probabilistic set data structure trading a small probability of error for efficient storage of large sets. Probabilistic hash tables based on bit arrays are also feasible, allowing acceptance of false positives or false negatives.
  • Construction of succinct data structures that utilize minimal space. Operations like finding the nth 1 bit or counting the number of 1 bits up to a certain position become crucial in this context.
  • Abstraction for examining streams of compressed data, where elements may not align with bytes or occupy portions of bytes. For instance, the compressed Huffman coding representation of a single 8-bit character can vary from 1 to 255 bits in length.
  • In information retrieval, bit arrays serve as a suitable representation for posting lists of frequently occurring terms. Computing gaps between adjacent values in a list of strictly increasing integers and encoding them using unary coding results in a bit array with a 1 bit in the nth position if and only if n is in the list. The implied probability of a gap of n is 1/2n. This aligns with the special case of Golomb coding where the parameter M is 1. This parameter is typically chosen when −log(2 − p) / log(1 − p) ≤ 1, indicating that the term occurs in at least 38% of documents.

Bit Array Operations: Pseudocodes

Initialization

Initializing a bit array involves setting up its structure and allocating necessary resources:


function InitializeBitArray(n, w)
    arraySize = Ceil(n/w)
    bitArray = new Array(arraySize)
    SetAllBitsToZero(bitArray)
    return bitArray
end function

function SetAllBitsToZero(bitArray)
    for each word in bitArray
        word = 0
    end for
end function

Set Operation

Setting a specific bit to 1 in the bit array:


procedure SetBit(bitArray, index)
    wordIndex = index / w
    bitIndex = index % w
    bitArray[wordIndex] = bitArray[wordIndex] | (1 << bitIndex)
end procedure

Clear Operation

Clearing a specific bit to 0 in the bit array:


procedure ClearBit(bitArray, index)
    wordIndex = index / w
    bitIndex = index % w
    bitArray[wordIndex] = bitArray[wordIndex] & (~(1 << bitIndex))
end procedure

Get Operation

Retrieving the value of a specific bit in the bit array:


function GetBit(bitArray, index)
    wordIndex = index / w
    bitIndex = index % w
    return (bitArray[wordIndex] >> bitIndex) & 1
end function

Toggle Operation

Toggling a specific bit (changing 0 to 1 or 1 to 0) in the bit array:


procedure ToggleBit(bitArray, index)
    wordIndex = index / w
    bitIndex = index % w
    bitArray[wordIndex] = bitArray[wordIndex] ^ (1 << bitIndex)
end procedure

Count Set Bits

Counting the number of set (1) bits in the bit array:


function CountSetBits(bitArray)
    count = 0
    for each word in bitArray
        count += HammingWeight(word)
    end for
    return count
end function

function HammingWeight(word)
    count = 0
    while word != 0
        count += word & 1
        word = word >> 1
    end while
    return count
end function

Length Operation:
    
    Function Length(array A):
        n := length of A in bits
        return n
    
    
Substring Operation:
    
    Function Substring(array A, start, length):
        
        B := new array of length bits
        for i from 0 to length - 1:
            B[i] := A[start + i]
        return B
    
    
Lexicographical Compare Operation:
    
    Function LexicographicalCompare(array A, array B):
        n := min(length of A, length of B)
        for i from 0 to n - 1:
            if A[i] < B[i]:
                return -1
            else if A[i] > B[i]:
                return 1
        
        if length of A < length of B:
            return -1
        else if length of A > length of B:
            return 1
        else:
            return 0
    
    
Concatenation Operation:
    
    Function Concatenate(array A, array B):
        C := new array of length (length of A + length of B) bits
        for i from 0 to length of A - 1:
            C[i] := A[i]
        for j from 0 to length of B - 1:
            C[length of A + j] := B[j]
        return C
    
    
Reverse Operation:
    
    Function Reverse(array A):
        B := new array of length bits
        for i from 0 to length - 1:
            B[i] := A[length - 1 - i]
        return B
    
    
Population / Hamming Weight Operation:
    
    Function HammingWeight(array A):
        count := 0
        for i from 0 to length of A - 1:
            count := count + (A[i] == 1)
        return count
    
    
i will keep adding more in future