Linear Data Structures


Linear Data Structures

Here we’re quickly going to go through linear data structures. The thing that defines them is the order of things between structure elements. It’s defined as an abstract type independent from implementation. Examples (click to jump):

Frequent operations:

  • element access
  • search
  • reading of writting
  • inserts (all successing elements are being moved one place up)
  • deleting (all successing elements are being moved one place down)
  • finding a precursor and successor for an element
  • list length
  • merging two lists
  • etc

There are numerous ways we can implement a linear list, as we mentioned in another post (Data structures), most frequent are sequential and linked lists.

Linear Data Structures: Arrays

This is the most basic and widely used type of data. It’s a homogenic structure with a finite number of elements. Efficient access (performance) with same physical & logical order of elements. The “disadvantage”, maybe aside from fixed size (memory space, max size), is that insert/delete operations can be costly.

Simplest one is a one-dimensional array or Vector. To define a place of an element (direct access), we’re using index (position).

Multi-dimensional arrays represent a generalization of one-dimensional ones. The two-dimensional array (or Matrix), can be considered as one-dimensional array containing one-dimensional arrays for elements.

Array Operations (representation, placement)

Basic one is a direct access using an array name and index value. Since we usually treat arrays as a sequential memory locations, inserting or deleting one element on a random location is inconvenient. Depending on the size, a lot of elements should be shifted (and indexes changed), thus performance might suffer.

As mentioned earlier (data structures, ), we can experience some issues if structures or elements sizes don’t take the same space as memory word size.

No filling:
|00000000|00001111|11112222|22223333| 

Filling :
|00000000|0000----|11111111|1111----|  

Padding:
|0000|0000|0000|1111|1111|2222|2222|3333|  

* 0, 1, 2, 3 = elements
* - = padding/filling 

If that happens, we have 3 options:

  1. no filling : Optimal space usage, but with the need to access and read multiple memory words, offsets, extractions, performance suffers
  2. filling : the remaining space in a last memory word is left unused. Space usage is not optimal but it’s more performant. We can calculate the usage as a relation between element size and space size reserved for it: Su = s/⌈s⌉
    If s>0.5, we rely on filling, if it’s <0.5, we’re turn to padding
  3. padding : Placing a multiple elements in one memory word. If we have k elements in one memory word, usage: Su = ks/⌈ks⌉ where k = ⌊1/s⌋
    Here, the remainder until the next memory word is left empty.

Since the memory is a linear structure, vector is being placed naturally in it, but the matrix (X[1:M, 1:N]) has to be converted to a linear form. We can do this either by rows (C, Pascal) or by columns (FORTRAN). With rows, we’re going through column indexes first (N), then slowly shifting the rows index (M). With the column approach is the other way around M first, N second.

I don’t want to go into math details, but the way arrays are being placed can be useful. Although programmers can’t affect that, knowing how things work for a specific programming language can enable them to write more efficient code. There’s a great example of this below.

We’re doing matrix sums and each element can be placed in one memory word. Matrix is being placed in memory by rows. Cache memory of 256 memory words is being divided in 16 blocks of 16 words. First matrix element is mapped to a first word of the first block. Which one is more efficient?

SUM A:
sum=0
for i = 1 to 32 do
    for j = 1 to 16 do
        sum = sum + M[i,j]
    end_for
end_for

SUM B:
sum=0
for j = 1 to 16 do
    for i = 1 to 32 do
        sum = sum + M[i,j]
    end_for
end_for 
In SUM A, each miss causes entire block (row in this case) to be placed into cache. One miss per row, 32 misses. Total number of calls, 32×16 = 512. Hit ratio: (512 – 32) / 512 = 93%.
In SUM B there’s a disaster. Only one element from each fetched block will be used. So basically we’re ending up with constant misses. Constantly emptying cache and fetchng the blocks to get one matrix element. Hit ratio: 0%.

Additional Placement Optimizations

On a matrix where significant amount of elements have 0 value we can do some optimizations. Two examples: triangular matrix and rare arrays/matrix.

Triangle matrix (X[1:n, 1:n]), where X[i,j] =0 for i<j (lower triangular) or i>j (upper triangular matrix). If we have <= or >= it’s strict lower and strict upper, respectively. We can save 50% of space.

Rare arrays/matrix, with high number of 0 valued elements (>80%). I didn’t use large matrix just for the purpose of e.g.:

    7 0 0
M = 0 2 9
    0 6 0

We can convert it to one record:

R   C   V
1   1   7
2   2   2 
2   3   9 
3   2   6 

or 3 independent vectors:

R     C    V
1 ->  1 -> 7
2 ->  2 -> 2 
      3 -> 9 
3 ->  2 -> 6 

Not ideal example, but you’ll get the picture. R holds the index of a column (C), and C holds the index of a V (non-zero values) Now we don’t have a direct access, but we do have a better space usage. On a large matrices that can be a significant save.

Linear Data Structures: Linked lists

Linked lists remove some of the problems arrays have. The access is indirect, but the size is flexible/dynamic. Arrays have elements placed sequentially in memory. Linked lists can have elements placed anywhere in memory. Having that in mind, there must be a way to maintain the order of elements. This is achieved by placing the explicit address of a next element inside each of the elements. So, each element or node, aside from information contains a pointer (e.g. next). Access to the list is realised via one external pointer tied to the first node of the list (it’s not a part of the list). Empty list is null/nil (list=null). The basic example of singly linked list:

List -> Data | Next -> Data | Next -> Data | Null

The list can be ordered or unordered (by some internal paramater/field). List that have elements pointing to both successor and predecessor are called double linked list. There are also circular linked list (successor of a last node is pointing to the the first node in the list).

Usually, lists are homogenic, but not exclusively.

Linked List Operations

Lists are generated dynamically and that requires some mechanisms to allocate and deallocate elements & space. There’s an approach to keep the free nodes and list nodes separated in individual lists. I guess it’s more perfomant, avoiding the cost of freeing and allocating memory all over again, but I’m not fully convinced that the gain is significant in general. Pseudo-code proved most useful when it comes to showing the algorithm principles :

GET_NODE:
if (avail = null) then
    ERROR
end_if
q = avail
avail = next(avail)
next(q) = nil
return q

FREE_NODE(p):
next(p) = avail
avail=p

Operations are mostly conducted through poitners, since they keep the list’s integrity.

Linked list insert

Adding a node behind the specific node p:

INSERT_AFTER (p, new_data)
q = GETNODE
data(q) = new_data 
next(q) = next(p)
next(p) = q
Linear Data Structures: Linked list insert
INSERT_BEFORE (p, new_data)
q = GETNODE
next(q) = next(p)
data(q) = data(p)
data(p) = new_data
next(p) = q

Linked list delete

DELETE_AFTER (p)
q = next(p)
next(p) = next(q)
FREENODE(q)

Delete first node:

DELETE (p)
q = next(p)
next(p) = next(q)
data(p) = data(q)
FREENODE(q)

Linked list search

Find the node with the specific content:

SEARCH (list, data)
p = list
while ( p != null) and (data(p) != data) do
p = next(p)
end_while
return p

In an unordered list, we have to go through all of the elements. In an ordered one, we might consider some adjustments (data as an integer):

SEARCH_ORD (list, data)
p = list
while ( p != null ) and ( data(p) < data ) do
p = next(p)
end_while
if ( p != null ) and ( data(p) > data ) then
p = null
end_if
return p

Linked list invert

INVERTING (list)
p = list
q = null
while ( p != null ) do
     r = q
     q = p
     p = next(p)
     next(q) = r
end_while
list = q

Linked list concatenation

CONCATENATION ( list1, list2 )
if list1 = null then
list1 = list2
return
else if list2 = null then
return
end_if
p = list1
while ( next(p) != null) do
p = next(p)
end_while
next(p) = list2

Circular linked list insert

Linear Data Structures: Circular linked list insert
INSERT_AFTER_C (p, data)
q = GETNODE
data(q) = x
next(q) = next(p)
next(p) = q
if (list = p ) then
list = q
end_if

Circular linked list concatenation

CONCATENATION_C ( list1, list2 )
if ( list1 = null ) then
     list1 = list2
     return
else if ( list2 = null ) then
     return
end_if
p = next(list1)
next(list1) = next(list2)
next(list2) = p
list1 = list2

Searching the circular list can lead us into infinite loop on unsuccessful search. Because of that we’re adding one pointer as the first node (heading), with some distinctive content/data that can’t be found on other “valid” nodes.

Inserting an node:

INSERT_C_H (list, data)
p = GETNODE
data(p) = data
next(p) = next(list)
next(list) = p

Double linked list

The disadvantage of circular list is that you have to go through entire list to get to the previous node. With double linked list offers a solution for that. Each node is tied to both precessor and successor nodes. It can have a header or be a circular one too. The operations are a bit more complex:

INSERT_AFTER_D (p, data)
q = GETNODE
data(q) = data
prev(data) = p
next(q) = r
next(p) = q
prev(r) = q
DELETE_D (p)
q = prev(p)
r = next(p)
next(q) = r
prev(r) = q
FREENODE(p)

Linked List vs Sequential Linked Lists

Although we need additional pointers in each node (using more space), it’s relatively small requirement. There are numerous examples of potential usages, for e.g. vector implementation of linked lists, rare arrays implemented via linked lists, set representations, polynome representation,etc.

Linear Data Structures: Stack

Stack is a dynamic (usually homogenous) structure and basically represent a type of linear lists with specific access type, adding/removing elements just from one end (LIFO – Last In First Out).

Bottom of the stack is fixed (bottom). Since only top of it moves, we need to keep track of it (stack pointer). Creating the stack S of maximum size n:

INIT_S (S, n)
ALLOCATE(S[1:n])
top[S] = 0

To check if a stack is empty:

STACK_EMPTY (S)
if ( top[S] = 0 ) then
return true
else
return false

Adding an element:

PUSH (S, data)
if ( top[S] = n ) then
ERROR (Overflow)
else
top[S] = top[S] + 1
S[top[S]] = data
end_if

Removing top element:

POP (S)
if ( top[S] = 0 ) then
return UNDERFLOW
else
x = S[top[S]]
top[S] = top[S] - 1
return x
end_if

Fetch a TOP element only:

TOP (S)
if ( top[S] = 0 ) then
return UNDERFLOW
else
return S[top[S]]
end_if

We can implement sequentially numerous stacks within one linear structure. As a simple example, we can have two stacks in one vector. First stack’s bottom is tied to the bottom of the vector (growing upwards) and the second stack’s bottom is tied to the vector’s top (growing downwards). We can have as many stacks as we can logically place in a vector. In one vector size of 20, we can place 5 stacks (size 4). We can optimize it further, for instance in situations where one stack (out of those 5) gets full, we can push the stacks in front of it (moving the elements upwards) to make space for additional element. I won’t write those operations/algorithms, but they’re out there.

Personally I wouldn’t want to complicate my life this way, but there is some logic to it. If we’re lacking the space, implementing multiple vectors independently would allocate twice the of memory. If those vectors (stacks) don’t grow as much, we could get a significant savings. Nowdays the memory is relatively cheap, but this might be handy (certain limits & conditions).

Stack Usages

Default example, cmopilers, OSs, arithmetic expression processing, processing subroutines.

Within aritmethical expressions, we have binary operators (operands on each side) and unary opeartor (operand in front/back). That’s a infix notation. It’s not quite suitable for compilers. In expressions with multiple operators and operands we can’t simply determine who belongs where. E.g. in A*B+C, is B an operand for * or +. We can use ( and ) to define it. Further on, to reduce number of parentheses, we can rely on operator prioritization.

To determine order of operator execution in subexpressions based on priority and level in an infix expressions requires out of compiler to go through it multiple times. That’s a main disadvantage of infix notation.

Further on, we can use Prefix and Postfix notation (operands in front, or in the back). A+B => +AB or AB+.

To convert an infix expression to postfix:

  • place parenthesis
  • move/replace right parenthesis with the operators,
  • remove left parenthesis

Same thing for prefix (just inverted, towards left).

Calculating the expression in postfix notation:

EVAL_EXP ( postfix )
INIT_STACK(S, n)
while ( not_end_of postfix ) do
     x = INPUT ( postfix )
     if ( x = operand) than
         PUSH(S, x)
    else if ( x = un_op ) then
         oprnd = POP(S)
         res = x oprnd
         PUSH(S, res)
    else if ( x = bin_op ) then
         oprnd2 = POP(S)
         oprnd1 = POP(S)
end_while
res = POP(S)
if ( STACK-EMPTY(S) ) then
    return res
else
    ERROR ( INCORRECT EXPRESSION )
end_if

Algoritham example: A + B / C ^ 2 – E * F
7 + 8 / 2 ^ 3 – 1 * 2

X       operand1      operand2      res       S
7                                             7
8                                             7 8
2                                             7 8 2
3                                             7 8 2 3
^          2             3           8        7 8 8
/          8             8           1        7 1
+          7             1           8        8
1                                             8 1
2                                             8 1 2
*          1             2           2        8 2
-          8             2           6        6

So, the result of previous expression is 6.

Infix to Postfix Conversion

Efficient algorithm going through expression just once, linear complexity – O(n)

IN2POST (infix, postfix)
INIT_STACK (S, n)
rank = 0
next = INPUT(infix)
while (next) do
   if ( next = operand ) then
       OUTPUT (next, postfix)
       rank = rank + 1
   else
       while ( not ( STACK_EMPTY(S)) and (IPR(next) <= SPR (TOP(S)) )do
           x = POP(S)
           OUTPUT(x, postfix)
           rank = rank + R(x)
           if ( rank < 1 ) then
                ERROR (  INCORRECT EXPRESSION )
           end_if
       end_while
       if ( STACK_EMPTY (S) or ( next != ')' )) then
           PUSH(S, next)
       else
           x = POP(S)
      end_if
  end_if
  next = INPUT ( infix )
end_while
while ( not (STACK_EMTPY (S) ) ) do
    x = POP(S)
    OUTPUT (x, postfix)
    rank = rank + R(x)
end_while
if ( rank != 1 ) then
    ERROR ( INCORRECT EXPRESSION )
end_if

Since ther is no requirement to set parenthesis, we need to define priority.

oprator     ipr     spr      R
  +,-        2       2       -1
  *,/        3       3       -1 
   ^         5       4       -1 
   (         6       0       -
   )         1       0       - 

IPR – moment it arrives on stack, SPR – already on stack

A + B * C – ( D + E – F ^ G ^ H) * ( I + ( J + K ) * L ) / M

next      S        postfix                   rank
A                  A                          1
+         +        A                          1 
B         +        AB                         2
*         +*       AB                         2
C         +*       ABC                        3
-         -        ABC*+                      1
(         -(       ABC*+                      1
D         -(       ABC*+D                     2
+         -(+      ABC*+D                     2
E         -(+      ABC*+DE                    3 
-         -(-      ABC*+DE+                   2
F         -(-      ABC*+DE+F                  3
^         -(-^     ABC*+DE+F                  3
G         -(-^     ABC*+DE+FG                 4 
^         -(-^^    ABC*+DE+FG                 4 
H         -(-^^    ABC*+DE+FGH                5
)         -        ABC*+DE+FGH^^-             2
*         -*       ABC*+DE+FGH^^-             2 
... 

For an expression: ABC * + D / , machine code equivalent (not optimal):

LOAD B
MUL C
STORE T1
LOAD A
ADD T1
STORE T2
LOAD T2
DIV D
STORE T3

Postfix notation is more appropriate for machine stack. Optimal:

PUSH A
PUSH B
PUSH C
MUL
ADD
PUSH D
DIV

Recursion..

Linear Data Structures: Queue

Similarly to stack, it’s a linear dynamic structure , but with two input/output points (front&back, FIFO – First In First Out). As with stack, it usually homogenous structure.

Sequential / Vector representation

Sequential representation is implicitly homogenous and limited in size. Two pointers, front and rear. One implementation that we can rely on is to move all elements down after each “delete” operation. More optimal approach is to wait for rear to meet the front of the vector. Even better implementation is circular buffer (empty: front = rear = 0). If front<rear, we’re getting the number of elements with: rear – front + 1. If front > rear, number of elements = n – front + rear + 1.

|   |   |   |   | 2 | 0 | 3 |   |
                  f       r
n=8, f=5, r=7 => front < rear => rear-front+1 => 7 - 5 + 1 => 3

 | 8 | 3 |   |   | 2 | 0 | 3 | 1 |
       r           f       
n=8, f=5, r=2 => front > rear => n - front + rear + 1 => 8 - 5 + 2 + 1 => 6 
Linear Data Structures: Queue
geeksforgeeks – queue

Usual operations: creation, inserts, deletions, reading and queue empty. Initialize queue (size of n):

INIT-QUEUE (Q,n)
  ALLOCATE Q([1:n])
  front[Q]=rear[Q] = 0

Insert operation first moves “rear” pointer to next element. If rear = front, queue is full, otherwise element is added. One specific situation is adding in an empty queue, when fron has to be set to 1.

INSERT (Q, p)
rear[Q] = rea[Q] mod n+1
if ( front[Q] == rear[Q] ) then
    ERROR ( Overflow )
else
    Q[rear[Q]]=p
    if ( front[Q] == 0 ) then
        front[Q] = 1
    end_if
end_if

For deletion, we’re first checking if the queue is empty (front == 0). If not, we’re returning the element and moving the front upwards. if (front == rear) we’re setting them to 0.

DELETE (Q)
if ( front[Q] == 0 ) then
    return ERROR (underflow)
else
    e = Q[front[Q]]
    if ( front[Q] = rear[Q] ) then
        front[Q] = rear[Q] = 0
    else
        front[Q] = front[Q] mod n + 1
    end_if
    return e
end_if

Checking if queue is empty:

QUEUE_EMPTY (Q)
if ( front[Q] == 0 ) then
return true
else
return false
end_if

Getting the front element without removal:

FRONT (Q) :
if ( front[Q] == 0 ) then
return ERROR (underflow)
else
return Q[front[Q]]
end_if
Note: some languages vector’s lower limit is 0 (C/C++), so condition front=rear=0 can’t be used as a criteria of empty queue. One solution is to use the condition in which “front” points to one position in front of it, while rear stays the same (pointing to the rear). Empty row: front = rear = n -1 . In that case:
QUEUE_EMPTY-1 (Q):
if ( front[Q] == rear[Q] ) then
return true
else
return false
end_if
DELETE_1 (Q):
if ( front[Q] == rear[Q] ) then
return ERROR (underflow)
else
front[Q] = ( front[Q] + 1 ) mod n
return Q[front[Q]]
end_if

There’s one disadvantage again. If queue gets full, front=rear, we’re ending up with the empty queue being equal to the full one. To circumvent this, we’re not allowing the queue to have more than n-1 elements. Full queue: front=(rear+1) mod n

INSERT_1 (Q, x):
rear[Q] = ( rear[Q] + 1) mod n
if ( front[Q] = rear[Q]) then
ERROR (overflow)
else
Q[rear[Q]] = x
end_if

Double Ended Queue

Inserts and deletions are conducted from the both ends. Front and rear lose meaning. There are a couple of variants: no restrictions, input-restricted deque (one insert side) and output-restricted deque (one deletion side).

Linked list queue

Outer pointer, first element, points to the list (front). Last element basically points to rear. Insert operation creates new node, rear pointer next value points to it and we reassign rear pointer to new node.

INSERT_L (q, x):
p = GETNODE
info(p) = x
next(p) = nil
if ( rear[q] == nil ) then
q=p
else
next ( rear[q] ) = p
end_if
rear[q] = p

Deletion removes an element and reassigns the pointer(s):

DELETE_L (q)
if ( q == nil ) then
ERROR (underflow)
else
p = q
x = info(p)
q = next(p)
if ( q = nil ) then
rear[q] = nil
end_if
FREENODE(p)
return x
end_if

Priority Queue

This is the structure in which order of elements (by content, by value) has a decisive factor on which element is going to be removed. It’s necessary to have a value/field (or a number of them) on which elements can be compared. This defines the “priority” of elements. Two types of priority queues:

  • max priority
    • Inserts are random, but on deletion always the minimum valued element is being removed
  • min priority
    • Inserts are random, but on deletion always the maximum values element is being removed

As for the implementation, different types, different efficiencies, linear structures, sequential or linked list.

Conclusion

Linear data structures are building blocks of everything, weather you use them in their raw form, library, on some distributed system (semaphores) or in your buffer overflow reverse engineering stuff, they’re unavoidable segment of computing. This was a rough walkthrough, mixing little bit of everything. It was mostly theory (not including pseudocode) but we’ll most likely get back to this at some point later on, adding practical examples in C/C++ or Java.