Hyper Quicksort / Parallel Quicksort

Hyper Quicksort / Parallel Quicksort

A parallel sorting technique to sort elements in machines connected in hypercube topology.

Quicksort basics

One of the most commonly used algorithms for sequential computers for sorting is Quicksort. The basic idea for this algorithm is to select a pivot in the input sequence and divides the sequence into two half. Where placing the pivot on its correct position and dividing the sequence into two halves, having left half elements smaller than the pivot and right half elements greater. Now algorithm will solve each left and right sequence recursively. You can refer to more here Quicksort.

Introduction and underlying assumptions for Hyper Quicksort

In the current world, every task and action is performed parallelly and in a distributed manner. Quicksort can also be performed parallelly on multiple machines at the same time of N numbers using a P-processor hypercube called Hyper Quicksort. We consider here the implementation of Hyper quicksort under the assumption that we've set up a cluster of the machines and all the machines can have access to read and write the shared global address space asynchronously. It can be defined as the Shared Address Space formulation in Parallel Computing.

Hypercube topology is explained here. Here we use N processing elements and P processors underlying a Hypercube topology. Hence, the dimension(d) of the hypercube will be log₂(P).

Figure1: Hypercube of 3 dimensions containing 8 processors

Algorithm explanation:

  • First, we consider the input sequence of size N equally divided and provided statically to all the processors. Each processor has N/P elements with it.

  • Select any processor and one element from its sequence to broadcast and broadcast that element to all the other processors.

  • Now for each iteration in the loop, which will run d times.

  • Partition the sequence into two lists B1 and B2, where elements of B1 are smaller than the pivot and elements of B2 are larger.

  • Divide the underlying topology into half each time and assign mapping to each processor about its neighbor to whom they should communicate.

Figure2: Iteration-wise neighbor for each processor after halving the topology

Underlying topologies after every iteration is shown here,

Figure3: Partition of topology in iteration1

Figure4: Partitions of topology in iteration2

Figure5: Partitions of topology in iteration3

  • Now exchange the sequence B1 and B2 between the neighbors in a way that the half topology contains all smaller elements than the pivot and the second half will contain all the greater.

  • Combine the received data into one list.

  • According to the partition of topology, now select one of the processors from each of that partitions and broadcast any one element from that processor to all other processors in the partition unless it is the last iteration because we've only one processor in the partition.

  • After d iterations, each processor now has some elements in it. And starting from processor 0 to P, all the elements in processor Pi < all the elements in processor Pi+1.

  • Each processor now has an unsorted element list. So run a simple Quicksort algorithm on each processor to sort the data locally.

  • Gather all the elements from processors to P0 and P0 will print the list. OR Print the elements by each processor in increasing order of their processor id (starting from P0 --> Pp)

Implementation MPI in C is used to perform Hyper Quicksort. Which is responsible for communication among the connected machines in the cluster. Normal Quick sort code is referenced from QuickSort.

My implementation is here.

Please note that:

My given approach is only applicable to multiple machines (as in parallel execution). And for the hyper quicksort, we need to have a manual setup cluster (which fortunately I had in my lab, so I wrote that code). But seems like, not practical every time.

If you set up the cluster successfully then the commands to connect to the cluster and run files:
- ssh hostname
- on the local computer:
- scp local_file_path hostname/remote_path
- on the remote terminal

  • mpicc filename.c -o output.o
  • mpirun -np number_of_worker_nodes output.o

Complexity Analysis

The parallel runtime of a program is dependent on the parallel algorithm and underlying platform. Also, if we implement any algorithm parallelly then there is the additional overhead of communication apart from computation such as idling and inter-process communication. In the case of Hyper Quicksort, initially, N/P numbers are mapped per processor. We take the assumption to select a balanced pivot at each step of the algorithm. For an underlying hypercube topology, it includes these steps,

  1. Broadcast pivots to all the processors - broadcasting will take O(d) time, where d is the dimension of the hypercube of processors (P). The dimension of the hypercube is log₂(P) because the hypercube follows the rule that it has a total of 2^d nodes.

  2. Partition elements in each processor concerning the selected pivot - it'll take O(N/P) time as each processor will receive total n/p elements.

  3. exchange (send and receive) the chunks (higher array B2 or lower array B1) between the processors - It'll take O(N/P) time to exchange N/P elements maximum for each iteration.

  4. Merging the exchanged chunks - It'll take O(N/P) time to rearrange N/P elements after the exchange

And this entire process will run for d iterations before a local simple quick sort.
Tp(N) = d ( O(d) + O(N/P) + O(N/P) + O(N/P) ) + local quicksort
= O(d^2) + O(dN/P) + O(dN/P) + O(dN/P) + local quicksort
= O(d^2) + O(dN/P) + local quicksort
= O(log(P))^2 + O((N/P)log(P)) + local quicksort

- also local quicksort in each machine will perform on maximum n/p elements.

Tp(N) = O(log(P))^2 + O((N/P)log(P)) + O((N/P)log(N/P))

Load balancing

Pivot selection is the key here for load balancing. For a good pivot selection. the load remains balanced after the exchange. However, in another case, the exchange will cause a load imbalance. This means that in the worst-case scenario, after the exchange, one-half of the hypercube will have almost no work to do causing the other half sorting all the exchanged elements.

Observations

We can observe 3 things by taking overall parallel time to execute this program,

  • For the very high number of data N = 262144, the more processor yields the more speedup. But while increasing processors; the efficiency of the program is expected to decrease.

  • If we then reduce N = 65536 then with the same number of processors, we are getting our speedup increased but not sharply, the parallel system is still resulting in better parallel time compared to the previous number of processors.

  • If we have a small amount of data to process when N=4096; our speedup increases and then it decreases sharply as the overhead included between message passing in processors more. Hence, the parallel time taken for the execution is quite more compared to the other cases. Also, the efficiency of the system drops significantly due to the given reason.

Improvements

The algorithm can dynamically distribute the data among the processors instead of statically assigning them. This will cause communication overhead but our algorithm can be made generous.

References: For the implementation of simple sequential quicksort and partition algorithm: https://www.geeksforgeeks.org/quick-sort/

Did you find this article valuable?

Support Jwalit Shah by becoming a sponsor. Any amount is appreciated!