Median-finding Algorithm
Median-finding algorithms (also called linear-time selection algorithms) use a divide and conquer strategy to efficiently compute the smallest number in an unsorted list of size , where is an integer between and . Selection algorithms are often used as part of other algorithms; for example, they are used to help select a pivot in quicksort and also used to determine -order statistics such as the maximum, minimum, and median elements in a list.
Contents
Median-finding Algorithm
The problem a median-finding algorithm solves is the following:
Given an array of numbers and an index where find the smallest element of
This problem can certainly be solved using a sorting algorithm to sort a list of numbers and return the value at the index. However, many sorting algorithms can’t go faster than time. A median-finding algorithm can find the smallest element in a list in time.
In practice, median-finding algorithms are implemented with randomized algorithms that have an expected linear running time. However, this wiki will focus on the median-of-medians algorithm, which is a deterministic algorithm that runs in linear time.
The Median-of-medians Algorithm
The median-of-medians algorithm is a deterministic linear-time selection algorithm. The algorithm works by dividing a list into sublists and then determines the approximate median in each of the sublists. Then, it takes those medians and puts them into a list and finds the median of that list. It uses that median value as a pivot and compares other elements of the list against the pivot. If an element is less than the pivot value, the element is placed to the left of the pivot, and if the element has a value greater than the pivot, it is placed to the right. The algorithm recurses on the list, honing in on the value it is looking for.
Median-of-medians Algorithm[1]
The algorithm takes in a list and an index—
median-of-medians(A, i)
. Assume that all elements of are distinct (though the algorithm can be further generalized to allow for duplicate elements).
Divide the list into sublists each of length five (if there are fewer than five elements available for the last list, that is fine).
Sort each sublist and determine the median. Sorting very small lists takes linear time since these sublists have five elements, and this takes time. In the algorithm described on this page, if the list has an even number of elements, take the floor of the length of the list divided by 2 to find the index of the median.
Use the median-of-median algorithm to recursively determine the median of the set of all the medians.
Use this median as the pivot element, . The pivot is an approximate median of the whole list and then each recursive step hones in on the true median.
Reorder such that all elements less than are to the left of , and all elements of that are greater than are to the right. This is called partitioning. The elements are in no particular order once they are placed on either side of . For example, if is the list to the right of maybe look like (i.e. not in sorted order). This takes linear time since comparisons occur—each element in is compared against only.
[1] 6. Let be the “rank” of meaning, for a set of numbers is the smallest number in .
7a. If , then return .
7b. If , then recurse using median-of-medians on .
7c. If , recurse using the median-of-medians algorithm on .
Show the steps for the median-of-medians algorithm to find the third lowest score in the list, , of exam scores.
Note: Some implementations of this algorithm, like the one below, are zero-indexed, meaning that the lowest score will be the lowest score in the list. In this example, use zero-indexing—so the third lowest score will be the element from the left in a sorted list.
First, we break the list into lists of five elements:
Sort each list:
Then, get the median out of each list and put them in a list of medians,
Sort this:
Pick the median from that list—since the length of the list is 2, and we determine the index of the median by the length of the list divided by two: we get the index of the median is 1, and M[1] = 76.
Use this as the pivot element and put all elements in that are less than 76 to the left and all elements greater than 76 to the right:
Find the index of 76, which is 5. How does 5 compare with 3? Since we must recurse on the left half of the list which is .
This list is only five elements long, so we can sort it and find what is at index 3: and 43 is at index three.
So, 43 is the fourth smallest number of .
Implementation of the Median-finding Algorithm
Here is a Python implementation of the median-of-medians algorithm[2]. This implementation works on lists that have unique elements (no repeated elements).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
Note the source cited here does not have a completely correct implementation but did inspire this (better) implementation.
Using the algorithm described for the median-of-medians selection algorithm, determine what the list of medians will be on the following input:
median_of_medians(A,7)
Hint: In the code, medians
is the list of these medians.
Bonus: Feel free to run the code and add some strategic print messages.
Say you wanted to use the above implementation to find the largest element in instead of the smallest. What change could you make to the implementation (not to the function's inputs) to achieve this?
Swaplow
andhigh
in the partitioning step so that
1 2high = [j for j in A if j < pivot] low = [j for j in A if j > pivot]
Make the changes and test it out with the following test cases:
1 2 3B = [1,2,3,4,5,6] print median_of_medians(B,0) #the first largest element should be 6 print median_of_medians(B,5) #the fifth largest element should be 1 (remember 0 indexing)
Now try the next example to see how you can find the largest element by carefully selecting an value.
What could you input to the original implementation above to find the largest element in a list?
The largest element of a list will always be the "least smallest" element. For example, in a list of length the least smallest element in the list is the ninth smallest (remember zero-indexing where the zeroth smallest is the smallest element). More generally, to find the largest element in the list, callmedian_of_medians(A, len(A)-1)
.Try this out with the following test cases:
1 2 3 4 5D = [1,2,3,4,5,6] # 6 is the largest (least small) element in D print median_of_medians(D, len(D)-1) E = [9,5,4,3] #9 is the largest (least small) element in E print median_of_medians(E, len(E)-1)
How could you select an value so that you can find the largest value in without modifying the original implementation itself?
The value you input for the variable would be
len(A) - x - 1\)
, where is the number largest value you want to find.
Complexity of the Median-of-medians Algorithm
The median-of-medians algorithm runs in time. is divided into sublists of five elements each. If is the list of all of the medians from these sublists, then has —one median for each of the sublists. Let's call the median of this list (the median of the medians) . Half of the elements in are less than . Half of . For each of these elements, there are two elements that are smaller than it (since these elements were medians in lists of five elements—two elements were smaller and two elements were larger). Therefore, there are and, in the worst case, the algorithm may have to recurse on the remaining elements. The time for dividing lists, finding the medians of the sublists, and partitioning takes time, and with the recursion factored in, the overall recurrence to describe the median-of-medians algorithm is
The master theorem can be used to show that this recurrence equals .
Why 5?
The median-of-medians divides a list into sublists of length five to get an optimal running time. Remember, finding the median of small lists by brute force (sorting) takes a small amount of time, so the length of the sublists must be fairly small. However, adjusting the sublist size to three, for example, does change the running time for the worse.
If the algorithm divided the list into sublists of length three, would be greater than approximately elements and it would be smaller than approximately elements. This would cause a worst case recursions, yielding the recurrence which by the master theorem is which is slower than linear time.
In fact, for any recurrence of the form , if , the recurrence will solve to , and if , the recurrence is usually equal to . [3]
The median-of-medians algorithm could use a sublist size greater than 5—for example, 7—and maintain a linear running time. However, we need to keep the sublist size as small as we can so that sorting the sublists can be done in what is effectively constant time.
See Also
References
- Moshkovitz, D., & Tidor, B. Introduction & Median Finding. Retrieved May 30, 2016, from http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-design-and-analysis-of-algorithms-spring-2012/lecture-notes/MIT6_046JS12_lec01.pdf
- , g. python Implementing median of medians algorithm. Retrieved May 30, 2016, from https://www.reddit.com/r/learnprogramming/comments/3ld88o/pythonimplementing_median_of_medians_algorithm/
- Trevisan, L. W4231: Analysis of Algorithms. Retrieved May 31, 2016, from http://people.eecs.berkeley.edu/~luca/w4231/fall99/slides/l3.pdf