UP | HOME

第三十四课 排序(三)

Table of Contents

1 SELECTION

Suppose that we want to find the kth smallest key in a list. In other words, we want to know which item has index j if the list is sorted (where j = k - 1). We could simply sort the list, then look up the item at index j. But if we don't actually need to sort the list, is there a faster way? This problem is called selection.

One example is finding the median of a set of keys. In an array of n items, we are looking for the item whose index is j = floor(n / 2) in the sorted list.

2 Quickselect

这里如果不用 quickselect,如果我用 heapsort 去做会如何?

  1. heapsort 可以很快的取出最大最小值,so
  2. 如果要求 kth largest item from n elements
  3. 先初始化一个 heap, 耗时 n
  4. 然后取第一个最大值,rebalance heap 耗时 logn
  5. 执行上一步 k 次
  6. 总计 n + klogn
[扩展] 其实这里可以把最后复杂度变成:n + klogk

The idea here is that somehow you're going to do this method so that your heap instead of modifying a heap that has all the elements in it. You're modifying a heap that has at most K elements in it, where k is biggest that you're looking for.

如果采用 quicksort,先排序呢?
毫无疑问,更差一些,排序的时间是 nlogn

We can modify quicksort to perform selection for us. Observe that when we choose a pivot v and use it to partition the list into three lists I1, Iv, and I2, we know which of the three lists contains index j, because we know the lengths of I1 and I2. Therefore, we only need to search one of the three lists.

Here's the quickselect algorithm for finding the item at index j - that is, having the (j + 1)th smallest key.

Start with an unsorted list I of n input items.
Choose a pivot item v from I.
Partition I into three unsorted lists I1, Iv, and I2.
  - I1 contains all items whose keys are smaller than v's key.
  - I2 contains all items whose keys are larger than v's.
  - Iv contains the pivot v.
  - Items with the same key as v can go into any of the three lists.
    (In list-based quickselect, they go into Iv; in array-based quickselect,
    they go into I1 and I2, just like in array-based quicksort.)
if (j < |I1|) {
  Recursively find the item with index j in I1; return it.
} else if (j < |I1| + |Iv|) {
  Return the pivot v.
} else {   // j >= |I1| + |Iv|.
  Recursively find the item with index j - |I1| - |Iv| in I2; return it.
}

[Attentation] If using array-based quicksort, and using this source code QuickSortOnArray you should change [|I1| with i, |Iv| with 1], 'i' is defined in the src code.

The advantage of quickselect over quicksort is that we only have to make one recursive call, instead of two. Since we make at most one recursive call at every level of the recursion tree, quickselect is much faster than quicksort. I won't analyze quickselect here, but it runs in Theta(n) average time if we select pivots randomly.

We can easily modify the code for quicksort on arrays, presented in Lecture 31, to do selection. The partitioning step is done exactly according to the Lecture 31 pseudocode for array quicksort. Recall that when the partitioning stage finishes, the pivot is stored at index "i" (see the variable "i" in the array quicksort pseudocode). In the quickselect pseudocode above, just replace |I1| with i and |Iv| with 1.

3 A LOWER BOUND ON COMPARISON-BASED SORTING

Suppose we have a scrambled array of n numbers, with each number from 1…n occurring once. How many possible orders can the numbers be in?

The answer is n!, where n! = 1 * 2 * 3 * ... * (n-2) * (n-1) * n.

Here's why: the first number in the array can be anything from 1…n, yielding n possibilities. Once the first number is chosen, the second number can be any one of the remaining n-1 numbers, so there are n * (n-1) possible choices of the first two numbers. The third number can be any one of the remaining n-2 numbers, yielding n * (n-1) * (n-2) possibilities for the first three numbers. Continue this reasoning to its logical conclusion.

Each different order is called a permutation of the numbers, and there are n! possible permutations. (For Homework 9, you are asked to create a random permutation of maze walls.)

Observe that if n > 0,
                                                                 n
  n! = 1 * 2 * ... * (n-1) * n <= n * n * n * ... * n * n * n = n

and (supposing n is even)
                                  n    n                                n/2
  n! = 1 * 2 * ... * (n-1) * n >= - * (- + 1) * ... * (n-1) * n >= (n/2)
                                  2    2

so n! is between \((n/2)^{ (n/2) }\) and \(n^n\). Let's look at the logarithms of both these numbers: log((n/2)^(n/2)) = (n/2) log (n/2), which is in Theta(n log n), and log(n^n) = n log n. Hence, log(n!) is also in Theta(n log n).

A comparison-based_sort is one in which all decisions are based on comparing keys (generally done by "if" statements). All actions taken by the sorting algorithm are based on the results of a sequence of true/false questions. All of the sorting algorithms we have studied are comparison-based.

Suppose that two computers run the same sorting algorithm at the same time on two different inputs. Suppose that every time one computer executes an "if" statement and finds it true, the other computer executes the same "if" statement and also finds it true; likewise, when one computer executes an "if" and finds it false, so does the other. Then both computers perform exactly the same data movements (e.g. swapping the numbers at indices i and j) in exactly the same order, so they both permute their inputs in exactly the same way.

A correct sorting algorithm must generate a different sequence of true/false answers for each different permutation of 1…n, because it takes a different sequence of data movements to sort each permutation. There are n! different permutations, thus n! different sequences of true/false answers.

If a sorting algorithm asks d true/false questions, it generates <= 2^d different sequences of true/false answers. If it correctly sorts every permutation of 1…n, then n! <= 2^d, so log_2 (n!) <= d, and d is in Omega(n log n). The algorithm spends Omega(d) time asking these d questions. Hence,

==============================================================================
EVERY comparison-based sorting algorithm takes Omega(n log n) worst-case time.
==============================================================================

This is an amazing claim, because it doesn't just analyze one algorithm. It says that of the thousands of comparison-based sorting algorithms that haven't even been invented yet, not one of them has any hope of beating O(n log n) time for all inputs of length n.

4 LINEAR-TIME SORTING

However, there are faster sorting algorithms that can make q-way decisions for large values of q, instead of true/false (2-way) decisions. Some of these algorithms run in linear time.

Comparison based algorithms(选/插/堆/快/并), is just '2-way' decisions. 下面介绍的算法都是 'q-way' decisions

5 Bucket Sort   QWayDecision Stable

Bucket Sort 适合使用 LinkedList 实现, 而Counting Sort 稍微更适合用 Array

Bucket_sort works well when keys are distributed in a small range, e.g. from 0 to q - 1, and the number of items n is larger than, or nearly as large as q. In other words, when q is in O(n).

  • q: maximum key of items;
  • n: number of items

    BucketSort or Counting Sort 对于键值 key 的最大值 or 范围有很苛刻的要求。
    BucketSort or Counting Sort 更适合处理 items 的 key 的分布范围和 items 数量保持一致的数列。
    比如,min(item.key) = 1; max(item.key)=100; ====> q = 100
         number(item) = 100;                   ====> n = 100
         q = O(n) BucketSort or Counting Sort
    如果,min(item.key) = 1000; max(item.key) = 1100; ====> q = 1100
         number(item) = 100;                         ====> n = 100
         q >> n +BucketSort or Counting Sort + Redix Sort
    

5.1 Key number I goes into Queue number I.

We allocate an array of q queues (or singly-linked lists with tail references, which are basically the same thing, but we only need the queue operations), numbered from 0 to q - 1. The queues are called buckets. We walk through the list of input items, and enqueue each item in the appropriate queue: an item with key i goes into queue i.

Each item illustrated here has a numerical key and an associated value.

             -------------------------------------------------------------
       Input | 6:a | 7:b | 3:c | 0:d | 3:e | 1:f | 5:g | 0:h | 3:i | 7:j |
             -------------------------------------------------------------

                 0       1       2       3       4       5       6       7
             -----------------------------------------------------------------
Queue fronts |   .   |   .   |   *   |   .   |   *   |   .   |   .   |   .   |
             ----|-------|---------------|---------------|-------|-------|----
                 v       v               v               v       v       v
              ------- -------         -------         ------- ------- -------
              | 0:d | | 1:f |         | 3:c |         | 5:g | | 6:a | | 7:b |
              |  .  | |     |         |  .  |         |  *  | |  *  | |  .  |
              ---|--- -------         ---|---         ------- ------- ---|---
                 v       ^               v               ^       ^       v
              -------    |            -------            |       |    -------
              | 0:h |    |            | 3:e |            |       |    | 7:j |
              |  *  |    |            |  .  |            |       |    |  *  |
              -------    |            ---|---            |       |    -------
                 ^       |               v               |       |       ^
                 |       |            -------            |       |       |
                 |       |            | 3:i |            |       |       |
                 |       |            |  *  |            |       |       |
                 |       |            -------            |       |       |
                 |       |               ^               |       |       |
             ----|-------|---------------|---------------|-------|-------|----
Queue tails  |   .   |   .   |   *   |   .   |   *   |   .   |   .   |   .   |
             -----------------------------------------------------------------

When we're done, we concatenate all the queues together in order.

                 0       1       2       3       4       5       6       7
             -----------------------------------------------------------------
Queue fronts |   .   |   .   |   *   |   .   |   *   |   .   |   .   |   .   |
             ----|-------|---------------|---------------|-------|-------|----
                 v       v               v               v       v       v
              ------- -------         -------         ------- ------- -------
              | 0:d | | 1:f |         | 3:c |         | 5:g | | 6:a | | 7:b |
          ...>|  .  | |     |........>|  .  |      ..>|  *  |>|  *  |>|  .  |
              ---|--- -------         ---|---      :  ------- ------- ---|---
                 v       ^               v         :                     v
              -------    :            -------      :                  -------
              | 0:h |....:            | 3:e |      :                  | 7:j |
              |  *  |                 |  .  |      :                  |  *  |
              -------                 ---|---      :                  -------
                                         v         :
                                      -------      :
                                      | 3:i |.......
                                      |  *  |
                                      -------


Concatenated output:
------- ------- ------- ------- ------- ------- ------- ------- ------- -------
| 0:d |>| 0:h |>| 1:f |>| 3:c |>| 3:e |>| 3:i |>| 5:g |>| 6:a |>| 7:b |>| 7:j |
------- ------- ------- ------- ------- ------- ------- ------- ------- -------

This data structure is exactly like a hash table (plus tail references), but the hash code just maps the key i to bucket i, and there is no compression function because there is no need for compression.

I just write a loop that walks through the buckets one by one and I always maintain the tail of the last non-empty queue or the last item in the list so far and if I hit an empty bucket I just don't do anything I skip over that bucket and I go on to the next bucket and when I hit a bucket that's not empty, I take that queue and I concatenate it to the end of the queue so far.

5.2 Running time of BucketSort

Bucket sort takes Theta(q + n) time–in the best case and in the worst case.

  • It takes Theta(q) time to initialize the buckets in the beginning and to concatenate them together in the end.
  • It takes Theta(n) time to put all the items in their buckets.

If q is in O(n)–that is, the number of possible keys isn't much larger than the number of items we're sorting–then bucket sort takes Theta(n) time.

5.3 How did we get around the Omega(n log n) lower bound on comparison-based sorting?

Bucket sort is not comparison-based. We are making a q-way decision every time we decide which queue to put an item into, instead of the true/false decisions provided by comparisons and "if" statements.

5.4 What is a Stable Sort

Bucket sort (as I've described it here) is said to be stable.

A sort is stable if items with equal keys come out in the same order they went in.

For example, observe that 3:c, 3:e, and 3:i appear in the same order in the output above as they appeared in the input.

Bucket sort is not the only stable sort we have seen; insertion sort, selection sort, and mergesort can all be implemented so that they are stable. The linked list version of quicksort we have seen can be stable, but the array version is decidedly not. Heapsort is never stable. (Actually, we can make heapsort stable using a simple trick called a secondary_key, which I might describe later in the semester.)

5.5 If q is in O(n), then BucketSort; else, RadixSort

Take note that bucket sort is ONLY appropriate when keys are distributed in a small range; i.e. q is in O(n). On Monday we'll study a sorting algorithm called radix_sort that will fix that limitation. The stability of bucket sort will be important for radix sort.

RadixSort use BucketSort as a subroutine and the stablity of BucketSort is critical to the success of RadixSort.