Big O Notation

Big O notation is a notation used when talking about growth rates. It formalizes the notion that two functions "grow at the same rate," or one function "grows faster than the other," and such.

It is very commonly used in computer science, when analyzing algorithms. Algorithms have a specific running time, usually declared as a function on its input size. However, implementations of a certain algorithm in different languages may yield a different function. For example, an algorithm with input size \(n\) bytes, when implemented in C++, might take time \(n^2\) microseconds; but when implemented in Python, it might take time \(1000n^2 + 1000n\) microseconds due to it being a slower language. Thus, instead, we often talk about an algorithm's growth rate; whether the algorithm only takes time proportional to its input size (linear), or the square of the size (quadratic), or perhaps it doesn't depend on its input size (constant). This is formalized in big O notation, stating that the algorithm above runs in "quadratic time," or "\(O(n^2)\)."

Another common use is when talking about approximations in power series. A power series has many terms, usually infinite, but for the sake of approximations we often don't need too many terms. For example, the sine function has power series \(\sin x = x - \frac{x^3}{6} + \frac{x^5}{120} - \frac{x^7}{5040} + \cdots\); to approximate \(\sin 0.1\) to four digits, we most likely don't need terms with exponent greater than 4, because \(0.1^5 = 0.00001\) doesn't contribute to the fourth digit. We formalize this also with big O notation; we state "\(\sin x = x - \frac{x^3}{6} + O(x^5)\)," indicating that the terms afterwards are too insignificant to matter.

Intuitive Meaning

Informally, if we have functions \(f,g\) such that \(f\) eventually grows slower than some multiple of \(g\), we say \(“f = O(g)."\)

For example, consider the functions \(f(n) = 1000n^2\) and \(g(n) = n^3\) as \(n \to \infty\). "Eventually," namely when \(n > 1000\), we see that \(f(n) < g(n)\), and thus \(f(n)\) grows slower than \(g(n)\). This means \(f = O(g)\), or \(1000n^2 = O(n^3)\).

In fact, since it says "some multiple of \(g\)," we also have \(1000n^2 = O(n^2)\); for some multiple of \(n^2\) \(\big(\)namely \(1001n^2\big),\) eventually \((\)in this case \(n > 1)\) we have \(f(n) < g(n)\).

The above examples talk when the input approaches \(\infty\), as often used in analysis of algorithms. (Algorithms can be used for arbitrarily large input size, and hence we talk about approaching infinity.) In approximations of power series, we often talk about functions as the input approaches a specific, finite value: for example, approaching 0. We can also use big O notation in this case.

For example, consider the functions \(f(n) = 1000n^2\) and \(g(n) = n\) as \(n \to 0\). "Eventually," which in this case means \(0 < n < 0.001\), we have \(f(n) < g(n)\), and thus \(f = O(g)\) as \(n \to 0\).

Formal Definition

We formalize the informal definition above, mostly by clarifying what "eventually" and "slower" means.

If \(f(x), g(x)\) are functions on the real numbers, then we say \(f = O(g)\) or \(“f\) is big-oh of \(g"\) as \(x \to \infty\) if there exist some constants \(M, c\) such that \(|f(x)| \le c |g(x)|\) for all \(x > M\).

If \(f(x), g(x)\) are functions on the real numbers, then we say \(f = O(g)\) or \(“f\) is big-oh of \(g"\) as \(x \to a\) if there exist some constants \(\delta, c\) such that \(|f(x)| \le c |g(x)|\) for all \(0 < |x-a| < \delta\).

We can verify that

for \(f(n) = 1000n^2\) and \(g(n) = n^3\) as \(n \to \infty\), we pick \(M = 1000, c = 1;\)
for \(f(n) = 1000n^2\) and \(g(n) = n^2\) as \(n \to \infty\), we pick \(M = 1, c = 1001;\)
for \(f(n) = 1000n^2\) and \(g(n) = n\) as \(n \to 0\), we pick \(\delta = 0.001, c = 1.\)

Technically, the appropriate notation is \(f \in O(g)\), because the equal sign does not imply symmetry. (It is to be read as "is" in English, instead of "is equal to.") Thus it's more correct to say \(1000n^2 \in O(n^3)\) instead of \(1000n^2 = O(n^3)\). In this way, we treat \(O(g)\) as the set of all functions \(f\) satisfying the definition above. However, the equal sign is more common, and such is an accepted abuse of notation.

Landau Notations

Big O notation has a few related notions. If \(f, g\) are functions, then informally

\(f = O(g)\) (big-oh) if eventually \(f\) grows slower than some multiple of \(g;\)
\(f = o(g)\) (little-oh) if eventually \(f\) grows slower than any multiple of \(g;\)
\(f = \Omega(g)\) (big-omega) if eventually \(f\) grows faster than some multiple of \(g;\)
\(f = \omega(g)\) (little-omega) if eventually \(f\) grows faster than any multiple of \(g;\)
\(f = \Theta(g)\) (theta) if eventually \(f\) grows at the same rate as \(g.\)

Formally,

If \(f(x), g(x)\) are functions on the real numbers, then as \(x \to \infty\), we say

\(f = O(g)\) if there exist \(M, c\) such that \(|f(x)| \le c |g(x)|\) for all \(x > M;\)

\(f = o(g)\) if for any positive \(c\), there exists \(M\) such that \(|f(x)| \le c |g(x)|\) for all \(x > M;\)

\(f = \Omega(g)\) if there exist \(M, c\) such that \(|f(x)| \ge c |g(x)|\) for all \(x > M;\)

\(f = \omega(g)\) if for any positive \(c\), there exists \(M\) such that \(|f(x)| \ge c |g(x)|\) for all \(x > M;\)

\(f = \Theta(g)\) if there exist \(M, c_1, c_2\) such that \(c_1 |g(x)| \le |f(x)| \le c_2 |g(x)|\) for all \(x > M.\)

Similar definitions can be made for the \(x \to a\) case, by replacing \(x > M\) with \(0 < |x-a| < \delta\) \((\)and asserting existence of \(\delta\) instead of \(M).\)

Properties

There are a few properties involving Landau notations, all pretty simple to prove from the definitions:

\(f = \Theta(g)\) if and only if \(f = O(g)\) and \(f = \Omega(g)\).
\(f = O(g)\) if and only if \(g = \Omega(f)\).
\(f = o(g)\) if and only if \(g = \omega(f)\).
If \(f = \Theta(g)\), then \(g = \Theta(f)\).
If \(f = O(g)\) and \(g = O(h)\), then \(f = O(h)\). If either (or both) of the big-oh is actually a little-oh (\(o\)), then \(f = o(h)\).
If \(f = \Omega(g)\) and \(g = \Omega(h)\), then \(f = \Omega(h)\). If either (or both) of the big-omega is actually a little-omega (\(\omega\)), then \(f = \omega(h)\).
If \(f = \Theta(g)\) and \(g = \Theta(h)\), then \(f = \Theta(h)\). If any one (but not both) of the theta is replaced by another notation, then the conclusion uses the same notation; for example, \(f = O(g)\) and \(g = \Theta(h)\) implies \(f = O(h)\).
If \(f = o(g)\), then \(f = O(g)\).
If \(f = \omega(g)\), then \(f = \Omega(g)\).
We have \(f = O(f), f = \Theta(f), f = \Omega(f)\), but \(f \neq o(f), f \neq \omega(f)\).

Use in Computer Science

In computer science, we often encounter algorithms to analyze. The analysis most often asks for the running time of the algorithm, but it can also be memory consumption, stack depth, and other resources, often expressed as a function of the input size.

Unfortunately, implementations of algorithms will result in different resource consumption. As given in the introduction, an example is an algorithm which, when given an input size of \(n\) bytes, can be implemented in C++ to take \(n^2\) microseconds, but in Python it's implemented to take \(1000n^2 + 1000n\) microseconds due to Python being a slower language. Not to mention that if we state the algorithm in pseudocode, it's unclear how much time it takes; it might take "\(10n^2 + 100n\) lines" to run, but we don't know how many microseconds it takes to execute a line of pseudocode.

However, for the analysis of algorithms, the most important aspect of it is how it behaves with different input size. An algorithm might take time proportional to its input size; thus, doubling the input size makes it run twice as long. Or perhaps it's proportional to its input size squared, so doubling the input size causes it to run four times as long.

To ignore the differences in implementation, we often resort to its asymptotic growth rate. Given a resource consumption of \(f\), we find a simple function \(g\) such that \(f = \Theta(g)\). Different implementations will be hidden in the \(\Theta\); the focus is now on \(g\), whether it takes linear time \(\big(g(n) = n\big),\) or quadratic time \(\big(g(n) = n^2\big),\) or some other value. The behavior on different input sizes is still retained here; doubling the input value will result in the same behavior on \(g\) \(\big(\)doubled for \(g(n) = n\), four times for \(g(n) = n^2\), and so on\(\big).\) Thus from the view point of analysis of algorithms, we don't lose any information for simplifying \(f\) into \(g\), since both give the same behavior. This allows us to compare different algorithms according to their growth rates, and claim that one algorithm is "faster" than another: if we have two algorithms with resource consumptions \(f_1, f_2\), and we have \(f_1 = \Theta(g_1), f_2 = \Theta(g_2)\), then we say the first algorithm is faster than the second if \(g_1 = o(g_2)\).

Although it's much better to say \(f = \Theta(g)\), in practice computer scientists often say \(f = O(g)\) instead even if they mean the former. This is another abuse of notation; the other Landau notations aren't used very often, and sometimes analyzing the exact runtime is so difficult that they make some simplifying assumptions even if the result becomes slightly weaker.

Also in computer science, some algorithms give different behaviors when given different input. Analysis of algorithms often asks for the worst case and the average case of the running times; given different inputs of a specific size, the worst case is the longest time taken for a specific input, while the average case is the average time taken over all inputs. Sometimes the best case running time is also analyzed, but it's much rarer than the other two.

When analyzing runtime of an algorithm, there are many common growth rates that we have specific names for. (Memory consumption and such are generally less impressive, mostly either constant, logarithmic, or linear.) Here are some of them, from fastest to slowest. In all of these, \(n\) increases without bound. This is usually the input size (in bytes or bits or digits), but sometimes when the input is a natural number, we take \(n\) as the value of this number, or when the input is a list of elements, \(n\) is the number of elements without considering how large the individual elements are.

\(O(1)\): Constant-time algorithm runs in a constant time no matter how large the input is. For example, checking whether the first byte of a file is null (0x00) is constant time; no matter how large the file is, we only need to inspect the first byte. As another example, programs that ignore their input and compute the answer to a specific problem are also constant-time, even though this problem might take a very long time.
\(O(\log n\)): Logarithmic-time algorithm runs in time proportional to the logarithm of the input. The most common example is perhaps binary search: given a sorted array (with random access) of \(n\) elements, we can find whether a specific item is in the array or not by using binary search, dividing the search space by half by checking the middle element.
\(O(n)\): Linear-time algorithm runs in time proportional to the input. This can be good or bad: when \(n\) is the number of elements in an array of integers, radix sorting allows us to sort it in linear-time, a very good performance, but when \(n\) is a positive integer and we want to check whether it's prime, doing trial division over all numbers \(2, 3, \ldots, n-1\) is a poor performance.
\(O(n \log n)\): Often encountered in sorting algorithms, a linearithmic-time algorithm runs in time that's not particularly distinguishable from linear-time for "reasonable" input. Many comparison-based sorting algorithms (merge sort, heap sort, etc.) take linearithmic-time, because it has been proven to be the best running time possible for comparison-based sorting algorithms.
\(O(n^2)\): Quadratic-time algorithms take time proportional to the square of the input. Example algorithms that take this time are schoolbook multiplication of two \(n\)-digit numbers or two polynomials of degree \(n\) (faster algorithms exist, such as Karatsuba's algorithm), some slow sorting algorithms (selection sort and insertion sort), and solving the longest common subsequence problem using dynamic programming.
\(O(n^3)\): The most common cubic-time algorithms are perhaps the running times of schoolbook matrix multiplication (again, faster algorithms exist such as Strassen's algorithm), computing determinant using Gaussian elimination (which can be done using matrix multiplication) and finding a triangle in a graph of \(n\) vertices (which can also be done using matrix multiplication). Other than that, a cubic-time algorithm will likely have the structure of a loop through \(n\) values inside another loop of \(n\) values inside a third loop of \(n\) values; the three examples above naturally give rise to this structure, but it's uncommon to see any other.
\(O(a^n)\): For an exponential-time algorithm, increasing the input by one is enough to multiply the algorithm's running time (by \(a\)). Note that if \(a < b\), then \(a^n = o(b^n)\) (they are not asymptotically the same). An example of this is to evaluate whether a Boolean formula on \(n\) variables can be satisfied; trying each possibility requires \(2^n\) cases to be checked (multiplied by the time it takes to actually evaluate the formula).
\(O(n!)\): A factorial-time algorithm is even slower. Such algorithms often check every permutation of an array, in which there are \(n!\) of them.

You lost your wedding ring on the beach, and have no memory of when it came off. Thus, you decide to do a brute force grid search with your metal detector, where you divide the beach into strips, and walk down every strip, scanning the whole beach, until you find it. For simplicity, assume the beach is a square of side length \(l\) meters, each of your strips has a constant width of 1 meter, and it takes 10 seconds to walk 1 meter (it's hard to walk while searching). Find the big-oh performance of your ring finding algorithm.

beach scanning going on

The beach has size \(l \times l\). Since you divide it into strips of width 1 meter, you have \(l\) strips, each of length \(l\) meters. Thus it takes \(10l\) seconds to check each strip, and \(10l^2\) seconds to check the entire beach (plus some time taken to move to the next strip, but it shouldn't be very relevant). Thus our algorithm takes about \(10l^2\) seconds, which we can simplify with big-oh notation to \(O(l^2)\). \(_\square\)

Consider the following two algorithms to print all permutations of \(1, 2, 3, \ldots, n\) that are strictly increasing:

Algorithm 1: Check each permutation on whether it's sorted or not, and print each one that is sorted.
Algorithm 2: The only such permutation is \(1, 2, 3, \ldots, n\), so just print it.

Determine the running time (in big-oh notation) of each of the two algorithms.

Algorithm 2 is very simple. It takes \(O(n)\) time, because we're printing \(n\) numbers.

Algorithm 1, however, is not very simple. It inspects each permutation, in which there are \(n!\) of them. Each permutation requires \(O(n)\) time to check: we need to make \(n-1\) comparisons. Thus, the running time is \(n! \cdot O(n) = O(n \cdot n!) = O\big((n+1)!\big)\).

Algorithm 1 could be better. If we modify the checking part so it stops after finding an incorrect comparison (instead of continuing uselessly), this can improve the performance. It will also make the analysis much more complicated. Half of the permutations will stop at comparison 1; among half that passes, \(\frac23\) will stop at comparison 2; among the \(\frac16\) that passes, \(\frac34\) will stop at comparison 3, and so on. Thus, while in the worst case, we need to do \(n-1\) comparisons, over all permutations we actually need a total of

\[n! \cdot \left( \frac{1}{2!} \cdot 1^2 + \frac{1}{3!} \cdot 2^2 + \frac{1}{4!} \cdot 3^2 + \cdots + \frac{1}{n!} \cdot (n-1)^2 \right) + (n-1)\]

comparisons. This is pretty difficult to evaluate, but we can show the parenthesized expression tends to \(e-1 = O(1)\), so in fact the total number of comparisons is \(O(n!)\). \(_\square\)

Retiring Dec 20: Data Structures

Contents