One of the simplest questions one might have about a list is whether or not some item is in the list. For example, if you are writing an algorithm to determine if someone should have access to a members-only website, you can solicit their information and then see if it matches some item in the members list.
Suppose that you have an unsorted array of 1000 emails and you want to check if
email@example.com is in the array. Using a naive algorithm where you move sequentially through the list checking if each element matches
firstname.lastname@example.org, what is the maximum number of comparisons you would need to make?
The search algorithm described in the previous question is known as linear search. Clearly, the worst case is very bad, since you might have to go through every element in the array.
What is the best case for the number of comparisons you will need to make when searching for a specific email in an array of 1000 emails?
So, for linear search, the best case is great (just 1 comparison!) but the worst case is very bad--possibly having to compare the email with every element in the array.
What about on average? If the email is in the array (whose elements are distinct and randomly arranged), what is the average number of comparisons that you will need to make?
(Note: While we are assuming the email is in the array, the algorithm can't make that assumption; it needs to confirm the presence of the email before it stops.)
To summarize, to determine if an element is in an array with \(n\) elements using linear search, the number of comparisons we need to make is
While the average case might seem not too bad, it’s important to think about what this means for large \(n.\) While 500,000 comparisons is certainly not as bad as 1,000,000, they’re both problematic in that they scale linearly with \(n;\) that is, with twice as much data, the algorithm will need twice as many comparisons.
This isn’t great in a world with increasingly large data sets. Can we do better?
At first glance, it feels hard to do better than linear search. If—as with an unsorted list—we know nothing about the elements and their organization within the array, then we can’t do any better than just checking all of the elements.
But what if the list were sorted? Assuming our target is present, what is the maximum number of comparisons needed in a searching problem on a sorted array with 1000 elements?
Hint: With a sorted list, knowing if an element is greater or less than the element we’re looking for can be useful!
The search described in the previous problem is called binary search.
Consider a sorted array. In short, binary search repeatedly chooses a number in the middle of the remaining possible numbers, and then determines if the desired number would lie to the left or right of this chosen number (or, if it is this exactly this chosen number). In each iteration, the amount of remaining numbers is halved, making binary search very efficient. This is especially important when dealing with a very large array.
How much does the faster binary search matter? The short answer: a lot. Comparing linear search and binary search serves as a clear demonstration that the choice of algorithm matters.
Suppose you have a sorted array with 100,000,000 elements in it. Assuming the worst case for each method, about how many times more comparisons will linear search make than binary search?
Modern data sets tend to be massive; it’s critical to store and structure data in a way that makes common questions easily answerable. Throughout this course, we’ll consider what data types and structures should be used in various situations.
So far, we’ve already seen one repeated theme: lists tend to be more useful when sorted. But we haven’t answered the question of how to sort a list, and how computationally complicated it would be to do so. We’ll start to answer this question in the next quiz!