Search Engines

When we type words into a search bar, computers in many different places spring to work in the blink of an eye. Search engines give us a level of access to information that would have been breathtaking to humans a generation ago.

We take web search for granted, but it's an amazing feat of engineering and computer science. In this course, you'll peek behind the curtain to learn how clever algorithms and data structures make this technology possible.

Searching the Whole World

                     

It’s hard to develop intuition about how search works. Computers are very fast, but the amount of information collected on the internet is very big.

Should looking for all occurrences of the word "puzzle" in ten years of your email history be fast because computers are fast, or is it slow because ten years of email history is an awful lot of information to sort through?

Searching the Whole World

                     

To find all the instances of a word in a document, the simplest strategy is to read the document from beginning to end.

This would be tedious and slow for a human. But when books are stored as text on a computer, the approach can work.

A computer can read through documents the same way a human does, letter by letter, word by word, from the beginning to the end. Unlike a human, a computer can do this task with blazing speed and essentially perfect accuracy.

Searching the Whole World

                     

Here’s a search engine that lets you query words from the Sherlock Holmes stories by Arthur Conan Doyle.

Type "puzzle" into the search bar below:

Based on the results, does Arthur Conan Doyle use "puzzle" as a verb, a noun, or both?

Searching the Whole World

                     

A reasonably designed computer program on a personal computer takes about a second to "read" any one of these documents to find all instances of a word or phrase:

  • the complete works of Shakespeare
  • Leo Tolstoy’s epic novel War and Peace
  • the Bible
  • the collected stories of Sherlock Holmes.

How long should you expect it to take a computer to read all four of those documents from beginning to end to count all instances of a word or phrase?

Searching the Whole World

                     

Computers can read the Bible from beginning to end faster than a human can turn a page. When we say the computer is “reading,” we mean that it is doing something similar to what a human does: processing the text letter by letter, word by word, from front to back.

Here’s a slightly-slowed-down visualization of a computer locating the occurrences of several words in the King James Bible:

So unlike humans, computers can quickly perform search by reading straight through — the entire Bible takes about a second.

According to Wikipedia, there are at least 450 different English translations of the Bible.

How long would it take for a computer to "read" all 450 translations in order to find out how many times each one uses a certain word?

Searching the Whole World

                     

You can’t just rely on the blinding speed of computers when there is lots of text to search through. It takes about 15 seconds for the users of Twitter to produce text equal in length to the Bible.

If a computer can read through a book the size of the Bible in a second, how many hours of Twitter output could that single computer read through in an hour of computing?

Searching the Whole World

                     

The output from Twitter is a huge, but comprehensible, amount of information. It still takes weeks or months of Twitter’s output to match the amount of text contained in the Library of Congress's 39 million physical books.

How long would you expect it to take for a single computer to search once through the entire Library of Congress, if the whole Library of Congress is about a month of tweets?

Searching the Whole World

                     

The simple read-from-front-to-back approach will stop working at the scale of large libraries. A few days isn't an extraordinary amount of time, but you probably don't want to run a computer for several days just to figure out where a particular word appears in the Library of Congress.

We expect search engines to respond in fractions of a second, even when faced with the firehose of text they have to deal with.

Even if you have multiple modern computers searching multiple parts of the text, you’d need a ridiculously large number of computers to achieve search-engine-like speeds.

Searching the Whole World

                     

The simple approach to reading a book —letter by letter and word by word — lets a computer almost instantly search a book. That approach just won't work anymore when you want to search the entire web.

A large search engine like Google must keep track of hundreds of trillions of web pages and respond to millions of search queries every hour.

Searching the Whole World

                     

Modern search engines and our Sherlock search engine are built on a fundamentally different approach.

This foundation also allow you to search in more powerful ways, like searching for exact phrases or using AND and OR in queries. In Chapter 2, you'll build up the skills you need to build a search engine by solving puzzles with Sherlock Holmes.

The ideas behind modern search engines are not new. In the next exploration, you’ll see that the basis of modern search has been around for hundreds — even thousands — of years.

Searching the Whole World

                     
×

Problem Loading...

Note Loading...

Set Loading...