In 1998, PageRank represented a radical new approach to navigating the Web, one that was unique in its ability to make effective determinations about the quality of webpages and to keep pace with the Web’s exponential growth.
The Problem of Web Search: Quality versus Breadth
Prior to PageRank, Web navigation took two primary forms: directories, such as Yahoo.com (Fig. 1), and keyword searching, such as that used by Altavista.com (and then Google; see Fig. 2). Directories organized websites into hierarchical categories for easy browsing, like a library card catalog or the yellow pages of a phone book. Directories had the benefit of being familiar to users of interfaces such as these, and because directories were updated and maintained by people they were generally reliable indicators of quality. This made directories highly effective for some purposes—covering popular topics such as sports or stable websites such as those of major news outlets. However, because of the Web’s expandability and reconfigurability—nearly anyone with Web access could create a Web page—new pages appeared online at a rapid rate, and these mercurial sites were nearly impossible for human-powered directories to evaluate and add to their databases in a timely fashion.
Keyword searching relied on different tools than directories. Both directories and keyword-based search engines did not directly search the Internet; rather, they searched indices of webpages in proprietary databases. Unlike human-curated directories, however, keyword-search databases are created by programs, variously called web crawlers or spiders, which navigate automatically from page to page via the links that connect them, adding information about the sites they visit to the database.
Because keyword-search sites automatically indexed the Web, their databases were necessarily more complete than directories; their breadth, however, was achieved by sacrificing the quality judgments provided by a directory’s human curators. Although spiders can easily index millions of pages, they are unable to make effective judgments about the contents of those pages. As Sergey Brin and Larry Page (1998) observed, simple searches “return[ed] too many low quality” results (p. 107). For example, in 1998 a search for “Bill Clinton” on a then-popular search site gave preference to a page consisting of a single image and the text “Bill Clinton Sucks” over the President’s official whitehouse.gov biography page (Brin & Page, 1998, p. 111). In short, unlike a directory site, which would contain a curated list of quality websites devoted to politics, a search for “politics” on a keyword-search site like Altavista would return all of the indexed web pages that contained that word, a potentially bewildering list of thousands or millions of results arranged with minimal or no quality control.
In short, directories and keyword searching each had benefits and drawbacks. Directories provided superior quality control, but were unable to manage the expandability and reconfigurability of the Web and its nature as “a vast collection of completely uncontrolled heterogeneous documents” (Brin & Page, 1998, p. 111), while keyword searching, which could keep up with this expansion (via automated indexing of websites), was unable to effectively rank its results in a way that gave prominence to quality pages.
What is PageRank?
PageRank offered a novel solution to these limitations, combining the scope offered by automation with a method for determining the quality of indexed sites that could compete with that of directories.
The logic of PageRank draws from a graph theory analysis of citation networks. Like academic citations, where the most influential papers tend to acquire the most citations, Larry Page, Sergey Brin, Rajeev Motwani, and Terry Winograd (1998) assumed that the highest quality pages on the Web would be hubs with many links (p. 2). PageRank was designed to use the number of links to a page to determine that page’s quality, treating links between web pages as a citation from one page to another and assigning each page a score—its PageRank—based on how many other pages link to it (Fig. 3).
To this basic citation-counting, PageRank added a twist: links that came from pages with a high PageRank were given a higher value within the system than links from pages that do not (Brin & Page, 1998, p. 110). Therefore, a link from a page with a low page rank is worth less than a link from a page with a high page rank. In this way, PageRank used “the link structure of the Web” to find pages with a high number of these “citations,” thus assigning a value to each page that was commensurate to its quality (Brin & Page, 1998, p. 109).
Note: Portions of this text are adapted from Jones (2012).