Yesterday while I was having a blast reading “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” I happened across some fun facts.
We got into some of the more technical goods from the paper yesterday, but figured these would also be an worthwhile — or at least more enjoyable — read. Friday and all.
1. “Wow, you looked at a lot of pages from my web site. How did you like it?” – people encountering a crawler for the first time
They note that they received almost daily emails from people either concerned about copyright issues or asking if they liked the site after looking at it. For many people with web pages, this was one of the first crawlers they had seen.
“It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, “Wow, you looked at a lot of pages from my web site. How did you like it?” There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like, “This page is copyrighted and should not be indexed.”
More innocent times.
2. A billion web documents predicted by 2000
“It is foreseeable that by the year 2000, a comprehensive index of the Web will contain over a billion documents. . . The goal of our system is to address many of the problems, both in quality and scalability, introduced by scaling search engine technology to such extraordinary numbers.”
Now in 2018, there are reportedly 130 trillion documents on the web — an extraordinary number indeed. And sure enough, their search has scaled to meet it.
3. Google took up 55 GB of storage
“The total of all the data used by the search engine requires a comparable amount of storage, about 55 GB.”
Now, Google is 2 billion lines of code. As noted by one of their engineering managers in 2016, the repository contains 86TB of data.
4. “People are still only willing to look at the first few tens of results.”
Please note: “tens.”
They write about the need for more precision in search. Remember the days when people regularly clicked past page 1?
5. Percentage of .com domains: from 1.5 to 60, to now 46.5
They note how “commercialized” the web was already becoming, leaving search engine technology “to be largely a black art and to be advertising oriented.”
“The Web has also become increasingly commercial over time. In 1993, 1.5% of web servers were on .com domains. This number grew to over 60% in 1997.”
According to Statistica, the number of .com domains is down to 46.5% as of May 2018.
“With Google,” they wrote, “we have a strong goal to push more development and understanding into the academic realm.”
6. “There are two types of hits: fancy hits and plain hits”
After going into some technical detail about optimized compact encoding, they reveal that they’ve their complex compact encoding preparations are categorized simply — endearingly — into fancy and plain.
7. Already defending user experience in anticipating search
From the start, it seems Brin and Page fought for users to not need to excessively specify their queries in order to get desired information. They wrote:
“Some argue that on the web, users should specify more accurately what they want and add more words to their query. We disagree vehemently with this position. If a user issues a query like “Bill Clinton” they should get reasonable results since there is a enormous amount of high quality information available on this topic. Given examples like these, we believe that the standard information retrieval work needs to be extended to deal effectively with the web.”
It’s interesting that this was so clearly in their thinking from the beginning. At last week’s Search Summit, Googler Juan Felipe Rincon said, “The future of search is no search, because search implies uncertainty. Instead, it will be about how you populate something before someone knows what they don’t know.”
8. There was a typo
In the second paragraph of section 3.2, they write “Couple this flexibility to publish anything with the enormous influence of search engines to route traffic and companies which deliberately manipulating search engines for profit become a serious problem.”
Did you catch it? The verb should be, “companies which are deliberately manipulating search engines become” or “companies which deliberately manipulate search engines become.” Of the utmost gravity, we know.
Just goes to show that even if an incomplete verb phrase won’t keep you from doing some pretty cool stuff in the world. And of course, that even the best of us need editors.
9. Search Engine Watch shout out
We tweeted this yesterday, but felt the need to share again for extra emphasis. Our very own Search Engine Watch was cited in the paper, stating that top search engines claimed to index 100 million web documents as of November 1997. Been a fun 21 years.
10: They chose these photos
Happy Friday, everyone.
Source:: Search Engine Watch RSS