 |
the library of Alexandria
in Egypt was one of the great intellectual institutions
of the ancient world. For three centuries beginning
around 300 B.C., the pharaoh Ptolemy and his heirs amassed
virtually all the great Greek literature and philosophy,
and tried to collect the whole world’s writings,
from cookbooks to medical texts. As a result, the cultures
of ancient civilizations changed as their scholars congregated
in Alexandria to read, study and write, absorbing Greek
influences.
But a library containing more than 500,000 papyrus scrolls
is useless if you can’t find the ones you want.
So Zenodotus, the first librarian of Alexandria, struck
upon the most enduring classification system ever invented:
he alphabetized the scrolls. Callimachus, one of his
successors, invented the bibliography, organizing the
collection into categories. The poet Philetas created
the first comprehensive dictionary at the library, which
Zenodotus improved by alphabetizing. Didymus wrote commentaries
and glossaries of the holdings, and Dionysius Thrax
created the first book on grammar.
When you’re faced with a body of knowledge many
orders of magnitude larger than anything seen in history,
you need to invent new ways to search, organize and
study it. Fortunately, a rich intellectual environment
enables creative people to rise to the challenge.
The Internet is today’s equivalent of the Alexandria
library, with more than 500 billion web pages
and growing. Making sense of this morass is more crucial
than ever in a world that runs on information. The right—or
wrong—intelligence affects decisions from running
economies to going to war.
It has become obvious that search technology is the
single most important application on the Internet. “The
sheer size and comprehensiveness of the Internet, perhaps
its greatest feature, would be useless if we didn’t
have search to take advantage of it,” notes Esther
Dyson, a longtime technology pundit and editor-at-large
at CNET Networks.
For more than a decade, the biggest innovations in Internet
search technology have come from one place—Stanford’s
computer science department. Most of that work was done
by graduate students under professors in the department’s
database group; much of it was financed by the government-supported
Digital Library Initiative—the project that gave
birth to search king Google. Without those students,
Internet search might be stuck in the pre-Hellenistic
age.
The Digital Library Initiative was not intended to create
technologies for Internet search; Stanford’s original
grant proposal in 1994 made no mention of the Internet
at all. The project started as an attempt by the Department
of Defense to make it easier to find computer research
papers electronically. Stanford and five other universities
each received about $800,000 annually to collaborate.
By 1998, the project’s budget and scope had grown,
as the National Institutes of Health, the National Science
Foundation and more universities got involved.
 |
 |
"We
all said, ‘There will never be another Yahoo!’"
- Brian Lent |
As it happened, 1994 was the year Netscape Communications
released its web browser, transforming the esoteric
Internet into the point-and-click World Wide Web. (People
now use the two terms interchangeably.) Suddenly, the
graphical web, which encompasses the overwhelming majority
of Internet sites, became the place to look for research
papers.
“The Internet completely changed things underneath
us,” says Professor Hector Garcia-Molina, chair
of computer science. As soon as the Digital Library
funding started coming in, the Internet became the focus
of most Stanford researchers’ efforts. Faculty
took a laissez-faire approach, encouraging students
to conduct research in any area they could think of.
The results were spectacular. “Google would never
have existed if not for the Digital Library program,”
says Jeffrey Ullman, professor emeritus of computer
science, who was Google co-founder Sergey Brin’s
adviser.
The first Stanford students to make a commercial success
out of helping people find things on the Internet were
David Filo and Jerry Yang, who started Yahoo! The venture
was never a true search engine—a software program
that pulls up web pages relevant to keywords the user
types. Rather, it started simply as a hand-selected
list of interesting websites called “Jerry’s
Guide to the World Wide Web.” It evolved into
“Yet Another Hierarchical Officious Oracle,”
or Yahoo!, a portal offering hand-selected sites and
free software deemed useful by Yahoo’s “domain”
experts—the equivalent of Callimachus’s
bibliography. To find other web pages, Yahoo! offered
search engines licensed from other companies.
The Yahoo! story also began in 1994. As part of their
Stanford doctoral course work, Filo, MS ’90, and
Yang, ’90, MS ’90, wrote a business plan
based on their web guide. Students had to evaluate each
other’s plans, and Brian Lent, a PhD student in
the database group, gave Yahoo! a D-minus. Lent, MS
’95, thought the selection process should be automated,
rather than hiring scores of experts to find the right
sites as the web grew.
Let that be a lesson to anyone with ambitious plans
for their research: you have to ignore a lot of naysayers.
When Filo asked Lent if he would like to join Yahoo!
as employee No. 1, in order to keep the founders on
their toes with his skepticism, he laughed. “You
couldn’t pay me enough money to work for a company
called Yahoo!” he recalls saying at the time.
Still, Lent was at least partially right. By the late
1990s, almost all search engines had given up trying
to make search a profitable enterprise and were busily
transforming themselves into portals modeled after Yahoo!
But after Google showed up in 1998, most of those portals
went out of business, while Yahoo! spent about $2 billion
buying search technology to add to its site. Microsoft
eventually started creating its own search technology,
hoping to release it sometime next year.
Throughout the 1990s, search engines primarily retrieved
pages according to how many times given keywords were
found on a site. It’s as simple an idea as alphabetizing
scrolls, and no more innovative than Yahoo!’s
approach. But these engines were easy to fool. For example,
by simply typing “sex” over and over again
in black type on a black background to make the words
invisible, site programmers could attract a lot of hits
from search engines, whether or not the site had anything
to do with the topic people were looking for.
When Google’s search engine was officially launched
in December 1998, it was distinguished by one big attribute.
It worked.
 |
HOW IT WORKS: Click the image to see what happens when you Google |
At its core is the PageRank system, invented by Larry
Page (and named after him) while he was working on his
PhD at Stanford. PageRank, which judges a site’s
importance by analyzing outside links to it, was the
first true innovation in search technology since the
bibliography. It takes advantage of the unique properties
of the web—the network of links that makes its
name so apt.
Garcia-Molina, Page’s adviser, recalls how it
all started. Page came into his office one day in 1995
to show him a neat trick he had discovered. The AltaVista
search engine not only collected keywords from sites,
but also could show what other sites linked to them.
AltaVista did not exploit this link information in the
way Google would, but Page suggested it would be a good
way to rank sites. He reasoned that those with the most
links probably were the most popular and would prove
most useful to searchers: they should be listed first
in the search results. He began creating his own software
for analyzing links between sites.
Meanwhile Lent, the student who had all but failed Yahoo!’s
business plan, had been working with Brin on a research
project within the database group. In 1995, they decided
to try a little associative data mining. This is the
process of finding pieces of information that commonly
occur together. Retailers use it to search through their
sales records and determine whether different items
are frequently bought at the same time by customers.
(They then can place those products as far apart as
possible in the store, hoping to lure customers into
additional purchases.)
Brin and Lent worked on ways to find specific word combinations
that often occurred together on the Internet, such as
authors and their book titles. This required searching
through masses of web data, so Brin wrote a “crawler”
program—software that visits websites, summarizes
their content and stores the data in a central location
accessible to graduate students and search companies.
He intended to call the crawler “Googol,”—after
the word coined by the 9-year-old nephew of mathematician
Edward Kasner for the number 10100—to reflect
the enormous amount of data they were collecting. For
two years, Lent recalls, they did not realize they were
spelling the word incorrectly.
Later, Page combined his method of analyzing “back”
links pointing to a given website with Brin’s
web crawler, and their combined research moved under
the Digital Library umbrella.
Lent, who had a tendency to wander back and forth between
university research and corporate life, did not stick
around to work with Page and Brin, a decision he confesses
he regrets. But in early 1996, Lent explains, “We
all said, ‘There will never be another Yahoo!’”
Their research seemed purely an academic exercise. Lent
was itching to get back into business, so he joined
a start-up company.
But the Google search engine, first set up to troll
through Stanford’s own web pages, was an immediate
hit with students and faculty. Page and Brin became
convinced of its commercial potential. With help from
Stanford’s Office of Technology Licensing and
a number of professors (see sidebar) they managed to
get their company funded. To bring in revenue, they
borrowed an idea from GoTo.com (later renamed Overture
and acquired by Yahoo!), a sort of Yellow Pages search
engine that went through ads, not websites. Google now
simultaneously searches through websites and its own
advertisers, listing the relevant ads next to the search
results. This has become the most successful advertising
approach on the Internet.
Is it always that easy to start a company out of Stanford?
Of course not. But, says Ullman, “The value system
we have at Stanford doesn’t sneer at commercial
utility.”
Not everyone agrees with that assessment. Scott Hassan,
who helped Page and Brin with some of the early programming
for Google while in the master’s program, thought
work that showed commercial potential was discouraged
at the University. “I saw people at Stanford who
waited until they left to do interesting things,”
he says. But, he adds, “Stanford does make it
easy to buy the patents.” Hassan, who co-founded
eGroups, later sold to Yahoo!, says he just didn’t
realize it while he was there. “Office of Technology
Licensing policies are very pro-inventor. They will
even help you file the patents. But all that isn’t
very well publicized at Stanford.”
Page, MS ’98, and Brin, MS ’95, may have
become yet another two PhD students to disappoint their
mothers by dropping out of grad school to start a company.
But the research they started continues at Stanford,
officially encapsulated in a project known as WebBase.
Using the techniques first developed by the Google founders,
the core of WebBase is a huge archive of websites now
stored at the San Diego Supercomputing Center. Researchers
from Stanford and other universities around the world
can download and work with information about millions
of websites as they develop search and retrieval technology.
Stanford has continued to supply Google with brainpower
and new ideas in search. For six years, nearly everybody
who graduated under a faculty adviser in the database
group either stayed in academia or went to work at Google.
That record was only recently broken when one alum went
to IBM’s Almaden Research Center. “We used
to joke that if Google went under, all our grads would
be unemployed,” says Professor Jennifer Widom.
As for Lent, he has not given up. He got a call from
Microsoft in 2003, telling him the company wanted “to
kill Google,” he recalls. He considered joining
the team, but decided that if Microsoft could do it,
so could he. Lent is now an “entrepreneur in residence”
at Silicon Valley venture capital firm Mohr, Davidow
Ventures, putting together a start-up team that will
tailor search to individuals’ interests.
Lent describes his quest as “a bit psychotic—I
mean, who goes after Google?” But he thinks Google
left him an opening. “I felt Google was stagnating,”
he says. “Their core premise is still link analysis.
But the other half of the equation is user behavior.”
Lent has an algorithm he calls “Dynamic PageRank,”
which adds the dimension of time to web searches in
order to better determine people’s interests.
How long do people stay on web pages; what hour, day
or week are they most active; what ads do they most
often click on; and what products do they most often
buy? By tracking their interests and behavior, Lent
thinks he will be able to give web searchers better
results.
Because he “passed on two companies” that
spun out of Stanford and became huge successes, Lent
notes, “I need to give it a try. Google and Yahoo!,
be warned.” Unless, of course, one of the companies
becomes impressed enough to buy his start-up.
Google has already bought a company that was developing
technology to personalize web searching. That company
was founded—you guessed it —by a few Stanford
computer science graduate students.
Glen Jeh was in the PhD program in 2003, working within
the database group, when he co-wrote (with Widom) a
prizewinning conference paper called “Scaling
Personalized Web Search.” His approach to personalizing
searches lets people specify their interests in advance.
The problem is that adding individual preferences to
web searches presents a difficult computational problem.
Since there are millions of users, each with separate
criteria, there are simply too many permutations to
quickly find all the websites that simultaneously match
search terms, have the highest PageRanks and correlate
with their lists of interests.
 |
Jeh, MS ’03, came up with the idea of “partial
vectors,” common preferences shared by many people.
Sites that match many of these preferences are given
higher priority even before anyone does a search, narrowing
the field. Then when an individual does a search, his
or her other preferences are calculated in. That can
still require a lot of expensive computing power, though,
so two other PhD candidates, Taher H. Haveliwala, MS
’01, and Sepandar Kamvar, PhD ’04, improved
the efficiency of calculating Jeh’s partial vectors,
and the trio set up a company called Kaltix last year.
Google snapped it up within months.
Some of Stanford’s computer science grads have
stayed in academia, and continue to conduct research
into the intricacies of web search. Junghoo Cho, MS
’97, PhD ’02, is an assistant professor
at UCLA. He’s concerned about Google’s ability
to alter the makeup of websites. Since a relatively
small number of sites have the most links, and Google
retrieves them first, those sites get visited more often
and even more people link to them. Cho’s studies
indicate that Google in effect drives more and more
traffic to fewer and fewer sites.
Search technology research also continues at Stanford.
Professor Andreas Paepcke, director of the Digital Library
program, and several grad students are working on programs
to search through digital photographs. Their technique
combines data from the camera’s date/time stamps
with information such as birthdays, holidays, vacations
and major events—even data from Global Positioning
System satellites—to help identify what photographs
depict. This is the first step in searching through
them.
Chris Manning, a professor in Stanford’s artificial
intelligence group, is trying to get computers to understand
“natural language,” with all its semantic
subtleties, as it is used (and misused) by humans. One
of Silicon Valley’s Great Tech Hopes is a “semantic
web” that will allow computers employed by search
engines and other sites to respond to questions written
in plain English, or other languages. This is something
the search site Ask Jeeves claims to do, but even Ask
Jeeves executives admit their first versions were mainly
a gimmick, simply picking out keywords in the questions
people typed. The company is trying to improve that
technology.
Stanford’s significant role as originator of search
technology may be winding down, though. For one thing,
this academic year will be the last for Digital Library
funding. And leading research is moving into corporations,
now that Google has demonstrated how profitable it can
be. “We’ve been discussing the question
of whether there’s anything new to do in search,”
says Garcia-Molina. “With all these big companies
out there, what can we do?”
Professor David Cheriton, an early investor in Google,
puts it more bluntly. “When you have something
like Google occur, where you can hire a bunch of great
researchers all motivated by stock options, it’s
hard for pure research organizations like universities
to compete.”
Did anyone say, “There will probably never be
another Google?” |