Inside one of Google’s data centres, the Dalles, Oregon.
Below are copy pasted sections of the article I found to be most useful:
we – the billion components of the collective questioning mind – have got used to asking Google pretty much anything and expecting it to point us to some kind of satisfactory answer. It’s long since become the place most of us go for knowledge, possibly even, desperately, for wisdom. And it is already almost inconceivable to imagine how we might have gone about finding the answer to some of these questions only 15 years ago without it – a visit to the library? To a doctor? To Citizens Advice? To a shrink?
That rate of change – of how we gather information, how we make connections and think – has been so rapid that it invites a further urgent Google question. Where will search go next?
In the past couple of years, a great advance in voice-recognition technology has allowed you to talk to search apps – notably on iPhone’s Siri as well as Google’s Jelly Bean – while Google Now, awarded 2012 innovation of the year, will tell you what you want to know – traffic conditions, your team’s football scores, the weather – before you ask it, based on your location and search history
Searching is ever more intimately related to thinking.
The man who is, these days, in charge of the vast majority of the world’s questing and wandering and seeking and traversing is called Amit Singhal. Aged 44, head of Google Search. For a dozen years, he has taken over responsibility from Brin for writing and refining the closely guarded algorithm – more than 200 separate coded equations – that powers Google’s endless trawl for answers through pretty much all of history’s recorded knowledge. So far, he has never stopped finding ways to make it ever smarter and quicker.
This year, Google will roll out what it calls its Knowledge Graph, the closest any system has yet come to creating what Tim Berners-Lee, originator of the web itself, called “the semantic web”, the version that had understanding as well as data, that could itself provide answers, not links to answers.
The Knowledge Graph is a database of the 500 million most searched for people, places and things in the Google world. For each one of these things, it has established a deep associative context that makes it more than a string of words or a piece of data. Thus, when you type “10 Downing Street” into Google with Knowledge Graph, it responds to that phrase not as any old address but much in the way you or I might respond – with a string of real-world associations, prioritised in order of most frequently asked questions
Knowledge Graph, you might say, is the beginning of that “timeless interval”. Google has already come closer than anyone could ever have imagined to the “nothing was left to be collected” part of that equation. It is in searchable possession not only of the trillions of pages of the world wide web, but it is well on the way to photographing all the world’s streets, of scanning all the world’s books, of collecting every video uploaded to the public internet, mostly on its own YouTube. In recent years, it has been assiduously accumulating as much human voice recording as possible, in all the languages and dialects under the sun, in order to power its translation and voice recognition projects. It is doing the same for face recognition in films and photographs. Not to mention the barely used possibilities of the great mass of information Google possesses regarding the interests and communications and movements and search history of just about everyone with a phone or an internet connection.
This data has been collected not just for the purpose of feeding it back to us as accurately as possible, but also for the wider purpose: of teaching Google how to think for itself. Singhal has worked with what he calls “signals of salience” for the past dozen years, finding ever more accurate text- and link-based methods of making searches happen. But also, crucially, as these signals have become ever more sophisticated, Singhal and his team have been able to “observe the whole world interacting with the data, and with that we were able to begin to do something else, which was to begin to make the computer understand the context of what was being asked”.
In talking to Singhal, it is quite easy to get caught up in the utopian possibilities of the technology and quite easy, of course, to forget that Google has also created wealth faster and more efficiently than any company in history; that it is probably the most effective generator of advertising dollars ever invented; and that a great deal of what it knows about us we might well want it not to (an unease that might grow by association now that Facebook has announced a search engine of its own data, one that promises to be even more intimate in its revelation of personal history than Google has ever dared to be).
My research interests are in the area of information retrieval (IR), its application to web search, web graph analysis, and user interfaces for search.
- Speech Retrieval: Increasing amounts of spoken communication are stored in digital form for archival purposes (for instance, broadcasts material). With advances in automatic speech recognition (ASR) technology, it is now possible to automatically transcribe speech with reasonable accuracy. Once transcribed, IR methods can be used to search speech collections. Think of this as a search engine for speech. However, the interesting problem is to search speech given large number of automatic speech recognition errors. More recently I have done some work in this area. When at AT&T Labs, we developed SCAN, a system that combines speech recognition, information retrieval and user interface techniques to provide a multimodal interface to speech archives.
- Document Ranking: Also called text/document searching/retrieval (that makes four phrases by the way), this is the best known part of our field. If you are reading this page, chances are that you have already used a “search engine” before. Document ranking is what search engines do: given a user query, how to rank a large collection of documents (web pages, news articles, your email, someone else’s email that you happen to have hacked, …) so that what you are looking for is ranked ahead of other less useful (or useless) documents.
- Question Answering: People have questions and they need answers, not documents. Automatic question answering will definitely be a significant advance in the state-of-art information retrieval technology. Systems that can do reliable question answering without domain restrictions have not been developed yet.
- Document Routing/Filtering: This is the “query by example” version of document ranking. Once you point the system to a few “good documents”, the system then tracks all NEW documents and points you to only those ones that you should be looking at. Typically the system tries to find new documents that are similar to the documents that you said were good.
- Automatic Text Summarization: Documents are huge and we don’t always want to read them all. (I don’t know about you but I certainly don’t have the patience. And given the stuff you find on the web …) Techniques that automatically “summarize” documents will be tremendously useful. Domain independent text summarization is very hard, at times even for humans; typically machines do summarization by text extraction. Relevant pieces (sentences, paragraphs, …) of text are typically extracted and presented as a “summary”.
- Miscellaneous (TREC): Since 1992 National Institute of Standards in Technology (NIST) (along with DARPA) sponsors an annual conference called Text REtrieval Conference (TREC) to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. I have been actively participating in TRECs since TREC-3 (held in 1994).