Today is Bill Gate's last day at Microsoft. New York Times has a nice article. Hate him or love him, the man has delivered solutions for people all around the world. And to dedicate his earnings and his time and energy to philanthropy, I for one admire him.
In a loosely related article, Wall Street Journal pokes fun at the relationship between Microsoft, Yahoo, and Google. A delightful read.
Friday, June 27, 2008
Friday, June 20, 2008
Google speller needs a grammar lesson?
Google is amazing. Among other things, their speller does a great job. Which is why it was shocking to me that the other day, when I was comparing Google and Powerset, I found an obvious error.
My query "who did leonard marry in yeomen of the guard" got the speller to suggest "who did leonard married in yeomen of the guard". Even a third-grader knows that that is wrong. While plain old n-gram may not fix that (bi-grams almost certainly cannot), the resulting document count may help: the original query had 92100 results, and the suggestion only 9640.
Monday, June 16, 2008
My tiny test with Powerset
As a small, anecdotal evaluation of Powerset, I tried two queries. On both queries, Powerset did not do as well as Google, yet there are promising signs of what PowerSet can do.
The first query was a variation of Ron Kaplan's example (see previous post) - to see who John McCain praised I used the query "McCain praises". The top two snippets showed that McCain praised Bush in 2004. The other snippets are somewhat problematic - the lexical parser seems to have ignored word ordering: "Tom Coburn [...] praising McCain" was returned as a result, so are "McCain was promoted" and "McCain's bill". The highlighting suggests that the words "promoted" and "bill" were matched with "praises". If done right, this could be a tremendous boost in recall.
In comparison, Google returned two recent instances of McCain praising someone. In June 2008, McCain praised Hillary and Jindal, the Governor of Louisiana. It appears that Google beat out Powerset, 2-1.
If you look closely at the top section of Powerset's result page, you will see a section under Factz: McCain praised Tatopoulos. Now who is this Tatopoulous? Unfortunately for Powerset, this factoid was taking from an article on the film Outlander. Interesting? Yes; Relevant? Not really.
Names can be very ambiguous. "Michael Jordan" usually refers to arguably the best basketball player who ever lived; yet in the information retrieval world, that name sometimes refer to a professor at Berkeley. In the same way, "McCain" refers to two persons. Can a system like Factz do a better job? Perhaps enumerate various ambiguities by tabs? That could be quite useful.
The second query was, I thought, a bit more tricky. In Gilbert and Sullivan's play, the Yeomen of the Guard, Colonel Fairfax married Elsie, but Elsie thought Fairfax was another guy, Leonard. With such a mess, perhaps the brains in Powerset will do a better job. My query, "who did fairfax marry in yeomen of the guard" is phrased in a question, which Powerset is built to handle. But yet again, Google came up with better results. The forth and fifth results, are relevant: "Elsie agrees to be Fairfax's bride", "The disguised Fairfax discovers that it is Elsie that he has married". Powerset, on the other hand, did not even give any female character's name from the play.
That may seem a bit pathetic, but notice that Google's results do not come from Wikipedia. Since Powerset only indexed Wikipedia, it can only do as good as Wikipedia. Google did not return anything from Wikipedia for that query on the first page, so perhaps that Wikipedia article was not very helpful. Clicking through from Powerset's result, a fancy outline of the Wikipedia article showed up on the right. In the bottom of that outline tool, there is a search bar - I typed in the word "marry", and sections on the page became highlighted. That by itself is nothing new, but if you look closer, the words highlighted are not all "marry" - instead, words such as "wedding", "marriage", and "wed" are matched. That is neat.
Wednesday, June 11, 2008
PowerSet - a recent talk
Recently I attended a talk by PowerSet's chief scientist, natural language processing guru, Ron Kaplan. He explained how usage patterns in natural languages (e.g. English) make bag-of-word search engines suffer in precision and recall. For example, to search for "who Obama critized" one may use the query words "obama critized". In that case, a bag-of-word search engine (e.g. Google) would return documents with the phrases "Hillary critized Obama" and "McCain critized Obama" - these are cases when a search engine returned results that were irrelevant, a precision problem. In other cases, synonyms for "critized" are not matched, so documents with "Obama rejected" are not returned - this means relevant results are not returned, a recall problem.
He then went on to discuss what technologies PowerSet is built on: Lexical Functional Grammar, Transfer/Glue Semantics, Finite state morphology - technologies to learn/detect structure from free-form documents. Crowd-sourced, structured datastreams are also used - freebase and wikipedia come to mind. By acquiring content (indexing terms with lexical/semantic information), packing ambiguity into compact states, a new generation of software is ready to improve for a wide range of tasks: summarization, question answering, translation, entity extraction... As for search, with less than 1 second in query response time, and less than 1 second per sentence in indexing time, he thinks natural language search is ready for prime time. Or at least, ready to convince investors to put in more money.
A handful of my co-workers also went to the talk. Most of them were underwhelmed. "Sure it was a good talk, but I heard it two years ago," said one. "I tried something similar to his example and it didn't work," said another. To be fair, PowerSet has been over-hyped - some thought it would be a Google Killer. And by only indexing wikipedia, it would not be far-fetched to say they under-delivered. However, in terms of bringing innovation to the table, and trying to improve search experience, I like their efforts. I won't put money on them yet, but I'll be watching.
Subscribe to:
Posts (Atom)