Wednesday, May 02, 2012

Google News + Google Translate could be much better

Google News is one of many Google Technologies that I love. Everyday I check both the US version and the HK version. Google Translate is another fantastic service, even if the lack of a free API is a bit short sighted in my mind.

Imagine my disappointment, when Google News US clustered a Chinese article poorly. Click through the image below to see the grey "translate" button next to the text "From China".

That article is about a Chinese Badminton player, Chen Jin, getting the last spot at the London Olympics. The news cluster is about activist Chen Guangcheng, a blind, self-taught human rights lawyer, escaping prison and getting to the U.S. embassy. (He has since left the U.S. embassy).

Mis-classified news article does appear fairly frequently in Google News. Had the mis-classified article been in English, I would probably have just ignored it.

Instead, I took a relevant news article and the wikipedia entry about Guang Cheng, and compared it against two translated articles. One translated article is the one listed by Google News US, about Jin, the badminton player. The other is an article about Chen Guang Cheng, from Google News HK. Basic word frequency shows clearly that Google News US made a mistake. I leave it as an exercise for the reader to sum up the total percentage of word hits and misses in both cases. I imagine using techniques to weigh information content of the words (e.g. IDF) and phrasing would make it even more obvious. Behavior-based learning from bilingual users should help also.

[In the tables below, the columns are, from left to right: word frequency_in_article frequency_in_cluster percentage_in_article, percentage_in_cluster]

Frequently used words in translated article on Chen Jin that appear in cluster

chinese 7  66 1.79 0.59
beijing 3  17 0.77 0.15
   chen 3 133 0.77 1.20
  china 2  46 0.51 0.41
     's 1  64 0.26 0.58
   also 1   9 0.26 0.08
    may 1  14 0.26 0.13
   news 1  17 0.26 0.15
 people 1  14 0.26 0.13
   said 1  12 0.26 0.11

Frequently used words in translated article on Chen Jin that do not appear in cluster

         london 8 0 2.05 0.00
          seats 8 0 2.05 0.00
        olympic 6 0 1.53 0.00
      badminton 5 0 1.28 0.00
       olympics 5 0 1.28 0.00
            six 5 0 1.28 0.00
          games 4 0 1.02 0.00
           gold 4 0 1.02 0.00
        project 4 0 1.02 0.00
     advantages 3 0 0.77 0.00
       capacity 3 0 0.77 0.00
         medals 3 0 0.77 0.00
       shooting 3 0 0.77 0.00
      strengths 3 0 0.77 0.00
            win 3 0 0.77 0.00
        ability 2 0 0.51 0.00
     delegation 2 0 0.51 0.00
         diving 2 0 0.51 0.00
    eligibility 2 0 0.51 0.00
            jin 2 0 0.51 0.00
          medal 2 0 0.51 0.00
          order 2 0 0.51 0.00
       projects 2 0 0.51 0.00
        qualify 2 0 0.51 0.00
          rifle 2 0 0.51 0.00
          skeet 2 0 0.51 0.00
          table 2 0 0.51 0.00
         tennis 2 0 0.51 0.00
  weightlifting 2 0 0.51 0.00
        achieve 1 0 0.26 0.00
       addition 1 0 0.26 0.00
           army 1 0 0.26 0.00
        aroused 1 0 0.26 0.00
         become 1 0 0.26 0.00
            big 1 0 0.26 0.00
            bus 1 0 0.26 0.00
         caught 1 0 0.26 0.00
            cnr 1 0 0.26 0.00
In contrast, taking a translated article on Guang Cheng

Frequently used words in translated article on Guang Cheng that appear in cluster

      chen 14 133 2.57 1.20
guangcheng 11  29 2.02 0.26
        hu  9   8 1.65 0.07
   embassy  8  15 1.47 0.13
   beijing  6  17 1.10 0.15
      u.s.  6  20 1.10 0.18
        's  5  64 0.92 0.58
  activist  3  34 0.55 0.31
     china  3  46 0.55 0.41
    escape  3  31 0.55 0.28
      name  3   7 0.55 0.06
    rights  3  29 0.55 0.26
  security  3   7 0.55 0.06
government  2   7 0.37 0.06
     human  2  17 0.37 0.15
      said  2  12 0.37 0.11
      also  1   9 0.18 0.08
    arrest  1  19 0.18 0.17
     blind  1  30 0.18 0.27
      case  1   8 0.18 0.07
    family  1  16 0.18 0.14
   foreign  1   7 0.18 0.06
      free  1   7 0.18 0.06
     house  1  24 0.18 0.22
    lawyer  1  10 0.18 0.09
     legal  1   9 0.18 0.08
     linyi  1   7 0.18 0.06
      news  1  17 0.18 0.15
    people  1  14 0.18 0.13
    return  1  25 0.18 0.22
     state  1   9 0.18 0.08
     times  1  15 0.18 0.13
      wife  1   9 0.18 0.08

Frequently used words in translated article on Guang Cheng that do not appear in cluster

        mainland 7 0 1.28 0.00
         affairs 3 0 0.55 0.00
       political 3 0 0.55 0.00
          states 3 0 0.55 0.00
       yesterday 3 0 0.55 0.00
           enter 2 0 0.37 0.00
         evening 2 0 0.37 0.00
            hide 2 0 0.37 0.00
        informed 2 0 0.37 0.00
     questioning 2 0 0.37 0.00
          screen 2 0 0.37 0.00
         station 2 0 0.37 0.00
      2012-04-30 1 0 0.18 0.00
        addition 1 0 0.18 0.00
           armed 1 0 0.18 0.00
             ask 1 0 0.18 0.00
          assist 1 0 0.18 0.00
           audio 1 0 0.18 0.00
         butcher 1 0 0.18 0.00
         capture 1 0 0.18 0.00
           chase 1 0 0.18 0.00
         chasing 1 0 0.18 0.00
           clock 1 0 0.18 0.00

Wednesday, February 16, 2011

Uncle Wah and Ping Pong Rhymes

=== 1 ===

Earlier this year, Uncle Wah died. Uncle Wah was not my real uncle, but a prominent politician in Hong Kong. He founded a teachers' union in the 70s, helped draft the Basic Law in the 80s, and pushed for democracy in China and Hong Kong, most notably after June 4, 1989. Browsing memorial sites, I came across some of his short articles [0], published every third day in a respected Hong Kong newspaper. These short pieces cover a large number of topics - from poetry to politics, from meetings with his students to meetings with powerful world leaders - I find them to be a great read. To help me read, I pulled the articles to my Kindle.

If you read Chinese, feel free to ask me for a copy. If you don't, I hope to translate some snippets as time allows. Feedback/comments/alternate translations are most welcome. Below are two snippets from an article published 2003/12/31, titled "Couplets by Mr. Sun Yat-Sen" [1].

=== 2 ====

On a visit to Cheung Chi-Tung, governor of a southern province, the young Sun sent in a name card, with the following note:

"Dear Brother Tung, Student Sun-Wen requests an audience." [2]

Cheung was unhappy with that, and sent back these words on the card:

"Three lines of scribble for
One powerful duke
How dare a school boy claim to be my equal" [3]

This was obviously a Ping Pong Rhyme challenge [4]. Sun read it with a slight smile, and quickly wrote back:

"Ten thousand books read and
One thousand miles walked
More proud could be a peasant than a noble" [5]

Chung realized the visitor was no ordinary Joe, and hurriedly welcomed him in as if he were an important official.


In 1915, Yuan Shikai abolished the new Republic of China, and made himself King. Meanwhile, Sun married Soong Ching-ling in Japan, at the same time actively making plans against Yuan. One day, as the Suns took a stroll in a garden, their conversation turned to Yuan. Soong came up with an opening of a Ping Pong Rhyme for Sun to complete:

"To the Garden we go
To drive out the King
And revive our Home" [6]

This opening rhyme plays a neat trick with the way the words are written: word 3 (園 - garden) becomes word 11 (國 - country) by replacing word 8 (袁 - yuan) with word 1 (或 - perhaps/by chance).

Sun thought for a while, then said:

"On this path I walk
No turning back
No idle talk" [7]

Not only is this response a good match in sound and meaning, it also plays the same trick with word composition: word 3 (道 - road) becomes word 11 (途 - path) by replacing word 8 (首 - head) with word 1 (余 - first person reference: "I")

=== 3 ====

To check my translation, I sent English versions of the Ping Pong Rhymes to a few friends. My brother wrote back with another famous rhyme that I might have read a long time ago:

Among the rocks in the mountains is an old tree - this tree is firewood;
Besides the white splashes of the river stream is a good lady - young ladies are wonderful. [8]

There are neat word tricks for both lines: putting word 1 above word 2 becomes word 3, word 5 next to word 6 gets word 7, words 8 and 9 form word 11.

Isn't this amazing? Are there examples of anagrams in poetry?

In any case, I thought writing this up would be a good way to remember Uncle Wah, as well as the Centenary of the Xinhai Revolution of 1911.

[2] Chinese people back in the day have many names - see
[3] 持三字帖,見一品官,儒生妄敢稱兄弟
[4] ping pong rhyme is my translation for duilian, also known as couplets. words (in all positions) have to match in sound and meaning.
[5] 讀萬卷書,行千里路,布衣亦可傲王侯
[6] 或入園中,逐出老袁還我國
[7] 余行道上,義無回首瞻前途
[8] 山石岩下古木枯,此木是柴; 白水泉邊女子好,少女最妙

Wednesday, July 09, 2008

Tool of the Month

The best applications not only solves real problems, they put a smile on your face. Here is a fantastic resignation letter generator for yahoo.

Honorable mention: PicLens

Friday, June 27, 2008

Bill Gate's last day

Today is Bill Gate's last day at Microsoft. New York Times has a nice article. Hate him or love him, the man has delivered solutions for people all around the world. And to dedicate his earnings and his time and energy to philanthropy, I for one admire him.

In a loosely related article, Wall Street Journal pokes fun at the relationship between Microsoft, Yahoo, and Google. A delightful read.

Friday, June 20, 2008

Google speller needs a grammar lesson?

Google is amazing. Among other things, their speller does a great job. Which is why it was shocking to me that the other day, when I was comparing Google and Powerset, I found an obvious error.

My query "who did leonard marry in yeomen of the guard" got the speller to suggest "who did leonard married in yeomen of the guard". Even a third-grader knows that that is wrong. While plain old n-gram may not fix that (bi-grams almost certainly cannot), the resulting document count may help: the original query had 92100 results, and the suggestion only 9640.

Monday, June 16, 2008

My tiny test with Powerset

As a small, anecdotal evaluation of Powerset, I tried two queries. On both queries, Powerset did not do as well as Google, yet there are promising signs of what PowerSet can do.

The first query was a variation of Ron Kaplan's example (see previous post) - to see who John McCain praised I used the query "McCain praises". The top two snippets showed that McCain praised Bush in 2004. The other snippets are somewhat problematic - the lexical parser seems to have ignored word ordering: "Tom Coburn [...] praising McCain" was returned as a result, so are "McCain was promoted" and "McCain's bill". The highlighting suggests that the words "promoted" and "bill" were matched with "praises". If done right, this could be a tremendous boost in recall.

In comparison, Google returned two recent instances of McCain praising someone. In June 2008, McCain praised Hillary and Jindal, the Governor of Louisiana. It appears that Google beat out Powerset, 2-1.

If you look closely at the top section of Powerset's result page, you will see a section under Factz: McCain praised Tatopoulos. Now who is this Tatopoulous? Unfortunately for Powerset, this factoid was taking from an article on the film Outlander. Interesting? Yes; Relevant? Not really.

Names can be very ambiguous. "Michael Jordan" usually refers to arguably the best basketball player who ever lived; yet in the information retrieval world, that name sometimes refer to a professor at Berkeley. In the same way, "McCain" refers to two persons. Can a system like Factz do a better job? Perhaps enumerate various ambiguities by tabs? That could be quite useful.

The second query was, I thought, a bit more tricky. In Gilbert and Sullivan's play, the Yeomen of the Guard, Colonel Fairfax married Elsie, but Elsie thought Fairfax was another guy, Leonard. With such a mess, perhaps the brains in Powerset will do a better job. My query, "who did fairfax marry in yeomen of the guard" is phrased in a question, which Powerset is built to handle. But yet again, Google came up with better results. The forth and fifth results, are relevant: "Elsie agrees to be Fairfax's bride", "The disguised Fairfax discovers that it is Elsie that he has married". Powerset, on the other hand, did not even give any female character's name from the play.

That may seem a bit pathetic, but notice that Google's results do not come from Wikipedia. Since Powerset only indexed Wikipedia, it can only do as good as Wikipedia. Google did not return anything from Wikipedia for that query on the first page, so perhaps that Wikipedia article was not very helpful. Clicking through from Powerset's result, a fancy outline of the Wikipedia article showed up on the right. In the bottom of that outline tool, there is a search bar - I typed in the word "marry", and sections on the page became highlighted. That by itself is nothing new, but if you look closer, the words highlighted are not all "marry" - instead, words such as "wedding", "marriage", and "wed" are matched. That is neat.

Wednesday, June 11, 2008

PowerSet - a recent talk

Recently I attended a talk by PowerSet's chief scientist, natural language processing guru, Ron Kaplan. He explained how usage patterns in natural languages (e.g. English) make bag-of-word search engines suffer in precision and recall. For example, to search for "who Obama critized" one may use the query words "obama critized". In that case, a bag-of-word search engine (e.g. Google) would return documents with the phrases "Hillary critized Obama" and "McCain critized Obama" - these are cases when a search engine returned results that were irrelevant,  a precision problem. In other cases, synonyms for "critized" are not matched, so documents with "Obama rejected" are not returned - this means relevant results are not returned, a recall problem.

He then went on to discuss what technologies PowerSet is built on: Lexical Functional Grammar, Transfer/Glue Semantics, Finite state morphology - technologies to learn/detect structure from free-form documents.  Crowd-sourced, structured datastreams are also used - freebase and wikipedia come to mind. By acquiring content (indexing terms with lexical/semantic information), packing ambiguity into compact states, a new generation of software is ready to improve for a wide range of tasks: summarization, question answering, translation, entity extraction... As for search, with less than 1 second in query response time, and less than 1 second per sentence in indexing time, he thinks natural language search is ready for prime time. Or at least, ready to convince investors to put in more money.

A handful of my co-workers also went to the talk. Most of them were underwhelmed. "Sure it was a good talk, but I heard it two years ago," said one. "I tried something similar to his example and it didn't work," said another. To be fair, PowerSet has been over-hyped - some thought it would be a Google Killer. And by only indexing wikipedia, it would not be far-fetched to say they under-delivered. However, in terms of bringing innovation to the table, and trying to improve search experience, I like their efforts. I won't put money on them yet, but I'll be watching.