Wednesday, May 02, 2012

Google News + Google Translate could be much better

Google News is one of many Google Technologies that I love. Everyday I check both the US version and the HK version. Google Translate is another fantastic service, even if the lack of a free API is a bit short sighted in my mind.

Imagine my disappointment, when Google News US clustered a Chinese article poorly. Click through the image below to see the grey "translate" button next to the text "From China".



That article is about a Chinese Badminton player, Chen Jin, getting the last spot at the London Olympics. The news cluster is about activist Chen Guangcheng, a blind, self-taught human rights lawyer, escaping prison and getting to the U.S. embassy. (He has since left the U.S. embassy).

Mis-classified news article does appear fairly frequently in Google News. Had the mis-classified article been in English, I would probably have just ignored it.

Instead, I took a relevant news article and the wikipedia entry about Guang Cheng, and compared it against two translated articles. One translated article is the one listed by Google News US, about Jin, the badminton player. The other is an article about Chen Guang Cheng, from Google News HK. Basic word frequency shows clearly that Google News US made a mistake. I leave it as an exercise for the reader to sum up the total percentage of word hits and misses in both cases. I imagine using techniques to weigh information content of the words (e.g. IDF) and phrasing would make it even more obvious. Behavior-based learning from bilingual users should help also.

[In the tables below, the columns are, from left to right: word frequency_in_article frequency_in_cluster percentage_in_article, percentage_in_cluster]

Frequently used words in translated article on Chen Jin that appear in cluster


chinese 7  66 1.79 0.59
beijing 3  17 0.77 0.15
   chen 3 133 0.77 1.20
  china 2  46 0.51 0.41
     's 1  64 0.26 0.58
   also 1   9 0.26 0.08
    may 1  14 0.26 0.13
   news 1  17 0.26 0.15
 people 1  14 0.26 0.13
   said 1  12 0.26 0.11


Frequently used words in translated article on Chen Jin that do not appear in cluster


         london 8 0 2.05 0.00
          seats 8 0 2.05 0.00
        olympic 6 0 1.53 0.00
      badminton 5 0 1.28 0.00
       olympics 5 0 1.28 0.00
            six 5 0 1.28 0.00
          games 4 0 1.02 0.00
           gold 4 0 1.02 0.00
        project 4 0 1.02 0.00
     advantages 3 0 0.77 0.00
       capacity 3 0 0.77 0.00
         medals 3 0 0.77 0.00
       shooting 3 0 0.77 0.00
      strengths 3 0 0.77 0.00
            win 3 0 0.77 0.00
        ability 2 0 0.51 0.00
     delegation 2 0 0.51 0.00
         diving 2 0 0.51 0.00
    eligibility 2 0 0.51 0.00
            jin 2 0 0.51 0.00
          medal 2 0 0.51 0.00
          order 2 0 0.51 0.00
       projects 2 0 0.51 0.00
        qualify 2 0 0.51 0.00
          rifle 2 0 0.51 0.00
          skeet 2 0 0.51 0.00
          table 2 0 0.51 0.00
         tennis 2 0 0.51 0.00
  weightlifting 2 0 0.51 0.00
        achieve 1 0 0.26 0.00
       addition 1 0 0.26 0.00
           army 1 0 0.26 0.00
        aroused 1 0 0.26 0.00
         become 1 0 0.26 0.00
            big 1 0 0.26 0.00
            bus 1 0 0.26 0.00
         caught 1 0 0.26 0.00
            cnr 1 0 0.26 0.00
...
In contrast, taking a translated article on Guang Cheng


Frequently used words in translated article on Guang Cheng that appear in cluster


      chen 14 133 2.57 1.20
guangcheng 11  29 2.02 0.26
        hu  9   8 1.65 0.07
   embassy  8  15 1.47 0.13
   beijing  6  17 1.10 0.15
      u.s.  6  20 1.10 0.18
        's  5  64 0.92 0.58
  activist  3  34 0.55 0.31
     china  3  46 0.55 0.41
    escape  3  31 0.55 0.28
      name  3   7 0.55 0.06
    rights  3  29 0.55 0.26
  security  3   7 0.55 0.06
government  2   7 0.37 0.06
     human  2  17 0.37 0.15
      said  2  12 0.37 0.11
      also  1   9 0.18 0.08
    arrest  1  19 0.18 0.17
     blind  1  30 0.18 0.27
      case  1   8 0.18 0.07
    family  1  16 0.18 0.14
   foreign  1   7 0.18 0.06
      free  1   7 0.18 0.06
     house  1  24 0.18 0.22
    lawyer  1  10 0.18 0.09
     legal  1   9 0.18 0.08
     linyi  1   7 0.18 0.06
      news  1  17 0.18 0.15
    people  1  14 0.18 0.13
    return  1  25 0.18 0.22
     state  1   9 0.18 0.08
     times  1  15 0.18 0.13
      wife  1   9 0.18 0.08


Frequently used words in translated article on Guang Cheng that do not appear in cluster


        mainland 7 0 1.28 0.00
         affairs 3 0 0.55 0.00
       political 3 0 0.55 0.00
          states 3 0 0.55 0.00
       yesterday 3 0 0.55 0.00
           enter 2 0 0.37 0.00
         evening 2 0 0.37 0.00
            hide 2 0 0.37 0.00
        informed 2 0 0.37 0.00
     questioning 2 0 0.37 0.00
          screen 2 0 0.37 0.00
         station 2 0 0.37 0.00
      2012-04-30 1 0 0.18 0.00
        addition 1 0 0.18 0.00
           armed 1 0 0.18 0.00
             ask 1 0 0.18 0.00
          assist 1 0 0.18 0.00
           audio 1 0 0.18 0.00
         butcher 1 0 0.18 0.00
         capture 1 0 0.18 0.00
           chase 1 0 0.18 0.00
         chasing 1 0 0.18 0.00
           clock 1 0 0.18 0.00