Achim's profilePattern RecognitionBlog Tools Help

Blog


    October 05

    Dr. Z interviewed me for ARCast.tv

    No, not the Daimler AG boss, the Microsoft Architect Evangelist Zhiming Xue. We talked about how to approach web application internationalization and what is involved in the process: Re-architecting Applications for Internationalization
     
    Thanks Dr. Z!
    September 10

    Cross-border online buyers

    Online buyers don't seem to hesitate to buy goods across borders according to a new Forrester Research report - via TechFlash. That is if language and culture are reasonably close ... or localized. I would suspect the threshold to buying is even lower for digital goods as there is no delivery delay.
    April 10

    Consulting site updates

    I added a quite a bit of information to my site Achim Ruopp Internationalization Consulting:
    Now stay tuned for the video of the interview I recently recorded - this site is becoming multimedia-enabled! Wink
    May 13

    Unicode won on the web!

    According to the official Google blog Unicode, namely UTF-8, last December became the most frequent encoding for content on the web. Congratulations Unicode! It has been a long, hard way.

    ACL 2008 conference paper: "Applying Morphology Generation Models to Machine Translation"

    I am currently working with Kristina Toutanova and Hisami Suzuki at the Microsoft Research Machine Translation group. I gathered data for them for an upcoming ACL 2008 conference paper and they were nice enough to add me to the author list. You can download the paper here (note that it is copyrighted by Microsoft and the Association for Computational Linguistics).
    December 09

    My first conference paper

    I was excited to hear that the paper "Finding parallel texts on the web using cross-language information retrieval" authored by myself and Fei Xia was accepted for the cross-language information access workshop at the IJCNLP 2008 conference! I won't go to Hyderabad myself, but hope our results will help other researchers build parallel corpora. And if you are not a language researcher: "parallel corpora" is just a fancy term for translated texts in two languages (preferably lots of them).

    Perl script to detect encoding of one or multiple text files

    Recently I had the need to do a very common task again: identify the character encoding of a bunch of text files in a directory. The tedious way to do this is to open each file in an editor that detects the encoding and then see what encoding was identified (e.g. opening it in Notepad and choosing File/Save As ...). Of course one would think that there are command line tools available online that achieve just that on a number of files without opening them manually. My searches always turn up descriptions how character encoding detection is done in the browsers or library functions for the purpose. But no command line tools!? Finally I got fed up and took some time to put together a short Perl script:
    #!/usr/bin/perl -w
    
    use strict;
    use Encode::Guess;
    
    if(@ARGV != 1)
    {
        print "Usage: perl $0 <file to analyze>\n";
        exit;
    }
    
    my @files = glob($ARGV[0]);
    
    foreach my $file (@files)
    {
        open(FILE,$file);
        binmode(FILE);
        if(read(FILE,my $filestart, 500)) {
    	my $enc = guess_encoding($filestart);
    	if(ref($enc)) {
    	    print "$file:\t",$enc->name,"\n";
    	}
    	else {
    	    print "Encoding of file $file can't be guessed\n";
    	}
        }
        else {
    	print "Cannot read from file $file\n";
        }
    }
    
    close(FILE);
    
    This assumes you have Perl installed (tested on v5.8.8) including the Encode::Guess module. This is not very sophisticated (see the documentation for the module), but is sufficient for most jobs, e.g. quickly finding out in what encoding somebody has saved some files or if the files are all in a consistent encoding.
    August 21

    How to be good (as a company)

    My company's mission is to "Enable people to use their language and cultural conventions on the web, while allowing them to bridge cultural and language borders." This is what the company does, but lately I have been realizing that it also matters a whole lot how a company does things. I am talking about values here, not the practical execution, although the latter should be guided by the former.
     
    What should my company's values be? I'm sure I can find an infinite number of value statements on the web. Now, some smart people at Google thought about the same thing and came up with the phrase "Don't be evil". According to the Wikipedia entry Larry Page said "I think it's much better than Be Good or something. When you are making decisions, it causes you to think. I think that's good." Sounds like a smart approach, doesn't it? But see where it got them with regards to their China policy and the many other decisions that were weighed against this rather undefined motto! (yes, a more detailed explanation of their philosophy is available, but they are often judged referring to the short phrase). Thinking after the fact leads to relativism.
     
    Another thing that is important to me, is to build a social enterprise. The best guidance that I found for this is from somebody who built one of the most exemplary social businesses of the last 100 years - nobel laureate Muhammad Yunus. The key phrase in his guest commentary for the latest G8 summit for me is "Many of the problems in the world today, including poverty, persist because of a too narrow interpretation of capitalism." Right! Where is it written that companies have to exclusively subscribe to the profit motive? Looking beyond profits also helps to avoid problems like this.
     
    So after all this laying of the foundation, here are the values my company should embody:
    1. For profit. Services and software that add value for the customers are to be sold for profit (otherwise I'd make it a non-profit)
    2. Open. Where reasonable source and data should be open. What do I mean with reasonable? In regards to source code my goal is to use open-source as much as possible and contribute improvements back to the community. I think open-source works particularly well for infrastructure that is used/useful for everybody (e.g. operating systems, general algorithms like the ones you can find in an algorithm book, algorithms that implement standards). Algorithms that implement the "secret sauce" of a company's main idea into an application are neither reasonable to open-source (see 1.), nor do they necessarily benefit from open-sourcing and neither do they benefit the community (often they are too specialized). I guess in this approach I'm not too far from Google's.
      The same approach applies to data - data gathered from public sources or raw data provided by the user should be openly accessible (user data of course only to the user who provided it). Data derived from this data by the "secret sauce" algorithms however, should not be openly accessible.
    3. Green. Any software my company provides should be as power efficient as possible while still reaching the stated purpose. The operations of the company must be run in a sustainable way (reduce, reuse, recycle).
    4. Secure. Software needs to follow commonly accepted industry security standards. Data needs to be kept secure to standards.
    5. Human Rights. Any projects that the company participates in must respect human rights. It shouldn't be necessary to state this, but I'm afraid nowadays it is. (I will contribute to open source projects that could potentially be used violating these rights. The open-source licenses unfortunately don't usually address this issue.)
    May 12

    Web 2.0 Expo 2007 presentation and demo now available

    If you didn't make it to the Web 2.0 Expo in San Francisco this year, the presentation is up on the expo website and my consulting web site. What took me a while longer is to port the Perl demo code to Un*x, which I ran on Windows at the expo. It is now available here.
     
    One issue I had porting the code was the huge difference between locale identifiers under Windows and Un*x - the later uses 2-letter language and country identifiers where as the former uses 3-letter identifiers or complete names. This is definitely something that affects the cross-platform portability of internationalized script language code.
     
    The other issue was moving from the module Locale::Maketext::Gettext to Locale::gettext. Not a big deal.
     
    Why isn't the application live on my consulting website? Hm, my web host is still running the very old Perl 5.6.1 and I could not find some of the modules I need that run on this version of Perl. Bummer. Time to switch web hosts or upgrade to a virtual or dedicated server?
    March 27

    Talk: Making Cents of Yens and Euros: Web 2.0 Internationalization

    At O'Reilly's Web 2.0 Expo next month I'll be presenting a talk on internationalization issues in a range of technologies that are considered "Web 2.0". What's exciting for me is to cover how this new programmable web - and for me programmability is what Web 2.0 is mainly about - can be created in a way that makes it accessible to users in many countries. Up to now international versions of many popular Web 2.0 applications in the United States are not available internationally or only after a long time lag.
    Yes, this is due to the fact that many of the apps are created by startups for which it is hard enough to get the apps promoted and used in one country. And yes, many of them are also quite culture-specific. But like with other software there is no reason to assume that there isn't a common basis to them that can't be leveraged to create big enough user bases in a range of countries. See for example Orkut in Brazil.
    Ideally these apps even allow people to communicate or trade between language and cultural borders.
    March 07

    Speedy UTF-8 to UTF-16 conversion from up north

    Via Rick Jelliffe comes a pointer to a new u8u16 library developed by Rob Cameron at Simon Fraser University, that promises a significant speed-up of UTF-8 to UTF-16 conversion using SIMD instructions. A really interesting concept, particularly useful for UTF-16 based platforms (.NET, Windows, Java) when dealing with web data (which is mostly in UTF-8). Of course the conversion from UTF-16 to UTF-8 needs to be covered too and maybe the concept could be expanded to non-algorithmic conversions for legacy encodings.
    One question in my mind is if the speed-up addresses a significant bottleneck. I never heared this conversion being a performance issue and I guess one has to have a lot of data to convert or frequent conversions before the speed-up makes a difference in a larger application.
    February 14

    The operating system knows best - coarse grain concurrency using pipes

    In my previous post I wrote about how the current languages (and runtimes) really have a hard time exploiting the benefits that the new multicore processors bring. This previous post mainly dealt with exploiting concurrency on a fine grain level (e.g. loops).
    Now I don't know what triggered my realization that there is an oldschool mechanism in both Windows and Un*x that enables many scenarios of coarse grain concurrency: anonymous pipes. It was either the announcement of Yahoo! Pipes or my laziness not to rewrite a bunch of scripts that interact heavily via command line IO.
    How do anonymous pipes enable coarse grain concurrency? Say we have a problem that requires some multi-step sequential manipulation/filtering of some kind of data. One way to attack this would be to code this sequence into the main routine of the program in your favorite programming language and pass the data around in programming language structures or references to them. Works fine, except that we'd have a hard time getting any of these steps to run concurrently on a multicore processor with today's languages/runtimes.
    The better way in regards to concurrency would be to split up the steps into their own little programs which receive their input via STDIN and output the result for their processing via STDOUT. To execute all steps in sequence you just string them together with anonymous pipes:
    s1|s2|s3|...|sn
    There are many benefits:
    • All steps create their own processes and get scheduled according to their resource needs by the operating system.
    • The steps can be written in different programming languages.
    • Steps can be easily exchanged without recompiling.
    • Additional filtering steps can be inserted into the chain (provided they obey the data formats).
    Of course there are a lot of things to be cautious about:
    • Do the steps warrant creating individual processes? (nowadays process creation seems to be relatively cheap)
    • Can the data be serialized efficiently into a byte stream? (Note that I don't say character stream here - character encoding in a command shell is a topic for another post - see this older post of mine on this topic).
    • Do the performance benefits of exploiting the multiple cores outweigh the additional overhead of process creation and data serialization?
    • Does the OS do a better job of using the resources than my code can do? (Considering the OS developers spent years on optimizing this the bet is against you)
    • Debugging could be harder.

    The Lambda the Ultimate blog had a good discussion on this and the relation to functional programming concepts a little over a year ago.

     

    October 17

    Multicore processors or how to choose a programming language for the next 5 years

    I love the open source scripting languages: Perl, Python, PHP and Ruby -  the P in LAMP (ok, they have to rename the last one). They are easy to pick up, support different programming paradigms, are available for many platforms, have extensive libraries, have great communities driving them forward, are free (as in beer and freedom) ... I could go on and on.
     
    All of them are in different phases of growing up. Perl 6, Python 3000, Ruby 2.0 all promise bigger and better things in terms of language design and functionality. But (judging from my web searches) not many people talk about how the next revolution in programming - multicore computing - affects the language runtimes (I know, big words, but bear with me).
     
    The battle between PC hardware companies is heating up again, on TV ads for multicore processors are shown during primetime. What happens though, when you run a Perl script containing code like this on one of these shiny new multicore machines?
    for($i = 0 ; $i < 1000000 ; $i++)
    {
        $array[$i] = some_expensive_function($array[$i]);
    }
    One of the cores is awfully busy while the others are idly sitting around! Assuming the function some_expensive_function has no side effects the task could easily split up among the different processor cores.
     
    I hear you saying: "But yes, of course the script has to be multi-thread enabled to make use of all the cores.". However, this requires additional, non-trivial work - as Herb Sutter says in his excellent 2005 article: "The free lunch is over". Herb urges everybody to brush up their skills in writing multi-threaded applications. He says: "Implicitly parallelizing compilers can help a little, but don’t expect much; they can’t do nearly as good a job of parallelizing your sequential program as you could do by turning it into an explicitly parallel and threaded version."
     
    This is one way to approach the problem - what I would call "handcrafting" your parallelism. I'm sure you can get very well performing applications out of this; applications like video encoders used to measure multicore performance are enabled today.
     
    That is if you can get this handcrafting right - the web is full of tales of multi-threaded programming gone bad. Applications like this are notoriously hard to debug.
     
    Is there a better solution? If I have to (re-)learn concepts is there one that deals with this problem a little more elegantly?
     
    Turns out there is: functional programming. From the Wikipedia entry: "Disallowing side effects provides for referential transparency, which makes it easier to verify, optimize, and parallelize programs, and easier to write automated tools to perform those tasks".
     
    Excellent - so I just have to adopt the functional programming constructs available in Perl, Python and Ruby and the things that are parallelizable will be parallelized automatically for me? Wishful thinking for now, unfortunately. I couldn't find any info that any of the present runtimes are thread-aware (not just thread-safe), especially for functional programming constructs.
     
    What to do? Wait for the new Parrot, YARV, CPython runtimes? Possibly contribute there? Judging from the Perl 6 history this could take a while.
     
    Use one of the functional programming languages like Erlang or F#? Certainly attractive from a learning point of view, but I'd certainly always would have to trade off at least one of the advantages of the P languages mentioned at the start of the post.
     
    Fortunately there seems to be a way out: Python. For Python, unlike for Perl and Ruby, there are multiple runtimes, among them the Java VM and the .NET runtime. There is a strong motivation for Sun and Microsoft to make the bytecode of these runtimes work as fast as possible on multicore machines. All we need now is for IronPython and Jython to analyze the functional constructs and parallelize them automatically if possible.
     
    For the really tricky performance bottlenecks there will be C/C++ extensions using the tried and tested OpenMP.
     
    Natural language processing requires a lot of computing power. Multicore machines promise to make this available.
     
    Update: Just found out about RubyCLR. Along with JRuby this will allow to target .NET and Java with Ruby. Both seem to be less mature than IronPython and Jython, though.
    August 04

    Google to release 5-gram language model

    Google will release their 5-gram language model trained on a training corpus of about 1 trillion words. Wow! What I would like to know: is if this is only for English or also other languages?
    They say you can use it no matter how small your computing resources are - that can be debated . In fact I think that infrastructure will become more important in CompLing. Flat file processing almost seems to be an obsession in our field stemming from the Unix-culture, but at some point this doesn't cut it anymore, especially for applications like online learning. The infrastructure needn't be Google-sized, but I think good old relational databases can help.
     
    Update: Unfortunately this n-gram model is only available for English right now.
    March 28

    G/localization talk at ETech 2006

    Dana Boyd posted a transcript of the talk she gave at ETech 2006. It contains very interesting observations about what forms cultures in online (and offline) communities and what motivates people to participate in them. A couple of remarks on her sections on language and machine translation:
    • it is true that users are in control and have the choice to participate/not participate - one reason why controlled languages likely won't work in this context
    • online communities need a balance of the opportunity to create subcultures but also provide a degree of leakage so that people can get to know "familiar strangers" - an interesting question here would be how to create this leakage across multiple languages and of course if this makes sense
    • machine translation doesn't have to encode the cultural conventions of the developer - especially statistical machine translation systems are quite language-independent
    January 19

    Software sucks

    A dense, but interesting essay by Jaron Lanier on the brittleness of software and what it means for its economics. Most interesting for me is his take on comparing computer and natural languages: The degree to which human, or "natural" language is unlike computer code cannot be overemphasized. Language can only be understood by the means of interpretation, so ambiguity is central to its character, and is properly understood as a strength rather than a weakness. Perfect precision would rob language of its robustness and potential for adaptation. Human language is not a phenomenon which is well understood by either science or philosophy, and it has not been reproduced by technologies. Well, we are working on the later. He brings up the interesting question though on how the inherent imperfectness and ambiguity of language can be part of the solution rather than part of the problem in NLP applications. How can we leverage these properties to build robust NLP applications?
    November 22

    Darpa Grand Challenge

    Just a case in point for what I wrote in my last post. Over in the Spiegel magazine an article (in German) about the Darpa Grand Challenge where more civilian oriented technology beat out more military oriented technology. 

    November 19

    Integrated speech recognition and MT for Iraq war

    Via Wired News: War-Zone Test for Babel-Fish Tool. This seems to be a good field test for integrating compling components to create something like a babel fish, but for me this still falls in the category "shoot first, ask questions later" category. I do not believe war is or should be the mother of invention. We need to find ways to use these technologies to avoid conflicts.

    November 18

    R is for statistics

    An article over on O'Reilly Net about The R Project for Statistical Computing. Seems interesting as a visualization tool for some of the statistics we do in our projects in class.

    Turing's Cathedral

    Via Tim O'Reilly I found a reference to a recent George Dyson talk on Turing's Cathedral. It talks about how von Neumann's model in computing is going to be displaced by Google/search engines because "All the answers in the known universe are there, and some very ingenious algorithms are in place to map them to questions that people ask." and this is going to help avoid having to have addressing and make computing more robust.

    I don't quite see this yet - there is still a huge semantic gap between what is found by search engines and what a computer can use for automatic processing. Every search engine user does this semantic work when browsing through the search results and picking the links that most likely contain answers to what the user "meant" with the query.

    Which reminds me that I should probably get back to working on the search engine that is due as homework in two weeks ;-)

    Nevertheless a good read and source of an insightful quote from Turing regarding AI: "In attempting to construct such machines we should not be irreverently usurping His power of creating souls, any more than we are in the procreation of children, [...] Rather we are, in either case, instruments of His will providing mansions for the souls that He creates."