Achim's profilePattern RecognitionBlog Tools Help

Blog


    October 05

    Dr. Z interviewed me for ARCast.tv

    No, not the Daimler AG boss, the Microsoft Architect Evangelist Zhiming Xue. We talked about how to approach web application internationalization and what is involved in the process: Re-architecting Applications for Internationalization
     
    Thanks Dr. Z!
    September 10

    Cross-border online buyers

    Online buyers don't seem to hesitate to buy goods across borders according to a new Forrester Research report - via TechFlash. That is if language and culture are reasonably close ... or localized. I would suspect the threshold to buying is even lower for digital goods as there is no delivery delay.
    May 13

    Unicode won on the web!

    According to the official Google blog Unicode, namely UTF-8, last December became the most frequent encoding for content on the web. Congratulations Unicode! It has been a long, hard way.
    December 09

    Perl script to detect encoding of one or multiple text files

    Recently I had the need to do a very common task again: identify the character encoding of a bunch of text files in a directory. The tedious way to do this is to open each file in an editor that detects the encoding and then see what encoding was identified (e.g. opening it in Notepad and choosing File/Save As ...). Of course one would think that there are command line tools available online that achieve just that on a number of files without opening them manually. My searches always turn up descriptions how character encoding detection is done in the browsers or library functions for the purpose. But no command line tools!? Finally I got fed up and took some time to put together a short Perl script:
    #!/usr/bin/perl -w
    
    use strict;
    use Encode::Guess;
    
    if(@ARGV != 1)
    {
        print "Usage: perl $0 <file to analyze>\n";
        exit;
    }
    
    my @files = glob($ARGV[0]);
    
    foreach my $file (@files)
    {
        open(FILE,$file);
        binmode(FILE);
        if(read(FILE,my $filestart, 500)) {
    	my $enc = guess_encoding($filestart);
    	if(ref($enc)) {
    	    print "$file:\t",$enc->name,"\n";
    	}
    	else {
    	    print "Encoding of file $file can't be guessed\n";
    	}
        }
        else {
    	print "Cannot read from file $file\n";
        }
    }
    
    close(FILE);
    
    This assumes you have Perl installed (tested on v5.8.8) including the Encode::Guess module. This is not very sophisticated (see the documentation for the module), but is sufficient for most jobs, e.g. quickly finding out in what encoding somebody has saved some files or if the files are all in a consistent encoding.
    May 12

    Web 2.0 Expo 2007 presentation and demo now available

    If you didn't make it to the Web 2.0 Expo in San Francisco this year, the presentation is up on the expo website and my consulting web site. What took me a while longer is to port the Perl demo code to Un*x, which I ran on Windows at the expo. It is now available here.
     
    One issue I had porting the code was the huge difference between locale identifiers under Windows and Un*x - the later uses 2-letter language and country identifiers where as the former uses 3-letter identifiers or complete names. This is definitely something that affects the cross-platform portability of internationalized script language code.
     
    The other issue was moving from the module Locale::Maketext::Gettext to Locale::gettext. Not a big deal.
     
    Why isn't the application live on my consulting website? Hm, my web host is still running the very old Perl 5.6.1 and I could not find some of the modules I need that run on this version of Perl. Bummer. Time to switch web hosts or upgrade to a virtual or dedicated server?
    March 27

    Talk: Making Cents of Yens and Euros: Web 2.0 Internationalization

    At O'Reilly's Web 2.0 Expo next month I'll be presenting a talk on internationalization issues in a range of technologies that are considered "Web 2.0". What's exciting for me is to cover how this new programmable web - and for me programmability is what Web 2.0 is mainly about - can be created in a way that makes it accessible to users in many countries. Up to now international versions of many popular Web 2.0 applications in the United States are not available internationally or only after a long time lag.
    Yes, this is due to the fact that many of the apps are created by startups for which it is hard enough to get the apps promoted and used in one country. And yes, many of them are also quite culture-specific. But like with other software there is no reason to assume that there isn't a common basis to them that can't be leveraged to create big enough user bases in a range of countries. See for example Orkut in Brazil.
    Ideally these apps even allow people to communicate or trade between language and cultural borders.
    March 07

    Speedy UTF-8 to UTF-16 conversion from up north

    Via Rick Jelliffe comes a pointer to a new u8u16 library developed by Rob Cameron at Simon Fraser University, that promises a significant speed-up of UTF-8 to UTF-16 conversion using SIMD instructions. A really interesting concept, particularly useful for UTF-16 based platforms (.NET, Windows, Java) when dealing with web data (which is mostly in UTF-8). Of course the conversion from UTF-16 to UTF-8 needs to be covered too and maybe the concept could be expanded to non-algorithmic conversions for legacy encodings.
    One question in my mind is if the speed-up addresses a significant bottleneck. I never heared this conversion being a performance issue and I guess one has to have a lot of data to convert or frequent conversions before the speed-up makes a difference in a larger application.
    March 28

    G/localization talk at ETech 2006

    Dana Boyd posted a transcript of the talk she gave at ETech 2006. It contains very interesting observations about what forms cultures in online (and offline) communities and what motivates people to participate in them. A couple of remarks on her sections on language and machine translation:
    • it is true that users are in control and have the choice to participate/not participate - one reason why controlled languages likely won't work in this context
    • online communities need a balance of the opportunity to create subcultures but also provide a degree of leakage so that people can get to know "familiar strangers" - an interesting question here would be how to create this leakage across multiple languages and of course if this makes sense
    • machine translation doesn't have to encode the cultural conventions of the developer - especially statistical machine translation systems are quite language-independent