| Achim's profilePattern RecognitionBlog | Help |
|
December 09 My first conference paperI was excited to hear that the paper "Finding parallel texts on the web using cross-language information retrieval" authored by myself and Fei Xia was accepted for the cross-language information access workshop at the IJCNLP 2008 conference! I won't go to Hyderabad myself, but hope our results will help other researchers build parallel corpora. And if you are not a language researcher: "parallel corpora" is just a fancy term for translated texts in two languages (preferably lots of them). Perl script to detect encoding of one or multiple text filesRecently I had the need to do a very common task again: identify the character encoding of a bunch of text files in a directory. The tedious way to do this is to open each file in an editor that detects the encoding and then see what encoding was identified (e.g. opening it in Notepad and choosing File/Save As ...). Of course one would think that there are command line tools available online that achieve just that on a number of files without opening them manually. My searches always turn up descriptions how character encoding detection is done in the browsers or library functions for the purpose. But no command line tools!? Finally I got fed up and took some time to put together a short Perl script:
This assumes you have Perl installed (tested on v5.8.8) including the Encode::Guess module. This is not very sophisticated (see the documentation for the module), but is sufficient for most jobs, e.g. quickly finding out in what encoding somebody has saved some files or if the files are all in a consistent encoding. |
|
|