This document hasn't been maintained for many years now and is here only for reference.

UTF-8 Setup Mini HOWTO

This document covers the basic steps needed to edit/create UTF-8 encoded files on a Linux system. A few notes on editing Japanese are also included. Most of the information is based on my experience with software included in SuSE 9.1.

The contents of the document reflect my understanding of the subject and may not be completely accurate and are certainly not meant to be comprehensive. Comments are welcome.

Mike Fabian's "CJK Support in SuSE Linux" and Markus Kuhn's "UTF-8 and Unicode FAQ for Unix/Linux" are great resources. See links to these documents in the References section. Also, if you're not already familiar with the POSIX definition of i18n environment variables, you may want to read it before continuing.

Terminals, Editors and Screen

How can I get xterm to display UTF-8 characters correctly?

You can run xterm in UTF-8 mode with:

xterm -u8

Or if you prefer to use UTF-8 the majority of the time, you can put this line in your .Xresoures file:

xterm*utf8: 1

If you specify this xterm resource, but then want to use an xterm in single-byte mode, you can start it with the +u8 option:

xterm +u8

If you use a UTF-8 enabled xterm, you probably want to make sure your locale is UTF-8 as well. For example, to switch your locale to Canadian English in UTF-8 mode, you would run (in bash):

export LANG=en_CA.UTF-8

You may also want to use a Unicode font for your xterm so as to be able to view more characters. Here is my xterm font resources:

xterm*font:     -misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1
xterm*wideFont: -misc-fixed-medium-r-normal-ja-18-120-100-100-c-180-iso10646-1

The wideFont resource is needed for languages such as Japanese. If a wideFont resource is not specified, xterm will try and use a font that is double the width of the regular font, but if this font does not exist, the Japanese characters will not display properly.

You can use xfontsel to choose a font, or use xlsfonts to get a listing of all the Unicode fonts installed on your system:

xlsfonts | grep iso10646-1 | less

I have installed the efont-unicode SuSE RPM to give me additional Unicode fonts to choose from.

If you would like to experiment with other terminals, try mlterm. I like it because it allows you to change the terminal encoding on the fly. You can do this with ctrl+<right-mouse-button>.

How do I get UTF-8 to display correctly under screen?

You can run screen in UTF-8 mode with:

screen -U

Make sure to use the UTF-8 option when reconnecting to a detached screen as well:

screen -Ux

How can I get vim to work with UTF-8 files?

Start vim with:

vim "+set encoding=utf-8"

or if you are already editing the file, you can set the encoding as follows:

:set enc=utf-8

If your terminal's encoding is different than the file's encoding, you may need to set the terminal encoding as well. For example, if you are using a non-UTF-8 xterm, but would like to edit a UTF-8 file containing characters in the Latin-1 range, you will need to set the terminal encoding to Latin-1.

:set tenc=latin-1

How can I get XEmacs to work with UTF-8 files?

If you want XEmacs to load UTF-8 files correctly, add the following lines to your ~/.xemacs/init.el:

(require 'un-define)
(set-coding-priority-list '(utf-8))
(set-coding-category-system 'utf-8 'utf-8)
Note that Emacs does not deal well with these additions, so if you also run Emacs, then adding the following will keep Emacs from complaining:
;; Are we running XEmacs or Emacs?
(defvar running-xemacs (string-match "XEmacs\\|Lucid" emacs-version))

...

(if (not running-xemacs) nil
  ;; enable Mule-UCS
  (require 'un-define)

  ;; by default xemacs does not autodetect Unicode
  (set-coding-priority-list '(utf-8))
  (set-coding-category-system 'utf-8 'utf-8))

These lines will get XEmacs to load UTF-8 files in UTF-8 mode (it will display a "u" in the bottom left corner of your status bar). If you have already loaded a file and would like to start inputting UTF-8, you can use C-x RET f, to set the file coding system to UTF-8. Note that you may additionally have to set the terminal coding system to UTF-8. This seems to be necessary, for example, in the case where XEmacs is run in non-graphical mode inside a UTF-8 enabled xterm. You can set the terminal encoding using C-x RET t.

Caution: I have had problems with XEmacs double encoding in the case where 1) the file contains UTF-8, 2) the file is loaded in non-UTF-8 mode, 3) the user switches to UTF-8 mode (using C-x RET f), 4) enters some text, and 5) saves. In other words, if your file already contains UTF-8 characters, make sure that it is loaded in UTF-8 mode before editing it.

Inputting Latin-1 and Japanese

Latin-1

How can I input Latin-1 characters under X?

First you need to define a Multi_key:

After activating the Multi_key, you can press Multi_key, a, ', for example, to get รก.

Japanese

I generally use Canna as the conversion backend and kinput2 as the input server. See [2] for information on how to start these up. There are, however, newer input methods (such as SCIM, Anthy, etc.) that are worth experimenting with.

I recently tried SCIM for Japanese input. I found the automatic cana-kanji conversions to be quite poor. Almost every kanji it chose was not the one I wanted, and so I found I had to constanly press a number to choose the appropriate kanji. I have heard good things about it though, so perhaps there's a way to configure it to be smarter.

I also tried Anthy with XEmacs (japanese-anthy input method), and found it to be very similar to the japanese-canna input method.

How can I input Japanese when using vim?

How can I input Japanese when using XEmacs?

kinput2 is not needed to input Japanese in XEmacs.

If Anthy is installed on your system, it is also possible to set the input method in XEmacs to japanese-anthy. Anthy is an alternate conversion engine from the Heke project.

How can I input Japanese in a browser?

If you always use kinput2 as your conversion engine, you can set XMODIFIERS to '@im=kinput2' in your .profile. Furthermore, setting LC_CTYPE is unnecessary if you are already using a Japanese environment.

How do I change the encoding of a page in Conkeror to utf-8?

References

  1. UTF-8 and Unicode FAQ for Unix/Linux
  2. CJK Support in SuSE Linux
  3. POSIX definition of i18n environment variables


Marjan Parsa
Last modified: Sat Jan 16 00:21:36 EST 2010