University of Bristol | ILRT | IntDev blog

Jump to content Subscribe via RSS

This is a blog from the Internet Development Team at ILRT, Bristol. We build websites and web applications for a wide variety of customers, many in the UK higher education sector. Continue reading…

Unicode and the web

Why bother?

If you create content for the web, you may have to transfer accented characters (or text in non-Roman scripts) onto a web page. But even if the text looks all right when viewed within your organisation, it may be illegible to the wider world. To avoid this, you need to know a little about character sets and Unicode.

The bad old days

In the 1990’s only a limited range of characters could be used in text displayed on the web for writing languages other than English. Some characters with accents (and cedillas, umlauts etc.) were available, but there was no simple way of displaying any others.

For the benefit of the computer, each possible character had a numeric equivalent, known as an ASCII code. ASCII 0-127 included unadorned letters, numbers and punctuation; 128-255 covered the available accented letters. So the HTML á – using the ASCII code 225 – produces the character á; the more memorable á also works.

However, you didn’t have to go far to run into problems. There was no standard way of producing Welsh ŵ and ŷ, for example. Nor was there a simple way of making text available in non-Roman scripts such as those required to write Greek, Russian, Hebrew, Arabic, Hindi, and Mandarin. Some workarounds existed but were not universally understood, and only a partial solution. It was not unusual for text in non-Roman scripts to be put on webpages as a scanned image of printed text.

Unicode arrives

The Unicode standard was developed as a universally-agreed way of representing all scripts in use. Unicode includes the ASCII codes mentioned above, but expands the range of codes to over 100,000.

A look at the list of Unicode character ranges (e.g. http://www.alanwood.net/unicode/#links) illustrates the diversity of human writing systems: everything from Babylonian cuneiform to scripts for India’s many languages. There are also symbols, such as mathematical notation, and even I Ching hexagrams, dominoes and mah-jongg tiles!

But I still can’t read it!

Chances are that some characters on the Unicode resource pages still won’t appear correctly, but will look like boxes. E.g.: ᡩ (if you don’t have a Mongolian font.) This means that your system doesn’t support those particular Unicode character ranges. Windows offers support for additional languages:

  • go to the Control Panel
  • choose ‘Regional and Language Options’
  • you can install support for ‘complex script and right-to-left’ languages and East Asian languages.

You will also of course need to have an appropriate font available. Many scripts are included in the Unicode versions of standard fonts such as Arial, but if you still can’t read a script you may need to install a font that includes it. Somenon-commercial ones are listed at http://www.alanwood.net/unicode/fonts.html.

Writing for the web with Unicode

Odd words or characters can be inserted onto a webpage without special software. For individual characters, the HTML is “&#” followed by the appropriate (decimal) number in Unicode, followed by “;”. So the Welsh characters mentioned above can be written as ŵ and ŷ. For more extended typing, Unicode keyboards are available for commoner scripts. For example, Microsoft Global has an East Asian keyboard for MS Office applications.

Remember though that a word in non-Roman characters may not be legible if your readers don’t have an appropriate font and consider giving a transliteration into Roman characters in addition or instead. It is also important to specify in the page header that the UTF-8 standard is what you are using, as follows:
<meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8;”>.

At ILRT we work to this standard, essential in a world where the Internet crosses all linguistic borders, for example in the European Agency site.

See also the longer version of this article.

Virginia Knight – Senior Technical Researcher

This entry was posted on 4th May 2010 at 2:27 pm and is filed under Briefings. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.

css.php