DekGenius.com
[ Team LiB ] Previous Section Next Section

13.5 Unicode Text

Unicode text is text in UTF-16 encoding, as opposed to string, which has the MacRoman encoding. Unicode is the native system-level encoding of Mac OS X, so text supplied by the System is often Unicode text rather than a string. For example:

tell application "Finder" to set x to (get name of disk 1)
class of x -- Unicode text

Similarly, some Mac OS X-native applications, such as TextEdit, return text values as Unicode text. Unicode is capable of expressing tens of thousands of characters, and in its fullest form will express about a million, embracing every character of every written language in history. Eventually we may expect that AppleScript will become completely Unicode-savvy; all AppleScript text will be Unicode text, and the old string type will fade into oblivion.

Unicode text is basically indistinguishable from a string; the differences between them are handled transparently. Whatever you can do to a string, you can do to Unicode text. If you get an element of a Unicode text value, the result is Unicode text. If you concatenate Unicode text and a string, the result is Unicode text (though if you concatenate a string and Unicode text, you get a string; this is troublesome and might change in a future version of AppleScript). You can explicitly coerce between a string and Unicode text, and AppleScript implicitly coerces for you as appropriate.

Nevertheless, Unicode text is currently still a second-class citizen in AppleScript, and can be hard to work with. You can't even type a Unicode text literal in AppleScript. Well, you can, but AppleScript will render it as MacRoman when you compile the script, so any characters outside the range of MacRoman are lost. And AppleScript's supplied string manipulation commands, such as the scripting addition command ASCII character, don't work outside the MacRoman range either.

One workaround is to construct a character as hex data (see "Data" later in this chapter) and coerce it to Unicode text. So, for example, the following code yields a z-hacek (), Unicode code point hex 017E:

set myZ to «data utxt017E» as Unicode text

Another approach is to write the data out to a file and read it back in. This works because AppleScript gives you a wide variety of ways to treat file data. Here's an example (on reading and writing files, see Chapter 20):

set f to a reference to file "myDisk:myFile"
open for access f with write permission
write 382 to f as small integer starting at 0
set s to read f as Unicode text from 0 to 1
close access f

After that, s is a z-hacek, because decimal 382 is hex 017E. There is also support for exchanging data with a file as UTF-8; but there is no internal support for AppleScript text in UTF-8 encoding, so if you read text as UTF-8, it is converted to UTF-16:

set f to a reference to file "myDisk:myFile"
open for access f with write permission
write "this is a test" to f as «class utf8» starting at 0
close access f
open for access f
set s to read f as «class utf8»
close access f
class of s -- Unicode text

Still another approach is to talk to the shell. This has the advantage that a good Unix scripting language, such as Perl, will let you express string data more conveniently than AppleScript will; it works because the do shell script scripting addition command returns Unicode text by default. So, for example:

set p to "use utf8;\n"
set p to p & "print chr(0x017E);"
set s to do shell script "perl -e " & quoted form of p

After that, s is a z-hacek. One must hope that some time soon these manipulations will cease to be necessary.

An older class, international text, is less likely to arise on Mac OS X. It was a way of representing text in accordance with a particular language and script (where "script" means a writing system); each language-script combination had its own rules (an encoding) for how particular sequences of bytes were mapped to characters (glyphs). The mess created by this multiplicity of encodings is the reason why Unicode is a Good Thing.

    [ Team LiB ] Previous Section Next Section