Temporarily making \u escapes use a different encoding on the command line
For debugging purposes, I made a little utility that can show the exact bytes (as a hex dump) of each command line argument fed to it:
$ cat cmdline
#!/bin/bash
mapfile -d '' array < /proc/$$/cmdline
for arg in "${array[@]:2}"; do printf '%s' "$arg" | xxd -; echo; done
Now I can create complex test data and everything seems to work as expected. For example:
$ ./cmdline $'\u00ff' $'\xff'
00000000: c3bf ..
00000000: ff .
As can be seen, the $'\u00ff'
sequence produces the bytes corresponding to a UTF-8 encoding of Unicode code point 255; and the $'\xff'
sequence produces the single byte with the value 255.
I expect the UTF-8 encoded result given my locale settings. But I would like to be able to get different bytes instead, corresponding to different text encodings of Unicode code point 255. Conceptually, to my mind, $'\u00ff'
represents the Unicode character ÿ
(lowercase y with diaresis), not any particular sequence of bytes. For example, I would like to be able to specify a UTF-16 encoding and get either 00ff
or ff00
in my resulting hex dump (according to the chosen endianness), or use the cp1252
encoding and get ff
, or use the cp437
encoding and get 98
.
I tried setting locale environment variables, but they seem to have no effect:
$ echo $LANG
en_CA.UTF-8
$ locale -m | grep 1252 # verify that the locale is supported
CP1252
$ LANG="en_CA.CP1252" ./cmdline $'\u00ff' # try to use it
00000000: c3bf ..
Similarly setting LC_ALL
gives a warning from setlocale
that the locale cannot be changed, and there is otherwise no effect.
How can I change the behaviour of the $'\uNNNN'
style escape sequences at the command line?
1 answer
I don't think you can persuade Bash to use non-UTF-8 characters internally—I expect much of its code assumes that strings are ASCII-compatible. So $'\u00FF'
will always translate, inside Bash, to the UTF-8 byte sequence c3 bf
.
If you want to take a UTF-8 byte sequence that Bash (or another utility) has produced and translate it to UTF-16, use iconv:
$ printf %s $'\u1234\u00FF' | iconv -t utf-16 | xxd -g2 -e
00000000: feff 1234 00ff ..4...
As you can see, you'll have to contend with the \uFEFF
BOM if you do this.
Note that this will barf if you give it $'\xFF'
, which isn't valid UTF-8. Mixing the two encodings is a recipe for sadness.
0 comment threads