A bash script had a latin1 field and needed to encode it to utf-8.
Fortunately, latin-1 only has 256 characters and only the top 128 are special, and (not that it makes any difference) most of those are the same.
The minimal system had busybox
od
command, so I decided to convert the variable to a numeric octal stream, like this:
$ read FIELD
Hello everybody I am the thing
Which can be converted to octal like this
$ echo "$FIELD" | od -b
0000000 110 145 154 154 157 040 145 166 145 162 171 142 157 144 171 040
0000020 111 040 141 155 040 164 150 145 040 164 150 151 156 147 012
0000037
and then strip to just the octal character values preceded by a space
echo "$FIELD" | od -b | sed -e 's/[^ ]*//;s/ *$//'
110 145 154 154 157 040 145 166 145 162 171 142 157 144 171 040
111 040 141 155 040 164 150 145 040 164 150 151 156 147 012
and then join lines together
$ echo "$FIELD" | od -b | sed -e 's/[^ ]*//;s/ *$//' | tr -d $'\012'
110 145 154 154 157 040 145 166 145 162 171 142 157 144 171 040 111 040 141 155 040 164 150 145 040 164 150 151 156 147 012
and then convert each space to
$_lu_
which is a nice variable prefix
$ echo "$FIELD" | od -b | sed -e 's/[^ ]*//;s/ *$//' | tr -d $'\012' | sed -e 's/ /$_lu_/g'
$_lu_110$_lu_145$_lu_154$_lu_154$_lu_157$_lu_040$_lu_145$_lu_166$_lu_145$_lu_162$_lu_171$_lu_142$_lu_157$_lu_144$_lu_171$_lu_040$_lu_111$_lu_040$_lu_141$_lu_155$_lu_040$_lu_164$_lu_150$_lu_145$_lu_040$_lu_164$_lu_150$_lu_151$_lu_156$_lu_147$_lu_012
Now if all those variables were defined to hold the utf-8 values, we could convert the field, like this:
$ FIELD=$(
eval echo \"$(echo -n "$FIELD" | od -b | sed -e 's/[^ ]*//;s/ *$//' | tr -d $'\012' | sed -e 's/ /$_lu_/g' )\")as a bash function:
latin1_to_utf8() {
eval echo -n \"$( <<<"$1" od -b | sed -e 's/[^ ]*//;s/ *$//' | tr -d $'\012' | sed -e 's/ /$_lu_/g' )\"
}
Here is how we define those variables; this code must be run on a fully-featured box with access to iconv.
for i in `seq 1 255`
do
echo "_lu_$(printf "%03o" $i)"=\$\'$( printf $( printf '\\x%x' $i ) | iconv -f latin1 -t utf-8 | od -b | sed -e 's/[^ ]*//;s/ *$//;s/ /\\/g' )\'
done
and the text it outputs
...
...
_lu_176=$'\176'
_lu_177=$'\177'
_lu_200=$'\302\200'
_lu_201=$'\302\201'
_lu_202=$'\302\202'
...
...
Will be pasted into the script that runs on the reduced environment
No comments:
Post a Comment