Monday, 24 September 2012

Latin1 to utf-8 without iconv

I recently helped someone convert latin1 text to utf-8 on a minimal system with no access to iconv.

A bash script had a latin1 field and needed to encode it to utf-8.

Fortunately, latin-1 only has 256 characters and only the top 128 are special, and (not that it makes any difference) most of those are the same.

The minimal system had busybox od command, so I decided to convert the variable to a numeric octal stream, like this:


$ read FIELD
Hello everybody I am the thing

Which can be converted to octal like this

$ echo "$FIELD" | od -b
0000000 110 145 154 154 157 040 145 166 145 162 171 142 157 144 171 040
0000020 111 040 141 155 040 164 150 145 040 164 150 151 156 147 012
0000037

and then strip to just the octal character values preceded by a space


echo "$FIELD" | od -b | sed -e 's/[^ ]*//;s/ *$//' 
 110 145 154 154 157 040 145 166 145 162 171 142 157 144 171 040
 111 040 141 155 040 164 150 145 040 164 150 151 156 147 012

and then join lines together


$ echo "$FIELD" | od -b | sed -e 's/[^ ]*//;s/ *$//' | tr -d $'\012'
 110 145 154 154 157 040 145 166 145 162 171 142 157 144 171 040 111 040 141 155 040 164 150 145 040 164 150 151 156 147 012

and then convert each space to $_lu_ which is a nice variable prefix


$ echo "$FIELD" | od -b | sed -e 's/[^ ]*//;s/ *$//' | tr -d $'\012' | sed -e 's/ /$_lu_/g'
$_lu_110$_lu_145$_lu_154$_lu_154$_lu_157$_lu_040$_lu_145$_lu_166$_lu_145$_lu_162$_lu_171$_lu_142$_lu_157$_lu_144$_lu_171$_lu_040$_lu_111$_lu_040$_lu_141$_lu_155$_lu_040$_lu_164$_lu_150$_lu_145$_lu_040$_lu_164$_lu_150$_lu_151$_lu_156$_lu_147$_lu_012


Now if all those variables were defined to hold the utf-8 values, we could convert the field, like this:

$ FIELD=$(eval echo \"$(echo -n "$FIELD" | od -b | sed -e 's/[^ ]*//;s/ *$//' | tr -d $'\012' | sed -e 's/ /$_lu_/g' )\")

as a bash function:


latin1_to_utf8() {
  eval echo -n \"$( <<<"$1" od -b | sed -e 's/[^ ]*//;s/ *$//' | tr -d $'\012' | sed -e 's/ /$_lu_/g' )\"
}


Here is how we define those variables; this code must be run on a fully-featured box with access to iconv.

for i in `seq 1 255`
do
  echo "_lu_$(printf "%03o" $i)"=\$\'$( printf $( printf '\\x%x' $i ) | iconv -f latin1 -t utf-8 | od -b | sed -e 's/[^ ]*//;s/ *$//;s/ /\\/g' )\'
done
and the text it outputs

...
...
_lu_176=$'\176'
_lu_177=$'\177'
_lu_200=$'\302\200'
_lu_201=$'\302\201'
_lu_202=$'\302\202'
...
...

Will be pasted into the script that runs on the reduced environment

No comments:

Post a Comment