Tuesday, 16 August 2011

Importing a threaded MBOX into google blogger

My first blog used wordpress. I worked out some script to export the wordpress database as a threaded-mbox and for a while ran my blog as a web interface to a citadel imap folder.

Circumstances moved me to google blogger and I didn't want to lose my old posts or comments.

So I wrote a sed script to convert a threaded mbox into google blogger format.

it's pretty complete. If you want to know how it words, just read ths source, man!

#! /usr/bin/sed -f

/^Content-[Tt]ype: /{
  # keep reading till new From
  :-body
  N
  /\nFrom - /!b-body
}

/^Content-[tT]ype: multipart\/alternative/{
  # Are we a comment (convert to text) or a post
  x;/\n *<category.*#comment.*>/!{x;b-get-html};x

  :-get-text
  /\nContent-Transfer-Encoding: quoted-printable/{
    s/=0D=0A/\n/g
    s/=20/ /g
    s/=\n//g
    s/=3D/=/g
  }

  s/^.*\nContent-type: text\/plain/Content-Type: text\/plain/i
  :-chop
  /^\n/b-chopped
  s/^[^\n]*\n//;
  b-chop
  :-chopped
  s/\n--[^\n]*\nContent-type.*//i
  s/^\n\n*//

  s/^/  <content type="html">\n/i;
  s/$/\n  <\/content>/i;

  H

  s/^.*/From - /

  :-get-html
  x
  s|$|\n  <category scheme='http://schemas.google.com/g/2005#kind' term='http://schemas.google.com/blogger/2008/kind#post'/>|
  x

  /\nContent-Transfer-Encoding: quoted-printable/{
    s/=0D=0A/\n/g
    s/=20/ /g
    s/=\n//g
    s/=3D/=/g
  }

  s/^.*\nContent-type: text\/html/Content-Type: text\/html/i
  s/\n--[^\n]*\nContent-type.*//i

  :-got
}

/^Content-[tT]ype: text\/html/{
  # try to save the <title> instead of the subject line
  /<title>/{
    x
    s/$/\n/
    G
    s/\n\n.*<title>\([^<]*\)<\/title>.*/\n  <title>\1<\/title>/i
    # rid of the old title
    s/ *<title>[^<]*<\/title>\n//
    x
  }

  s/^.*\?<body\b[^>]*>[\n ]*//i
  s/[\n ]*<\/body>.*//i

  # Are we a comment (convert to text) or a post
  x;/\n *<category.*#comment.*>/!{x;b-convert-html};x

  :-convert-text
  s/<br\b[^>]*>/\n/gi
  s/<p\b>/\n/gi
  s/<[^>]*>//g
  s/&\#8320;/…/g
  s/^\n\n*//
  s/\n\n*/\n/g
  b-quote

  :-convert-html
  x
  s|$|\n  <category scheme='http://schemas.google.com/g/2005#kind' term='http://schemas.google.com/blogger/2008/kind#post'/>|
  x

  :-quote
  s/&/\&amp;/g; s/"/\&quot;/g; s/</\&lt;/g; s/>/\&gt;/g

  s/^/  <content type="html">\n/i;
  s/$/\n  <\/content>/i;

  H

  s/^.*/From - /
}

/^From - /{
  s/.*//
  1!{
    s/$/<\/entry>\n/
    H
    x
    p
    x
  }
  s/.*/<entry>/
  h
  d
}

/^Message-ID:/{
  s/^[^:]*:[[:space:]]*<\?//
  s/>\?$//
  s/#/_/g
  s/&/\&amp;/g; s/"/\&quot;/g; s/</\&lt;/g; s/>/\&gt;/g
  s/^/  <id>/
  s/$/<\/id>/
  H
  d
}

/^References:/{
  s/^[^:]*:[[:space:]]*<\?//
  s/>.*$//
  s/&/\&amp;/g; s/"/\&quot;/g; s/</\&lt;/g; s/>/\&gt;/g
  s/^/  <thr:in-reply-to ref="/
  s/$/" type="text\/html"\/>/
  s|$|\n  <category scheme='http://schemas.google.com/g/2005#kind' term='http://schemas.google.com/blogger/2008/kind#comment'/>|
  H
  d
}

/^Subject:/{
  s/^[^:]*:[[:space:]]*//
  s/&/\&amp;/g; s/"/\&quot;/g; s/</\&lt;/g; s/>/\&gt;/g
  s/^/  <title>/
  s/$/<\/title>/
  H
  d
}

/^From:/{
  s/^[^:]*:[[:space:]]*//
  s/^"*//
  s/".*//
  s/&/\&amp;/g; s/"/\&quot;/g; s/</\&lt;/g; s/>/\&gt;/g
  s/^/  <author><name>/
  s/$/<\/name><\/author>/
  H
  d
}

/^Date:/{
  s/^[^:]*:[[:space:]]*<\?//
  s/\([^ ]*\) *\([^ ]*\) *\([^ ]*\) *\([^ ]*\) *\([^ ]*\) *\(...\)\(..\).*/\4-\3-\2T\5.000\6:\7/
  s/Dec/12/i
  s/Nov/11/i
  s/Oct/10/i
  s/Sep/09/i
  s/Aug/08/i
  s/Jul/07/i
  s/Jun/06/i
  s/May/05/i
  s/Apr/04/i
  s/Mar/03/i
  s/Feb/02/i
  s/Jan/01/i
  s/&/\&amp;/g; s/"/\&quot;/g; s/</\&lt;/g; s/>/\&gt;/g
  s/^/  <published>/
  s/$/<\/published>/
  H
  s/published/updated/g
  H
  d
}

${
#  a</content></entry></feed>
}


The hard part is that google will sometimes drop posts or comments without explaining why (even after claiming to have imported 20 comments it may be that only 17 are available after import) so it took a lot of trail and error.

On comment seemed to be dropped because it contained this text:

13. Use with the GNU Affero General Public License.

And to prove it, I was able to post that a s a comment to a google blog, but if I exported and imported the blog the comment would be dropped.

No comments:

Post a Comment