Converting WordPress to Webby

The process of converting my old WordPress posts to Webby was relatively painless, but there are a few things worth sharing.

The first step was to export my WordPress MySQL database and create a local copy, and then to create DataMapper classes corresponding to the two tables I was interested in, wp_posts and wp_comments.

mysql> describe wp_posts;
+-----------------------+---------------------+
| Field                 | Type                |
+-----------------------+---------------------+
| ID                    | bigint(20) unsigned |
| post_author           | bigint(20)          |
| post_date             | datetime            |
| post_date_gmt         | datetime            |
| post_content          | longtext            |
| post_title            | text                |
| post_category         | int(4)              |
| post_excerpt          | text                |
| post_status           | varchar(20)         |
| comment_status        | varchar(20)         |
| ping_status           | varchar(20)         |
| post_password         | varchar(20)         |
| post_name             | varchar(200)        |
| to_ping               | text                |
| pinged                | text                |
| post_modified         | datetime            |
| post_modified_gmt     | datetime            |
| post_content_filtered | text                |
| post_parent           | bigint(20)          |
| guid                  | varchar(255)        |
| menu_order            | int(11)             |
| post_type             | varchar(20)         |
| post_mime_type        | varchar(100)        |
| comment_count         | bigint(20)          |
+-----------------------+---------------------+
24 rows in set (0.01 sec)                      

mysql> describe wp_comments;                   
+----------------------+---------------------+
| Field                | Type                |
+----------------------+---------------------+
| comment_ID           | bigint(20) unsigned |
| comment_post_ID      | int(11)             |
| comment_author       | tinytext            |
| comment_author_email | varchar(100)        |
| comment_author_url   | varchar(200)        |
| comment_author_IP    | varchar(100)        |
| comment_date         | datetime            |
| comment_date_gmt     | datetime            |
| comment_content      | text                |
| comment_karma        | int(11)             |
| comment_approved     | varchar(20)         |
| comment_agent        | varchar(255)        |
| comment_type         | varchar(20)         |
| comment_parent       | bigint(20)          |
| user_id              | bigint(20)          |
+----------------------+---------------------+
15 rows in set (0.00 sec)

And no, I don’t know why they have wp_posts.ID as a bigint(20) and then wp_comments.comment_post_ID, which should be the same size, as an int(11). This is a database that has been upgraded a few times so perhaps that’s a legacy thing.

While DataMapper can easily accept a non-standard primary key in a table, it gets a little trickier when you are linking two tables together using has n and belongs_to. I found it simpler to just change the names of the primary keys and foreign key. So, after creating a new database and loading the mysqldump file with all my blog’s data, I ran the following:

1
2
3
ALTER TABLE wp_posts CHANGE ID id bigint(20) unsigned;
ALTER TABLE wp_comments CHANGE comment_ID id bigint(20) unsigned;
ALTER TABLE wp_comments CHANGE comment_post_ID post_id int(11);

Update: I think I cracked the custom parent_key, child_key bit in DataMapper.

22
23
24
25
26
27
28
29
30
31
32
class Post
  has n,       :comments, 
               :parent_key => [:ID], 
               :child_key => [:comment_ID]
end

class Comment
  belongs_to   :post, 
               :parent_key => [:comment_post_ID], 
               :child_key => [:comment_ID]
end
See parent_key_example.rb for a full working example. This should negate the need to change field names as above but I haven’t fully tested it.

One of the really nice things about DataMapper is that it will happily ignore any fields in your database which you don’t mention explicitly. So, you only have to define DataMapper properties for the fields you want to be able to work with. The top of my post.rb file looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
class Post
  include DataMapper::Resource
  storage_names[:default] = 'wp_posts'

  property :id, Integer, :serial => true # original field name ID
  property :post_date, DateTime
  property :post_content, Text
  property :post_title, String
  property :post_status, String
  property :post_name, String

  has n, :comments, :comment_approved => true, :order => [:comment_date]

And my comment.rb file starts with:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
class Comment
  include DataMapper::Resource
  storage_names[:default] = 'wp_comments'

  property :id, Integer, :serial => true # original field name comment_ID
  property :post_id, Integer # original field name comment_post_ID
  property :comment_author, String
  property :comment_author_url, String
  property :comment_date, DateTime
  property :comment_content, String
  property :comment_approved, Boolean
  property :user_id, Integer

  belongs_to :post

So, just like that I can access all my posts and comments using DataMapper classes, and I can do things like post.comments.

The initialization for DataMapper is simply:

1
2
3
4
5
6
7
require "rubygems"
require "dm-core"
DataMapper.setup(:default, 'mysql://localhost/ananelson_wordpress?socket=/tmp/mysql.sock')

# Local files
require "lib/comment"
require "lib/post"

Now, how do I get the content formatted nicely? Wordpress takes the data stored in the database and feeds it through a PHP function called the_content.

1
2
3
4
5
6
7
// This is an excerpt from the WordPress source code. http://wordpress.org/about/gpl/
function the_content($more_link_text = '(more...)', $stripteaser = 0, $more_file = '') {
    $content = get_the_content($more_link_text, $stripteaser, $more_file);
    $content = apply_filters('the_content', $content);
    $content = str_replace(']]>', ']]>', $content);
    echo $content;
}

The apply_filters function is the thing that interests me. More digging in the WordPress source revealed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// This is an excerpt from the WordPress source code. http://wordpress.org/about/gpl/

add_filter('the_content', 'wptexturize');
add_filter('the_content', 'convert_smilies');
add_filter('the_content', 'convert_chars');
add_filter('the_content', 'wpautop');
add_filter('the_content', 'prepend_attachment');

# snip...

add_filter('comment_text', 'wptexturize');
add_filter('comment_text', 'convert_chars');
add_filter('comment_text', 'make_clickable', 9);
add_filter('comment_text', 'force_balance_tags', 25);
add_filter('comment_text', 'convert_smilies', 20);
add_filter('comment_text', 'wpautop', 30);

So, WordPress has a number of filters which are applied to the post content and the comments after the text is pulled out of the database. The simplest way I could think of to replicate this behaviour was to just use these same WordPress filters. I decided that I could live without the convert_smilies, and that there was no reason not to use make_clickable for my posts as well as for the comments, so that left me with a standard list of filters. I wrote a short php-based shell script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/env php -q
<?php

include 'wp/plugin.php';

include 'wp/kses.php';
include 'wp/formatting.php';
include 'wp/shortcodes.php';

$text = file_get_contents($argv[1]);

$text = wptexturize($text);
$text = convert_chars($text);
$text = make_clickable($text);
$text = force_balance_tags($text);
$text = wpautop($text);

echo $text;
?>

Then I just had to wrap the shell script in Ruby.

10
11
12
13
14
15
16
17
18
19
20
def wp_format(text)
  tmpfile = "temp.txt"
  File.open(tmpfile, 'w') do |f|
    f.write text
  end

  result = `./wp_format #{tmpfile}`
  `rm #{tmpfile}`
  puts result
  result
end

For some reason Ruby’s Tempfile library gave me some strange filenames which either got garbled or weren’t palatable to system(), so I just used “temp.txt”. You could always add a timestamp if you wanted to.

Now, I need to recreate the perma-url scheme I had set up in WordPress.

14
15
16
17
18
19
20
21
22
23
  
  def filedir
    location = "../content/" # relative path to webby content dir
    location + "said/on/" + post_date.strftime("%Y/%m/%d/") + post_name
  end

  def filename
    filedir + "/index.txt"
  end

I used a directory “said/on” (yeah, sorry, I was feeling too clever that day) followed by Year/Month/Day and then the post slug. So, in my Post class I have two functions, filedir which creates the directory and then filename which adds the post slug and a .txt extension (.txt since this is going into Webby).

Finally, I need code which formats comments and posts, and then a method to iterate over all published posts and all approved comments to print them in that format.

In post.rb:
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
  
  def webby_header
%{---
title: #{post_title}
created_at: #{post_date.to_s}
---
}
  end

  def publish
    FileUtils.mkdir_p(filedir)
    File.open(filename, "w") do |f|
      f.write(webby_header)
      if [33].include?(id) # Post no. 33 and wp_format don't get along.
        f.write(post_content)
      else
        f.write(wp_format(post_content))
      end
      if !comments.empty?
        f.write("\n\n<hr>\n\n<h3>Comments</h3>\n")
        comments.each do |c|
          f.write(c.to_html)
        end
      end
    end
  end

  def self.publish_all
    FileUtils.rm_rf("../content/said")
    Post.all(:post_status => 'publish').each do |p|
      p.publish
    end
  end

In comment.rb:
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
  
  def author_with_url
    if comment_author_url.to_s === ""
      comment_author
    else
      %{<a href="#{comment_author_url}">#{comment_author}</a>}
    end
  end

  def to_html
    %{
<b>#{author_with_url}</b> #{comment_date.strftime("%d %b %Y")}
#{wp_format(comment_content)}

}
  end

Not the most beautiful of code, but I’m only using it once and it works.

So, when I call Post.publish_all, I get a directory structure like this in my Webby content directory:

And the next time I call rake build, each of those text files will be converted to a HTML page.

I have ignored tags and categories, and I didn’t have to deal with images in any of my blog posts, so that made this job easier. I did have to manually tweak the output for two of these blog posts. In one of them, quotation marks were turned into some bizarre character and, since there were only 6 of them, I changed them by hand. Also one of my posts resisted wp_format completely so I just excluded that one from being formatted and added a Webby textile filter, which worked just fine.

If I had more posts to convert I would have investigated the reasons behind these problems and adjusted my code accordingly, but in this case it made sense to just fix them.

So, there you are. A relatively painless export. I can see that DataMapper is going to be my tool of choice for quickly working with legacy databases and exporting or reformatting them. It’s so quick to set up, and then you have access to any Ruby library you need to help you process your data.

You are free to make use of any of these scripts subject to the terms of the GPL. We really, really need a decent license for code snippets which fits in a single line comment. I’m going with GPL on this one since that is WordPress’s license and I’m using bits of their code here. But, if you want to do something similar to what I have done here not relating to WordPress then you can consider the code I have written to be in the public domain or, if you prefer, the MIT license. And, thats the code, not the blog post. Of course, if you find this useful I’d love to hear about it in the comments, by email or on your blog.

If you are looking for any of my old posts, there is a list of them here.

Source Code

Download All (tgz)
Download All (zip)