Clean HTML from Microsoft Word: Rube Goldberg Method

Wouldn’t it be nice to be able to save nice clean HTML from Microsoft Word? Here’s a way that is only slightly convoluted.

If you use the “Web Page, Filtered” option from Word, you get buried in piles of CSS cruft, but as noted here, you can use the Publish > Blog menu to generate pretty clean HTML… to a blog. So, just sign yourself up for a free blog account somewhere, figure out how to enable remote publishing, and um… BZZZZT! Next idea please!

So here’s a Python script that is a small MetaWebLog server. Run it locally on your machine and publish straight to your hard drive!

How to use it

  1. Save the script below as mwl.py
  2. Run “python mwl.py” from a command prompt
  3. From the Office menu thingy, choose Publish > Blog
  4. Go ahead and click Register Now when prompted to register a blog account
  5. For the blog provider, pick “Other” and click Next
  6. In the “API” drop-down, choose “MetaWebLog”
  7. For “Blog Post URL”, enter “http://localhost:8585″
  8. You can optionally put whatever you want for the “User Name” and “Password”, and check the “Remember Password” box so that Word won’t keep nagging you for credentials
  9. If all goes well, Word should report “Account registration successful.”
  10. Where it says “Enter Post Title Here”, type what you want your file to be named (it will be saved in your Documents folder, and a .html extension will be added)
    WARNING: It will overwrite any existing file at that location!
  11. Click the Publish button, and your wonderfully clean HTML file will be saved

Update 10 Jun 2012: The script outputs UTF-8 now.

# mwl.py - by Eric Smith - http://esmithy.net

from SimpleXMLRPCServer import SimpleXMLRPCServer
import os.path

PORT = 8585
BLOG_URL = "http://localhost:{0}".format(PORT)
HTML = u'{1}'

def get_user_blogs(key, username, password):
    return [{'url':BLOG_URL, 'blogid':'1', 'blogName':'Save HTML'}]
    
def new_post(blogid, username, password, struct, publish):
    filename = _get_filename(struct['title'])
    _write_html(filename, struct['title'], struct['description'])
    return struct['title']

def get_post(postid, username, password):
    filename = _get_filename(postid)
    print 'Reading HTML from ' + filename
    struct = {'title':postid}
    with open(filename, 'r') as f:
        struct['description'] = f.read()
    return struct

def edit_post(postid, username, password, struct, publish):
    filename = _get_filename(postid)
    _write_html(filename, struct['title'], struct['description'])
    return True

def _get_filename(n):
    return os.path.expanduser('~/Documents/{0}.html'.format(n))
    
def _write_html(filename, title, body):
    print 'Saving HTML to ' + filename
    with open(filename, 'w') as f:
        f.write(HTML.format(title, body).encode("utf-8"))

def main():
    server = SimpleXMLRPCServer(('localhost', PORT))
    server.register_introspection_functions()
    server.register_function(get_user_blogs, 'blogger.getUsersBlogs')
    server.register_function(new_post, 'metaWeblog.newPost')
    server.register_function(get_post, 'metaWeblog.getPost')
    server.register_function(edit_post, 'metaWeblog.editPost')
    print "Listening on port {0}".format(PORT)
    try:
        server.serve_forever()
    except KeyboardInterrupt:
        pass

if __name__ == '__main__':
    main()

10 thoughts on “Clean HTML from Microsoft Word: Rube Goldberg Method

  1. I am trying to get this working and have the script running on Linux (Fedora 17) with Python 2.7.3 and on Windows 7 with Python 2.7.3. The script runs and I can connect with Word 2010 but when I publish, Word complains that it can’t publish the page and the resulting file is empty on the server. Does this require a specific version of Python, like 3.x, or is something else not quite right?

    Shawn

  2. I tested with Python 3.2 and it fails to run at all. It looks like a lot of the syntax has changed in Python 3.2. given that, I don’t believe you are using 3.2 and that testing with 2.7 is the right approach.
    After further debugging, it seems that it fails during the write call in this function:
    def _write_html(filename, title, body):
    print ‘Saving HTML to ‘ + filename
    with open(filename, ‘rw’) as f:
    f.write(HTML.format(title, body))

    I see the print there but nothing after that and Word complains. If I add another print after the f.write (indented properly), it isn’t shown on the command line running the code. If I comment out the “with open” and the “f.write” lines and run it, Word is successful in publishing and any later prints in the flow show up on the command line.

    It does create the file wherever it is defined in the code (~/Documents/.html) but the file is zero length. Same behavior on Windows and Linux.

    Thoughts?

  3. Ok, I got it working. It didn’t like one of my tests in Word. Appears to be invalid in this case. Works good though, thanks!

    • Hi Shawn,

      I should have mentioned that I wrote this targeting Python 2.x rather than 3.x. Glad that you got it to work with absolutely no help from me ;)

      Let me know if you think there are tweaks I should make to the code or the instructions to make life easier.

      Thanks

      • The only thing I would suggest is moving the hostname information and save location up to the top and stored in a variable (for each). That way, you have PORT, HOSTNAME, and SAVE_FOLDER that can easily be modified and will handle changing ports, changing from localhost to a remote host (maybe a central host that several people use?) and setting the save folder (useful in the case where you are using a central host and want the files in a predictable place there regardless of the user doing the posting).

        If you really want to be “crazy”, you could use the username passed down to create a folder in that save folder to store that “users” files in so that multiple users don’t collide (or less easily collide anyway).

        Those would be pretty straight forward changes to make that would make it cleaner to use. The only other thing I would suggest is seeing what it will take to get it working under Python 3, since that is the currently “actively developed” stable version. I don’t know enough python myself to know what it doesn’t like some of the syntax (the things it doesn’t like don’t make sense to me).

        I will most likely be poking around with it and if I get it working with those types of changes above, I will send the changed version on to you. What is the best way to do that?

      • I got those changes in and it is working pretty nicely. I have some changes in that actually check for and create the user’s folder when they connect if it doesn’t exist. Try getting the file from this link and let me know if that works for you or not: http://dl.dropbox.com/u/2274101/mwl.py

        • I thought about parameterizing the script a bit with command-line arguments for host, port, save location, etc., but ultimately decided to go with the simpler solution just the way I wanted it. If more of a central, multi-user environment is what you’re looking for, then great — your modified script looks good.

          On the Python 2.x vs. 3.x issue — 2.x continues to be the “workhorse” version of the language as people slowly migrate to 3.x and it gets wider adoption. In some environments (including some I have to work with in my day job), it is hard to even get a 2.7 version, let alone 3.x. If you’re interested in converting my script, though, take a look at http://docs.python.org/library/2to3.html.

          Thanks for your interest.

  4. Thanks. Just trying to figure out how to adapt this to put in Apache\cgi-bin, where it will always be accessible without needing to run the file first, of course!

  5. Started trying to create a script that would be served by Apache Server by placing in Apache\cgi-bin. Ended up with a much simpler solution:
    Create a backup of Blog.dotx template then open it with word.
    Click on the down arrow next to the Quick Access Toolbar and select More Commands
    Select Choose commands from: Developer tab > Design View > OK
    Click the newly added Design View icon and selct the ‘Enter title’ block at top of the document then delete it.
    Save as Blog.dotx in templates folder and the next time you select Publish > Blog it will give you the html page you wanted!
    Unfortunately you still get the ‘create a blogging account’ nag screen, though.

  6. Previous instructions slightly wrong:
    To create clean html pages in Word

    Backup Blog.docx in templates folder
    In Word, select Publish > Blog
    Add Design view to Quick Access toolbar from Developer tab in the opened blogging window
    Select Design View and delete the top ‘Enter title’ block
    Save as Blog.dotx in templates folder

    To create clean HTML from Word documents you can now select Publish > Blog and save the file
    Still get ‘Create a blogging account’ dialog, though

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>