I was recently given the task of converting many PDFs to HTML. This is not fun. However, it’s fun if you get to play with the Sublime Text 2 settings/packages/amazing stupedous power. Here’s what I wanted to do:
- In Adobe Reader, save PDF as HTML. Against that, do all of the following.
- run HTML Tidy
- delete doctype tag
- delete opening/closing HTML tags, and closing Body tag
- unindent two tabs
- delete opening Body tag
- delete head tag (and all child tags)
- delete style blocks
- delete all br tags
- delete all •
- delete all classes
- delete all span tags
- remove all width and height attributes from image tags
- replace “ with “
- replace ” with “
- replace ’ with ‘
- replace copyright symbol ©
- replace trademark sybmol ™
- replace registration symbol ®
- reconnect words that have been randomly broken into multiple “words”
- convert all totally uppercase strings to “Title Case”
- wrap each section in the appropriate tag
- style as needed (should hardly need anything given the previous step and the fact that styles have already been written)
Using Sublime Text, steps 2 through 19 could be automated. Here’s how.
- Install “HtmlTidy” package
- Oen Command Palette, type “install package”, type “HtmlTidy”.
- This assumes you’ve already installed the package “Package Control”. If no, do so.
- Install “RegReplace” package
- Open Command Palette, type “install package”, type “RegReplace”.
- It actually appears this is the only way to automate a large set of find/replace commands in Sublime Text. Rather surprising, but true. Anyway, this package works great.
- Read up on Macros. It’s really, really easy. Mostly you just need to know ctrl+q. This command both starts and stops recording of a macro.
- Record a macro for steps 2 through 5. Ctrl+q to both start and stop it. After that Tools -> Save Macro. Name it for future use.
- Start using RegReplace to do steps 6 through 19.
- You’ll want to copy over two files, and then customize those files to your heart’s content.
- (There’s actually more that could be said here, more detail, but I’ve run out of time, and my memory is bad. Either way! This post is mostly complete and still has some value.)