Converting Word Documents to Markdown: A Complete Guide

Using pandoc, python, or online tools for convertion to MD

Page content

Converting Word documents to Markdown format is a very common task for technical writers, developers, and content creators who want to move their content to platforms with Markdown (like GitHub, GitLab, static site generators like Hugo).

This guide covers multiple approaches and tools to accomplish this conversion effectively.

word to markdown on the grinder

Why Convert Word to Markdown?

Markdown offers several advantages over Word documents:

  • Version control friendly: Plain text format works well with Git
  • Platform independent: Readable on any system without special software
  • Future-proof: Simple text format that won’t become obsolete
  • Web-ready: Easy to convert to HTML for websites and blogs
  • Lightweight: Much smaller file sizes
  • Automation friendly: Easy to process programmatically

What Pandoc Preserves:

  • Headings (converted to #, ##, ###, etc.)
  • Bold and italic formatting
  • Lists (bulleted and numbered)
  • Links and references
  • Tables (converted to Markdown tables or HTML)
  • Code blocks and inline code
  • Images (with --extract-media option)
  • Footnotes

Pandoc is a universal document converter that excels at converting between different markup formats. It’s the most reliable tool for Word to Markdown conversion.

Installing Pandoc

On Ubuntu/Debian:

sudo apt update
sudo apt install pandoc

On macOS:

# Using Homebrew
brew install pandoc

# Or download from the official website
# https://pandoc.org/installing.html

On Windows:

# Using Chocolatey
choco install pandoc

# Or download the installer from:
# https://github.com/jgm/pandoc/releases

Verify Installation:

pandoc --version

Converting with Pandoc

For DOCX files (modern Word format):

pandoc document.docx -o document.md

For older DOC files: Pandoc cannot directly read .doc files. You need to convert them to .docx first using LibreOffice:

# First convert DOC to DOCX
libreoffice --headless --convert-to docx document.doc

# Then convert DOCX to Markdown
pandoc document.docx -o document.md

Advanced Pandoc Options:

# Convert with specific Markdown variant
pandoc document.docx -t gfm -o document.md  # GitHub Flavored Markdown

# Extract images to a folder
pandoc document.docx --extract-media=./images -o document.md

# Preserve more formatting
pandoc document.docx -t markdown+pipe_tables+raw_html -o document.md

# Convert with custom template
pandoc document.docx --template=custom.template -o document.md

Method 2: Using LibreOffice + Pandoc (For DOC files)

When dealing with older .doc files, this two-step process works best:

Installing LibreOffice

On Ubuntu/Debian:

sudo apt update
sudo apt install libreoffice

On macOS:

brew install --cask libreoffice

On Windows: Download from LibreOffice website

Conversion Process:

# Step 1: Convert DOC to DOCX
libreoffice --headless --convert-to docx document.doc

# Step 2: Convert DOCX to Markdown with Pandoc
pandoc document.docx -o document.md

# Clean up intermediate file (optional)
rm document.docx

Batch Conversion Script with pandoc:

Create a script to convert multiple files:

#!/bin/bash
# convert-docs.sh

for file in *.doc; do
    if [ -f "$file" ]; then
        echo "Converting $file..."
        
        # Convert DOC to DOCX
        libreoffice --headless --convert-to docx "$file"
        
        # Get filename without extension
        basename=$(basename "$file" .doc)
        
        # Convert DOCX to Markdown
        pandoc "${basename}.docx" -o "${basename}.md"
        
        # Clean up intermediate DOCX file
        rm "${basename}.docx"
        
        echo "✓ Created ${basename}.md"
    fi
done

Make it executable and run:

chmod +x convert-docs.sh
./convert-docs.sh

Method 3: Online Converters (Quick & Easy)

For occasional conversions, online tools can be convenient:

Popular Online Converters:

Pros and Cons:

  • Pros: No installation required, works on any device
  • Cons: Privacy concerns, file size limits, less control over output

Method 4: Using Word’s Built-in Export (Limited)

Modern versions of Microsoft Word can export to basic Markdown:

  1. Open your document in Word
  2. Go to FileExportChange File Type
  3. Select Web Page, Filtered (*.html)
  4. Use an HTML to Markdown converter like Pandoc:
pandoc document.html -o document.md

Note: This method often produces suboptimal results compared to direct DOCX conversion.

Method 5: Programming Solutions

Python with python-docx and markdownify:

#!/usr/bin/env python3
import sys
from docx import Document
from markdownify import markdownify

def docx_to_markdown(docx_path, md_path):
    # Read the docx file
    doc = Document(docx_path)
    
    # Extract text (basic conversion)
    full_text = []
    for paragraph in doc.paragraphs:
        full_text.append(paragraph.text)
    
    # Convert to markdown (basic)
    markdown_content = '\n\n'.join(full_text)
    
    # Write to file
    with open(md_path, 'w', encoding='utf-8') as f:
        f.write(markdown_content)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python docx_to_md.py input.docx output.md")
        sys.exit(1)
    
    docx_to_markdown(sys.argv[1], sys.argv[2])
    print(f"Converted {sys.argv[1]} to {sys.argv[2]}")

Install dependencies:

pip install python-docx markdownify

Note: This is a basic implementation. Pandoc will produce better results for complex documents.

Handling Common Issues

1. Complex Tables

# Use pipe tables format for better compatibility
pandoc document.docx -t markdown+pipe_tables -o document.md

2. Images Not Converting

# Extract images to a separate folder
pandoc document.docx --extract-media=./images -o document.md

3. Formatting Loss

# Preserve more HTML for complex formatting
pandoc document.docx -t markdown+raw_html -o document.md

4. Character Encoding Issues

# Specify UTF-8 encoding
pandoc document.docx -t markdown -o document.md --from=docx --to=markdown

Best Practices

1. Pre-conversion Preparation

  • Clean up your Word document before conversion
  • Use consistent heading styles (Heading 1, Heading 2, etc.)
  • Avoid complex formatting that doesn’t translate well to Markdown
  • Use Word’s built-in list formatting rather than manual bullets

2. Post-conversion Cleanup

  • Review the output for formatting issues
  • Fix table formatting if needed
  • Adjust image paths and alt text
  • Clean up extra line breaks or spacing issues

3. Automation Tips

# Create an alias for common conversion
echo 'alias doc2md="pandoc --from=docx --to=markdown"' >> ~/.bashrc

# Function for batch conversion
doc2md_batch() {
    for file in *.docx; do
        pandoc "$file" -o "${file%.docx}.md"
    done
}

Comparison of Methods

Method Pros Cons Best For
Pandoc Excellent quality, many options Requires installation Regular conversions, complex documents
LibreOffice + Pandoc Handles DOC files Two-step process Legacy DOC files
Online Converters No installation Privacy, limited features Quick one-off conversions
Word Export Built-in Poor quality output Simple documents only
Programming Customizable Requires coding Automated workflows

Little Summary

For most users, Pandoc is the recommended solution for converting Word documents to Markdown. It provides the best balance of quality, features, and reliability. For legacy .doc files, the LibreOffice + Pandoc combination works excellently.

The key to successful conversion is:

  1. Prepare your Word document with consistent formatting
  2. Choose the right tool for your specific needs
  3. Review and clean up the output
  4. Automate the process if you’re doing regular conversions

With these tools and techniques, you can efficiently convert your Word documents to Markdown format while preserving most of the original formatting and structure.

Quick Reference Commands

# Basic conversion (DOCX to Markdown)
pandoc document.docx -o document.md

# DOC to Markdown (two steps)
libreoffice --headless --convert-to docx document.doc
pandoc document.docx -o document.md

# GitHub Flavored Markdown
pandoc document.docx -t gfm -o document.md

# Extract images
pandoc document.docx --extract-media=./images -o document.md

# Batch convert all DOCX files
for file in *.docx; do pandoc "$file" -o "${file%.docx}.md"; done