Extract Text from PDFs with PDFMiner in Python
Master PDF text extraction with Python
PDFMiner.six is a powerful Python library for extracting text, metadata, and layout information from PDF documents.
Unlike simple PDF readers, it provides deep analysis of PDF structure and handles complex layouts effectively.

What is PDFMiner and Why Use It?
PDFMiner is a pure-Python library designed to extract and analyze text from PDF documents. The .six version is the actively maintained fork that supports Python 3.x, while the original PDFMiner project is no longer updated.
Key Features:
- Pure Python implementation (no external dependencies)
- Detailed layout analysis and text positioning
- Font and character encoding detection
- Support for encrypted PDFs
- Command-line tools included
- Extensible architecture for custom processing
PDFMiner is particularly useful when you need precise control over text extraction, need to preserve layout information, or work with complex multi-column documents. While it may be slower than some alternatives, its accuracy and detailed analysis capabilities make it the preferred choice for document processing pipelines. For the reverse workflow, you might also be interested in generating PDFs programmatically in Python.
Installation and Setup
Install PDFMiner.six using pip:
pip install pdfminer.six
For virtual environments (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install pdfminer.six
If you’re new to Python package management, check out our Python Cheatsheet for more details on pip and virtual environments.
Verify the installation:
pdf2txt.py --version
The library includes several command-line tools:
pdf2txt.py- Extract text from PDFsdumppdf.py- Dump PDF internal structurelatin2ascii.py- Convert Latin characters to ASCII
These tools complement other PDF manipulation utilities like Poppler that provide additional functionality such as page extraction and format conversion.
Basic Text Extraction
Simple Text Extraction
The most straightforward way to extract text from a PDF:
from pdfminer.high_level import extract_text
# Extract all text from a PDF
text = extract_text('document.pdf')
print(text)
This high-level API handles most common use cases and returns the entire document as a single string.
Extract Text from Specific Pages
To extract text from specific pages:
from pdfminer.high_level import extract_text
# Extract text from pages 2-5 (0-indexed)
text = extract_text('document.pdf', page_numbers=[1, 2, 3, 4])
print(text)
This is particularly useful for large documents where you only need certain sections, significantly improving performance.
Extract Text with Page Iteration
For processing pages individually:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages('document.pdf'):
for element in page_layout:
if isinstance(element, LTTextContainer):
print(element.get_text())
This approach gives you more control over how each page is processed, useful when working with documents where page structure varies.
Advanced Layout Analysis
Understanding LAParams
LAParams (Layout Analysis Parameters) control how PDFMiner interprets document layout. Understanding the difference between PDFMiner and simpler libraries is crucial here - PDFMiner actually analyzes the spatial relationships between text elements.
from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams
# Create custom LAParams
laparams = LAParams(
line_overlap=0.5, # Min overlap for text lines
char_margin=2.0, # Character margin
line_margin=0.5, # Line margin
word_margin=0.1, # Word spacing
boxes_flow=0.5, # Box flow threshold
detect_vertical=True, # Detect vertical text
all_texts=False # Extract only text in boxes
)
text = extract_text('document.pdf', laparams=laparams)
Parameter Explanation:
line_overlap: How much lines must overlap vertically to be considered the same line (0.0-1.0)char_margin: Maximum spacing between characters in the same word (as multiple of character width)line_margin: Maximum spacing between lines in the same paragraphword_margin: Spacing threshold to separate wordsboxes_flow: Threshold for text box flow directiondetect_vertical: Enable detection of vertical text (common in Asian languages)
Extracting Layout Information
Get detailed position and font information:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBox, LTTextLine, LTChar
for page_layout in extract_pages('document.pdf'):
for element in page_layout:
if isinstance(element, LTTextBox):
# Get bounding box coordinates
x0, y0, x1, y1 = element.bbox
print(f"Text at ({x0}, {y0}): {element.get_text()}")
# Iterate through lines
for text_line in element:
if isinstance(text_line, LTTextLine):
# Get character-level details
for char in text_line:
if isinstance(char, LTChar):
print(f"Char: {char.get_text()}, "
f"Font: {char.fontname}, "
f"Size: {char.height}")
This level of detail is invaluable for document analysis, form extraction, or when you need to understand document structure programmatically.
Handling Different PDF Types
Encrypted PDFs
PDFMiner can handle password-protected PDFs:
from pdfminer.high_level import extract_text
# Extract from password-protected PDF
text = extract_text('encrypted.pdf', password='your_password')
Note that PDFMiner can only extract text from PDFs - it cannot bypass security restrictions that prevent text extraction at the PDF level.
Multi-Column Documents
For documents with multiple columns, tune LAParams:
from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams
# Optimize for multi-column layouts
laparams = LAParams(
detect_vertical=False,
line_margin=0.3,
word_margin=0.1,
boxes_flow=0.3 # Lower value for better column detection
)
text = extract_text('multi_column.pdf', laparams=laparams)
The boxes_flow parameter is particularly important for multi-column documents - lower values help PDFMiner distinguish between separate columns.
Non-English and Unicode Text
PDFMiner handles Unicode well, but ensure proper encoding:
from pdfminer.high_level import extract_text
# Extract text with Unicode support
text = extract_text('multilingual.pdf', codec='utf-8')
# Save to file with UTF-8 encoding
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(text)
Working with Scanned PDFs
PDFMiner cannot extract text from scanned PDFs (images) directly. These require OCR (Optical Character Recognition). However, you can integrate PDFMiner with OCR tools.
Here’s how to detect if a PDF is scanned and needs OCR:
from pdfminer.high_level import extract_text
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTFigure, LTImage
def is_scanned_pdf(pdf_path):
"""Check if PDF appears to be scanned (mostly images)"""
text_count = 0
image_count = 0
for page_layout in extract_pages(pdf_path):
for element in page_layout:
if isinstance(element, (LTFigure, LTImage)):
image_count += 1
elif hasattr(element, 'get_text'):
if element.get_text().strip():
text_count += 1
# If mostly images and little text, likely scanned
return image_count > text_count * 2
if is_scanned_pdf('document.pdf'):
print("This PDF appears to be scanned - use OCR")
else:
text = extract_text('document.pdf')
print(text)
For scanned PDFs, consider integrating with Tesseract OCR or using tools to extract images from PDFs first, then applying OCR to those images.
Command-Line Usage
PDFMiner includes powerful command-line tools:
Extract Text with Command-Line Tools
# Extract text to stdout
pdf2txt.py document.pdf
# Save to file
pdf2txt.py -o output.txt document.pdf
# Extract specific pages
pdf2txt.py -p 1,2,3 document.pdf
# Extract as HTML
pdf2txt.py -t html -o output.html document.pdf
Advanced Options
# Custom layout parameters
pdf2txt.py -L 0.3 -W 0.1 document.pdf
# Extract with detailed layout (XML)
pdf2txt.py -t xml -o layout.xml document.pdf
# Set password for encrypted PDF
pdf2txt.py -P mypassword encrypted.pdf
These command-line tools are excellent for quick testing, shell scripts, and integration into automated workflows.
Performance Optimization
Processing Large PDFs
For large documents, consider these optimization strategies:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LAParams
# Process only needed pages
def extract_page_range(pdf_path, start_page, end_page):
text_content = []
for i, page_layout in enumerate(extract_pages(pdf_path)):
if i < start_page:
continue
if i >= end_page:
break
text_content.append(page_layout)
return text_content
# Disable layout analysis for speed
from pdfminer.high_level import extract_text
text = extract_text('large.pdf', laparams=None) # Much faster
Batch Processing
For processing multiple PDFs efficiently:
from multiprocessing import Pool
from pdfminer.high_level import extract_text
import os
def process_pdf(pdf_path):
"""Process single PDF file"""
try:
text = extract_text(pdf_path)
output_path = pdf_path.replace('.pdf', '.txt')
with open(output_path, 'w', encoding='utf-8') as f:
f.write(text)
return f"Processed: {pdf_path}"
except Exception as e:
return f"Error processing {pdf_path}: {str(e)}"
# Process PDFs in parallel
def batch_process_pdfs(pdf_directory, num_workers=4):
pdf_files = [os.path.join(pdf_directory, f)
for f in os.listdir(pdf_directory)
if f.endswith('.pdf')]
with Pool(num_workers) as pool:
results = pool.map(process_pdf, pdf_files)
for result in results:
print(result)
# Usage
batch_process_pdfs('/path/to/pdfs', num_workers=4)
Common Issues and Solutions
Issue: Incorrect Text Order
Problem: Extracted text appears jumbled or out of order.
Solution: Adjust LAParams, especially boxes_flow:
from pdfminer.layout import LAParams
laparams = LAParams(boxes_flow=0.3) # Try different values
text = extract_text('document.pdf', laparams=laparams)
Issue: Missing Spaces Between Words
Problem: Words run together without spaces.
Solution: Increase word_margin:
laparams = LAParams(word_margin=0.2) # Increase from default 0.1
text = extract_text('document.pdf', laparams=laparams)
Issue: Encoding Errors
Problem: Strange characters or encoding errors.
Solution: Specify codec explicitly:
text = extract_text('document.pdf', codec='utf-8')
Issue: Memory Errors with Large PDFs
Problem: Out of memory errors with large files.
Solution: Process page by page:
def extract_text_chunked(pdf_path, chunk_size=10):
"""Extract text in chunks to reduce memory usage"""
all_text = []
page_count = 0
for page_layout in extract_pages(pdf_path):
page_text = []
for element in page_layout:
if hasattr(element, 'get_text'):
page_text.append(element.get_text())
all_text.append(''.join(page_text))
page_count += 1
# Process in chunks
if page_count % chunk_size == 0:
yield ''.join(all_text)
all_text = []
# Yield remaining text
if all_text:
yield ''.join(all_text)
Comparing PDFMiner with Alternatives
Understanding when to use PDFMiner versus other libraries is important:
PDFMiner vs PyPDF2
PyPDF2 is simpler and faster but less accurate:
- Use PyPDF2 for: Simple PDFs, quick extraction, merging/splitting PDFs
- Use PDFMiner for: Complex layouts, accurate text positioning, detailed analysis
PDFMiner vs pdfplumber
pdfplumber builds on PDFMiner with a higher-level API:
- Use pdfplumber for: Table extraction, simpler API, quick prototyping
- Use PDFMiner for: Maximum control, custom processing, production systems
PDFMiner vs PyMuPDF (fitz)
PyMuPDF is significantly faster but has C dependencies:
- Use PyMuPDF for: Performance-critical applications, large-scale processing
- Use PDFMiner for: Pure Python requirement, detailed layout analysis
Practical Example: Extract and Analyze Document
Here’s a complete example that extracts text and provides document statistics:
from pdfminer.high_level import extract_pages, extract_text
from pdfminer.layout import LTTextBox, LTChar
from collections import Counter
import re
def analyze_pdf(pdf_path):
"""Extract text and provide document analysis"""
# Extract full text
full_text = extract_text(pdf_path)
# Statistics
stats = {
'total_chars': len(full_text),
'total_words': len(full_text.split()),
'total_lines': full_text.count('\n'),
'fonts': Counter(),
'font_sizes': Counter(),
'pages': 0
}
# Detailed analysis
for page_layout in extract_pages(pdf_path):
stats['pages'] += 1
for element in page_layout:
if isinstance(element, LTTextBox):
for line in element:
for char in line:
if isinstance(char, LTChar):
stats['fonts'][char.fontname] += 1
stats['font_sizes'][round(char.height, 1)] += 1
return {
'text': full_text,
'stats': stats,
'most_common_font': stats['fonts'].most_common(1)[0] if stats['fonts'] else None,
'most_common_size': stats['font_sizes'].most_common(1)[0] if stats['font_sizes'] else None
}
# Usage
result = analyze_pdf('document.pdf')
print(f"Pages: {result['stats']['pages']}")
print(f"Words: {result['stats']['total_words']}")
print(f"Main font: {result['most_common_font']}")
print(f"Main size: {result['most_common_size']}")
Integration with Document Processing Pipelines
PDFMiner works well in larger document processing workflows. For example, when building RAG (Retrieval-Augmented Generation) systems or document management solutions, you might combine it with other Python tools for a complete pipeline.
Once you’ve extracted text from PDFs, you often need to convert it to other formats. You can convert HTML content to Markdown using Python libraries or even leverage LLM-powered conversion with Ollama for intelligent document transformation. These techniques are particularly useful when PDF extraction produces HTML-like structured text that needs cleaning and reformatting.
For comprehensive document conversion pipelines, you might also need to handle Word document to Markdown conversion, creating a unified workflow that processes multiple document formats into a common output format.
Best Practices
-
Always use LAParams for complex documents - The default settings work for simple documents, but tuning LAParams significantly improves results for complex layouts.
-
Test with sample pages first - Before processing large batches, test your extraction settings on representative samples.
-
Handle exceptions gracefully - PDF files can be corrupted or malformed. Always wrap extraction code in try-except blocks.
-
Cache extracted text - For repeated processing, cache extracted text to avoid re-processing.
-
Validate extracted text - Implement checks to verify extraction quality (e.g., minimum text length, expected keywords).
-
Consider alternatives for specific use cases - While PDFMiner is powerful, sometimes specialized tools (like tabula-py for tables) are more appropriate.
-
Keep PDFMiner updated - The
.sixfork is actively maintained. Keep it updated for bug fixes and improvements. -
Document your code properly - When sharing PDF extraction scripts, use proper Markdown code blocks with syntax highlighting for better readability.
Conclusion
PDFMiner.six is an essential tool for Python developers working with PDF documents. Its pure-Python implementation, detailed layout analysis, and extensible architecture make it ideal for production document processing systems. While it may have a steeper learning curve than simpler libraries, the precision and control it offers are unmatched for complex PDF extraction tasks.
Whether you’re building a document management system, analyzing scientific papers, or extracting data for machine learning pipelines, PDFMiner provides the foundation for reliable PDF text extraction in Python.
Related Resources
Related Articles on This Site
- Pdf manipulating tools in Ubuntu - Poppler - Comprehensive guide to command-line PDF tools including pdftotext, pdfimages, and other poppler utilities that work alongside PDFMiner in document processing workflows
- How to extract images from PDF - Cheatsheet - Learn how to extract embedded images from PDFs using poppler command-line tools, complementing PDFMiner’s text extraction capabilities
- Generating PDF in Python - Libraries and examples - Explore Python libraries for PDF generation including ReportLab, PyPDF2, and FPDF to create the reverse workflow of PDF text extraction
- Python Cheatsheet - Essential Python syntax reference including file handling, string operations, and best practices for writing clean PDF processing scripts
- Converting HTML to Markdown with Python: A Comprehensive Guide - When building document conversion pipelines, learn how to convert HTML (extracted from PDFs or web) to Markdown format using Python libraries
- Convert HTML content to Markdown using LLM and Ollama - Advanced technique using local LLMs to intelligently convert HTML content to Markdown, useful for cleaning up extracted PDF text
- Using Markdown Code Blocks - Master Markdown syntax for documenting your PDF extraction code with proper formatting and syntax highlighting
- Converting Word Documents to Markdown: A Complete Guide - Complete document conversion guide including Word, PDF, and other formats for cross-platform document processing pipelines