How to extract the markdown text from HTML using LLM Ollama?

To extract the markdown text from HTML using LLM Ollama you can use ReaderLM-v2 model.

Convert HTML content to Markdown using LLM and Ollama

LLM to extract text from HTML...

Page content

In the Ollama models library there are models that able convert HTML content to Markdown, which is useful for content conversion tasks.

For example model reader-lm which is based on qwen2, is trained to do this.

llama is pulling html cart

ReaderLM-v2

I have tried the next such model version - reader-lm-v2. ReaderLM-v2 is built on Qwen2.5-1.5B-Instruction. I can confirm: it works, but the conversion is somehow slow-ish…

Can you imagine the 500KB html webpage that you need to convert extract a text from? Maybe there is 100000 tokens? or let it be even 10k tokens.

I took a sample page of 121KB and conversion time on my PC is: ~1sec.

Calling Ollama Commandline

#!/bin/bash

MODEL="milkey/reader-lm-v2:latest"
INPUT_FILE="prompt.html"
OUTPUT_FILE="response.md"

# Read file content as prompt
PROMPT="Extract the main content from the given HTML and convert it to Markdown format.\nhtml:\n $(cat "$INPUT_FILE")"

# Call Ollama and save the response
ollama run "$MODEL" "$PROMPT" > "$OUTPUT_FILE"

echo "Ollama response saved to $OUTPUT_FILE"

ReaderLM-v2

Calling Ollama Commandline

Useful links