Parsing Carrier PDF Invoices with pdfplumber Step-by-Step

This page resolves the three failures that break pdfplumber parsers in freight audit production: extract_tables() returning fragmented or empty rows, memory exhaustion on large carrier PDFs, and silent rate drift that slips past validation.

The Failure You Are Hitting

You wrote a parser that works on the carrier sample PDF and ships it into the PDF Invoice Parsing with Python stage. In production it degrades in one of three observable ways:

page.extract_tables() returns [], None, or rows with cells shifted one column to the right, so line-item descriptions land in the rate field and the record is unusable.
A 500-page consolidated invoice or a 300-DPI scanned remittance drives container RSS past its limit and the worker is OOMKilled mid-batch, leaving the ingestion event half-processed.
Extraction succeeds, the payload validates, and the audit later shows a fuel surcharge that was off by 6% — the parser read 1,284.50 as 128450 because a thousands separator was dropped, and nothing flagged it.

Unlike a structured EDI 210/810 transaction set, a PDF carries no semantic markup. pdfplumber reconstructs tables from the visual position of glyphs, so any layout the carrier ships that differs from your sample silently produces wrong-but-plausible output instead of an exception.

Root Cause Analysis

These failures are rarely defects in pdfplumber itself. They trace to four production conditions that a one-carrier sample never exercises:

Unnormalized coordinate space. Carriers email scans at 200 or 300 DPI and rotate pages (page.rotation of 90 or 180). extract_tables() uses the default lines strategy, which depends on ruled grid lines; borderless or rotated tables collapse into one column.
Unbounded page caching. pdfplumber lazily caches the parsed object tree of every page you touch. Iterate a large document while accumulating rows into a list and you hold the whole PDF plus every extracted DataFrame fragment in memory at once.
Lossy numeric extraction. Glyph-to-string reconstruction does not understand locale. $1,284.50, 1.284,50, and a value split across two text spans by a kerning gap all reach your code as raw strings that Decimal() will happily misparse.
Rigid validation with no tolerance band. A parser that either accepts or hard-rejects a row cannot tell a genuine surcharge from an extraction artefact, so teams loosen the gate until real drift passes through unnoticed.

Reproducible Diagnostic

Before changing any extraction code, confirm which failure you have. This snippet prints the signals that distinguish a rotation/border problem from an empty-text scan:

import pdfplumber

with pdfplumber.open("carrier_sample.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()
    text = page.extract_text() or ""
    print(f"size={page.width:.0f}x{page.height:.0f} rotation={page.rotation}")
    print(f"tables_found={len(tables)} first_table_rows={len(tables[0]) if tables else 0}")
    print(f"text_chars={len(text)} words={len(page.extract_words())}")

Read the output like a decision tree:

Signal	Likely cause	Where to fix
`rotation` is 90 or 180	page rendered sideways	normalize before extraction (Step 1)
`tables_found=0` but `text_chars` high	borderless table	switch `table_settings` to `text` strategy (Step 2)
`text_chars=0`, `words=0`	image-only scan, no text layer	route to OCR or dead-letter (Step 2)
tables found but cells shifted	merged/spanned columns	tighten `snap`/`join` tolerances (Step 2)

A text_chars=0 page will never yield data from pdfplumber alone — it has no embedded text layer, and no amount of table_settings tuning recovers it.

Resolution Path

The fix is a four-step parser that normalizes coordinates, streams pages to bound memory, extracts with a cascading fallback, and validates against a tolerance band. Pin your dependencies first so CI and production agree exactly:

# requirements.txt
pdfplumber==0.11.4
pandas==2.2.3
pydantic==2.10.6
structlog==24.4.0

Step 1 — Normalize rotation and validate structure

Reject encrypted or text-less PDFs before they enter the worker, and correct rotation so the coordinate space matches the table strategy:

import subprocess
import pdfplumber

def is_parseable(pdf_path: str) -> bool:
    """Fail fast on encrypted or page-less PDFs before ingestion."""
    try:
        result = subprocess.run(
            ["pdfinfo", pdf_path], capture_output=True, text=True, check=True
        )
        return "Pages:" in result.stdout
    except subprocess.CalledProcessError:
        return False

def upright(page: "pdfplumber.page.Page") -> "pdfplumber.page.Page":
    """Return a page whose coordinate space is axis-aligned for extraction."""
    deg = page.rotation % 360
    return page if deg == 0 else page.rotate(-deg)

Step 2 — Stream pages to bound memory

Never materialize the full document. A generator yields one page at a time and dereferences it so the garbage collector can reclaim the cached object tree before the next iteration:

import structlog
from pathlib import Path
from typing import Generator, Tuple

logger = structlog.get_logger()

def page_streamer(pdf_path: Path) -> Generator[Tuple[int, "pdfplumber.page.Page"], None, None]:
    """Yield pages one-by-one with explicit teardown to keep RSS flat."""
    with pdfplumber.open(pdf_path) as pdf:
        total = len(pdf.pages)
        for idx, page in enumerate(pdf.pages, start=1):
            logger.info("page_yield", path=str(pdf_path), page_idx=idx,
                        total_pages=total, rotation=page.rotation)
            yield idx, upright(page)
            page.flush_cache()   # drop pdfplumber's per-page object cache

Memory stays flat across a 1-page or 1,000-page document because at most one page tree is resident. Write results incrementally to Parquet rather than appending to a list — the same discipline that keeps async batch processing workers within their per-task memory budget.

Step 3 — Extract with a cascading fallback

A single strategy cannot survive layout variety. Try a coordinate-aligned table, fall through to a text-and-regex parse when completeness is too low, then dead-letter the page rather than emit garbage:

import re
from typing import List, Dict, Optional

ROW_COMPLETENESS_MIN = 0.85  # accept a table only if 85% of cells are populated

TABLE_SETTINGS = {            # borderless carrier tables need the text strategy
    "vertical_strategy": "text",
    "horizontal_strategy": "text",
    "snap_tolerance": 4,
    "join_tolerance": 4,
}

def extract_rows(page: "pdfplumber.page.Page") -> List[Dict[str, str]]:
    """Coordinate table -> regex text -> dead-letter, in that order."""
    tables = page.extract_tables(TABLE_SETTINGS)
    if tables and _completeness(tables[0]) >= ROW_COMPLETENESS_MIN:
        return _to_records(tables[0])

    logger.warning("table_strategy_failed", page_idx=page.page_number)
    text = page.extract_text()
    if text:
        return _regex_line_items(text)

    logger.error("extraction_exhausted", page_idx=page.page_number)
    return []  # caller routes an empty result to the dead-letter queue

def _completeness(table: List[List[Optional[str]]]) -> float:
    if not table:
        return 0.0
    total = sum(len(r) for r in table)
    filled = sum(1 for r in table for c in r if c and c.strip())
    return filled / total if total else 0.0

def _to_records(table: List[List[Optional[str]]]) -> List[Dict[str, str]]:
    if not table:
        return []
    headers = [h.strip().lower().replace(" ", "_") for h in table[0] if h]
    return [
        {headers[i]: (row[i] or "").strip() for i in range(len(headers))}
        for row in table[1:] if any(row)
    ]

def _regex_line_items(text: str) -> List[Dict[str, str]]:
    pattern = re.compile(
        r"(?P<description>[A-Z][A-Z\s\-\./]{2,})\s+"
        r"(?P<quantity>\d+)\s+"
        r"(?P<unit_rate>\$?\d[\d,]*\.\d{2})"
    )
    return [m.groupdict() for m in pattern.finditer(text)]

Step 4 — Validate against a tolerance band with a circuit breaker

Locale-safe numeric coercion plus a per-contract tolerance band catches the silent-drift case, and a circuit breaker halts the batch when failures accumulate instead of corrupting thousands of records:

from decimal import Decimal, InvalidOperation
from pydantic import BaseModel, field_validator, ValidationError
from typing import Iterator

def to_money(raw: str) -> Decimal:
    """Coerce '$1,284.50' / '1.284,50' to a Decimal without locale guesswork loss."""
    s = raw.replace("$", "").strip()
    if "," in s and "." in s:                 # last separator is the decimal point
        dec = max(s.rfind(","), s.rfind("."))
        s = s[:dec].replace(",", "").replace(".", "") + "." + s[dec + 1:]
    else:
        s = s.replace(",", "")
    try:
        return Decimal(s)
    except InvalidOperation as exc:
        raise ValueError(f"unparseable amount: {raw!r}") from exc

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_rate: Decimal
    expected_rate: Decimal
    tolerance_pct: Decimal = Decimal("0.025")  # 2.5% default band

    @field_validator("unit_rate")
    @classmethod
    def within_band(cls, v: Decimal, info) -> Decimal:
        expected = info.data.get("expected_rate")
        tol = info.data.get("tolerance_pct")
        if expected and tol and abs(v - expected) / expected > tol:
            raise ValueError(f"rate drift: {v} vs {expected} (>{tol*100}%)")
        return v

class CircuitBreaker:
    def __init__(self, max_failures: int = 5):
        self.failures, self.max_failures, self.open = 0, max_failures, False

    def record_failure(self):
        self.failures += 1
        if self.failures >= self.max_failures:
            self.open = True
            logger.critical("circuit_breaker_open", failures=self.failures)

    def reset(self):
        self.failures, self.open = 0, False

def validate(rows: Iterator[dict], breaker: CircuitBreaker) -> Iterator[LineItem]:
    for row in rows:
        try:
            item = LineItem(**row)
            breaker.reset()
            yield item
        except ValidationError as exc:
            logger.warning("validation_failed", error=str(exc), row=row)
            breaker.record_failure()
            if breaker.open:
                raise RuntimeError("batch halted: consecutive validation failures exceeded")

A drift ValueError does not mean the carrier overbilled — it means the number disagrees with the contract band, which is exactly the input that accessorial charge scoring is built to weigh. The parser flags; it never adjudicates.

Verification

Confirm each failure is actually closed rather than hidden. The assertions below belong in the integration suite that runs on every new carrier template:

def test_memory_flat_on_large_pdf():
    import tracemalloc
    tracemalloc.start()
    for _ in page_streamer(Path("fixtures/500page_consolidated.pdf")):
        pass
    _, peak = tracemalloc.get_traced_memory()
    assert peak < 256 * 1024 * 1024, "page streaming leaked: peak >256 MB"

def test_borderless_table_recovers():
    with pdfplumber.open("fixtures/borderless_carrier.pdf") as pdf:
        rows = extract_rows(upright(pdf.pages[0]))
    assert rows and "unit_rate" in rows[0], "fallback failed to recover rows"

def test_thousands_separator_parses():
    assert to_money("$1,284.50") == Decimal("1284.50")
    assert to_money("1.284,50") == Decimal("1284.50")

In production, the proof is in the logs: a healthy parser emits one page_yield per page, occasional table_strategy_failed warnings that recover via regex, and extraction_exhausted only on genuinely image-only scans. A spike in circuit_breaker_open means a carrier changed their layout — investigate the template, do not raise the tolerance band.

Preventive Configuration

Stop the regression from returning by encoding these as configuration, not tribal knowledge:

Per-carrier table settings. Keep a SCAC -> table_settings map so each carrier resolves to the strategy and tolerances proven against its layout, instead of one global default that silently rots.
Per-contract tolerance bands. Fuel surcharges typically need ±5% while base linehaul holds to ±1%; source the band from the same versioned tariff store that threshold tuning and alerting reads, so a contract change updates both in lockstep.
CI structure gate. Run is_parseable() and a one-page extraction assertion in CI on every PR that adds or edits a carrier template, so a malformed fixture fails the build, not the night batch.
Observability alerts. Export structlog JSON to your stack and alert on circuit_breaker_open and any extraction_exhausted rate above baseline — both indicate a layout shift the parser cannot yet handle.

The same coordinate-and-fallback discipline carries directly to rate-sheet work such as extracting FTL zone-based pricing from carrier PDFs, where the document is a tariff rather than an invoice but the failure modes are identical.

FAQ

Why does extract_tables() return an empty list when I can clearly see the table?

The default lines strategy needs ruled grid lines. Most carrier invoices use whitespace, not borders, to separate columns, so pdfplumber finds no lines and emits nothing. Pass the text strategy (vertical_strategy="text", horizontal_strategy="text") with snap_tolerance/join_tolerance around 4, as in Step 3, and the borderless table reconstructs.

How do I stop the worker from being OOMKilled on large consolidated invoices?

Do not call pdfplumber.open() and then build a list of every row. Stream pages with a generator that calls page.flush_cache() after each yield (Step 2) and write results incrementally to Parquet. Peak memory then reflects a single page, not the whole document, so a 1,000-page PDF uses roughly the same RSS as a one-page one.

extract_text() returns an empty string — can I still parse the PDF?

No, not with pdfplumber alone. An empty extract_text() with words=0 means the page is a scanned image with no embedded text layer. Route it to an OCR step (or the dead-letter queue) — table-setting tweaks cannot recover characters that were never encoded as text.

Should the parser correct a rate it thinks is wrong?

Never. This stage converts a PDF into a typed payload and flags values that fall outside the contract tolerance band. Deciding whether a flagged charge is a real overbill belongs downstream in rate validation and dispute routing, which work off the contract master data rather than the document.

PDF Invoice Parsing with Python — the parent stage this parser plugs into.
Converting XML Carrier Invoices to pandas DataFrames — the structured-format sibling with the same memory and drift concerns.
Implementing Async Batch Invoice Processing with Celery — how to fan these parsers across workers safely.
Extracting FTL Zone-Based Pricing from Carrier PDFs — the same pdfplumber techniques applied to tariff sheets.
Threshold Tuning & Alerting — where the tolerance bands this parser enforces are configured.

Up one level: PDF Invoice Parsing with Python · Section: Automated Invoice Parsing & EDI/XML Ingestion