Extracting FTL Zone-Based Pricing from Carrier PDFs

This page resolves the three failures that break zone-matrix extractors in production: silently dropped origin-destination rows, worker OOM on multi-hundred-page tariff PDFs, and undetected zone-rate drift that overwrites an active tariff with a stale one.

The Failure You Are Hitting

You built a parser against one carrier’s zone tariff sample, wired it into the FTL Base Rate Extraction stage, and it works on the sample. In production it degrades in one of three observable ways:

page.extract_tables() returns a zone grid with rows missing or shifted one column, so a rate for zone 7 lands under zone 6 and the lane key resolves to the wrong tariff cell — wrong-but-plausible output, no exception.
A 400-page consolidated rate sheet loads into a single monolithic frame, the worker’s RSS blows past its limit, and it is OOMKilled mid-ingest, leaving the dataset half-written with no version stamp.
Extraction succeeds, the dataset validates, and weeks later a reconciliation shows a lane billed at the prior quarter’s zone rate — the carrier moved a zone boundary mid-cycle without incrementing the contract version, and nothing flagged the change.

Unlike a structured EDI 210/810 transaction set, a tariff PDF carries no semantic markup. The extractor reconstructs the zone matrix from the visual position of glyphs, so any layout the carrier ships that differs from your sample produces a corrupt rate table instead of an error.

Failure Definition

Zone-based FTL pricing maps origin-destination pairs to predefined tariff zones, then applies a base-rate matrix that scales by weight tier, distance band, or equipment type. The extractor’s job is to turn that printed matrix into a version-controlled, lookup-ready dataset for the rest of Freight Contract Architecture & Rate Mapping to read. The failure is any path where the emitted matrix diverges from the printed one without raising: truncated tables, misaligned zone headers, merged cells that collapse into a single row, or rasterized grids that bypass native text extraction entirely.

Root Cause Analysis

These failures are rarely defects in pdfplumber, camelot, or PyMuPDF themselves. They trace to four production conditions a one-carrier sample never exercises:

Repeating-header bleed. Zone matrices span dozens of pages, reprinting the ZONE header band and footer artifacts on every page. Concatenate page tables naively and those header rows become phantom data rows, inflating the matrix and corrupting the deterministic key.
Monolithic frame construction. Loading a multi-hundred-page rate sheet into one DataFrame holds the whole document plus every extracted fragment in memory at once, which triggers garbage-collection thrashing and the OOM kill.
Single-strategy extraction. extract_tables() defaults to a ruled-line strategy; a borderless or stream-formatted zone grid collapses into one column, and with no fallback the page yields nothing.
No drift baseline. Carriers adjust zone boundaries, rename lanes, or apply mid-cycle surcharges without touching the contract version. With no content hash to compare against, a re-ingest silently overwrites the active tariff — the same blind spot that threshold tuning and alerting exists to close on the validation side.

Reproducible Diagnostic

Before changing extraction code, confirm which failure you have. Enable structured logging at the extraction boundary so every page emits the three signals that distinguish header bleed from a borderless grid from an image-only scan:

import logging
import json
import hashlib
from typing import Generator, Dict, Any, List

class JsonFormatter(logging.Formatter):
    def format(self, record: logging.LogRecord) -> str:
        log_entry = {
            "timestamp": self.formatTime(record),
            "level": record.levelname,
            "message": record.getMessage(),
            "module": record.module,
            "line": record.lineno,
            **(getattr(record, "extra_data", {}))
        }
        return json.dumps(log_entry)

logger = logging.getLogger("ftl_rate_extractor")
logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(JsonFormatter())
logger.addHandler(handler)

def compute_matrix_hash(matrix: List[List[str]]) -> str:
    """Deterministic SHA-256 hash of the zone matrix for drift detection."""
    serialized = "|".join("|".join(row) for row in matrix).encode("utf-8")
    return hashlib.sha256(serialized).hexdigest()

def log_extraction_event(pdf_path: str, page_num: int, table_idx: int,
                         status: str, details: Dict[str, Any]) -> None:
    """Structured logging for every extraction-boundary event."""
    logger.info(
        f"Extraction event: {status}",
        extra={
            "extra_data": {
                "pdf_path": pdf_path,
                "page": page_num,
                "table_index": table_idx,
                "status": status,
                **details
            }
        }
    )

Read the emitted stream like a decision tree: a SCHEMA_MISMATCH on the column count points at merged cells or a shifted grid; a flood of SKIPPED with insufficient_rows points at header/footer fragments; and zero SUCCESS events with high page counts points at a borderless table the default strategy cannot see.

Resolution Path

The fix is a four-step extractor that streams pages to bound memory, filters header bleed, cascades through parsers under a CI gate, and hashes the result against a drift baseline. Pin dependencies first so CI and production agree exactly:

# requirements.txt
pdfplumber==0.11.4
camelot-py==0.11.0
polars==1.12.0
pyarrow==17.0.0

Step 1 — Stream pages and filter header bleed

Monolithic frame construction is the primary OOM cause. A generator yields one validated zone table at a time, streaming into Polars’ lazy frame rather than accumulating a list. The schema gate rejects column-count drift, and the ZONE filter drops header rows that bleed across page breaks:

import pdfplumber
import polars as pl

def stream_zone_tables(pdf_path: str, expected_cols: int = 8) -> Generator[pl.DataFrame, None, None]:
    """Memory-safe generator yielding validated zone tables page-by-page."""
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            tables = page.extract_tables()
            for t_idx, table in enumerate(tables):
                if not table or len(table) == 0:
                    continue

                # Drop header/footer fragments by row-count threshold
                if len(table) < 3:
                    log_extraction_event(pdf_path, page_num, t_idx, "SKIPPED",
                                         {"reason": "insufficient_rows"})
                    continue

                # Schema gate: reject any page whose column count drifts
                col_count = len(table[0])
                if col_count != expected_cols:
                    log_extraction_event(pdf_path, page_num, t_idx, "SCHEMA_MISMATCH",
                                         {"expected": expected_cols, "actual": col_count})
                    continue

                # Normalize cells, then assign names from the header row
                cleaned = [[str(cell).strip() if cell else "" for cell in row] for row in table]
                headers = [f"col_{i}" if not cleaned[0][i] else cleaned[0][i]
                           for i in range(len(cleaned[0]))]
                df = pl.DataFrame(
                    {h: [row[i] for row in cleaned[1:]] for i, h in enumerate(headers)}
                )

                # Drop reprinted header rows that bleed across page breaks
                df = df.filter(~pl.col(headers[0]).str.starts_with("ZONE"))

                if df.height > 0:
                    log_extraction_event(pdf_path, page_num, t_idx, "SUCCESS",
                                         {"rows": df.height, "cols": df.width})
                    yield df

Memory stays flat across a 1-page or 1,000-page tariff because at most one page table is resident. The same streaming discipline keeps async batch processing workers inside their per-task memory budget. Note that pl.col(pl.Utf8) was removed in Polars 0.19 — select by column name directly, as shown above.

Step 2 — Cascade through parsers under a CI gate

No single strategy survives layout variety. Try native page-streaming extraction first, fall through to camelot lattice for ruled grids, and gate the result on the share of rows whose first column is a valid numeric zone code. This is the same coordinate-and-fallback approach proven in parsing carrier PDF invoices with pdfplumber, applied to a tariff rather than an invoice:

import camelot

def extract_with_fallback(pdf_path: str, expected_cols: int = 8) -> pl.DataFrame:
    """Multi-parser fallback routing with a strict CI gate."""
    parsers = [
        ("pdfplumber_native", lambda p: list(stream_zone_tables(p, expected_cols))),
        ("camelot_lattice", lambda p: [
            pl.from_pandas(t.df) for t in camelot.read_pdf(p, flavor="lattice", pages="all")
        ]),
    ]

    for parser_name, extraction_fn in parsers:
        try:
            logger.info(f"Attempting parser: {parser_name}")
            frames = extraction_fn(pdf_path)
            if not frames:
                continue

            combined = pl.concat(frames, how="diagonal")
            if combined.height == 0:
                continue

            # CI gate: reject if >15% of rows fail the zone-code shape check
            first_col = combined.columns[0]
            valid_count = combined.filter(
                pl.col(first_col).str.contains(r"^\d{3,5}$")
            ).height
            valid_ratio = valid_count / combined.height

            if valid_ratio < 0.85:
                log_extraction_event(pdf_path, 0, 0, "CI_BLOCKED",
                                     {"parser": parser_name, "valid_ratio": valid_ratio})
                raise ValueError(f"{parser_name} failed CI gate: {valid_ratio:.2%} valid rows")

            log_extraction_event(pdf_path, 0, 0, "PARSER_SELECTED",
                                 {"parser": parser_name, "total_rows": combined.height})
            return combined

        except Exception as e:
            logger.warning(f"Parser {parser_name} failed: {e}")
            continue

    raise RuntimeError("All parsers failed. Route to OCR fallback or manual review queue.")

A page with no embedded text layer will fail both parsers and reach the final RuntimeError — that is correct behaviour. An image-only tariff scan must go to OCR or a review queue, not silently emit an empty matrix.

Step 3 — Hash against a drift baseline

Silent drift is the failure that survives a clean extraction. Compute a deterministic hash of the extracted matrix and compare it to the last approved baseline for that contract. A mismatch halts automated activation and routes the tariff to staging for analyst review:

def detect_rate_drift(current_df: pl.DataFrame, baseline_hash: str, contract_id: str) -> Dict[str, Any]:
    """Compare the current extraction against the baseline to flag silent drift."""
    matrix_as_list = current_df.to_numpy().tolist()
    current_hash = compute_matrix_hash(matrix_as_list)
    drift_detected = current_hash != baseline_hash

    audit_record = {
        "contract_id": contract_id,
        "current_hash": current_hash,
        "baseline_hash": baseline_hash,
        "drift_detected": drift_detected,
        "action": "REVIEW_REQUIRED" if drift_detected else "APPROVED"
    }

    log_extraction_event("audit_trail", 0, 0, "DRIFT_CHECK", audit_record)
    return audit_record

A REVIEW_REQUIRED verdict does not mean the carrier erred — it means the printed matrix no longer matches the approved snapshot, which is exactly the signal an analyst needs before the new tariff can supersede the active one.

Step 4 — Write a versioned dataset

Only after the CI gate passes and the drift check returns APPROVED (or an analyst approves a flagged change) does the extractor write the matrix as a versioned Parquet partition keyed by contract_id and current_hash. Downstream stages — lane matching and rate validation — read that immutable snapshot, never the live PDF.

Verification

Confirm each failure is closed rather than hidden. These assertions belong in the integration suite that runs on every new carrier template:

import tracemalloc
from pathlib import Path

def test_memory_flat_on_large_tariff():
    tracemalloc.start()
    for _ in stream_zone_tables("fixtures/400page_zone_tariff.pdf"):
        pass
    _, peak = tracemalloc.get_traced_memory()
    assert peak < 256 * 1024 * 1024, "page streaming leaked: peak >256 MB"

def test_header_bleed_filtered():
    frame = extract_with_fallback("fixtures/repeating_header_tariff.pdf")
    first_col = frame.columns[0]
    assert frame.filter(pl.col(first_col).str.starts_with("ZONE")).height == 0

def test_drift_flags_changed_matrix():
    frame = extract_with_fallback("fixtures/zone_tariff_q2.pdf")
    result = detect_rate_drift(frame, baseline_hash="deadbeef", contract_id="ACME-FTL-2026")
    assert result["action"] == "REVIEW_REQUIRED"

In production the proof is in the logs: a healthy run emits one SUCCESS per zone page, occasional SCHEMA_MISMATCH warnings that recover via the camelot fallback, and CI_BLOCKED only when a layout genuinely breaks the gate. A spike in DRIFT_CHECK with drift_detected means a carrier moved a zone boundary — investigate the tariff, do not lower the valid-ratio threshold.

Preventive Configuration

Encode the fix as configuration, not tribal knowledge:

Per-carrier expected_cols and table strategy. Keep a SCAC -> {expected_cols, flavor} map so each carrier resolves to the column count and parser proven against its layout, instead of one global default that silently rots.
Baseline hash in the versioned tariff store. Persist current_hash on every approved ingest so the next extraction has a baseline to diff against — the same store that calculating dynamic fuel surcharges reads its tier snapshots from.
CI valid-ratio gate. Run extract_with_fallback() against a one-page fixture in CI on every PR that adds or edits a carrier template, so a malformed grid fails the build, not the night batch.
Drift alerts. Export the ftl_rate_extractor JSON stream to your observability stack and alert on any CI_BLOCKED or drift_detected event above baseline — both indicate a layout or boundary shift the extractor cannot yet absorb.

FAQ

Why do reprinted ZONE headers show up as data rows in my matrix?

Zone tariffs reprint the header band on every page, and a naive page-table concatenation treats each reprint as a data row. Filter rows whose first column starts with the header token (df.filter(~pl.col(headers[0]).str.starts_with("ZONE")), Step 1) before yielding, so only genuine zone rows survive into the combined frame.

How do I stop the worker from being OOMKilled on a 400-page rate sheet?

Do not build one DataFrame for the whole document. Stream pages with the generator in Step 1, which holds at most one page table in memory and yields validated frames into a Polars concat. Peak memory then reflects a single page, so a 400-page tariff uses roughly the same RSS as a one-page one.

The carrier changed a zone rate but kept the same contract version — how do I catch it?

Version numbers are not trustworthy drift signals. Hash the extracted matrix with compute_matrix_hash() and compare it to the stored baseline (Step 3). Any byte-level change in the printed grid produces a different hash, flips the verdict to REVIEW_REQUIRED, and halts automated activation regardless of what the version field says.

Should the extractor correct a zone rate it thinks is wrong?

Never. This stage converts a tariff PDF into a typed, versioned dataset and flags matrices that diverge from the approved baseline. Deciding whether a changed rate is valid belongs to an analyst, and pricing a shipment against it belongs to downstream rate validation — not the extractor.

FTL Base Rate Extraction — the parent stage this extractor feeds.
Parsing Carrier PDF Invoices with pdfplumber Step-by-Step — the same coordinate-and-fallback technique applied to invoices.
How to Map LTL Class Rates to JSON Schemas — the class-based sibling with a different matrix shape but identical drift concerns.
Calculating Dynamic Fuel Surcharges with Python Formulas — reads the versioned snapshot this extractor produces.
Threshold Tuning & Alerting — where the validation-side guards that complement drift detection are configured.

Up one level: FTL Base Rate Extraction · Section: Freight Contract Architecture & Rate Mapping