LTL Rate Sheet Digitization

LTL rate sheet digitization is the deterministic ingestion and normalization stage that turns less-than-truckload carrier tariffs into version-controlled, lookup-ready rate matrices for a freight audit pipeline. Where FTL Base Rate Extraction deals with lane-driven, weight-tiered flat rates, LTL pricing is multidimensional in kind: it intersects NMFC freight classes (50–500), weight breaks, origin/destination zone matrices, and a negotiated discount off a base tariff, so both the parsing logic and the canonical schema differ structurally. This stage parses raw carrier rate sheets, isolates the class/weight rate grid, enforces canonical data types, pivots the records into a queryable matrix, and emits a versioned, hashed dataset. It is the LTL-specific entry into Freight Contract Architecture & Rate Mapping, and it stops precisely at canonical serialization — rate validation, dispute resolution, and invoice matching are downstream responsibilities that read the dataset this stage produces, never part of it.

The audience is Python ETL developers and freight auditors who own the rate store. The boundary matters: every downstream tier treats the digitized LTL matrix as a single source of truth, so a silently dropped weight break or a misread freight class becomes an audit gap that surfaces months later during carrier reconciliation. The scope of this guide is the parse-to-dataset path for a single carrier tariff; it ends at the field contract handed to validation and never prices a shipment itself.

Prerequisites & Input Contract

This stage assumes a working ingestion gateway upstream: carrier rate sheets land via SFTP, AS2, or a portal export, are persisted as raw bytes, and are hashed before this digitizer runs. The digitizer is a pure transform — given one raw tariff artifact plus its source hash, carrier identifier, and effective date, it produces a set of normalized LTL rate records — so it carries no network or storage concerns of its own.

Dependency	Minimum version	Purpose in this stage
`python`	3.10+	Modern typing, `match` statements, structural pattern handling
`pydantic`	2.5+	Strict schema enforcement, type coercion, field validation
`pdfplumber`	0.11+	Layout-aware table detection in carrier PDFs
`pandas`	2.1+	Workbook ingestion, header normalization, matrix pivot
`decimal` (stdlib)	—	Exact monetary arithmetic for base rates and min charges
`hashlib` (stdlib)	—	SHA-256 content hashing for dataset provenance
`pyarrow`	14+	Versioned Parquet partition writes

The config keys this stage reads should come from the pipeline’s settings layer, never hard-coded per carrier:

Config key	Default	Meaning
`ltl.table_strategy`	`lines`	`lines` for ruled grids, `text` for whitespace-delimited rate blocks
`ltl.class_range`	`50,500`	Valid NMFC freight-class bounds; rows outside this range dead-letter
`ltl.weight_unit`	`lb`	Source weight unit for weight-break columns
`ltl.quarantine_dir`	`s3://audit-dlq/ltl/`	Destination for unreadable or schema-failing artifacts
`ltl.partition_keys`	`carrier_id,effective_date`	Parquet partition columns for downstream query pruning

The input contract is deliberately narrow. The digitizer accepts one of three heterogeneous artifacts and routes anything outside that set to quarantine before normalization:

Inbound format	Typical source	What the digitizer must isolate
PDF rate sheet	Carrier portal / email	The class/weight rate grid, excluding FSC and accessorial footnotes
Legacy XLSX/XLS workbook	Pricing-team handoff	The rate sheet(s) carrying weight-break or freight-class columns
Portal CSV export	TMS or carrier API	Header-mapped columns with explicit class and weight-break declarations

Scanned, image-only PDFs are out of scope for structural detection and must pass through an OCR pre-pass; the coordinate-calibration and OCR-fallback techniques for difficult carrier PDFs are shared with PDF Invoice Parsing with Python, which the ingestion gateway runs before this stage sees the bytes.

Architecture Detail: Column & Field Mapping

The architecture of this stage is a map from each carrier’s idiosyncratic header layout onto one canonical record. Carriers label the same concept a dozen ways — Class, Freight Class, and NMFC Class all mean the same field, while Wt Break, Weight Break, and Min Wt collapse onto another. The table below is the contract the normalizer enforces, and it doubles as the reference for which fields are mandatory. The deeper work of expressing this contract as a reusable, validatable JSON document — including nested zone matrices and $ref reuse — is covered in How to map LTL class rates to JSON schemas.

Canonical field	Common carrier headers	Data type	Validation rule
`freight_class`	Class, Freight Class, NMFC Class	INT	In `[50, 500]`, `multiple_of` 5
`weight_break`	Wt Break, Weight Break, Min Wt	DECIMAL(10,2)	Non-negative, strictly ascending per lane
`base_rate`	Rate, Base Rate, CWT Rate	DECIMAL(10,2)	Non-negative, currency-precise
`min_charge`	Min, Min Charge, AMC	DECIMAL(10,2)	Non-negative; defaults to `0.00`
`origin_zip`	Origin, O-Zip, From	CHAR(5)	5-digit US ZIP
`dest_zip`	Dest, D-Zip, To	CHAR(5)	5-digit US ZIP
`carrier_id`	— (injected from context)	VARCHAR(10)	2–10 chars, joined to carrier registry
`effective_date`	— (injected from context)	DATE	`YYYY-MM-DD`, pins point-in-time resolution

Two LTL-specific subtleties drive the schema. First, the rate basis is per-hundredweight (CWT): base_rate is the rate per 100 lb, so the matrix stores rates, not totals, and the actual charge is computed downstream against billable weight. Second, the weight_break column is not a value but the floor of a tier — a 500 lb break applies to every shipment from 500 lb up to the next break — which is why the assembly stage pivots weight breaks into columns rather than treating them as discrete rows. Density-based reclassification and FAK (freight-all-kinds) substitutions change which class applies but never the matrix shape, so they are normalized as attributes on the record, not as new rows.

Step-by-Step Implementation

Step 1 — Structural parsing & table isolation

Carrier rate sheets rarely share a tabular layout, so extraction must prioritize structural detection over text scraping. Separate structural parsing from semantic normalization to prevent cascading failures: this routine isolates I/O and layout parsing only, applies heuristic filters to discard non-rate artifacts (legal disclaimers, cover pages, footers), and routes unreadable files to quarantine. It never attempts semantic interpretation — that belongs to Step 2.

import logging
import pdfplumber
import pandas as pd
from pathlib import Path
from typing import List

logger = logging.getLogger(__name__)


def extract_ltl_tables(file_path: Path, quarantine_dir: Path) -> List[pd.DataFrame]:
    """Extract candidate rate tables from a carrier PDF or Excel workbook.

    Applies structural heuristics to filter noise and routes unreadable
    files to quarantine rather than raising into the batch driver.
    """
    tables: List[pd.DataFrame] = []
    try:
        suffix = file_path.suffix.lower()

        if suffix == ".pdf":
            with pdfplumber.open(file_path) as pdf:
                for page in pdf.pages:
                    for raw in page.extract_tables():
                        # Heuristic: minimum row depth + a freight-class indicator
                        has_class = any(
                            "class" in str(cell).lower()
                            for row in raw for cell in row if cell
                        )
                        if len(raw) >= 4 and has_class:
                            df = pd.DataFrame(raw[1:], columns=raw[0])
                            df.dropna(how="all", inplace=True)
                            if not df.empty:
                                tables.append(df)

        elif suffix in (".xlsx", ".xls"):
            workbook = pd.ExcelFile(file_path)
            for sheet_name in workbook.sheet_names:
                df = pd.read_excel(file_path, sheet_name=sheet_name, header=0)
                cols = df.columns.str.lower().str.replace(r"\s+", "_", regex=True)
                if any(k in cols for k in ("weight_break", "freight_class", "class")):
                    tables.append(df)

        else:
            logger.warning("Unsupported file format: %s", file_path.name)

    except Exception as exc:
        logger.error("Ingestion failed for %s: %s", file_path.name, exc)
        quarantine_dir.mkdir(parents=True, exist_ok=True)
        file_path.rename(quarantine_dir / file_path.name)

    return tables

Common mistake: running text extraction before table detection. Native text order does not respect column boundaries, so two adjacent class columns collapse into one string and the rate for class 100 is attributed to class 85. Detect tables structurally first; fall back to text only for rasterized pages routed through the OCR pre-pass.

Step 2 — Schema mapping & strict normalization

Raw extracted DataFrames carry carrier-specific terminology, string-formatted numbers ("$12.40", "1,000"), and inconsistent casing. They must conform to a rigid schema before entering the matrix. Pydantic v2 provides runtime validation, type coercion, and explicit per-field error reporting; the normalizer standardizes headers, cleans numeric strings, enforces the NMFC class bounds, and returns only records that validate while logging the rest for the audit trail.

import re
from decimal import Decimal
from pydantic import BaseModel, Field, ValidationError


class LtlRateRecord(BaseModel):
    origin_zip: str = Field(pattern=r"^\d{5}$")
    dest_zip: str = Field(pattern=r"^\d{5}$")
    freight_class: int = Field(ge=50, le=500, multiple_of=5)
    weight_break: Decimal = Field(ge=0)
    base_rate: Decimal = Field(ge=0)               # per-hundredweight (CWT)
    min_charge: Decimal = Field(default=Decimal("0.00"), ge=0)
    carrier_id: str = Field(min_length=2, max_length=10)
    effective_date: str = Field(pattern=r"^\d{4}-\d{2}-\d{2}$")


COLUMN_MAPPING = {
    "freight class": "freight_class",
    "nmfc class": "freight_class",
    "class": "freight_class",
    "weight break": "weight_break",
    "wt break": "weight_break",
    "weight": "weight_break",
    "base rate": "base_rate",
    "cwt rate": "base_rate",
    "rate": "base_rate",
    "min charge": "min_charge",
    "min": "min_charge",
    "origin zip": "origin_zip",
    "dest zip": "dest_zip",
}


def normalize_dataframe(
    df: pd.DataFrame, carrier_id: str, effective_date: str
) -> List[LtlRateRecord]:
    """Map raw columns to the canonical schema, coerce types, and validate.

    Returns only successfully parsed records; logs failures for audit trails.
    Header normalization returns a NEW frame — never mutates the caller's.
    """
    df = df.copy()
    df.columns = (
        df.columns.str.lower()
        .str.replace(r"[^a-z0-9\s]", "", regex=True)
        .str.strip()
    )
    rename_map = {k: v for k, v in COLUMN_MAPPING.items() if k in df.columns}
    df = df.rename(columns=rename_map)

    valid: List[LtlRateRecord] = []
    for idx, row in df.iterrows():
        try:
            payload = {
                "carrier_id": carrier_id,
                "effective_date": effective_date,
                **{
                    field: row.get(field)
                    for field in LtlRateRecord.model_fields
                    if field not in ("carrier_id", "effective_date")
                },
            }
            # Strip currency symbols and thousands separators before coercion.
            for numeric_field in ("weight_break", "base_rate", "min_charge"):
                val = payload.get(numeric_field)
                if isinstance(val, str):
                    payload[numeric_field] = re.sub(r"[^\d.]", "", val)

            valid.append(LtlRateRecord(**payload))

        except ValidationError as ve:
            logger.debug("Row %d failed validation: %s", idx, ve.json())
        except Exception:
            logger.error("Unexpected normalization error at row %d", idx, exc_info=True)

    return valid

Common mistake: calling df.rename(columns=..., inplace=True). With inplace=True the rename mutates the caller’s DataFrame, so a downstream re-read of the same frame sees already-renamed columns and the mapping silently no-ops. Always assign the result of rename to a copy, as above, and keep the normalizer side-effect-free.

Step 3 — Canonical class/weight matrix assembly

Validated records are flat rows; downstream lookup wants a matrix. This stage pivots weight breaks into columns, indexes by origin_zip, dest_zip, and freight_class, sorts the breaks numerically, and forward-fills the gaps carriers leave when they omit an intermediate break. The pivot is what makes a point-in-time rate lookup an O(1) cell read rather than a row scan.

def assemble_rate_matrix(records: List[LtlRateRecord]) -> pd.DataFrame:
    """Construct a canonical LTL rate matrix from validated records.

    Pivots weight breaks into columns and forward-fills omitted intermediate
    breaks so every (lane, class) row exposes a rate at every known break.
    """
    if not records:
        return pd.DataFrame()

    df = pd.DataFrame([r.model_dump() for r in records])

    matrix = df.pivot_table(
        index=["origin_zip", "dest_zip", "freight_class"],
        columns="weight_break",
        values="base_rate",
        aggfunc="first",
    )

    # Weight breaks are tier floors: sort ascending, then carry rates forward
    # across breaks the carrier did not restate.
    matrix = matrix.reindex(sorted(matrix.columns, key=float), axis=1)
    matrix = matrix.ffill(axis=1)

    matrix = matrix.reset_index()
    matrix.columns.name = None
    return matrix

Common mistake: back-filling instead of forward-filling the breaks. LTL rates step down as weight rises (the heavier the shipment, the lower the CWT rate), and an omitted break inherits the rate of the break below it, not above. Back-filling assigns the cheaper high-weight rate to lighter shipments and understates every expected charge in the matrix.

Step 4 — Versioned serialization & pipeline handoff

The assembled matrix is serialized to Parquet with an immutable, hashed metadata manifest so the dataset is reproducible and the version is verifiable. The content hash is the spine of point-in-time resolution: a tariff whose body changed but whose effective date did not must produce a different hash, or downstream lookups silently resolve to a superseded rate.

import json
import hashlib
from datetime import datetime, timezone


def serialize_and_handoff(
    matrix: pd.DataFrame, carrier_id: str, output_dir: Path
) -> Path:
    """Serialize the canonical matrix to Parquet with a hashed manifest."""
    if matrix.empty:
        raise ValueError("Cannot serialize an empty rate matrix")

    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
    output_path = output_dir / f"{carrier_id}_ltl_rates_{timestamp}.parquet"

    # Deterministic content hash for version control and drift detection.
    content_hash = hashlib.sha256(
        matrix.to_json(orient="records").encode()
    ).hexdigest()

    manifest = {
        "carrier_id": carrier_id,
        "generated_at": timestamp,
        "record_count": len(matrix),
        "schema_version": "v2.1",
        "content_hash": content_hash,
        "pipeline_stage": "ltl_digitization",
    }

    output_path.parent.mkdir(parents=True, exist_ok=True)
    matrix.to_parquet(output_path, engine="pyarrow", index=False)
    (output_path.parent / f"{output_path.stem}_meta.json").write_text(
        json.dumps(manifest, indent=2)
    )

    logger.info("Rate matrix serialized: %s (hash %s)", output_path.name, content_hash)
    return output_path

Common mistake: hashing the file path or timestamp instead of the matrix body. A path-based version changes on every run even when the rates are identical, so renewals look like edits and the audit history fills with phantom versions. Hash the serialized record body only, and treat an unchanged hash as a no-op write.

Validation & Testing

Digitization correctness is verified with table-driven tests over fixtures that mirror real carrier quirks, not idealized grids. Build fixtures for: a clean multi-break lane, an out-of-range freight class, a currency-formatted rate string, an omitted intermediate break, and an all-empty noise table.

import pytest
from decimal import Decimal
from pydantic import ValidationError


def _record(**overrides) -> dict:
    base = {
        "origin_zip": "60601",
        "dest_zip": "30301",
        "freight_class": 100,
        "weight_break": Decimal("500"),
        "base_rate": Decimal("42.50"),
        "min_charge": Decimal("89.00"),
        "carrier_id": "RLCA",
        "effective_date": "2026-01-01",
    }
    base.update(overrides)
    return base


def test_valid_record_round_trips():
    rec = LtlRateRecord(**_record())
    assert rec.freight_class == 100
    assert rec.base_rate == Decimal("42.50")


def test_off_grid_class_rejected():
    with pytest.raises(ValidationError):
        LtlRateRecord(**_record(freight_class=92))  # not a multiple of 5


def test_currency_string_is_cleaned():
    df = pd.DataFrame([{"class": "100", "weight break": "500",
                        "rate": "$42.50", "min": "89.00",
                        "origin zip": "60601", "dest zip": "30301"}])
    records = normalize_dataframe(df, carrier_id="RLCA", effective_date="2026-01-01")
    assert len(records) == 1
    assert records[0].base_rate == Decimal("42.50")


def test_forward_fill_carries_rate_across_omitted_break():
    recs = [
        LtlRateRecord(**_record(weight_break=Decimal("500"), base_rate=Decimal("42.50"))),
        LtlRateRecord(**_record(weight_break=Decimal("2000"), base_rate=Decimal("31.00"))),
    ]
    matrix = assemble_rate_matrix(recs)
    # The 500 lb rate must NOT bleed down into the 2000 lb column.
    assert float(matrix.iloc[0][2000.0]) == 31.00

The currency-string test pins the resilience contract: a row that arrives as formatted text still validates after cleaning, and a class off the NMFC grid produces zero yielded records and zero exceptions — exactly the behaviour the batch driver depends on. Run the suite under pytest -q in CI and gate merges on it.

Performance & Tuning

Digitization is I/O-bound on PDF rendering and schema validation, not on the pivot arithmetic, so tuning targets memory and batch shape rather than raw CPU.

Batch size. Process one carrier tariff per worker task and validate 200–400 rows per batch as a starting point. Larger batches amortize the Parquet write; smaller batches lower peak memory and shrink the blast radius of a poison file.
Memory footprint. Stream PDF pages through pdfplumber rather than materializing a multi-hundred-page rate sheet into one DataFrame, and use chunksize for CSV exports above 500 MB. A streamed tariff stays in single-digit megabytes; a monolithic load of the same file can trigger garbage-collection thrashing and OOM-kill a CI runner.
Concurrency. The digitizer is stateless per tariff, so it parallelizes cleanly across a worker pool. The heavier fan-out of many tariffs at once is handled out-of-process by Async Batch Processing Workflows rather than threaded inside this function.

Failure Modes

Five scenarios account for the overwhelming majority of production incidents in LTL rate-sheet digitization. Each has a stable signature and a known resolution path.

Merged class-header collapse. A carrier merges adjacent class header cells, so extract_tables returns one fused column and every rate is attributed to the wrong class. Diagnostic: the header row is one or more cells shorter than the widest data row. Resolution: switch ltl.table_strategy to text for that carrier and re-derive class columns from positional offsets.
Off-grid freight class. A rate sheet lists class 92.5 or a typo’d class 600, which the multiple_of=5 / [50, 500] constraints reject. Diagnostic: a spike of ValidationError entries scoped to freight_class. Resolution: confirm against the NMFC tariff whether the class is a genuine half-class (rare, carrier-specific) and either widen the validator for that carrier or dead-letter the row — never round silently.
Weight-break direction inversion. A carrier publishes breaks high-to-low, so the naïve sort plus forward-fill carries the wrong rate across omitted breaks. Diagnostic: the assembled matrix shows CWT rates rising with weight. Resolution: sort breaks ascending before the ffill (as in Step 3) and assert monotonic non-increasing rates per lane in a post-pivot check.
Accessorial and FSC bleed. A fuel or liftgate footnote sits inside the table boundary and is parsed as a rate row, injecting a phantom class. Diagnostic: records whose freight_class is null or whose origin_zip fails the ZIP pattern. Resolution: enforce the class and ZIP patterns at the schema boundary so the bleed dead-letters; route the genuine accessorial codes to Accessorial Charge Taxonomy Mapping and the fuel index to Fuel Surcharge Formula Implementation.
Mid-cycle rate edit without a version bump. A carrier adjusts a CWT rate but reuses the prior effective date, so the upsert overwrites the historical rate and breaks point-in-time resolution. Diagnostic: a matrix whose content_hash changed while effective_date did not. Resolution: reject any write where the hash changed but the effective date did not, and force a new version with a fresh effective date.

Integration Points

The output of this stage is a versioned Parquet partition plus a hashed manifest — a stable field contract that the validation tier consumes without ever touching a PDF or workbook again. Partition by carrier_id and effective_date for query pruning, and keep the manifest’s content_hash and record_count available for reconciliation.

Downstream, Rule-Based Rate Validation & Accessorial Auditing joins each billed LTL shipment against the class/weight cell in force on the service date, Lane Matching Algorithms confirm the billed origin/destination pair against the contracted lanes this stage emitted, and Weight & Zone Cross-Validation re-checks the billable weight that selects the weight-break column. Because digitization never prices freight itself, the integration boundary is exactly the canonical record: keep it clean, type-safe, and versioned, and every downstream tier stays format-agnostic. Invoices that reference these tariffs arrive through EDI 210/810 Processing and XML Freight Bill Ingestion, each stamped with the contract_version_id that pins it to the exact matrix snapshot produced here.

In This Section

How to map LTL class rates to JSON schemas — the schema-design companion to this stage: expressing the canonical class/weight contract as a validatable JSON document, modelling nested zone matrices, reusing weight-break definitions with $ref, and wiring the schema into runtime validation so malformed rate rows fail fast.

Freight Contract Architecture & Rate Mapping — the parent architecture this stage feeds with versioned LTL rate matrices.
FTL Base Rate Extraction — the lane-driven sibling for full-truckload tariffs, with a different schema shape.
Accessorial Charge Taxonomy Mapping — normalizes the accessorial codes deliberately excluded from the base rate matrix.
Fuel Surcharge Formula Implementation — versions the FSC index independently of the base tariff.
PDF Invoice Parsing with Python — the coordinate-calibration and OCR techniques the ingestion gateway applies before this stage.

Up: Freight Contract Architecture & Rate Mapping