FTL Base Rate Extraction

FTL base rate extraction is the deterministic ingestion and normalization stage that turns negotiated full-truckload agreements into version-controlled, lookup-ready rate tables for a freight audit pipeline. Where LTL Rate Sheet Digitization leans on NMFC class matrices and density calculations, FTL pricing is lane-driven, weight-tiered, or zone-matrix based, so the extraction logic and the canonical schema are different in kind, not just in degree. This stage parses raw carrier contracts, isolates the base rate table, enforces canonical data types, generates a deterministic lane key, and emits a versioned dataset. It is the FTL-specific entry into Freight Contract Architecture & Rate Mapping, and it deliberately stops short of rate validation, dispute routing, and compliance auditing — those responsibilities belong to later stages that read the dataset this stage produces.

The audience is Python ETL developers and freight auditors who own the rate store. The boundary here matters: everything downstream treats the extracted FTL base rates as a single source of truth, so a silently dropped weight tier or a mis-parsed effective date becomes an audit gap that surfaces months later during carrier reconciliation. The scope of this guide is the parse-to-dataset path for a single carrier contract; it ends at the field contract handed to validation and never prices a shipment itself.

Prerequisites & Input Contract

This stage assumes a working ingestion gateway upstream: carrier contracts land via SFTP, AS2, or a portal export, are persisted as raw bytes, and are hashed before this extractor ever runs. The extractor is a pure transform — given one raw contract artifact plus its source hash, it produces a set of normalized FTL base-rate records — so it carries no network or storage concerns of its own.

Dependency	Minimum version	Purpose in this stage
`python`	3.10+	`match` statements, modern typing, `dataclasses`
`pydantic`	2.5+	Strict schema enforcement and field validation
`pdfplumber`	0.11+	Coordinate-based table detection in carrier PDFs
`camelot-py`	0.11+	Lattice/stream table extraction for ruled grids
`decimal` (stdlib)	—	Exact monetary arithmetic for base rates
`hashlib` (stdlib)	—	SHA-256 source hashing for dataset provenance
`pyarrow`	14+	Versioned Parquet partition writes

The config keys this stage reads should come from the pipeline’s settings layer, never hard-coded per carrier:

Config key	Default	Meaning
`ftl.table_strategy`	`lattice`	`lattice` for ruled grids, `stream` for whitespace-delimited tables
`ftl.weight_unit`	`lb`	Source weight unit; non-`lb` values are converted before validation
`ftl.open_tier_sentinel`	`null`	Value that marks the final, open-ended weight tier
`ftl.dlq_target`	`s3://audit-dlq/ftl/`	Dead-letter prefix for records that fail schema validation
`ftl.partition_keys`	`carrier_scac,effective_date`	Parquet partition columns for downstream query pruning

The input contract is deliberately narrow. The extractor accepts one of four heterogeneous artifacts and rejects anything outside that set before parsing:

Inbound format	Typical source	What the extractor must isolate
PDF rate sheet	Carrier portal / email	Ruled base-rate grid, excluding accessorial and FSC footnotes
EDI 210/204	EDI VAN attachment	`B3`, `L1`, `R4` segments mapping lanes to base charges
XML contract payload	TMS or carrier API	Lane nodes, weight-break nodes, effective/expiry attributes
Legacy CSV export	Spreadsheet handoff	Header-mapped columns with explicit unit declarations

Architecture Detail: Segment & Field Mapping

The architecture of this stage is a map from each inbound format onto one canonical record. For EDI 210/204 payloads the mapping is positional and segment-based; the table below is the contract the parser enforces, and it doubles as the reference for which elements are mandatory. The deeper coordinate-calibration work for PDF grids — grid-line detection, regex anchor patterns, and rasterized-page fallbacks — is covered in Extracting FTL zone-based pricing from carrier PDFs.

EDI Segment	Element	Internal Field	Data Type	Validation Rule
`B3`	`B302`	`shipment_reference`	VARCHAR(30)	Non-null, unique per carrier
`B3`	`B306`	`effective_date`	DATE	YYYYMMDD, not after `expiry_date`
`R4`	`R402` (origin)	`origin_zip`	CHAR(5)	5-digit US ZIP, resolved from city/state if absent
`R4`	`R402` (dest)	`destination_zip`	CHAR(5)	5-digit US ZIP
`L1`	`L104`	`base_rate`	DECIMAL(10,2)	Positive, currency-precise
`L1`	`L105`	`rate_basis`	CHAR(2)	`FR` flat, `PM` per-mile, `PH` per-cwt
`ISA`/`GS`	`ISA06`	`carrier_scac`	CHAR(4)	`[A-Z0-9]{4}`, joined to carrier registry

The geographic codes from R4 and the rate basis from L1 together determine whether a contract is lane-flat, mileage-band, or zone-matrix. Equipment-type modifiers (dry van, reefer, flatbed) ride alongside as a normalized attribute rather than multiplying the row count, which keeps the lane key stable across equipment classes.

Step-by-Step Implementation

Step 1 — Document parsing & table isolation

Carrier FTL rate sheets have no standard layout, so extraction must prioritize structural detection over heuristic text scraping. For PDFs, configure coordinate-based table detection with pdfplumber or camelot to identify grid lines, merged cells, and header boundaries. For EDI/X12 payloads, map the B3 (shipment information), L1 (rate and charge), and R4 (geographic location) segments to lane identifiers and base charges; the EDI 210/810 Processing stage already implements the segment-splitting primitives this reuses, and the official ASC X12 Standards documentation governs segment positioning and delimiters.

When a rate matrix spans multiple pages, a page-stitching routine must track header continuity across page breaks and prevent row duplication. The extraction must isolate the base rate table from surrounding contractual language, fuel surcharge schedules, and accessorial footnotes — separating accessorials here prevents downstream contamination.

import logging
import pdfplumber
from pathlib import Path
from typing import Iterator

logger = logging.getLogger(__name__)

def isolate_base_rate_rows(pdf_path: Path, settings: dict) -> Iterator[dict]:
    """Yield raw base-rate rows from a carrier PDF, skipping FSC/accessorial blocks.

    Header continuity is tracked across pages so a base-rate table that breaks
    across a page boundary is stitched into one logical table.
    """
    table_settings = {"vertical_strategy": settings["ftl.table_strategy"]}
    header: list[str] | None = None
    for page in pdfplumber.open(pdf_path).pages:
        for table in page.extract_tables(table_settings):
            if not table:
                continue
            first_row = [c.strip().lower() if c else "" for c in table[0]]
            # A page-break continuation repeats no header; reuse the prior one.
            if "origin" in first_row and "rate" in first_row:
                header = first_row
                body = table[1:]
            else:
                body = table
            if header is None:
                logger.warning("Table found before any header row; skipping")
                continue
            for row in body:
                cells = dict(zip(header, row))
                # Footnote/accessorial blocks lack an origin cell — drop them.
                if not cells.get("origin"):
                    continue
                yield cells

Common mistake: running text extraction first and table detection second. Native text order does not respect column boundaries, so two adjacent lanes collapse into one string and the base rate of lane B is attributed to lane A. Detect tables structurally first; fall back to text only for rasterized pages.

Step 2 — Target schema & configuration mapping

Before extraction commits anything, define a strict target schema that enforces data types, required fields, and audit constraints. The Pydantic v2 model below is the canonical FTL base-rate structure: Decimal for currency precision, explicit validators for lane formatting, and tier-continuity checks that reject overlapping weight breaks.

from pydantic import BaseModel, Field, field_validator, ConfigDict
from typing import Optional, List
from datetime import date
from decimal import Decimal


class WeightTier(BaseModel):
    min_weight: int = Field(..., ge=0, description="Minimum weight in lbs")
    max_weight: Optional[int] = Field(
        None, ge=0, description="Maximum weight in lbs (None implies unlimited)"
    )
    base_rate: Decimal = Field(
        ..., ge=0, decimal_places=2, description="Base freight charge"
    )


class FTLBaseRate(BaseModel):
    model_config = ConfigDict(strict=True, extra="forbid")

    carrier_scac: str = Field(..., min_length=4, max_length=4, pattern=r"^[A-Z0-9]{4}$")
    origin_zip: str = Field(..., pattern=r"^\d{5}$")
    destination_zip: str = Field(..., pattern=r"^\d{5}$")
    effective_date: date
    expiry_date: Optional[date] = None
    weight_tiers: List[WeightTier]
    rate_version: str = Field(..., pattern=r"^v\d+\.\d+$")
    source_file_hash: str = Field(..., min_length=32, description="SHA-256 of source document")

    @field_validator("expiry_date")
    @classmethod
    def validate_date_range(cls, v: Optional[date], info) -> Optional[date]:
        if v is not None and v < info.data.get("effective_date"):
            raise ValueError("expiry_date must be on or after effective_date")
        return v

    @field_validator("weight_tiers")
    @classmethod
    def validate_tier_continuity(cls, v: List[WeightTier]) -> List[WeightTier]:
        if not v:
            raise ValueError("At least one weight tier is required")
        for i in range(len(v) - 1):
            if v[i].max_weight is None:
                raise ValueError("Only the final tier may have an undefined max_weight")
            if v[i].max_weight >= v[i + 1].min_weight:
                raise ValueError("Weight tiers must be strictly sequential without overlap")
        return v

Validation runs synchronously during extraction. Any record that fails FTLBaseRate.model_validate() is routed to a dead-letter queue with a structured error payload rather than silently dropped. For the full validator surface, consult the official Pydantic documentation.

Common mistake: typing base_rate as float. A 0.1 + 0.2 style rounding error in a per-mile rate compounds across a multi-thousand-mile lane and surfaces as a phantom variance during audit. Keep currency in Decimal from extraction through to the Parquet write.

Step 3 — Normalization & deterministic key generation

Raw carrier data carries inconsistent geographic codes, mixed-case SCACs, and non-standard weight units. Normalize before validation so downstream joins are deterministic.

Geographic standardization. Convert city/state combinations to 5-digit ZIP codes against a maintained postal reference table. Reject partial or ambiguous locations rather than guessing.
Weight-tier alignment. Convert every weight to pounds, enforce ascending order, and require the final tier to use max_weight=None for open-ended capacity.
Composite key generation. Build a deterministic primary key, f"{carrier_scac}_{origin_zip}_{destination_zip}_{effective_date.isoformat()}", which enables idempotent upserts and prevents duplicate ingestion during contract renewals.

from decimal import Decimal


def lane_key(record: "FTLBaseRate") -> str:
    """Deterministic, collision-free key for idempotent upserts."""
    return (
        f"{record.carrier_scac}_{record.origin_zip}_"
        f"{record.destination_zip}_{record.effective_date.isoformat()}"
    )


def to_pounds(value: Decimal, unit: str) -> Decimal:
    """Normalize a source weight to pounds before tier validation."""
    factors = {"lb": Decimal("1"), "kg": Decimal("2.20462"), "ton": Decimal("2000")}
    if unit not in factors:
        raise ValueError(f"Unsupported weight unit: {unit!r}")
    return (value * factors[unit]).quantize(Decimal("1"))

Accessorial charges, detention fees, and TONU rates are explicitly excluded here. They belong to Accessorial Charge Taxonomy Mapping and must be parsed into separate normalized tables to keep audit traceability. Fuel surcharges are likewise out of scope and are modelled independently in Fuel Surcharge Formula Implementation, so the base tariff can version separately from the index.

Common mistake: generating the lane key before normalizing the ZIP. A key built on Chicago, IL and another on 60601 point at the same lane but never deduplicate, so a renewal double-ingests. Normalize geography first, then key.

Step 4 — Error handling & operational resilience

Production extraction must fail gracefully, log deterministically, and preserve raw inputs for forensic replay.

Structured logging. Emit JSON logs at INFO, WARN, and ERROR with carrier_scac, source_file_hash, and record_index on every entry.
Transient-failure retries. Wrap network and database lookups in exponential backoff with jitter; cap at 3 attempts before routing to the dead-letter queue.
Schema-violation routing. On ValidationError, capture the raw row, attach the Pydantic error context, and publish to the dead-letter target. Never halt the batch for an isolated row.
Idempotent writes. Upsert on the composite lane key and stamp a processed_at timestamp to track execution windows.

from pydantic import ValidationError
from typing import Iterator

def extract_and_normalize(raw_records: Iterator[dict]) -> Iterator[FTLBaseRate]:
    for idx, record in enumerate(raw_records):
        try:
            yield FTLBaseRate(**record)
        except ValidationError as e:
            logger.error(
                "Schema validation failed",
                extra={
                    "record_index": idx,
                    "carrier_scac": record.get("carrier_scac", "UNKNOWN"),
                    "validation_errors": e.errors(),
                },
            )
            # Route the raw row to the dead-letter queue, then keep going.
            continue
        except Exception:
            logger.critical("Unrecoverable extraction error", exc_info=True)
            raise

Common mistake: letting one bad row abort the batch. A single carrier that ships a malformed expiry date should cost you one dead-lettered record, not an entire contract’s worth of lanes. Catch ValidationError per row and continue; reserve raise for genuinely unrecoverable faults.

Validation & Testing

Extraction correctness is verified with table-driven unit tests over fixtures that mirror real carrier quirks, not idealized inputs. Build fixtures for: a clean multi-tier lane, an overlapping-tier defect, an open-ended final tier, a missing-ZIP row, and a mixed-unit weight column.

import pytest
from datetime import date
from decimal import Decimal
from pydantic import ValidationError


def _record(**overrides) -> dict:
    base = {
        "carrier_scac": "ABCD",
        "origin_zip": "60601",
        "destination_zip": "30301",
        "effective_date": date(2026, 1, 1),
        "weight_tiers": [
            {"min_weight": 0, "max_weight": 10000, "base_rate": Decimal("950.00")},
            {"min_weight": 10001, "max_weight": None, "base_rate": Decimal("1450.00")},
        ],
        "rate_version": "v1.0",
        "source_file_hash": "a" * 64,
    }
    base.update(overrides)
    return base


def test_valid_record_round_trips():
    rec = FTLBaseRate(**_record())
    assert lane_key(rec) == "ABCD_60601_30301_2026-01-01"


def test_overlapping_tiers_rejected():
    bad = _record(weight_tiers=[
        {"min_weight": 0, "max_weight": 10000, "base_rate": Decimal("950.00")},
        {"min_weight": 9000, "max_weight": None, "base_rate": Decimal("1450.00")},
    ])
    with pytest.raises(ValidationError, match="strictly sequential"):
        FTLBaseRate(**bad)


def test_expiry_before_effective_rejected():
    with pytest.raises(ValidationError, match="on or after"):
        FTLBaseRate(**_record(expiry_date=date(2025, 12, 31)))


def test_missing_zip_routes_to_dlq():
    rows = iter([_record(origin_zip="ILChi")])
    assert list(extract_and_normalize(rows)) == []  # bad row swallowed, not raised

The last test pins the resilience contract: a malformed row produces zero yielded records and zero exceptions, which is exactly the behaviour the batch driver depends on. Run the suite under pytest -q in CI and gate merges on it.

Performance & Tuning

Extraction is I/O-bound on PDF rendering and schema validation, not on the arithmetic, so tuning targets memory and batch shape rather than raw CPU.

Batch size. Process one carrier contract per worker task and 200–400 lanes per validation batch as a starting point. Larger batches amortize the Parquet write; smaller batches lower peak memory and shrink the blast radius of a poison file.
Memory footprint. Stream pages with pdfplumber rather than loading a multi-hundred-page rate sheet into one DataFrame. A streamed contract stays in single-digit megabytes; a monolithic load of the same file can trigger garbage-collection thrashing and OOM-kill CI runners.
Concurrency. The extractor is stateless per contract, so it parallelizes cleanly across a worker pool. The heavier fan-out of many contracts is handled out-of-process by Async Batch Processing Workflows rather than threaded inside this function.

Failure Modes

Five scenarios account for the overwhelming majority of production incidents in FTL base-rate extraction. Each has a stable signature and a known resolution path.

Merged header collapse. A carrier merges the Origin and Origin ZIP header cells; extract_tables returns one fused column and every lane key is malformed. Diagnostic: len(header) is one less than the data row width. Resolution: switch ftl.table_strategy to stream for that carrier and re-derive headers from column positions.
Rasterized rate grid. The PDF is a scanned image, so native text extraction yields empty tables and the contract silently produces zero rows. Diagnostic: page.extract_tables() returns [] on a page with a visible grid. Resolution: route the file to an OCR pre-pass before extraction; never treat zero rows as “no rates.”
Open-tier sentinel drift. A carrier represents the open-ended top tier as 999999 instead of an empty cell, so validate_tier_continuity accepts a finite max where infinity was intended. Diagnostic: a final tier with an implausibly round max_weight. Resolution: map the configured ftl.open_tier_sentinel to None during normalization, before validation.
Mid-cycle rate drift without a version bump. A carrier adjusts a lane’s base rate but reuses the prior rate_version, so the upsert overwrites the historical rate and breaks point-in-time resolution. Diagnostic: a lane key whose base_rate changed while rate_version and effective_date did not. Resolution: hash the rate body and reject any write where the hash changed but the version did; force a new version. This drift detection is detailed in Extracting FTL zone-based pricing from carrier PDFs.
Accessorial bleed. A fuel or detention footnote sits inside the table boundary and is parsed as a lane, injecting a phantom rate. Diagnostic: records with a non-ZIP origin or a rate_basis outside the allowed set. Resolution: enforce the origin_zip and rate_basis patterns at the schema boundary so the bleed dead-letters instead of contaminating the dataset.

Integration Points

The output of this stage is a versioned Parquet partition plus a metadata manifest — a stable field contract that the validation tier consumes without ever touching a PDF or X12 stream again. Partition by carrier_scac and effective_date for query pruning, and write a manifest containing the pipeline execution ID, source file checksums, record counts (total, valid, rejected), and the schema version identifier.

Downstream, Rule-Based Rate Validation & Accessorial Auditing joins each billed shipment against the FTL base rate in force on the service date, and Lane Matching Algorithms confirm the billed origin/destination pair against the contracted lanes this stage emitted. Because extraction never prices freight itself, the integration boundary is exactly the canonical record: keep it clean, type-safe, and versioned, and every downstream tier stays format-agnostic.

In This Section

Extracting FTL zone-based pricing from carrier PDFs — a debugging-and-scaling walkthrough of truncated tables, merged cells, rasterized grids, memory pressure during bulk ingestion, and silent rate drift, with reproducible diagnostics and production-safe resolution paths for coordinate calibration and regex anchor patterns.

Freight Contract Architecture & Rate Mapping — the parent architecture this stage feeds with versioned FTL base rates.
LTL Rate Sheet Digitization — the class-based sibling for less-than-truckload networks.
Accessorial Charge Taxonomy Mapping — normalizes the accessorial codes deliberately excluded from base extraction.
Fuel Surcharge Formula Implementation — versions the FSC index independently of the base tariff.
Lane Matching Algorithms — matches billed lanes to the contracted rate tables produced here.

Up: Freight Contract Architecture & Rate Mapping