FTL Base Rate Extraction

FTL base rate extraction functions as the deterministic ingestion and normalization layer within a freight audit pipeline. Unlike LTL Rate Sheet Digitization, which relies on NMFC class matrices and density calculations, FTL pricing is fundamentally lane-driven, weight-tiered, or zone-matrix based. This stage is strictly scoped to parsing raw carrier contracts, isolating base rate tables, enforcing canonical data types, and emitting version-controlled datasets. It does not perform rate validation, dispute routing, or compliance auditing; those responsibilities belong to downstream pipeline stages.

Pipeline Boundaries & Data Flow

This extraction stage operates immediately after raw contract ingestion and terminates before any validation or dispute logic executes. The boundary is explicitly defined:

  • Inbound: Heterogeneous carrier inputs including PDF rate sheets, EDI 210/204 attachments, XML contract payloads, and legacy CSV exports.
  • Processing: Coordinate-based table detection, segment mapping, schema enforcement, lane normalization, and deterministic key generation.
  • Outbound: Normalized, version-controlled Parquet/JSON datasets containing origin-destination pairs, base rate values, effective/expiry dates, weight breakpoints, and carrier metadata.
  • Downstream Handoff: The output feeds directly into Freight Contract Architecture & Rate Mapping frameworks, where downstream audit engines reference the extracted rates as a single source of truth. Validation engines consume this output to flag anomalies, while dispute routers consume validation results. Extraction itself remains stateless and idempotent.

Step 1: Document Parsing & Table Isolation

Carrier FTL rate sheets lack standardized layouts. Extraction requires a hybrid parsing strategy that prioritizes structural detection over heuristic text scraping. For PDFs, coordinate-based table detection using pdfplumber or camelot should be configured to identify grid lines, merged cells, and header boundaries. For EDI/X12 payloads, standard segment parsers must map B3 (shipment information), L1 (rate and charge), and R4 (geographic location) segments to lane identifiers and base charges. Refer to the official ASC X12 Standards documentation for segment positioning and delimiter handling.

When processing multi-page rate matrices, implement a page-stitching routine that tracks header continuity across page breaks and prevents row duplication. The extraction logic must strictly isolate the base rate table from surrounding contractual language, fuel surcharge schedules, and accessorial footnotes. Separating accessorials at this stage prevents downstream contamination; detailed methodologies for isolating base tables from complex carrier PDF layouts are documented in Extracting FTL zone-based pricing from carrier PDFs, which covers coordinate calibration and regex anchor patterns.

Step 2: Target Schema & Configuration Mapping

Before extraction executes, define a strict target schema that enforces data types, required fields, and audit constraints. The following Pydantic v2 model establishes the canonical FTL base rate structure, utilizing Decimal for currency precision and explicit validators for lane formatting:

from pydantic import BaseModel, Field, field_validator, ConfigDict
from typing import Optional, List
from datetime import date
from decimal import Decimal
import logging

logger = logging.getLogger(__name__)

class WeightTier(BaseModel):
    min_weight: int = Field(..., ge=0, description="Minimum weight in lbs")
    max_weight: Optional[int] = Field(None, ge=0, description="Maximum weight in lbs (None implies unlimited)")
    base_rate: Decimal = Field(..., ge=0, decimal_places=2, description="Base freight charge")

class FTLBaseRate(BaseModel):
    model_config = ConfigDict(strict=True, extra="forbid")
    
    carrier_scac: str = Field(..., min_length=4, max_length=4, pattern=r"^[A-Z0-9]{4}$")
    origin_zip: str = Field(..., pattern=r"^\d{5}$")
    destination_zip: str = Field(..., pattern=r"^\d{5}$")
    effective_date: date
    expiry_date: Optional[date] = None
    weight_tiers: List[WeightTier]
    rate_version: str = Field(..., pattern=r"^v\d+\.\d+$")
    source_file_hash: str = Field(..., min_length=32, description="SHA-256 of source document")

    @field_validator("expiry_date")
    @classmethod
    def validate_date_range(cls, v, info):
        if v is not None and v < info.data.get("effective_date"):
            raise ValueError("expiry_date must be on or after effective_date")
        return v

    @field_validator("weight_tiers")
    @classmethod
    def validate_tier_continuity(cls, v):
        if not v:
            raise ValueError("At least one weight tier is required")
        for i in range(len(v) - 1):
            if v[i].max_weight is None:
                raise ValueError("Only the final tier may have an undefined max_weight")
            if v[i].max_weight >= v[i+1].min_weight:
                raise ValueError("Weight tiers must be strictly sequential without overlap")
        return v

Schema validation occurs synchronously during extraction. Any record failing FTLBaseRate.model_validate() is immediately routed to a dead-letter queue (DLQ) with structured error payloads. For comprehensive configuration patterns, consult the official Pydantic Documentation.

Step 3: Normalization & Deterministic Key Generation

Raw carrier data frequently contains inconsistent geographic codes, mixed case SCACs, or non-standard weight units. Normalization must occur before schema validation to guarantee deterministic downstream joins.

  1. Geographic Standardization: Convert city/state combinations to 5-digit ZIP codes using a maintained postal reference table. Reject partial or ambiguous locations.
  2. Weight Tier Alignment: Convert all weight values to pounds. Enforce ascending order and validate that the final tier uses max_weight=None to represent open-ended capacity.
  3. Composite Key Generation: Create a deterministic primary key using f"{carrier_scac}_{origin_zip}_{destination_zip}_{effective_date.isoformat()}". This key enables idempotent upserts and prevents duplicate ingestion during contract renewals.

Accessorial charges, detention fees, and TONU rates must be explicitly excluded from this stage. Those elements belong to Accessorial Charge Taxonomy Mapping and should be parsed into separate normalized tables to maintain audit traceability.

Step 4: Error Handling & Operational Resilience

Production extraction pipelines must fail gracefully, log deterministically, and preserve raw inputs for forensic analysis. Implement the following error-handling strategies:

  • Structured Logging: Emit JSON-formatted logs at INFO, WARN, and ERROR levels. Include carrier_scac, source_file_hash, and record_index in every log entry for traceability.
  • Transient Failure Retries: Wrap network calls or database lookups in exponential backoff with jitter. Limit retries to 3 attempts before escalating to DLQ.
  • Schema Violation Routing: When ValidationError occurs, capture the raw row, attach the Pydantic error context, and publish to a Kafka topic or S3 DLQ prefix. Do not halt batch processing for isolated row failures.
  • Idempotent Writes: Use UPSERT operations keyed on the composite lane key. Include a processed_at timestamp to track pipeline execution windows.
import hashlib
from pydantic import ValidationError
from typing import Iterator

def extract_and_normalize(raw_records: Iterator[dict]) -> Iterator[FTLBaseRate]:
    for idx, record in enumerate(raw_records):
        try:
            normalized = FTLBaseRate(**record)
            yield normalized
        except ValidationError as e:
            logger.error(
                "Schema validation failed",
                extra={
                    "record_index": idx,
                    "carrier_scac": record.get("carrier_scac", "UNKNOWN"),
                    "validation_errors": e.errors()
                }
            )
            # Route to DLQ implementation here
            continue
        except Exception as e:
            logger.critical("Unrecoverable extraction error", exc_info=True)
            raise

Step 5: Output Generation & Downstream Handoff

Upon successful extraction and normalization, write the dataset to a version-controlled Parquet partition. Partition by carrier_scac and effective_date to optimize downstream query performance. Include a metadata manifest containing:

  • Pipeline execution ID
  • Source file checksums
  • Record counts (total, valid, rejected)
  • Schema version identifier

The output dataset is immediately available to validation engines for threshold checking, historical variance analysis, and compliance reporting. By maintaining strict stage boundaries, FTL base rate extraction ensures that downstream audit logic operates on clean, type-safe, and versioned pricing data. This separation of concerns prevents validation rules from leaking into parsing logic and guarantees reproducible audit trails across contract lifecycles.