Matching Shipment Lanes to Contracted Rate Tables Using Python

This page resolves the three failures that break lane-to-rate matching in production freight audits: shipments that silently drop to zero matches, bulk joins that get OOMKilled mid-run, and the wrong contract version winning when effective windows overlap.

The Failure You Are Hitting

You wrote a join that links each billed shipment to its contracted rate and it passed every sample. In production it degrades in one of three observable ways, none of which raises an exception:

A naive pandas.merge() on origin_zip/dest_zip returns far fewer rows than you have shipments. Valid lanes never reach validation because "07114" does not exact-match the rate sheet’s "7114", so the records fall through and the audit under-recovers without ever flagging a miss.
A left join of 50M line items against an unindexed rate table inflates to a near cross-product, peak RSS blows past the worker limit, and the Airflow or Kubernetes pod is OOMKilled halfway through the batch — leaving the run half-committed.
The join succeeds, the payload validates, and a later reconciliation shows a 6% overbill because two contract versions had overlapping effective_start/effective_end dates and the row that survived was non-deterministic — a different version won on the re-run.

This stage sits inside Rule-Based Rate Validation & Accessorial Auditing, immediately downstream of Lane Matching Algorithms, and its whole job is to turn a normalized shipment into a single, defensible rate anchor. When it drops, over-bills, or crashes silently, every charge check that follows inherits the error.

Root Cause Analysis

These failures are rarely defects in pandas or Polars. They trace to four production conditions a single rate sheet never exercises:

Unnormalized lane keys. ZIPs arrive zero-stripped (7114), zero-padded (07114), ZIP+4 (07114-2207), or with embedded whitespace. Exact string equality treats every variant as a different lane, so legitimate shipments match nothing.
Unbounded join cardinality. A lane key that is not unique on the rate side — because effective windows and weight breaks were not collapsed first — turns a left join into a many-to-many explosion. The row count, and the memory, scale O(N×M) instead of O(N).
Overlapping contract windows with no tie-breaker. When two contract_version rows are both effective on the ship_date, the engine has no deterministic rule for which wins, so groupby().first() returns whichever row the sort happened to leave on top.
Hard-fail vs silent-fallthrough with no middle. A matcher that either drops unmatched rows or back-fills them with a national default cannot tell a genuine missing-rate from a normalization artefact, so teams quietly accept whatever number comes out.

Reproducible Diagnostic

Before changing any join code, confirm which failure you have. This snippet quantifies the match ratio and the key-shape mismatch that causes silent drops:

import polars as pl

shipments = pl.read_csv("shipments.csv", infer_schema_length=0)
rates = pl.read_csv("rate_table.csv", infer_schema_length=0)

joined = shipments.join(rates, on=["origin_zip", "dest_zip"], how="left")
total = shipments.height
matched = joined.filter(pl.col("rate_per_lb").is_not_null()).height

print(f"shipments={total} matched={matched} ratio={matched / total:.3f}")
print("ship zip sample:", shipments["origin_zip"].head(3).to_list())
print("rate zip sample:", rates["origin_zip"].head(3).to_list())
print("rows after join:", joined.height, "(>> shipments means cardinality blow-up)")

Read the output like a decision tree:

Signal	Likely cause	Where to fix
`ratio` well below 1.0, ZIP samples differ in width/format	unnormalized lane keys	canonicalize before join (Step 1)
`rows after join` far exceeds `shipments`	non-unique rate key	collapse versions/weight breaks first (Step 3)
`ratio` ≈ 1.0 but reconciliation shows drift	overlapping windows, no tie-break	deterministic version resolution (Step 3)
process killed before printing	full materialized join	switch to streaming lazy scan (Step 2)

A join that returns more rows than you have shipments is the early warning for the memory failure — fix the cardinality before you scale the input.

Resolution Path

The fix is a four-step matcher that canonicalizes lane keys, streams the join to bound memory, resolves contract versions deterministically, and routes the unmatched remainder instead of guessing. Pin dependencies so CI and production agree exactly:

# requirements.txt
polars==1.12.0
pydantic==2.10.6
structlog==24.4.0

Step 1 — Canonicalize the lane key

Normalize both sides to the same key shape before any join. A ZIP truncated to five digits and stripped of non-numeric noise matches regardless of how the carrier formatted it:

import polars as pl

def canon_zip(col: str) -> pl.Expr:
    """Five-digit, numeric-only ZIP so '07114-2207' and ' 7114 ' match."""
    return (
        pl.col(col)
        .cast(pl.String)
        .str.replace_all(r"\D", "")        # drop dashes, spaces, ZIP+4 tail
        .str.zfill(5)                       # restore leading zeros: '7114' -> '07114'
        .str.slice(0, 5)
        .alias(col)
    )

def normalize_lanes(lf: pl.LazyFrame) -> pl.LazyFrame:
    return lf.with_columns([canon_zip("origin_zip"), canon_zip("dest_zip")])

The same canonicalization must run on the contracted rate table when it is loaded from Freight Contract Architecture & Rate Mapping — a key normalized on only one side still misses.

Step 2 — Stream the join to bound memory

Never materialize a 50M-row merge. Polars’ lazy engine with collect(streaming=True) runs the join out-of-core, so peak memory reflects a chunk, not the whole frame. Validate the schema at the boundary so a missing column fails fast instead of producing a silent null join:

import structlog

logger = structlog.get_logger()

SHIPMENT_SCHEMA = {
    "shipment_id": pl.String, "origin_zip": pl.String, "dest_zip": pl.String,
    "weight_lbs": pl.Float64, "ship_date": pl.Date, "carrier_scac": pl.String,
}
RATE_SCHEMA = {
    "carrier_scac": pl.String, "origin_zip": pl.String, "dest_zip": pl.String,
    "min_weight": pl.Float64, "max_weight": pl.Float64, "rate_per_lb": pl.Float64,
    "effective_start": pl.Date, "effective_end": pl.Date, "contract_version": pl.Int32,
}

def typed_scan(path: str, schema: dict) -> pl.LazyFrame:
    """Lazy CSV scan with fail-fast schema enforcement at the boundary."""
    lf = pl.scan_csv(path)
    missing = [c for c in schema if c not in lf.collect_schema().names()]
    if missing:
        raise ValueError(f"{path} missing required columns: {missing}")
    return lf.with_columns([pl.col(c).cast(t) for c, t in schema.items()])

def match_lanes(shipment_path: str, rate_path: str) -> pl.DataFrame:
    shipments = normalize_lanes(typed_scan(shipment_path, SHIPMENT_SCHEMA))
    rates = normalize_lanes(typed_scan(rate_path, RATE_SCHEMA))

    joined = shipments.join(
        rates, on=["carrier_scac", "origin_zip", "dest_zip"], how="left"
    ).filter(
        # temporal filter AFTER the join, using each shipment's own ship_date
        pl.col("ship_date").is_between(pl.col("effective_start"), pl.col("effective_end"))
        | pl.col("rate_per_lb").is_null()        # keep unmatched rows for routing
    ).filter(
        (pl.col("weight_lbs").is_between(pl.col("min_weight"), pl.col("max_weight")))
        | pl.col("rate_per_lb").is_null()
    )

    logger.info("lane_join_complete", source=shipment_path)
    return joined.collect(streaming=True)

The temporal filter runs after the join because the effective window must be tested against each shipment’s own ship_date — the original placeholder that compared effective_start to a literal None matched nothing and is the classic source of a silent zero-match.

Step 3 — Resolve contract-version drift deterministically

After the join, more than one effective contract version can survive for a single shipment. Apply an explicit precedence — highest contract_version, then latest effective_start — and reduce to exactly one row per shipment so the result is reproducible across re-runs:

def resolve_version(df: pl.DataFrame) -> pl.DataFrame:
    """Deterministic single-rate selection when windows overlap."""
    return (
        df.sort(
            ["shipment_id", "contract_version", "effective_start"],
            descending=[False, True, True],
        )
        .group_by("shipment_id", maintain_order=True)
        .first()                              # highest version, latest start wins
    )

Never default a missing min_weight/max_weight to an arbitrary value. A row with no weight break is UNVERIFIED, not zero — flag it for review rather than letting it pass as a confident match.

Step 4 — Route the unmatched remainder with a circuit breaker

When the match ratio falls below threshold, degrade in a controlled way: tag unmatched rows for review, and trip a circuit breaker before a bad rate sheet corrupts the whole batch:

def route_unmatched(df: pl.DataFrame, min_ratio: float = 0.95) -> tuple[pl.DataFrame, dict]:
    total = df.height
    matched = df.filter(pl.col("rate_per_lb").is_not_null()).height
    ratio = matched / total if total else 0.0
    stats = {"total": total, "matched": matched, "ratio": round(ratio, 4)}

    if ratio < min_ratio * 0.5:               # catastrophic miss: do not emit garbage
        raise RuntimeError(f"circuit breaker: match ratio {ratio:.2%} far below {min_ratio:.0%}")

    tagged = df.with_columns(
        pl.when(pl.col("rate_per_lb").is_null())
        .then(pl.lit("REQUIRES_REVIEW"))
        .otherwise(pl.lit("VALIDATED"))
        .alias("audit_flag")
    )
    return tagged, stats

Rows tagged REQUIRES_REVIEW carry forward to Accessorial Charge Scoring and Threshold Tuning & Alerting, which weigh exceptions; the matcher flags, it never adjudicates the dollar amount.

Verification

Confirm each failure is closed rather than hidden. These assertions belong in the integration suite that runs on every new rate sheet:

from decimal import Decimal

def test_zip_normalization_matches_across_formats():
    s = pl.DataFrame({"origin_zip": ["7114"], "dest_zip": ["07666-1203"]}).lazy()
    out = normalize_lanes(s).collect()
    assert out["origin_zip"][0] == "07114"
    assert out["dest_zip"][0] == "07666"

def test_join_does_not_explode_cardinality():
    df = match_lanes("fixtures/shipments_10k.csv", "fixtures/rates.csv")
    one_per_shipment = resolve_version(df)
    assert one_per_shipment.height <= df["shipment_id"].n_unique()

def test_overlapping_windows_pick_highest_version():
    df = match_lanes("fixtures/overlap_window.csv", "fixtures/rates_v1_v2.csv")
    picked = resolve_version(df)
    assert picked["contract_version"].max() == picked["contract_version"][0]

In production the proof is in the telemetry: a healthy run logs one lane_join_complete per batch, a stable ratio near 1.0, and a flat memory profile. A sudden drop in ratio means a rate sheet changed key format or an effective window lapsed — investigate the contract load, do not lower the threshold.

Preventive Configuration

Stop the regression from returning by encoding these as configuration, not tribal knowledge:

lane_matcher:
  streaming: true
  zip_canonicalization: true        # both shipment and rate sides
  min_match_ratio: 0.95             # below half this, the circuit breaker trips
  version_precedence: [contract_version_desc, effective_start_desc]
  on_missing_weight_break: REQUIRES_REVIEW   # never silently default
  memory_limit_mb: 4096

Cardinality test in CI. Assert joined.height <= shipments.height * 1.05 on a 10% sample; a blow-up means the rate key is not unique and an effective-window or weight-break filter is missing.
Memory budget gate. Run the streaming join on a sample in CI and fail if peak RSS exceeds 80% of the worker allocation.
Shared key contract. The same canon_zip lives in the loader that ingests rates from LTL Rate Sheet Digitization and FTL Base Rate Extraction so both sides of the join always agree.
Alert on match-ratio drift. Export the route_unmatched stats and alert when ratio falls below min_match_ratio for two consecutive runs — that is a layout or contract change, not a tolerance to loosen.

FAQ

Why does my pandas merge drop shipments that clearly have a contracted rate?

Almost always a lane-key format mismatch. merge() uses exact equality, so "7114" (zero-stripped) never equals "07114" and ZIP+4 ("07114-2207") never equals the five-digit rate key. Canonicalize both sides to a five-digit, numeric-only ZIP before joining, as in Step 1, and the dropped rows reappear.

How do I stop the join from being OOMKilled on 50M line items?

Do not call pandas.merge() on fully materialized frames. Use pl.scan_csv() and collect(streaming=True) so Polars runs the join out-of-core, and make sure the rate key is unique per effective window so the join stays O(N) instead of exploding to a many-to-many cross-product (Steps 2 and 3).

Two contract versions are effective on the same ship date — which one should win?

Make it deterministic, never incidental. Sort by contract_version descending then effective_start descending and reduce to one row per shipment_id with group_by().first() (Step 3). The highest active version wins every time, so the same input always produces the same rate on a re-run.

Should the matcher back-fill a national default when no lane matches?

No. A missing match is signal, not noise. Tag the row REQUIRES_REVIEW and let the circuit breaker trip if the overall match ratio collapses. Silently defaulting to a national rate hides both normalization bugs and genuine missing contracts, and that is exactly how an audit under-recovers without anyone noticing.

Lane Matching Algorithms — the parent stage that produces the resolved lane key this matcher anchors a rate to.
Cross-Checking Billable Weight Against Actual Weight Logs — the weight-break sibling that consumes this matcher’s output.
Extracting FTL Zone-Based Pricing from Carrier PDFs — where the contracted rate table this page joins against originates.
How to Map LTL Class Rates to JSON Schemas — the structured rate store that feeds the join’s right-hand side.
Accessorial Charge Scoring — where REQUIRES_REVIEW rows are weighed for dispute.

Up one level: Lane Matching Algorithms · Section: Rule-Based Rate Validation & Accessorial Auditing