Accessorial Charge Taxonomy Mapping: Normalizing Carrier Surcharges for Freight Audit

Accessorial charge taxonomy mapping is the deterministic normalization layer that sits between raw carrier invoice ingestion and the downstream audit validation engine, turning a carrier’s idiosyncratic surcharge codes into a canonical taxonomy the rest of the pipeline can reason about. Carrier billing formats diverge sharply: the same liftgate fee arrives as code LG from one SCAC, ACC-LIFT from another, and only as a free-text description ("Lift Gate Service - Delivery") from a third. Without a strict, configuration-driven mapping strategy, audit pipelines raise false-positive overcharges, silently drop legitimate fees, and misroute disputes to the wrong resolution queue. This page sits inside Freight Contract Architecture & Rate Mapping and owns one job in that architecture: consuming the normalized invoice payloads emitted by ingestion and producing standardized, billability-resolved accessorial records that the validation tier can audit against the contracted schedule.

The mapper is intentionally narrow. It does not parse EDI or PDF, it does not compute base freight, and it does not talk to carriers. It reads a canonical raw-accessorial row, resolves it to an internal category, enforces the contracted monetary and weight rules, and emits an audit verdict. Everything upstream of it is ingestion; everything downstream is validation and dispute routing. Holding that boundary is what keeps the stage stateless, testable, and safe to scale horizontally.

Pipeline scope and stage boundaries

To keep the mapping stage decoupled from the tiers on either side of it, the responsibilities must be drawn explicitly. The table below is the contract the rest of the pipeline relies on.

Concern	In scope for this stage	Owned elsewhere
Code/description → category resolution	Yes	—
Contract-specific billability rules	Yes	—
Monetary cap + weight-floor enforcement	Yes	—
Audit status emission (`MATCHED`/`FLAGGED`/`UNMAPPED`)	Yes	—
Raw EDI 210/810 + PDF parsing	No	EDI 210/810 Processing, PDF Invoice Parsing with Python
Base linehaul rate resolution	No	FTL Base Rate Extraction, LTL Rate Sheet Digitization
Fuel surcharge math	No	Fuel Surcharge Formula Implementation
Weighted penalty scoring of flagged charges	No	Accessorial Charge Scoring
Alerting on the decision stream	No	Threshold Tuning & Alerting

Violating these boundaries introduces coupling that degrades throughput and complicates version control. The mapper is a pure transformation: same input plus same configuration always yields the same output, which is what makes it replayable during backfills.

Prerequisites

This stage assumes the upstream tier has already done its work. Concretely, it expects:

Upstream component: a normalizer that has parsed the carrier transmission and emitted one canonical InvoiceAccessorialRaw row per billed accessorial line, each stamped with the contract_version_id resolved during ingestion. If that stamp is missing, the mapper cannot select the correct rule set — point-in-time resolution belongs upstream, not here.
Python dependencies: pandas (batch frames), pydantic v2 (config validation), and PyYAML (rule loading). The standard-library re, decimal, logging, enum, and dataclasses modules carry the rest.
Config keys: a version-controlled carrier_mappings.yaml keyed by SCAC, each entry holding contract_id, effective_date, and an ordered list of rules. The file must be validated at load time, never trusted raw.
Input data contract: a strictly typed row that preserves shipment context alongside billing data, so threshold checks (weight, zone) have the fields they need.

The relational lookup layer that backs high-concurrency reads of this taxonomy — indexing, partitioning, and COPY-based bulk loads — is covered in depth in Building an accessorial charge lookup table in Postgres. This page treats that store as a given and focuses on the mapping logic that reads from it.

Input and output field contracts

The transformation is defined by two schemas. The input preserves enough shipment context to evaluate conditional rules; the output is the audit-ready record every downstream consumer keys off.

Input — InvoiceAccessorialRaw

Field	Type	Notes
`carrier_scac`	`str`	4-character SCAC; selects the rule set
`accessorial_code`	`str`	carrier-native code (may be blank)
`accessorial_desc`	`str \| None`	carrier-provided free text
`billed_amt`	`Decimal`	must be ≥ 0
`shipment_id`	`str`	shipment correlation key
`pro_number`	`str`	carrier PRO for traceability
`weight_lbs`	`Decimal \| None`	required for weight-gated accessorials
`zone`	`str \| None`	used by zone-conditional rules
`contract_version_id`	`str`	FK to the active rate agreement

Output — AuditAccessorialNormalized

Field	Type	Notes
`internal_accessorial_id`	`UUID`	stable identity for the normalized line
`taxonomy_category`	`Enum`	`LIFTGATE`, `DETENTION`, `FUEL_SURCHARGE`, `REDELIVERY`, `INSIDE_DELIVERY`, `UNKNOWN`
`is_billable`	`bool`	per contracted schedule
`max_allowable_amt`	`Decimal \| None`	contracted cap, `None` if uncapped
`audit_status`	`Enum`	`MATCHED`, `FLAGGED`, `UNMAPPED`
`mapping_rule_id`	`str \| None`	traceability for audit logs

Note that FUEL_SURCHARGE appears in the taxonomy only so a fuel line is classified, not recomputed — the actual percentage math lives in Fuel Surcharge Formula Implementation. The mapper’s job is to tag the line and hand off.

Step-by-step implementation

The mapper is built in four stages: validate and load the rule configuration, compile it into an in-memory match structure, resolve each row, then emit the verdict. Each stage is a separable unit so it can be unit-tested in isolation.

Stage 1 — Validate the rule configuration

A YAML-driven configuration lets transportation ops teams adjust carrier rules, effective dates, and thresholds without a code deploy. It must be validated at load time so a typo cannot reach the matching loop.

carrier_mappings:
  SCAC_ABC:
    contract_id: "CTR-2024-089"
    effective_date: "2024-01-01"
    rules:
      - carrier_code: "LG"
        carrier_desc_pattern: "(?i).*lift\\s*gate.*"
        internal_category: "LIFTGATE"
        billable: true
        max_amt: 75.00
        requires_weight_threshold: false
      - carrier_code: "DET"
        carrier_desc_pattern: "(?i).*detention.*"
        internal_category: "DETENTION"
        billable: true
        max_amt: 120.00
        requires_weight_threshold: true
        min_weight_lbs: 500
      - carrier_code: "FSC"
        carrier_desc_pattern: "(?i).*fuel\\s*surcharge.*"
        internal_category: "FUEL_SURCHARGE"
        billable: true
        max_amt: null
        requires_weight_threshold: false

Validate it with Pydantic so type coercion, regex readiness, and category membership are guaranteed before any ETL runs:

from decimal import Decimal
from datetime import date
from typing import List, Optional

from pydantic import BaseModel, field_validator
import re


class RuleConfig(BaseModel):
    carrier_code: str
    carrier_desc_pattern: str
    internal_category: str
    billable: bool
    max_amt: Optional[Decimal] = None
    requires_weight_threshold: bool = False
    min_weight_lbs: Optional[Decimal] = None

    @field_validator("carrier_desc_pattern")
    @classmethod
    def pattern_must_compile(cls, v: str) -> str:
        # Reject un-compilable regex at load time, not mid-batch.
        re.compile(v)
        return v

    @field_validator("internal_category")
    @classmethod
    def category_must_be_known(cls, v: str) -> str:
        allowed = {
            "LIFTGATE", "DETENTION", "FUEL_SURCHARGE",
            "REDELIVERY", "INSIDE_DELIVERY",
        }
        if v not in allowed:
            raise ValueError(f"Unknown taxonomy category: {v}")
        return v


class CarrierConfig(BaseModel):
    contract_id: str
    effective_date: date
    rules: List[RuleConfig]

Common mistake: validating only the YAML shape and not the regex. An un-compilable pattern that passes a shallow schema check will raise re.error deep inside the batch loop, killing a whole shipment’s worth of lines instead of one bad rule. Compile every pattern at load time.

Stage 2 — Compile rules into match structures

Compiling regex once per rule (not once per row) is the single biggest determinant of batch throughput. A naive implementation that calls re.search(pattern_string, desc) per row recompiles the pattern millions of times.

import logging
import re
from dataclasses import dataclass
from decimal import Decimal
from enum import Enum
from typing import Dict, List, Optional

logger = logging.getLogger("freight_audit.taxonomy_mapper")


class TaxonomyCategory(str, Enum):
    LIFTGATE = "LIFTGATE"
    DETENTION = "DETENTION"
    FUEL_SURCHARGE = "FUEL_SURCHARGE"
    REDELIVERY = "REDELIVERY"
    INSIDE_DELIVERY = "INSIDE_DELIVERY"
    UNKNOWN = "UNKNOWN"


class AuditStatus(str, Enum):
    MATCHED = "MATCHED"
    FLAGGED = "FLAGGED"
    UNMAPPED = "UNMAPPED"


@dataclass(frozen=True)
class MappingRule:
    carrier_code: str
    compiled_pattern: re.Pattern
    internal_category: TaxonomyCategory
    billable: bool
    max_amt: Optional[Decimal]
    requires_weight_threshold: bool
    min_weight_lbs: Optional[Decimal] = None


def compile_ruleset(config: Dict) -> Dict[str, List[MappingRule]]:
    compiled: Dict[str, List[MappingRule]] = {}
    for scac, contract_data in config.get("carrier_mappings", {}).items():
        rules: List[MappingRule] = []
        for rule in contract_data.get("rules", []):
            try:
                rules.append(MappingRule(
                    carrier_code=rule["carrier_code"].strip().upper(),
                    compiled_pattern=re.compile(rule["carrier_desc_pattern"]),
                    internal_category=TaxonomyCategory(rule["internal_category"]),
                    billable=bool(rule["billable"]),
                    max_amt=(Decimal(str(rule["max_amt"]))
                             if rule.get("max_amt") is not None else None),
                    requires_weight_threshold=bool(
                        rule.get("requires_weight_threshold", False)),
                    min_weight_lbs=(Decimal(str(rule["min_weight_lbs"]))
                                    if rule.get("min_weight_lbs") else None),
                ))
            except (re.error, KeyError, ValueError) as exc:
                # Skip the bad rule, keep the carrier; never fail the batch.
                logger.error("Invalid rule for %s: %s", scac, exc)
        compiled[scac.strip().upper()] = rules
    return compiled

Common mistake: keying the compiled map on the raw SCAC string. Carrier feeds emit scac_abc, SCAC_ABC, and SCAC_ABC interchangeably; normalize with .strip().upper() on both the config key and the lookup key, or a casing mismatch silently routes every line to UNMAPPED.

Stage 3 — Resolve each row

Resolution is code-first, description-second: prefer the carrier-native code when present, fall back to the compiled description pattern. The first rule that both matches and passes its threshold wins, so rule order in the YAML is meaningful.

import pandas as pd


def _passes_threshold(row: pd.Series, rule: MappingRule) -> bool:
    if not rule.requires_weight_threshold:
        return True
    weight = row.get("weight_lbs")
    if pd.isna(weight) or weight is None:
        # Missing weight on a weight-gated accessorial is NOT a pass.
        return False
    return Decimal(str(weight)) >= rule.min_weight_lbs


def _find_rule(row: pd.Series,
               carrier_rules: List[MappingRule]) -> Optional[MappingRule]:
    raw_code = str(row.get("accessorial_code", "")).strip().upper()
    raw_desc = str(row.get("accessorial_desc", "")).strip()
    for rule in carrier_rules:
        code_hit = bool(raw_code) and rule.carrier_code == raw_code
        desc_hit = bool(raw_desc) and rule.compiled_pattern.search(raw_desc)
        if (code_hit or desc_hit) and _passes_threshold(row, rule):
            return rule
    return None

Common mistake: treating an empty code as a wildcard. rule.carrier_code == "" will be true for every blank-code line if you forget the bool(raw_code) guard, mapping unrelated charges to whatever rule happens to carry an empty code. Always require a non-empty value before comparing.

Stage 4 — Emit the audit verdict

Once a rule is found, the cap decides whether the line is clean (MATCHED) or over the contracted ceiling (FLAGGED). No match at all is UNMAPPED — never auto-approved.

import uuid
from typing import Any, Dict as TDict


def _verdict(row: pd.Series, rule: Optional[MappingRule],
             billed_amt: Decimal, scac: str) -> TDict[str, Any]:
    if rule is None:
        return {
            "internal_accessorial_id": uuid.uuid4(),
            "taxonomy_category": TaxonomyCategory.UNKNOWN.value,
            "is_billable": False,
            "max_allowable_amt": pd.NA,
            "audit_status": AuditStatus.UNMAPPED.value,
            "mapping_rule_id": None,
        }

    status = AuditStatus.MATCHED
    if rule.max_amt is not None and billed_amt > rule.max_amt:
        status = AuditStatus.FLAGGED
        logger.warning(
            "Overcharge flagged: scac=%s code=%s billed=%s cap=%s pro=%s",
            scac, rule.carrier_code, billed_amt, rule.max_amt,
            row.get("pro_number"),
        )

    return {
        "internal_accessorial_id": uuid.uuid4(),
        "taxonomy_category": rule.internal_category.value,
        "is_billable": rule.billable,
        "max_allowable_amt": rule.max_amt,
        "audit_status": status.value,
        "mapping_rule_id": f"{scac}_{rule.carrier_code}",
    }


class AccessorialTaxonomyMapper:
    def __init__(self, config: Dict):
        self._rules = compile_ruleset(config)

    def map_batch(self, df: pd.DataFrame) -> pd.DataFrame:
        if df.empty:
            return df.assign(
                internal_accessorial_id=pd.Series(dtype="object"),
                taxonomy_category=TaxonomyCategory.UNKNOWN.value,
                is_billable=False,
                max_allowable_amt=pd.NA,
                audit_status=AuditStatus.UNMAPPED.value,
                mapping_rule_id=pd.Series(dtype="object"),
            )

        records = []
        for _, row in df.iterrows():
            scac = str(row.get("carrier_scac", "")).strip().upper()
            try:
                billed_amt = Decimal(str(row.get("billed_amt", "0")))
            except (ValueError, ArithmeticError):
                billed_amt = Decimal("0")
            rule = _find_rule(row, self._rules.get(scac, []))
            records.append(_verdict(row, rule, billed_amt, scac))

        return pd.concat(
            [df.reset_index(drop=True), pd.DataFrame(records)], axis=1)

Common mistake: mapping the over-cap line to a zero or capped amount in place. Preserve the original billed_amt; only attach max_allowable_amt alongside it. The downstream dispute calculator needs both numbers to compute the recoverable variance — overwriting the billed value destroys the evidence trail.

Validation and testing

Because the mapper is pure, it is exhaustively unit-testable with small frames. The fixtures that matter are the awkward inputs: blank codes, description-only lines, sub-threshold weights, and over-cap amounts.

import pandas as pd
import pytest
from decimal import Decimal

CONFIG = {
    "carrier_mappings": {
        "SCAC_ABC": {
            "contract_id": "CTR-2024-089",
            "effective_date": "2024-01-01",
            "rules": [
                {"carrier_code": "LG",
                 "carrier_desc_pattern": r"(?i).*lift\s*gate.*",
                 "internal_category": "LIFTGATE", "billable": True,
                 "max_amt": 75.00, "requires_weight_threshold": False},
                {"carrier_code": "DET",
                 "carrier_desc_pattern": r"(?i).*detention.*",
                 "internal_category": "DETENTION", "billable": True,
                 "max_amt": 120.00, "requires_weight_threshold": True,
                 "min_weight_lbs": 500},
            ],
        }
    }
}


@pytest.fixture
def mapper():
    return AccessorialTaxonomyMapper(CONFIG)


def _row(**kw):
    base = {"carrier_scac": "SCAC_ABC", "accessorial_code": "",
            "accessorial_desc": "", "billed_amt": "0",
            "weight_lbs": None, "pro_number": "PRO1"}
    base.update(kw)
    return base


@pytest.mark.parametrize("row,expected", [
    (_row(accessorial_code="LG", billed_amt="60.00"), "MATCHED"),
    (_row(accessorial_code="LG", billed_amt="90.00"), "FLAGGED"),
    (_row(accessorial_desc="Lift Gate Delivery", billed_amt="50"), "MATCHED"),
    (_row(accessorial_code="DET", billed_amt="100",
          weight_lbs="600"), "MATCHED"),
    (_row(accessorial_code="DET", billed_amt="100",
          weight_lbs="300"), "UNMAPPED"),   # below weight floor
    (_row(accessorial_code="DET", billed_amt="100",
          weight_lbs=None), "UNMAPPED"),    # missing weight
    (_row(accessorial_code="ZZZ", billed_amt="40"), "UNMAPPED"),
])
def test_audit_status(mapper, row, expected):
    out = mapper.map_batch(pd.DataFrame([row]))
    assert out.iloc[0]["audit_status"] == expected


def test_empty_frame_is_safe(mapper):
    out = mapper.map_batch(pd.DataFrame())
    assert out.empty

The parametrized table is the heart of the suite: every branch of the decision tree (code match, description fallback, weight pass, weight fail, missing weight, cap breach, no match) gets one explicit row. When a carrier reports a new quirk, you add a row, watch it fail, then adjust the rule — never the matching code.

Performance and tuning

For the per-shipment batches typical of audit pipelines (tens to low thousands of accessorial lines), the iterrows loop above is more than fast enough once regex is precompiled. The tuning levers that matter at higher volume:

Group by SCAC before iterating. Most batches mix a handful of carriers. Sorting or grouping the frame on carrier_scac keeps the active rule list small and cache-warm rather than re-fetching self._rules.get(scac) per row.
Batch size. Process frames of 5k–20k rows. Below that, per-call overhead dominates; above it, the intermediate records list inflates peak memory without throughput gain. Tune against your worker’s memory ceiling, not a fixed number.
Memory footprint. The mapper holds two copies of the frame briefly during pd.concat. For very wide upstream frames, project down to only the nine input-contract columns before mapping and re-join the rest afterward.
Regex cost. Anchor patterns where possible and avoid unbounded .*.* constructs; a greedy nested wildcard on a long free-text description is where catastrophic backtracking bites. Keep patterns specific to the surcharge phrase.

This stage is CPU-light and embarrassingly parallel — partition the input frame by SCAC across a worker pool if you genuinely saturate a core. It rarely becomes the pipeline bottleneck; ingestion and the Postgres lookup almost always do first.

Failure modes

Production mapping fails quietly far more often than it crashes. The named scenarios below each carry a root cause, a diagnostic, and a resolution.

1. Description-only carrier floods UNMAPPED. A SCAC that bills no native codes (every line is free text) matches nothing if its rules only key on carrier_code.

# Diagnostic: what fraction of a carrier's lines fell through?
unmapped = out[out["audit_status"] == "UNMAPPED"]
print(unmapped.groupby("carrier_scac").size())

Resolution: ensure each rule for that carrier carries a carrier_desc_pattern, and confirm Stage 3’s desc_hit branch is reached (it requires a non-empty raw_desc).

2. Missing weight silently suppresses detention. A weight-gated accessorial whose row arrives with weight_lbs = None returns UNMAPPED, which can look like a config gap when it is really an upstream data gap.

gated = out[(out["taxonomy_category"] == "UNKNOWN")]
missing_weight = gated[gated.index.isin(
    df.index[df["weight_lbs"].isna()])]
print(f"{len(missing_weight)} lines unmapped due to absent weight")

Resolution: fix the ingestion mapping so weight is populated; weight-floor logic is closely related to Weight & Zone Cross-Validation, which is the canonical place these checks are reconciled.

3. Stale contract_version_id mismatches the cap. If a row carries a version that predates a cap change, MATCHED/FLAGGED verdicts use the wrong ceiling. The mapper trusts the stamp; it does not re-resolve it.

Resolution: this is an upstream resolution bug, not a mapping bug — verify the ingestion tier pinned the correct snapshot before the row reached this stage.

4. Currency strings break Decimal coercion. A billed_amt of "$75.00" or "75,00" raises in Decimal(str(...)). The mapper defends with a fallback to Decimal("0"), but a zeroed amount can mask a real overcharge.

from decimal import Decimal
for bad in ["$75.00", "75,00", "  "]:
    try:
        Decimal(bad)
    except Exception as exc:
        print(repr(bad), "->", type(exc).__name__)

Resolution: normalize currency formatting in ingestion and route un-coercible amounts to a dead-letter table rather than zeroing them.

5. Rule-order shadowing. A broad pattern placed above a specific one (.*delivery.* before .*inside\s*delivery.*) captures lines the specific rule should own, since the first match wins.

Resolution: order rules most-specific-first within each carrier, and add a parametrized test asserting the specific category is returned for the ambiguous description.

Integration points

The mapper emits a single normalized frame; downstream consumers partition it by audit_status and route accordingly.

Output lane	Routed to	Field contract relied on
`MATCHED`	validation engine for line-item reconciliation	`taxonomy_category`, `is_billable`, `max_allowable_amt`
`FLAGGED`	dispute queue and Accessorial Charge Scoring for weighted penalty assignment	`max_allowable_amt`, original `billed_amt`, `mapping_rule_id`
`UNMAPPED`	config-gap alert via Threshold Tuning & Alerting	`carrier_scac`, `accessorial_code`, `accessorial_desc`

The mapping_rule_id field is the load-bearing traceability link: every flagged dispute can be traced back to the exact YAML rule and contract that produced the verdict, which is what makes the audit defensible during carrier reconciliation. Records failing schema coercion are isolated to a dlq_accessorial_mapping table with the original payload and traceback, and every FLAGGED/UNMAPPED line emits a structured JSON log carrying carrier_scac, pro_number, mapping_rule_id, and the failure reason for ops triage.

Topics in this section

Building an accessorial charge lookup table in Postgres — production hardening for the relational store behind this taxonomy: schema design, partitioning, COPY-based bulk loads with memory bounds, CI gating, and debugging silent code-mapping mismatches.

Freight Contract Architecture & Rate Mapping — the contract architecture that resolves the contract_version_id every accessorial verdict here depends on.
Fuel Surcharge Formula Implementation — recomputes the FUEL_SURCHARGE lines this stage only classifies.
FTL Base Rate Extraction — resolves the base linehaul that accessorials are billed on top of.
Accessorial Charge Scoring — assigns weighted penalties to the FLAGGED lines this stage emits.
Threshold Tuning & Alerting — turns the UNMAPPED and FLAGGED decision stream into actionable alerts.

Up: Freight Contract Architecture & Rate Mapping — the contract architecture this mapping stage belongs to.