The Developer's Guide to Trademark Data: Fields, Formats, and Where to Get It

Trademark data spans 50+ fields per record across 200+ offices worldwide. Learn what records contain, where to source them, and how to query them via API.
14 min read

A single trademark record can contain more than 50 fields. The USPTO alone holds 14.7 million of them. Globally, that number exceeds 150 million records spread across hundreds of offices, each with its own schema, status codes, and update cadence. This is one of the largest under-explored public datasets available, and most developers have never touched it.

This guide covers what trademark data contains, where it comes from, and how to work with it programmatically. If you're building legaltech, compliance tools, e-commerce platforms, or brand protection features, understanding its structure before you start will save months of rework.

What Trademark Data Actually Contains

Most developers expect trademark data to be simple: a name, a date, a status. A trademark record is a structured legal document with entity relationships, classification hierarchies, and a full prosecution timeline. Here is what a typical record looks like under the hood.

Core Fields

The basics include the mark text (the word or phrase being trademarked), the mark type (word mark, design mark, sound mark, or a combination), filing and registration dates, serial and registration numbers, and the current status. A "word mark" and a "standard character mark" are functionally different at the USPTO: standard character claims protect the text in any stylization, while a word mark in a design format protects only that specific visual presentation. The distinction matters when you're building search logic.

Entity Data

Every trademark has an owner (called the "applicant" before registration), often an attorney of record, and a correspondent for office communications. Each entity includes name, address, entity type (individual, corporation, LLC, partnership), and for US filings, state or country of incorporation.

The entity resolution problem shows up immediately: the same company may appear as "Apple Inc." at the USPTO, "APPLE INC." at EUIPO, and "Apple Inc" (no period) at CIPO. Across offices, there is no shared identifier. If you're aggregating data, you need to resolve these yourself or use a provider that has already done the work.

Classification Data

Trademarks are organized by Nice classes, the international system for categorizing goods and services into 45 classes. Class 9 covers software and electronics. Class 25 covers clothing. Class 43 covers restaurant services. Every trademark registration specifies at least one class and includes a goods/services description detailing exactly what the mark covers within that class.

The specificity matters. The TMClass database and USPTO ID Manual together contain roughly 96,000 accepted goods/services terms. "Computer software" is not granular enough for a modern filing; the description might read "downloadable computer software for managing cryptocurrency wallets." When building clearance or monitoring tools, matching at the class level is a starting point. Matching at the goods/services description level is where the real signal is.

Design marks add another classification layer: Vienna codes, a hierarchical system for categorizing visual elements. A logo containing a stylized apple might be coded as 05.07.11 (fruits, apples). These codes enable image-similarity search without computer vision.

Prosecution History

The event timeline from application to registration (or abandonment) is the prosecution history. This includes every interaction between the applicant and the trademark office: filing receipts, examination results, office actions (requests for clarification or amendment), responses, publication for opposition, the 30-day opposition window, and finally registration or abandonment.

For developers, prosecution history is where you find actionable signals. A trademark that received an office action citing a likelihood of confusion with your client's mark is relevant intelligence. A mark that was abandoned during the opposition period tells a different story than one that was abandoned for failure to respond to an office action.

Relationships

Trademarks are not isolated records. The Madrid Protocol (the international system that lets applicants file one application designating multiple countries) creates parent-child relationships between an international registration and its national designations. Divisional applications split one filing into multiple. Priority claims link filings across offices based on the Paris Convention's six-month priority window.

These relationships mean that a single brand protection event can touch records in half a dozen offices simultaneously. If you're building a monitoring system, you need to follow these links.

Where Trademark Data Comes From

Seven major trademark data sources publish records in formats that developers can work with. The coverage, format, and freshness vary dramatically.

USPTO is the most accessible. USPTO trademark data is available through the Trademark Status and Document Retrieval (TSDR) system, which provides bulk XML files updated daily. Since TESS was retired, the bulk data and API endpoints at developer.uspto.gov are the primary programmatic access points. The dataset includes 14.7 million records with full prosecution histories.

EUIPO publishes trademark data through its Open Data Portal. The format is different from USPTO (different XML schema, different field naming), and the update cadence is less frequent. Coverage includes roughly 2 million EU trademark records.

WIPO provides two main data sources. The Global Brand Database aggregates records from participating offices but only covers internationally registered marks (Madrid Protocol filings). Madrid Monitor tracks the lifecycle of international registrations specifically. Between them, WIPO covers approximately 1.5 million records.

CIPO (Canada), NIPO (Norway), IP Australia, and IPOS (Singapore) each publish data in their own formats, with varying levels of completeness and different update schedules.

Trademark Records by Office (Millions)

The Trademark Data Format Problem

Every office uses its own schema. USPTO uses XML with one structure. EUIPO uses XML with a completely different structure. WIPO provides yet another format. Each trademark data format differs in field names, date conventions, and ID schemes (USPTO serial numbers look nothing like EUIPO application numbers).

Status codes illustrate this fragmentation. The USPTO defines 167 distinct status codes. CIPO uses 10. A "LIVE" status at one office and an "ACTIVE" status at another might mean the same thing, or they might not. "LIVE" at the USPTO includes marks under examination that have not yet registered, while "ACTIVE" at another office might refer only to registered marks. Cross-office search cannot treat status as a simple string comparison.

Number of Distinct Status Codes by Office

Coverage Gaps

Not every office publishes bulk data. Many smaller offices provide only a web search interface with no downloadable dataset and no API. For some jurisdictions, the options are scraping (legally and technically fraught) or commercial data providers.

Global trademark filings exceeded 15 million applications in 2023 (WIPO IP Facts and Figures). The publicly downloadable portion is a subset. Building truly global coverage requires combining public sources with commercial feeds and, in some cases, direct relationships with national offices.

The Trademark Data Normalization Problem

Raw office data cannot be merged by concatenation. The schemas differ, the entity naming is inconsistent, and the status taxonomies are incompatible. Normalization is the engineering problem between "I have data from multiple offices" and "I can query trademarks globally."

ID Schemes

USPTO identifies trademarks by serial number (an 8-digit number like 97123456) and registration number (a 7-digit number). EUIPO uses an application number with a different format. WIPO uses international registration numbers. If you're building a unified data store, you need a synthetic ID scheme that wraps office-specific identifiers. Prefixed IDs solve this cleanly: tm_us_97123456, tm_eu_018765432, tm_wo_1234567.

Entity Resolution

Matching the same entity across offices is a hard problem. "APPLE INC." and "Apple Inc." are obvious matches. "APPLE INC." and "Apple Computer, Inc." (the company's former name) require historical knowledge. "Nike, Inc." and "NIKE INNOVATE C.V." are both Nike, but one is a Dutch subsidiary. Naive string matching fails, and even fuzzy matching produces false positives at scale.

Production-grade entity resolution requires normalized name matching, address matching, and known corporate hierarchy data. Some providers augment this with SEC filings and corporate registry data to build a ground-truth entity graph.

What Normalization Looks Like

A normalized trademark record uses consistent field names, ISO 8601 dates, resolved entity references, and a unified status enum that maps office-specific codes to a canonical set:

{
  "id": "tm_us_97654321",
  "office": "USPTO",
  "mark": {
    "text": "WAVELENGTH",
    "type": "standard_character"
  },
  "status": {
    "code": "registered",
    "office_code": "700",
    "office_description": "Registered"
  },
  "filing_date": "2023-03-15",
  "registration_date": "2024-01-22",
  "owner": {
    "id": "own_8f3k2m",
    "name": "Wavelength Technologies Inc.",
    "entity_type": "corporation",
    "jurisdiction": "US-DE"
  },
  "classes": [
    {
      "nice_class": 9,
      "description": "Downloadable software for audio signal processing"
    },
    {
      "nice_class": 42,
      "description": "Software as a service for audio engineering"
    }
  ],
  "attorneys": [
    {
      "name": "Sarah Chen",
      "firm": "Chen & Associates IP Law"
    }
  ]
}

The status.code field maps to a normalized enum (filed, examined, published, registered, abandoned, cancelled, expired). The status.office_code preserves the original office-specific code for traceability. Signa's data model uses this approach: prefixed IDs (tm_, own_, cls_), consistent enums, and bidirectional mapping to office-native values.

The key design choice is to normalize without discarding. The unified status code gives you cross-office querying. The preserved office code gives you an audit trail. You need both.

Working with Trademark Data via API

Raw bulk data works for one-off research and batch analysis. For production applications, you need a trademark database API.

What to Evaluate

Coverage: How many offices does the trademark database API aggregate? A single-office API (like the USPTO's own endpoint) solves a narrow problem. A multi-office API saves you from building normalization infrastructure. Look for providers covering 50+ offices for any cross-jurisdictional use case.

Freshness: How quickly do new filings and status changes appear? The USPTO publishes daily bulk updates. An API that ingests weekly is already stale for monitoring use cases. Ask about ingestion lag.

Search capabilities: Exact text match is table stakes. Phonetic matching (does "APLE" match "APPLE"?) catches conflicts that exact match misses. Fuzzy matching handles typos and transliterations. The best APIs offer all three as configurable strategies.

Entity resolution: Are owners linked across offices? Can you search by owner and get results from every jurisdiction? This is the difference between searching trademarks and searching a normalized trademark knowledge graph.

Fetching a Trademark Record

Retrieving a trademark by its prefixed ID:

curl https://api.signa.so/v1/trademarks/tm_us_97654321 \
  -H "Authorization: Bearer sig_live_..."
import Signa from '@signa-so/sdk';

const signa = new Signa({ apiKey: 'sig_live_...' });

const trademark = await signa.trademarks.retrieve('tm_us_97654321');
console.log(trademark.mark.text);       // "WAVELENGTH"
console.log(trademark.status.code);     // "registered"
console.log(trademark.classes[0].nice_class); // 9

The response returns the full normalized record: mark text, status, owner, classes, prosecution history, and relationships. Detail endpoints return everything; summary endpoints return a lightweight subset (ID, mark text, status, owner name, classes) for fast scanning when paginating through search results.

Searching Across Offices

Cross-office search with phonetic matching is the use case that makes an API worth using over raw data. It requires both normalization (to search multiple offices in one call) and phonetic indexing (to find similar-sounding marks). You can automate these searches as part of a clearance pipeline.

curl https://api.signa.so/v1/trademarks/search \
  -H "Authorization: Bearer sig_live_..." \
  -d '{
    "mark": {
      "text": "WAVELENGHT",
      "strategies": ["exact", "phonetic", "fuzzy"]
    },
    "filters": {
      "nice_classes": [9, 42],
      "status": ["registered", "filed"]
    }
  }'
const results = await signa.trademarks.search({
  mark: {
    text: 'WAVELENGHT',
    strategies: ['exact', 'phonetic', 'fuzzy']
  },
  filters: {
    nice_classes: [9, 42],
    status: ['registered', 'filed']
  }
});

for await (const trademark of results) {
  console.log(`${trademark.mark.text} (${trademark.office}) - ${trademark.status.code}`);
}
// WAVELENGTH (USPTO) - registered
// WAVELENGTH (EUIPO) - registered
// WAVE LENGTH (CIPO) - filed
// WAVELENGHT (IPOS) - filed

Note the deliberate misspelling in the query ("WAVELENGHT"). The phonetic strategy catches the correct spelling "WAVELENGTH" because they sound identical. The fuzzy strategy catches the split-word variant "WAVE LENGTH." This is why phonetic and fuzzy matching matter for trademark clearance: exact match alone would miss the most dangerous conflicts.

Pagination

Cursor-based pagination is the right pattern for trademark data. The dataset changes continuously as new filings arrive and statuses update. Offset-based pagination breaks when a new filing inserted between page fetches causes duplicate or skipped records. Cursor-based pagination provides stable traversal regardless of concurrent changes.

const results = await signa.trademarks.search({
  mark: { text: 'WAVE', strategies: ['exact'] }
});

// Async iteration handles pagination automatically
for await (const trademark of results) {
  process.stdout.write(`${trademark.id} `);
}

What Developers Build with Trademark Data

Brand clearance. Checking name availability before filing is the most common use case. Developers at naming agencies, startup tools, and domain registrars integrate trademark search to warn users of potential conflicts before they commit to a name. A comprehensive clearance search goes beyond exact matches. Consult a trademark attorney for legal guidance on whether a specific name is safe to use. A brand name availability checker can be built with a single API integration.

Portfolio monitoring. Law firms managing hundreds or thousands of marks across jurisdictions need to track status changes, upcoming renewal deadlines, and new filings that might conflict with client marks. Trademark monitoring requires both fresh data and relationship tracking (Madrid designations, divisional applications).

Competitive intelligence. New trademark filings signal product launches, market entries, and rebrands before public announcements. A competitor filing in Class 9 (software) in a new jurisdiction signals expansion. Filing patterns reveal strategic intent months before press releases.

E-commerce compliance. Platforms like Amazon, Shopify, and Alibaba need to verify brand claims and detect potential counterfeit listings. Trademark data provides the verification layer: does this seller have a legitimate connection to the trademark they claim to own? Automated checks at listing time reduce infringement risk.

Data analysis and market research. Trademark filing data reveals macro trends in innovation and commerce. An analysis of 2 million USPTO filings surfaces which industries are growing, which geographies are expanding, and where competitive density is highest. Class distribution shifts over time signal structural changes in the economy.

Getting Started with Trademark Data

Two paths exist for working with trademark data: raw bulk data or API access.

Raw bulk data is free and available from several offices. The USPTO's bulk XML downloads give you access to 14.7 million records with full detail. WIPO's Global Brand Database offers web-based search across participating offices. TMView aggregates European data. These sources work well for one-off research, academic analysis, or if you are building your own normalization layer and have the engineering bandwidth to maintain it.

The limitations add up. Bulk data requires significant parsing infrastructure (USPTO XML files are large and nested). Each office publishes in a different format. You inherit the normalization problem entirely. Freshness depends on your own ingestion schedule. And some offices do not publish bulk data at all.

API access makes sense when you need multi-office search, real-time freshness, entity resolution, phonetic matching, or production-grade reliability. If you are building a feature that end users depend on, raw bulk data will not scale. The parsing, normalization, and infrastructure costs add up faster than API pricing.

Signa's free tier provides access to normalized trademark data across 200+ offices, with search, filtering, and entity resolution included. It is one option for getting started without building ingestion infrastructure from scratch. Start at signa.so.