Skip to main content

Runbook: JSON Schema

Choose the right schema approach for JSON data in ClickHouse — typed columns, hybrid, native JSON, or String storage
Note

The JSON column type is production-ready from ClickHouse 25.3+. Earlier versions are not recommended for production use.

Your data arrives as JSON. ClickHouse gives you several ways to store it, from fully typed columns to a raw String. The right choice depends on how predictable your schema is and whether you need field-level queries.

Scope: This page covers schema design decisions for storing JSON data. It does not cover JSON input/output formats, JSON functions, or query syntax. For background on the JSON column type itself, see Use JSON where appropriate.

Assumes: Familiarity with ClickHouse table creation, MergeTree basics, and column type syntax.

Quick decision

  • If every field has a known, stable type and the schema rarely changes Typed columns
  • If most fields are stable but some section is dynamic or unpredictable Hybrid (typed + JSON)
  • If the entire structure is dynamic, with keys that appear and disappear across records Native JSON column
  • If you only store and retrieve the JSON blob without field-level queries Opaque String storage
Note

Don't confuse the JSON format with the JSON column type. You can insert JSON-formatted data (via JSONEachRow, etc.) into typed columns without using the JSON column type at all. The decision here is about column types, not input formats.

Approach details

Typed columns

When to use: The JSON structure is fully known at design time. Fields and types don't change between records. Even complex nested structures (arrays of objects, nested maps) can be expressed with Array, Tuple, and Nested types.

Trade-offs: Schema changes require ALTER TABLE. Unexpected fields are silently dropped on insert unless the schema is updated.

Setup, verification, and gotchas

Setup

CREATE TABLE events
(
    `timestamp` DateTime,
    `service`   LowCardinality(String),
    `level`     Enum8('DEBUG' = 1, 'INFO' = 2, 'WARN' = 3, 'ERROR' = 4),
    `message`   String,
    `host`      LowCardinality(String),
    `duration_ms` UInt32
)
ENGINE = MergeTree
ORDER BY (service, timestamp)

Verification

-- Confirm column types match expectations
DESCRIBE TABLE events FORMAT Vertical

-- Insert and query to validate the schema handles your data
INSERT INTO events FORMAT JSONEachRow
{"timestamp":"2025-03-19 10:00:00","service":"api","level":"INFO","message":"request handled","host":"node-1","duration_ms":42}

SELECT service, level, duration_ms FROM events WHERE service = 'api'

Watch out for

  • If you insert JSON data with JSONEachRow and the JSON contains fields not in the schema, ClickHouse drops them silently by default. Set input_format_skip_unknown_fields to 0 if you want errors instead.

Hybrid (typed columns + JSON)

When to use: A core set of fields is stable (timestamps, IDs, status codes), but part of the payload is dynamic. Think user-defined attributes, tags, metadata, or extension fields that vary across records.

Trade-offs: Full performance on typed columns, flexibility on the JSON column. The JSON column still carries insert overhead and storage cost for its dynamic portion.

Setup, verification, and gotchas

Setup

CREATE TABLE events
(
    `timestamp`  DateTime,
    `service`    LowCardinality(String),
    `level`      Enum8('DEBUG' = 1, 'INFO' = 2, 'WARN' = 3, 'ERROR' = 4),
    `message`    String,
    `host`       LowCardinality(String),
    `duration_ms` UInt32,
    `attributes` JSON(
        max_dynamic_paths = 256,
        `http.status_code` UInt16,
        `http.method` LowCardinality(String),
        SKIP REGEXP 'debug\..*'
    )
)
ENGINE = MergeTree
ORDER BY (service, timestamp)

Verification

-- Insert sample data and inspect inferred paths
INSERT INTO events FORMAT JSONEachRow
{"timestamp":"2025-03-19 10:00:00","service":"api","level":"INFO","message":"request handled","host":"node-1","duration_ms":42,"attributes":{"http.status_code":200,"http.method":"GET","user.region":"eu-west","custom_tag":"abc"}}

SELECT JSONAllPathsWithTypes(attributes)
FROM events
FORMAT PrettyJSONEachRow

Watch out for

  • Use type hints on JSON paths you know ahead of time. Hints bypass the discriminator column and store the path like a regular typed column, with the same performance and no overhead.
  • Use SKIP or SKIP REGEXP for paths you never query (debug metadata, internal tracing IDs) to save storage and reduce subcolumn count.
  • Set max_dynamic_paths proportional to the number of distinct paths you actually query. The default (1024) works for most cases. Lower it if your dynamic section is narrow.
  • Don't set max_dynamic_paths above 10,000. High values increase resource consumption and reduce efficiency.
Dotted keys

Keys with dots (e.g., http.status_code) are treated as nested paths by default, so {"http.status_code": 200} is stored the same as {"http": {"status_code": 200}}. This is common with OTel attributes. Use type hints to control how dotted paths are stored, or enable json_type_escape_dots_in_keys (25.8+).


Native JSON column

When to use: The structure is genuinely unpredictable, with keys that appear and disappear across records. User-generated schemas, plugin systems, or data-lake ingestion where you don't control the upstream schema.

Trade-offs: Slower inserts than typed columns. Slower full-object reads than String. Storage overhead from subcolumn management. Works well for field-level queries on specific paths.

Setup, verification, and gotchas

Setup

CREATE TABLE dynamic_events
(
    `id`   UInt64,
    `ts`   DateTime DEFAULT now(),
    `data` JSON(
        max_dynamic_paths = 512,
        `event_type` LowCardinality(String),
        `version` UInt8
    )
)
ENGINE = MergeTree
ORDER BY (data.event_type, ts)

Use JSONAsObject format when inserting whole JSON documents into a JSON column. It treats each input line as a complete JSON object mapped to the column.

Verification

INSERT INTO dynamic_events (id, data) FORMAT JSONEachRow
{"id": 1, "data": {"event_type": "click", "version": 2, "page": "/home", "button_id": "cta-1"}}
{"id": 2, "data": {"event_type": "purchase", "version": 1, "item_id": "SKU-99", "amount": 49.99, "currency": "USD"}}

-- Check which paths ClickHouse detected and their types
SELECT JSONAllPathsWithTypes(data) FROM dynamic_events FORMAT PrettyJSONEachRow

-- Query a specific path
SELECT data.page FROM dynamic_events WHERE data.event_type = 'click'

Watch out for

  • Without type hints, ClickHouse infers types per-path from the first values it sees. If score arrives as "10" (string) in one record and 10 (integer) in another, the path gets a discriminator column and queries become slower. Add hints for paths with known types.
  • When path count exceeds max_dynamic_paths, overflow values move to a shared data structure with reduced query performance. Monitor with JSONDynamicPaths() and keep the limit below 10,000.
  • Each dynamic path supports up to max_dynamic_types (default 32) distinct data types. If a single path exceeds this, extra types fall back to shared variant storage. This rarely matters unless your data has highly inconsistent types for the same field.

Opaque String storage

When to use: JSON documents are stored and retrieved whole, then passed through to an application, archived, or forwarded downstream. No field-level filtering or aggregation inside ClickHouse.

Trade-offs: Fastest inserts and simplest schema. No field-level queries without runtime parsing (JSONExtract family), which is slow at scale.

Setup, verification, and gotchas

Setup

CREATE TABLE raw_events
(
    `id`        UInt64,
    `received`  DateTime DEFAULT now(),
    `payload`   String
)
ENGINE = MergeTree
ORDER BY (received)

Verification

INSERT INTO raw_events (id, payload) VALUES
(1, '{"type":"click","page":"/home"}'),
(2, '{"type":"purchase","item":"SKU-99","amount":49.99}')

-- Confirm data round-trips intact
SELECT payload FROM raw_events WHERE id = 1

-- Verify you can still parse fields ad-hoc when needed
SELECT JSONExtractString(payload, 'type') AS event_type FROM raw_events

Watch out for

  • If requirements change and you later need field-level queries, you'll need to create a new table with typed or JSON columns and backfill the data. If there's any chance you'll query individual fields, start with the hybrid approach instead.
  • JSONExtract functions parse the string on every query. Acceptable for ad-hoc exploration, not for production dashboards or high-QPS workloads.
  • Consider compression codecs (ZSTD) on the String column if the JSON payloads are large — it compresses well.

Comparison

DimensionTyped columnsHybridNative JSONString
Insert throughputFastestFastModerateFastest
Field-level queriesFastestFast (typed); good (hinted JSON)Good (hinted); slower (dynamic)Slow (runtime parsing)
Full-object readsFastModerateSlowFastest
Storage efficiencyBestGoodModerateGood (compresses well)
Schema flexibilityNone (ALTER TABLE)Partial (rigid core, flexible tail)FullFull
ComplexityLowMediumMedium–HighLow
· 7 min read