What it is and the real problem it solves

Every data platform eventually faces the same tension: structured schemas are efficient but rigid; semi-structured data is flexible but expensive to query. In practice, most lakehouse tables end up with at least one column that stores raw JSON as a STRING — event payloads, API responses, configuration metadata, or CDC change records with heterogeneous schemas.

The problem is not storing JSON. The problem is querying it.

When JSON lives in a STRING column, every field extraction requires the engine to:

  1. Deserialize the entire JSON string from UTF-8 bytes into an in-memory tree
  2. Navigate the tree to locate the requested field
  3. Cast the extracted value to the target type
  4. Repeat steps 1–3 for every row, every query, every field access

This “JSON tax” compounds quickly. A single SELECT that extracts three fields from a JSON column parses the same string three times per row. At lakehouse scale — billions of rows, terabytes of data — this becomes a dominant cost in both CPU time and I/O.

The Variant data type: a fundamentally different approach

The Variant data type, introduced in Apache Spark 4.0 and matured in Spark 4.1 with Delta Lake 4.x support, solves this by storing semi-structured data in a pre-parsed binary format rather than raw text. The binary encoding preserves the full flexibility of JSON (arbitrary nesting, mixed types, schema-per-row) while enabling:

  • Single-pass parsing at write time — JSON is parsed once during ingestion and stored as binary. Subsequent reads never re-parse.
  • Direct field access without deserialization — The binary format supports offset-based navigation, allowing the engine to extract a nested field without reading the entire value.
  • Columnar storage compatibility — Variant values are stored in Parquet using a specialized encoding that participates in column pruning (the engine can skip reading the Variant column entirely when it is not referenced). Note that predicate pushdown does not apply to fields extracted from within a Variant value — promote frequently-filtered fields to top-level typed columns.
  • Type preservation — Numeric types, booleans, and strings retain their original types in the binary encoding, avoiding the string-to-type casting overhead at query time.

In short: Variant moves the cost of parsing from read time to write time — a single write amortizes the cost across every future query.

Works seamlessly across Spark workloads

The Variant data type is not limited to batch SQL. It integrates naturally with:

  • Spark SQL — full function support (parse_json, variant_get, variant_explode, etc.)
  • DataFrame API — use the same functions in PySpark and Scala DataFrames
  • Structured Streaming — ingest and write Variant columns in streaming pipelines with no additional configuration
  • Delta Lake tables — Variant columns participate in standard Delta operations (MERGE, UPDATE, DELETE)

This provides a consistent programming model across batch and streaming — no need to switch approaches based on the execution mode.

Before and after: the query cost difference

Before — JSON in a STRING column:

from pyspark.sql.functions import get_json_object, from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Every field extraction re-parses the entire JSON string
df = spark.table("dbo.raw_events")
result = df.select(
    get_json_object(col("payload"), "$.user_id").cast("int").alias("user_id"),
    get_json_object(col("payload"), "$.event_type").alias("event_type"),
    get_json_object(col("payload"), "$.timestamp").alias("event_ts")
)

After — Variant column:

from pyspark.sql.functions import variant_get, col

# Direct binary field extraction — no re-parsing
df = spark.table("dbo.raw_events_v2")
result = df.select(
    variant_get(col("payload"), "$.user_id", "INT").alias("user_id"),
    variant_get(col("payload"), "$.event_type", "STRING").alias("event_type"),
    variant_get(col("payload"), "$.timestamp", "STRING").alias("event_ts")
)

The API looks similar, but the execution is fundamentally different. With Variant, the binary value is navigated by offset — no JSON deserialization occurs at query time.

When NOT to use Variant

Variant is not a replacement for strongly-typed columns. If you know the schema at design time and it is stable, use StructType, ArrayType, or flat columns. Variant is specifically for the columns where you cannot define a schema in advance or where the schema evolves independently of your table DDL.


Real-world use cases

The Variant data type is designed for scenarios where the schema is unknown at design time, varies across rows, or evolves faster than your table schema can keep up.

Use case Why Variant fits Alternative (and its problem)
Event telemetry / clickstream Event payloads vary by event type; new fields appear without warning STRING + get_json_object() — repeated parsing per field
API response logging Third-party APIs change response shapes without notice Deeply nested StructType — breaks on schema changes
CDC with heterogeneous schemas Change records from different source tables in a single stream MapType — loses nesting and type fidelity
Configuration / metadata columns Feature flags, experiment configs, user preferences Multiple nullable columns — schema bloat
IoT sensor payloads Different device types produce different field sets Union of all possible fields — sparse, wasteful
ML feature stores Feature definitions evolve rapidly during experimentation Fixed schema — requires migrations on every change

How it works with examples

Creating a table with a Variant column

%%sql
CREATE TABLE dbo.raw_events_v2 (
    event_id BIGINT,
    source STRING,
    received_at TIMESTAMP,
    payload VARIANT
);

The VARIANT data type is a first-class Delta Lake column data type. No special configuration is required.

Note: The %%sql prefix in code blocks below is the Fabric notebook cell magic for running SQL. It is equivalent to wrapping the query in spark.sql("""...""") in PySpark.

Inserting data with parse_json()

The parse_json() function converts a JSON string into the binary Variant format at write time:

%%sql
INSERT INTO dbo.raw_events_v2 VALUES
(1, 'web', current_timestamp(),
 parse_json('{"user_id": 42, "event_type": "click", "page": "/products", "duration_ms": 320}')),
(2, 'mobile', current_timestamp(),
 parse_json('{"user_id": 87, "event_type": "purchase", "item_id": "SKU-9912", "amount": 59.99}')),
(3, 'api', current_timestamp(),
 parse_json('{"service": "payment-gateway", "status": "timeout", "retry_count": 3}'));

Notice that each row has a different JSON structure. This is the core value proposition of Variant — no need to define a union schema or force all rows into the same shape.

Extracting fields with variant_get()

variant_get(column, path, target_type) extracts a field from a Variant value using a JSONPath-like syntax:

%%sql
SELECT
    event_id,
    variant_get(payload, '$.user_id', 'INT') AS user_id,
    variant_get(payload, '$.event_type', 'STRING') AS event_type,
    variant_get(payload, '$.amount', 'DOUBLE') AS amount
FROM dbo.raw_events_v2 ORDER BY event_id;
event_id user_id event_type amount
1 42 click NULL
2 87 purchase 59.99
3 NULL NULL NULL

Fields that do not exist in a particular row return NULL — no error, no exception.

Safe extraction with try_variant_get()

If you need to handle type mismatches gracefully (e.g., a field that is sometimes a string and sometimes a number), use try_variant_get():

%%sql
SELECT
    event_id,
    try_variant_get(payload, '$.retry_count', 'INT') AS retry_count
FROM dbo.raw_events_v2 ORDER BY event_id;
event_id retry_count
1 NULL
2 NULL
3 3

try_variant_get() returns NULL instead of raising an error when the target type does not match the stored value.

Exploring Variant structure with schema_of_variant()

When you do not know the structure of a Variant value, schema_of_variant() reveals its schema:

%%sql
SELECT
    event_id,
    schema_of_variant(payload) AS payload_schema
FROM dbo.raw_events_v2 ORDER BY event_id;
event_id payload_schema
1 OBJECT<duration_ms: BIGINT, event_type: STRING, page: STRING, user_id: BIGINT>
2 OBJECT<amount: DECIMAL(4,2), event_type: STRING, item_id: STRING, user_id: BIGINT>
3 OBJECT<retry_count: BIGINT, service: STRING, status: STRING>

Aggregating schemas with schema_of_variant_agg()

To discover the merged schema across all rows (useful for understanding what fields exist across an entire table):

%%sql
SELECT schema_of_variant_agg(payload) AS merged_schema
FROM dbo.raw_events_v2;

This returns the union of all schemas observed across all rows — invaluable for data discovery and documentation.

Exploding nested arrays with variant_explode()

For Variant values containing arrays or objects, variant_explode() flattens them into rows:

%%sql
-- First, insert a record with an array
INSERT INTO dbo.raw_events_v2 VALUES
(4, 'batch', current_timestamp(),
 parse_json('{"items": [{"sku": "A1", "qty": 2}, {"sku": "B3", "qty": 1}]}'));

-- Explode the items array into individual rows
SELECT
    event_id,
    pos,
    variant_get(value, '$.sku', 'STRING') AS sku,
    variant_get(value, '$.qty', 'INT') AS qty
FROM dbo.raw_events_v2,
LATERAL variant_explode(variant_get(payload, '$.items', 'VARIANT'))
WHERE event_id = 4;
event_id pos sku qty
4 0 A1 2
4 1 B3 1

Checking for Variant NULL vs SQL NULL

Variant has its own concept of null (a JSON null value stored inside the binary) that is distinct from SQL NULL (the absence of a value). Use is_variant_null() to distinguish:

%%sql
SELECT
    parse_json('null') AS variant_null_value,
    is_variant_null(parse_json('null')) AS is_var_null,
    parse_json('null') IS NULL AS is_sql_null;
variant_null_value is_var_null is_sql_null
null true false

This distinction matters when processing data that uses JSON null to represent “field exists but has no value” versus missing fields.

Using Variant in PySpark DataFrames

from pyspark.sql.functions import parse_json, variant_get, col

# Read JSON files directly into a Variant column
raw_df = spark.read.text("/data/events/*.json")

events_df = raw_df.select(
    parse_json(col("value")).alias("payload")
)

# Extract typed fields from the Variant column
processed_df = events_df.select(
    variant_get(col("payload"), "$.user_id", "BIGINT").alias("user_id"),
    variant_get(col("payload"), "$.event_type", "STRING").alias("event_type"),
    variant_get(col("payload"), "$.timestamp", "TIMESTAMP").alias("event_ts"),
    col("payload")  # Keep the full Variant for downstream consumers
)

processed_df.write.mode("append").saveAsTable("dbo.raw_events_v2")

Migration guide: from JSON strings to Variant

If your existing tables store JSON in STRING columns, here is a step-by-step migration path.

Migration checklist

  • Identify all tables with STRING columns storing JSON
  • Audit queries using get_json_object(), from_json(), or json_tuple() against those columns
  • Create new tables (or add new Variant columns) for the target schema
  • Backfill using parse_json() or try_parse_json()
  • Update read queries to use variant_get() / try_variant_get()
  • Update write pipelines to call parse_json() at ingestion
  • Validate query results match between old and new patterns
  • Monitor for malformed JSON using try_parse_json() NULL checks
  • Deprecate and eventually drop the old STRING columns

Option A: In-place column addition (minimal disruption)

If you want to migrate without recreating the table, you can add a Variant column alongside the existing STRING column:

%%sql
ALTER TABLE dbo.raw_events ADD COLUMN payload_v VARIANT;

-- Backfill the new column from the existing STRING column
UPDATE dbo.raw_events
SET payload_v = parse_json(payload_json);

This approach lets you migrate reads incrementally while keeping the original column as a fallback. Once all consumers are updated, drop the old STRING column.

Option B: New table with full migration

For a clean-break migration, create a new table and backfill:

Step 1: Create the new table with a Variant column

%%sql
CREATE TABLE dbo.raw_events_v2 (
    event_id BIGINT,
    source STRING,
    received_at TIMESTAMP,
    payload VARIANT
)
USING DELTA;

Step 2: Backfill data from the old table

%%sql
INSERT INTO dbo.raw_events_v2
SELECT
    event_id,
    source,
    received_at,
    parse_json(payload_json) AS payload
FROM dbo.raw_events_legacy;

The parse_json() function converts each STRING value into the binary Variant format during the write. This is a one-time cost.

Step 3: Update your read queries

Before:

%%sql
-- Old pattern: repeated string parsing
SELECT
    event_id,
    get_json_object(payload_json, '$.user_id') AS user_id,
    get_json_object(payload_json, '$.event_type') AS event_type
FROM dbo.raw_events_legacy;

After:

%%sql
-- New pattern: binary field extraction
SELECT
    event_id,
    variant_get(payload, '$.user_id', 'INT') AS user_id,
    variant_get(payload, '$.event_type', 'STRING') AS event_type
FROM dbo.raw_events_v2;

Step 4: Update your write pipelines

Before:

# Old pattern: store raw JSON string
from pyspark.sql.functions import col

raw_df = spark.read.text("/data/incoming/*.json")
raw_df.select(
    col("value").alias("payload_json")
).write.mode("append").saveAsTable("dbo.raw_events_legacy")

After:

# New pattern: parse at write time, store as Variant
from pyspark.sql.functions import parse_json, col

raw_df = spark.read.text("/data/incoming/*.json")
raw_df.select(
    parse_json(col("value")).alias("payload")
).write.mode("append").saveAsTable("dbo.raw_events_v2")

Step 5: Handle malformed JSON with try_parse_json()

During migration, you may encounter malformed JSON strings. Use try_parse_json() to avoid pipeline failures:

%%sql
INSERT INTO dbo.raw_events_v2
SELECT
    event_id,
    source,
    received_at,
    try_parse_json(payload_json) AS payload
FROM dbo.raw_events_legacy;

try_parse_json() returns NULL for invalid JSON instead of throwing an error. You can then identify and handle the bad records separately:

%%sql
-- Find records that failed to parse
SELECT event_id, payload_json
FROM dbo.raw_events_legacy
WHERE try_parse_json(payload_json) IS NULL;

Best practices

1. Parse at the boundary, query from binary

Always convert JSON to Variant at the ingestion boundary — the point where data first enters your lakehouse. Every downstream consumer then benefits from the binary format without re-parsing.

# Good: parse once at ingestion
incoming_df = spark.read.text("/data/stream/*.json")
incoming_df.select(
    parse_json(col("value")).alias("data")
).write.mode("append").saveAsTable("dbo.landing_events")

2. Use try_variant_get() for heterogeneous schemas

When the same field can have different types across rows (e.g., "price" is sometimes a string "N/A" and sometimes a number), use try_variant_get() to avoid runtime errors:

%%sql
SELECT
    event_id,
    COALESCE(
        try_variant_get(payload, '$.price', 'DOUBLE'),
        0.0
    ) AS price
FROM dbo.raw_events_v2;

3. Combine Variant with strongly-typed columns

The ideal table design uses Variant for flexible, low-query-frequency fields and strongly-typed columns for high-query-frequency, filterable fields:

%%sql
CREATE TABLE dbo.fact_user_events (
    event_id BIGINT,
    user_id BIGINT,           -- strongly typed: used in joins and filters
    event_type STRING,         -- strongly typed: used in GROUP BY
    event_ts TIMESTAMP,        -- strongly typed: used in time-based queries
    properties VARIANT         -- flexible: varies by event type
)
USING DELTA
PARTITIONED BY (event_type);

This pattern gives you the best of both worlds: fast, predicate-pushdown-enabled filtering on structured columns, and flexible field extraction on the Variant column.

4. Use schema_of_variant_agg() for data discovery

Before building downstream models, profile your Variant columns to understand what fields exist and their types:

%%sql
SELECT schema_of_variant_agg(properties) AS all_fields
FROM dbo.fact_user_events
WHERE event_type = 'purchase';

This helps you decide which fields to promote to strongly-typed columns over time.

5. Avoid storing large arrays in Variant

While Variant supports arrays, very large arrays (thousands of elements) stored in a single Variant value can degrade extraction performance. If you have consistently large arrays, consider flattening them into separate rows at ingestion time:

from pyspark.sql.functions import parse_json, explode, col

# Instead of storing a large array inside a single Variant value,
# explode it into one row per element at ingestion
raw_df = spark.read.json("/data/bulk_orders/*.json")

flattened_df = raw_df.select(
    col("order_id"),
    explode(col("line_items")).alias("item")
).select(
    col("order_id"),
    parse_json(col("item").cast("string")).alias("item_payload")
)

flattened_df.write.mode("append").saveAsTable("dbo.order_line_items")

6. Cast Variant to typed columns for analytics

When building aggregate tables or BI-facing views, extract and cast Variant fields into typed columns:

%%sql
CREATE OR REPLACE VIEW dbo.v_purchase_events AS
SELECT
    event_id,
    user_id,
    event_ts,
    variant_get(properties, '$.item_id', 'STRING') AS item_id,
    variant_get(properties, '$.amount', 'DOUBLE') AS amount,
    variant_get(properties, '$.currency', 'STRING') AS currency
FROM dbo.fact_user_events
WHERE event_type = 'purchase';

7. Align with the medallion architecture

The Variant data type fits naturally into a multi-layer lakehouse design:

Layer Pattern
Bronze Store full payload as VARIANT — no schema enforcement, maximum fidelity
Silver Extract frequently-used fields into typed columns alongside the raw VARIANT
Gold Fully structured tables with no VARIANT columns — optimized for BI and aggregation

This progression lets you ingest rapidly at the Bronze layer without worrying about schema, then progressively refine as you understand the data.


Limitations

Understanding what Variant cannot do is as important as knowing what it can.

Operations not supported on Variant columns

Limitation Workaround
No comparison or ordering — cannot use ORDER BY, >, <, = directly Extract with variant_get() and compare the typed result
No hashing — cannot use in GROUP BY, DISTINCT, joins, or window PARTITION BY Extract and cast fields before grouping or joining
Cannot be used as map keysMapType(VariantType, ...) is not supported Use extracted STRING or INT values as keys
No predicate pushdown on internals — filters on extracted fields do not push down Promote frequently-filtered fields to top-level typed columns
No set operations — cannot use UNION, INTERSECT, or EXCEPT directly Extract fields before applying set operations
Cannot be used as partition or clustering key Partition by a typed column; extract fields for clustering

Data handling constraints

Limitation Workaround
Variant NULL vs SQL NULL distinctionis_variant_null(col) and col IS NULL mean different things Use is_variant_null() explicitly when the distinction matters
Size limit: 128 MiB per value Split very large documents into multiple rows or separate tables
parse_json() rejects invalid JSON — malformed strings cause write failures Use try_parse_json() for fault-tolerant ingestion
No direct write from structured types — cannot cast StructType directly to Variant in all contexts Use to_json() then parse_json(), or use to_variant_object()
Case-sensitive field access — unlike get_json_object(), Variant field names are case-sensitive Ensure field name casing matches exactly in variant_get() paths
No Hive compatibility — Variant columns cannot be read by Hive-based engines Use Spark or engines that support the Variant Parquet encoding

Despite these constraints, Variant covers the vast majority of semi-structured workloads without issue.


Note

The Variant data type is supported in Fabric Runtime 2.0 (Spark 4.x + Delta Lake 4.x). It is a first-class Delta Lake column type — no additional configuration or feature flags are required.

Variant columns are stored in Parquet using the Variant encoding specification. Older Parquet readers that do not support this encoding will not be able to read Variant columns. Ensure all consumers of your Delta tables support the Variant type before migrating.


Key takeaways

  • Parse once, query foreverparse_json() at ingestion, variant_get() at query time.
  • Use for unknown/evolving schemas only — stable schemas belong in strongly-typed columns.
  • Combine both — typed columns for filters/joins/GROUP BY; Variant for flexible, low-frequency fields.
  • Migrate incrementallytry_parse_json() for fault tolerance; add Variant columns alongside existing STRING columns.
  • Always extract before operating — Variant columns cannot be compared, hashed, grouped, or ordered directly.

Final thoughts

The Variant data type addresses one of the longest-standing compromises in lakehouse architecture: the choice between schema flexibility and query performance. With Fabric Runtime 2.0, you no longer need to accept the JSON parsing tax on every read or maintain brittle from_json() calls with hardcoded schema definitions that break when upstream sources change.

Note: This blog post was written with assistance from AI