Data Types in Spark SQL: The Translation Guide SQL Server DBAs Actually Need

Shannon Lowder

19 Apr 2018 — 4 min read

On a recent project, we had a pipeline that worked perfectly in SQL Server, imported cleanly into Databricks, and produced subtly wrong results for six weeks before anyone noticed. The culprit: a DATETIME column that became a TIMESTAMP in Spark SQL, and a timezone assumption that was implicit in our original SQL and explicit (and wrong) in the new environment.

Data type mismatches in Spark SQL don't always blow up loudly. Sometimes they truncate silently. Sometimes they cast implicitly in ways that change your arithmetic. Sometimes they give you NaN where you expected NULL, and your aggregate query returns a completely wrong number without an error message in sight.

Here's the translation guide I wish I'd had the first time through.

String Types: STRING Is Not VARCHAR

In T-SQL, strings have two dimensions: encoding (VARCHAR for single-byte, NVARCHAR for Unicode) and length. A VARCHAR(255) column refuses values longer than 255 characters. An NVARCHAR(MAX) stores up to 2 GB.

In Spark SQL, strings are just STRING. Always Unicode. No length limit. No VARCHAR(n). Spark will accept the VARCHAR syntax in a CREATE TABLE statement — it just treats it as STRING and silently ignores the length constraint at write time.

This is fine for reads. For writes, it means you have no length enforcement at the storage layer. If you need that constraint, put it in a Delta Lake CHECK constraint or in your application code — not in the type declaration.

-- T-SQL
CREATE TABLE customers (
  customer_name VARCHAR(100) NOT NULL,
  email NVARCHAR(255) NOT NULL
);

-- Spark SQL equivalent
CREATE TABLE customers (
  customer_name STRING NOT NULL,
  email STRING NOT NULL
);

Date and Time Types: The Timezone Problem

This is where SQL pros get burned most consistently. The type mapping:

T-SQL DATE → Spark DATE — same behavior, no issues
T-SQL DATETIME → Spark TIMESTAMP — with an important catch
T-SQL DATETIME2 → Spark TIMESTAMP — higher precision, same catch
T-SQL DATETIMEOFFSET → no direct equivalent; store as STRING or two columns

The catch: Spark's TIMESTAMP is always interpreted as UTC and displayed in the session timezone. SQL Server's DATETIME stores whatever you give it — no timezone awareness at all. If your SQL Server data was in local time (EST, CST, whatever) and you load it into Spark, Spark will interpret those values as UTC. Your "2:00 PM EST" timestamps become "2:00 PM UTC" — off by five hours for every East Coast user.

Handle it at ingestion time, explicitly:

-- Loading SQL Server datetime data that was stored in US Eastern time
SELECT
  to_utc_timestamp(
    to_timestamp(event_datetime_local, 'yyyy-MM-dd HH:mm:ss'),
    'America/New_York'
  ) AS event_timestamp_utc
FROM raw_events;

Get the timezone conversion explicit at the boundary. Don't assume. Trust me on this one — the bug it introduces is exactly subtle enough to survive six sprints of code review.

Numeric Types: Integer Literals Default to BIGINT

In T-SQL, integer literals are INT by default (4 bytes, max ~2.1 billion). In Spark SQL, they're BIGINT by default (8 bytes, max ~9.2 quintillion). This rarely causes bugs, but it can produce unexpected schema inference results and occasional type mismatch errors when mixing Spark-generated columns with external systems expecting INT.

More practically: Spark has no silent overflow behavior for integer arithmetic. If your math overflows, you get a runtime error, not silent wrapping. That's the right behavior, but it can surprise you in legacy pipelines that relied on SQL Server's overflow handling.

DECIMAL maps cleanly, with one gotcha: when Spark infers DECIMAL precision from literal arithmetic, it can choose lower precision than you expect. Cast explicitly for financial math:

-- Don't rely on inference for financial calculations
SELECT CAST(order_amount / 100.0 AS DECIMAL(18, 4)) AS adjusted_amount
FROM orders;

NULL vs NaN: Two Different Problems

T-SQL has NULL and nothing else. Aggregates skip NULLs. Comparisons use IS NULL / IS NOT NULL.

Spark has NULL and NaN (Not a Number) for floating-point columns. NaN is not NULL. COUNT(*) counts NaN rows. SUM() propagates NaN — a single NaN in the column makes the sum NaN, not NULL. MAX() and MIN() treat NaN as greater than any numeric value.

If your data came from a system that can generate NaN values — Python float division, certain CSV imports, some Parquet files from legacy tools — you need to handle them explicitly:

-- Check for NaN in a float column
SELECT COUNT(*) AS nan_count
FROM measurements
WHERE isnan(sensor_reading);

-- Treat NaN as NULL for aggregation
SELECT AVG(
  CASE WHEN isnan(sensor_reading) THEN NULL ELSE sensor_reading END
) AS clean_average
FROM measurements;

SQL Server doesn't have this problem — NaN doesn't exist in T-SQL's type system. Spark inherited it from IEEE 754 floating-point via Java and Scala. Know it's there.

BOOLEAN: Not BIT

T-SQL uses BIT for boolean-like values (0/1 integers, no true/false literal). Spark uses BOOLEAN with actual true/false literals.

-- T-SQL
WHERE is_active = 1

-- Spark SQL
WHERE is_active = true
-- or just
WHERE is_active

If you're loading SQL Server data where boolean columns came across as integers, a simple cast handles it:

SELECT CAST(is_active AS BOOLEAN) AS is_active
FROM legacy_customers;

Quick Reference

VARCHAR(n) / NVARCHAR(n) → STRING (no length enforcement at storage layer)
DATETIME / DATETIME2 → TIMESTAMP (SQL Server stores local time; Spark reads UTC — convert explicitly at ingestion)
DATE → DATE (same)
INT → INT (same); but integer literals in Spark default to BIGINT
DECIMAL(p,s) → DECIMAL(p,s) (same; cast explicitly in arithmetic)
BIT → BOOLEAN (cast when loading from SQL Server)
NULL → NULL (same semantics, but NaN also exists in float columns and is not NULL)
UNIQUEIDENTIFIER → STRING (UUID stored as string; compare carefully)

None of these are dealbreakers. They're places where assuming direct equivalence causes subtle bugs instead of obvious errors. Handle them at the ingestion boundary — explicit timezone conversion on datetime columns, NaN checks on floats from external sources, CAST on BIT columns — and you won't think about them again.

If you've run into a type mismatch I haven't covered here, I'd genuinely like to hear about it. The list grows with every migration project. As always, I'm here to help.

Data Types in Spark SQL: The Translation Guide SQL Server DBAs Actually Need

Shannon Lowder

String Types: STRING Is Not VARCHAR

Date and Time Types: The Timezone Problem

Numeric Types: Integer Literals Default to BIGINT

NULL vs NaN: Two Different Problems

BOOLEAN: Not BIT

Quick Reference

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving