U-SQL: When Microsoft Merged SQL and C# Into One Language

U-SQL is Microsoft's answer to a question Hive and Pig were already answering for the Hadoop ecosystem: how do you write analytical queries over big data that non-MapReduce programmers can understand? The Hive answer was "SQL-ish syntax that compiles to MapReduce." The Pig answer was "a dataflow language." Microsoft's answer was "actual SQL, extended with actual C#." I started using it in late 2014 and found it genuinely elegant — and genuinely sharp-edged.

The Basic Model

A U-SQL script looks like this:

// Define the input rowset from ADLS
@stormEvents =
    EXTRACT EventId        int,
            EventDate      DateTime,
            State          string,
            EventType      string,
            Injuries       int,
            Deaths         int,
            PropertyDamage decimal
    FROM "/raw/noaa/storm-events/2014/StormData.csv"
    USING Extractors.Csv(skipFirstNRows: 1);

// Apply SQL-style transformation
@summary =
    SELECT State,
           EventType,
           SUM(Injuries) AS TotalInjuries,
           SUM(Deaths) AS TotalDeaths,
           SUM(PropertyDamage) AS TotalPropertyDamage,
           COUNT(*) AS EventCount
    FROM @stormEvents
    GROUP BY State, EventType;

// Output to ADLS
OUTPUT @summary
    TO "/curated/analytics/storm-summary-2014.csv"
    USING Outputters.Csv(outputHeader: true);

If you know SQL, you recognize almost all of this. EXTRACT is U-SQL's way of reading a file and giving it a schema — essentially a schema-on-read model. The SELECT/GROUP BY is standard SQL. The OUTPUT writes the result back to ADLS. The declarative structure is familiar; the file I/O is new.

Where C# Enters

U-SQL scripts can include C# code in two ways: inline expressions and custom functions/operators. The inline expressions were the first thing people used:

// Use C# methods directly in expressions
@cleaned =
    SELECT EventId,
           EventDate,
           State.ToUpper().Trim() AS State,
           String.IsNullOrEmpty(EventType) ? "Unknown" : EventType AS EventType,
           Injuries > 0 ? Injuries : 0 AS Injuries
    FROM @stormEvents;

// C# DateTime operations
@filtered =
    SELECT *
    FROM @stormEvents
    WHERE EventDate.Year == 2014
      AND EventDate.Month >= 6;

This was the elegance of U-SQL: you didn't need a UDF for simple string manipulation or date operations. You just called the C# method inline. State.ToUpper(), DateTime.Parse(), String.Format() — all available directly in the query expression.

Custom Extractors and Reducers

For cases where the built-in extractors didn't cover the file format — custom binary formats, complex multi-line records, malformed CSV with embedded newlines — U-SQL let you write a custom extractor in C#:

// Custom extractor for fixed-width format
[SqlUserDefinedExtractor(AtomicFileProcessing = false)]
public class FixedWidthExtractor : IExtractor
{
    private readonly int[] _fieldWidths;

    public FixedWidthExtractor(params int[] fieldWidths)
    {
        _fieldWidths = fieldWidths;
    }

    public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
    {
        using (var reader = new StreamReader(input.BaseStream))
        {
            string line;
            while ((line = reader.ReadLine()) != null)
            {
                int offset = 0;
                for (int i = 0; i < _fieldWidths.Length; i++)
                {
                    output.Set(i, line.Substring(offset, _fieldWidths[i]).Trim());
                    offset += _fieldWidths[i];
                }
                yield return output.AsReadOnly();
            }
        }
    }
}

The extractor registered as a U-SQL operator and could be used exactly like the built-in Extractors.Csv — just USING new FixedWidthExtractor(10, 20, 8, 50). This was the correct abstraction: complex parsing logic in typed C# code, clean declarative U-SQL orchestrating the result.

What Made It Work Well

The combination of SQL's declarative querying model and C#'s type system and library ecosystem was genuinely better than alternatives for the workloads I was using it for. Complex date arithmetic, geocoding operations, string normalization — things that required UDFs in Hive or ugly workarounds in Pig — were natural C# method calls in U-SQL. The type safety meant errors showed up at compile time (when you submitted the job, before it ran) rather than in the middle of a multi-hour execution.

The compile-time validation was something the Hadoop ecosystem didn't provide. A Pig script that had a typo in a field name would run for hours and then fail. A U-SQL script with the same error failed at submission. For long-running analytical jobs, that's the difference between a 10-minute wasted submission cycle and a 6-hour failed job. The next post covers the gotchas that made the 6-hour failure still possible even with U-SQL. As always, I'm here to help.

Read more