U-SQL: When Microsoft Merged SQL and C# Into One Language
U-SQL is Microsoft's answer to a question Hive and Pig were already answering for the Hadoop ecosystem: how do you write analytical queries over big data that non-MapReduce programmers can understand? The Hive answer was "SQL-ish syntax that compiles to MapReduce." The Pig answer was "a dataflow language." Microsoft's answer was "actual SQL, extended with actual C#." I started using it in late 2014 and found it genuinely elegant — and genuinely sharp-edged.
The Basic Model
A U-SQL script looks like this:
// Define the input rowset from ADLS
@stormEvents =
EXTRACT EventId int,
EventDate DateTime,
State string,
EventType string,
Injuries int,
Deaths int,
PropertyDamage decimal
FROM "/raw/noaa/storm-events/2014/StormData.csv"
USING Extractors.Csv(skipFirstNRows: 1);
// Apply SQL-style transformation
@summary =
SELECT State,
EventType,
SUM(Injuries) AS TotalInjuries,
SUM(Deaths) AS TotalDeaths,
SUM(PropertyDamage) AS TotalPropertyDamage,
COUNT(*) AS EventCount
FROM @stormEvents
GROUP BY State, EventType;
// Output to ADLS
OUTPUT @summary
TO "/curated/analytics/storm-summary-2014.csv"
USING Outputters.Csv(outputHeader: true);If you know SQL, you recognize almost all of this. EXTRACT is U-SQL's way of reading a file and giving it a schema — essentially a schema-on-read model. The SELECT/GROUP BY is standard SQL. The OUTPUT writes the result back to ADLS. The declarative structure is familiar; the file I/O is new.
Where C# Enters
U-SQL scripts can include C# code in two ways: inline expressions and custom functions/operators. The inline expressions were the first thing people used:
// Use C# methods directly in expressions
@cleaned =
SELECT EventId,
EventDate,
State.ToUpper().Trim() AS State,
String.IsNullOrEmpty(EventType) ? "Unknown" : EventType AS EventType,
Injuries > 0 ? Injuries : 0 AS Injuries
FROM @stormEvents;
// C# DateTime operations
@filtered =
SELECT *
FROM @stormEvents
WHERE EventDate.Year == 2014
AND EventDate.Month >= 6;This was the elegance of U-SQL: you didn't need a UDF for simple string manipulation or date operations. You just called the C# method inline. State.ToUpper(), DateTime.Parse(), String.Format() — all available directly in the query expression.
Custom Extractors and Reducers
For cases where the built-in extractors didn't cover the file format — custom binary formats, complex multi-line records, malformed CSV with embedded newlines — U-SQL let you write a custom extractor in C#:
// Custom extractor for fixed-width format
[SqlUserDefinedExtractor(AtomicFileProcessing = false)]
public class FixedWidthExtractor : IExtractor
{
private readonly int[] _fieldWidths;
public FixedWidthExtractor(params int[] fieldWidths)
{
_fieldWidths = fieldWidths;
}
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
using (var reader = new StreamReader(input.BaseStream))
{
string line;
while ((line = reader.ReadLine()) != null)
{
int offset = 0;
for (int i = 0; i < _fieldWidths.Length; i++)
{
output.Set(i, line.Substring(offset, _fieldWidths[i]).Trim());
offset += _fieldWidths[i];
}
yield return output.AsReadOnly();
}
}
}
}The extractor registered as a U-SQL operator and could be used exactly like the built-in Extractors.Csv — just USING new FixedWidthExtractor(10, 20, 8, 50). This was the correct abstraction: complex parsing logic in typed C# code, clean declarative U-SQL orchestrating the result.
What Made It Work Well
The combination of SQL's declarative querying model and C#'s type system and library ecosystem was genuinely better than alternatives for the workloads I was using it for. Complex date arithmetic, geocoding operations, string normalization — things that required UDFs in Hive or ugly workarounds in Pig — were natural C# method calls in U-SQL. The type safety meant errors showed up at compile time (when you submitted the job, before it ran) rather than in the middle of a multi-hour execution.
The compile-time validation was something the Hadoop ecosystem didn't provide. A Pig script that had a typo in a field name would run for hours and then fail. A U-SQL script with the same error failed at submission. For long-running analytical jobs, that's the difference between a 10-minute wasted submission cycle and a 6-hour failed job. The next post covers the gotchas that made the 6-hour failure still possible even with U-SQL. As always, I'm here to help.