U-SQL Hands-On: Loading and Transforming Data in Azure Data Lake Analytics
The U-SQL introduction post was conceptual. This one is practical: given raw CSV files in ADLS, here's how to go from nothing to a clean, queryable dataset in Azure Data Lake Analytics. I'll use the NOAA storm events data I was working with at the time — it's large enough to be interesting and structured enough to be tractable.
The Setup
You need: an Azure Data Lake Store account with data in it, an Azure Data Lake Analytics account linked to that store, and the ADLA Visual Studio Tools extension (or the Azure Portal's job editor if you're avoiding VS). Create a new U-SQL project in Visual Studio and you have a local development environment that validates syntax and lets you test against small samples before submitting to ADLA.
Step 1: Extract From Raw Files
NOAA storm events come as CSV files with inconsistent formatting across years. Early years had different column counts than later years. Start with a single year to understand the structure:
// Extract 2014 storm events
@raw =
EXTRACT BeginYearMonth int,
BeginDay int,
BeginTime string,
EndYearMonth int,
EndDay int,
EndTime string,
EpisodeId string?,
EventId int,
State string,
StateCode int,
EventType string,
CZType string,
CZCode int,
CZName string,
WFO string,
BeginDate_Time string,
CST int,
EndDate_Time string,
Injuries_Direct int,
Injuries_Indirect int,
Deaths_Direct int,
Deaths_Indirect int,
Damage_Property string,
Damage_Crops string,
Source string,
Magnitude decimal?,
MagnitudeType string?,
FloodCause string?,
Category int?,
TornadoFScale string?,
TornadoLength decimal?,
TornadoWidth int?
FROM "/raw/noaa/storm-events/StormEvents_details-ftp_v1.0_d2014_c20150326.csv"
USING Extractors.Csv(skipFirstNRows: 1, quoting: true);The ? suffix marks nullable columns — columns that might be empty in the source. U-SQL is strict about nullability; without the ?, an empty value throws an extraction error. When in doubt, mark columns nullable initially and add constraints after you understand the data quality.
Step 2: Clean and Normalize
NOAA's damage figures are stored as strings like "2.5K" or "1.2M" rather than numbers. This needs a custom parser:
// Parse NOAA damage string format into decimal
DECLARE @parseNOAADamage = (string s) =>
{
if (string.IsNullOrEmpty(s) || s == "0") return 0M;
string clean = s.Trim().ToUpper();
if (clean.EndsWith("K"))
return decimal.Parse(clean.Substring(0, clean.Length - 1)) * 1000M;
if (clean.EndsWith("M"))
return decimal.Parse(clean.Substring(0, clean.Length - 1)) * 1_000_000M;
if (clean.EndsWith("B"))
return decimal.Parse(clean.Substring(0, clean.Length - 1)) * 1_000_000_000M;
return decimal.TryParse(clean, out decimal result) ? result : 0M;
};
@cleaned =
SELECT EventId,
State.ToUpper().Trim() AS State,
EventType.Trim() AS EventType,
DateTime.Parse(BeginDate_Time) AS BeginDateTime,
DateTime.Parse(EndDate_Time) AS EndDateTime,
Injuries_Direct + Injuries_Indirect AS TotalInjuries,
Deaths_Direct + Deaths_Indirect AS TotalDeaths,
@parseNOAADamage(Damage_Property) AS PropertyDamageUSD,
@parseNOAADamage(Damage_Crops) AS CropDamageUSD
FROM @raw
WHERE EventType != null
AND BeginDate_Time != null;Step 3: Aggregate and Output
// Annual summary by state and event type
@summary =
SELECT State,
EventType,
BeginDateTime.Year AS Year,
COUNT(*) AS EventCount,
SUM(TotalInjuries) AS TotalInjuries,
SUM(TotalDeaths) AS TotalDeaths,
SUM(PropertyDamageUSD) AS PropertyDamageUSD,
SUM(CropDamageUSD) AS CropDamageUSD
FROM @cleaned
GROUP BY State, EventType, BeginDateTime.Year;
OUTPUT @summary
TO "/curated/analytics/storm-events-summary-2014.csv"
ORDER BY TotalDeaths DESC, PropertyDamageUSD DESC
USING Outputters.Csv(outputHeader: true);Submitting and Monitoring the Job
Submit from Visual Studio via the "Submit" button, or from the portal. You'll see the job graph — a directed acyclic graph of the compilation output showing how U-SQL distributed the work across vertices. Each vertex is a unit of parallel execution running on ADLA's compute.
The job view shows: vertex execution status, input/output bytes per vertex, and elapsed time per stage. This is where you catch the distribution problems — vertices that are doing all the work while others sit idle because the data isn't evenly distributed across the partition key you chose.
The Development Loop
The local development experience was the part that made ADLA practical. Visual Studio let you run U-SQL scripts against a local ADLS emulator with a small data sample. You could iterate on the script — fix the date parsing, adjust the null handling, debug the custom extractor — without submitting a job and waiting for ADLA compute. Once the script ran correctly locally, submit to ADLA with the full dataset. Most of the debugging happened locally in seconds; the ADLA submission was the final verification on real scale. As always, I'm here to help.