U-SQL Hands-On: Loading and Transforming Data in Azure Data Lake Analytics

The U-SQL introduction post was conceptual. This one is practical: given raw CSV files in ADLS, here's how to go from nothing to a clean, queryable dataset in Azure Data Lake Analytics. I'll use the NOAA storm events data I was working with at the time — it's large enough to be interesting and structured enough to be tractable.

The Setup

You need: an Azure Data Lake Store account with data in it, an Azure Data Lake Analytics account linked to that store, and the ADLA Visual Studio Tools extension (or the Azure Portal's job editor if you're avoiding VS). Create a new U-SQL project in Visual Studio and you have a local development environment that validates syntax and lets you test against small samples before submitting to ADLA.

Step 1: Extract From Raw Files

NOAA storm events come as CSV files with inconsistent formatting across years. Early years had different column counts than later years. Start with a single year to understand the structure:

// Extract 2014 storm events
@raw =
    EXTRACT BeginYearMonth  int,
            BeginDay        int,
            BeginTime       string,
            EndYearMonth    int,
            EndDay          int,
            EndTime         string,
            EpisodeId       string?,
            EventId         int,
            State           string,
            StateCode       int,
            EventType       string,
            CZType          string,
            CZCode          int,
            CZName          string,
            WFO             string,
            BeginDate_Time  string,
            CST             int,
            EndDate_Time    string,
            Injuries_Direct int,
            Injuries_Indirect int,
            Deaths_Direct   int,
            Deaths_Indirect int,
            Damage_Property string,
            Damage_Crops    string,
            Source          string,
            Magnitude       decimal?,
            MagnitudeType   string?,
            FloodCause      string?,
            Category        int?,
            TornadoFScale   string?,
            TornadoLength   decimal?,
            TornadoWidth    int?
    FROM "/raw/noaa/storm-events/StormEvents_details-ftp_v1.0_d2014_c20150326.csv"
    USING Extractors.Csv(skipFirstNRows: 1, quoting: true);

The ? suffix marks nullable columns — columns that might be empty in the source. U-SQL is strict about nullability; without the ?, an empty value throws an extraction error. When in doubt, mark columns nullable initially and add constraints after you understand the data quality.

Step 2: Clean and Normalize

NOAA's damage figures are stored as strings like "2.5K" or "1.2M" rather than numbers. This needs a custom parser:

// Parse NOAA damage string format into decimal
DECLARE @parseNOAADamage = (string s) =>
{
    if (string.IsNullOrEmpty(s) || s == "0") return 0M;
    string clean = s.Trim().ToUpper();
    if (clean.EndsWith("K"))
        return decimal.Parse(clean.Substring(0, clean.Length - 1)) * 1000M;
    if (clean.EndsWith("M"))
        return decimal.Parse(clean.Substring(0, clean.Length - 1)) * 1_000_000M;
    if (clean.EndsWith("B"))
        return decimal.Parse(clean.Substring(0, clean.Length - 1)) * 1_000_000_000M;
    return decimal.TryParse(clean, out decimal result) ? result : 0M;
};

@cleaned =
    SELECT EventId,
           State.ToUpper().Trim()                                   AS State,
           EventType.Trim()                                          AS EventType,
           DateTime.Parse(BeginDate_Time)                            AS BeginDateTime,
           DateTime.Parse(EndDate_Time)                              AS EndDateTime,
           Injuries_Direct + Injuries_Indirect                       AS TotalInjuries,
           Deaths_Direct + Deaths_Indirect                           AS TotalDeaths,
           @parseNOAADamage(Damage_Property)                         AS PropertyDamageUSD,
           @parseNOAADamage(Damage_Crops)                            AS CropDamageUSD
    FROM @raw
    WHERE EventType != null
      AND BeginDate_Time != null;

Step 3: Aggregate and Output

// Annual summary by state and event type
@summary =
    SELECT State,
           EventType,
           BeginDateTime.Year                AS Year,
           COUNT(*) AS EventCount,
           SUM(TotalInjuries)               AS TotalInjuries,
           SUM(TotalDeaths)                 AS TotalDeaths,
           SUM(PropertyDamageUSD)           AS PropertyDamageUSD,
           SUM(CropDamageUSD)               AS CropDamageUSD
    FROM @cleaned
    GROUP BY State, EventType, BeginDateTime.Year;

OUTPUT @summary
    TO "/curated/analytics/storm-events-summary-2014.csv"
    ORDER BY TotalDeaths DESC, PropertyDamageUSD DESC
    USING Outputters.Csv(outputHeader: true);

Submitting and Monitoring the Job

Submit from Visual Studio via the "Submit" button, or from the portal. You'll see the job graph — a directed acyclic graph of the compilation output showing how U-SQL distributed the work across vertices. Each vertex is a unit of parallel execution running on ADLA's compute.

The job view shows: vertex execution status, input/output bytes per vertex, and elapsed time per stage. This is where you catch the distribution problems — vertices that are doing all the work while others sit idle because the data isn't evenly distributed across the partition key you chose.

The Development Loop

The local development experience was the part that made ADLA practical. Visual Studio let you run U-SQL scripts against a local ADLS emulator with a small data sample. You could iterate on the script — fix the date parsing, adjust the null handling, debug the custom extractor — without submitting a job and waiting for ADLA compute. Once the script ran correctly locally, submit to ADLA with the full dataset. Most of the debugging happened locally in seconds; the ADLA submission was the final verification on real scale. As always, I'm here to help.

Read more