Azure Managed Disks: Moving to Premium SSD v2 - Part 3 | Hasan Gural

Hello Friends,

Welcome to the final part of this series. In Part 1 we covered why Premium SSD creates the oversizing problem in the first place, how the pricing model for Premium SSD v2 actually works, and how to identify which disks in your environment are worth migrating. In Part 2 we went through the snapshot-based migration itself, built a PowerShell script that handles the full workflow with a WhatIf mode, and looked at a Bicep pattern for new deployments with zone alignment done correctly from the start.

Now that the disks are running on Premium SSD v2, there is one more thing I want to address, because this is where I see most teams leave money on the table. When you create a Premium SSD v2 disk, you have to decide on IOPS and throughput values at that moment. Most people base that decision on what the old Premium SSD tier was delivering, which was itself a number that came from disk size, not from any measurement of what the workload actually needed. So you migrate, and you carry the same inflated numbers across to the new SKU.

In this part we will look at how to use real utilization data to right-size those values, how to write KQL queries that show you what the disk is actually doing, and how to set up Azure Monitor alerts that surface problems before they affect anything.

Your starting values are probably wrong

I want to be direct about this. When you migrated the disk in Part 2, the IOPS and throughput values you provisioned were inherited from a model that ties performance to storage size. A P40 disk gives you 7,500 IOPS because it is 2,048 GiB, not because your workload asks for 7,500 IOPS. Those are very different things.

Premium SSD v2 charges you for every IOPS above the free baseline of 3,000, and every MB/s above the free 125. So if you migrated a P40 and provisioned 7,500 IOPS to stay safe, you are paying for 4,500 IOPS above baseline every month. If the workload peaks at 2,500 IOPS on a busy day, that extra provisioning is pure waste.

What I usually recommend is giving the workload one to two weeks to settle on the new disk, pulling the utilization data, and then adjusting. You do not need a maintenance window to change provisioned values on a Premium SSD v2 disk. The update applies immediately, the VM keeps running, and there is no disruption at all. That changes the calculus completely compared to the old model.

The three metrics that tell you what is actually happening

Azure Monitor exposes disk performance through the VM resource. The three metrics I look at first are:

metric

The percentage metrics are the most actionable ones. If Data Disk IOPS Consumed Percentage is sitting below 30 percent consistently, you have a lot more headroom than you need. If it is regularly pushing past 85 to 90 percent, you are close enough to the ceiling that spikes will hurt. The latency metrics tell you whether the disk is actually under pressure or just busy in a way the workload can tolerate.

KQL queries for disk utilization

The queries below run against AzureMetrics in a Log Analytics workspace. If you have diagnostic settings configured on the VM and forwarding to a workspace, this data will be there. Replace the VM name filter with whatever matches your environment.

IOPS consumed percentage, hourly over the past week:

AzureMetrics
| where TimeGenerated >= ago(7d)
| where MetricName == "Data Disk IOPS Consumed Percentage"
| where ResourceId contains "your-vm-name"
| summarize
    AvgConsumed = avg(Average),
    MaxConsumed = max(Maximum),
    P95Consumed = percentile(Average, 95)
    by bin(TimeGenerated, 1h), Resource
| order by TimeGenerated desc

The P95 column is the number I base my decisions on. If P95 is at 40 percent, I know the provisioned ceiling is well above what normal operation requires. If P95 is at 85 percent and max is touching 100, the workload is regularly hitting the limit and provisioned IOPS should go up, not down.

Throughput consumed percentage, daily over two weeks:

AzureMetrics
| where TimeGenerated >= ago(14d)
| where MetricName == "Data Disk Bandwidth Consumed Percentage"
| where ResourceId contains "your-vm-name"
| summarize
    AvgConsumed = avg(Average),
    MaxConsumed = max(Maximum),
    P95Consumed = percentile(Average, 95)
    by bin(TimeGenerated, 1d), Resource
| order by TimeGenerated desc

I use two weeks here rather than seven days because a lot of production workloads have weekly cycles. Batch jobs, reporting runs, end-of-week processing. If you only look at a week, you might catch one cycle. Two weeks gives you a more honest picture.

Read and write latency baseline:

AzureMetrics
| where TimeGenerated >= ago(7d)
| where MetricName in ("Data Disk Read Latency E2E", "Data Disk Write Latency E2E")
| where ResourceId contains "your-vm-name"
| summarize
    AvgLatencyMs = avg(Average),
    MaxLatencyMs = max(Maximum)
    by MetricName, bin(TimeGenerated, 1h)
| order by TimeGenerated desc

Premium SSD v2 is rated for sub-millisecond latency, but that is under controlled conditions. In practice, real workloads with mixed IO patterns will see higher numbers. What you are looking for is consistency, and whether any latency spikes line up with the moments when IOPS consumed percentage was climbing. If they do, that confirms you are provisioning-constrained. If latency spikes happen at low IOPS consumption, something else is going on.

Collect these into an Azure Workbook

If you migrated more than a handful of disks, set up an Azure Workbook with these queries parameterised by VM name and time range. You will have a single view across all your migrated disks without running queries individually for each one. Much easier to make right-sizing decisions at scale.

Adjusting provisioned IOPS and throughput

Once you have the data, updating the provisioned values on a running disk looks like this:

[CmdletBinding()]
param (
    [Parameter(Mandatory = $true)]
    [string]$resourceGroupName,

    [Parameter(Mandatory = $true)]
    [string]$diskName,

    [Parameter(Mandatory = $true)]
    [int]$newIops,

    [Parameter(Mandatory = $true)]
    [int]$newThroughputMBps
)

$ErrorActionPreference = 'Stop'

$disk = Get-AzDisk -ResourceGroupName $resourceGroupName -DiskName $diskName

if ($disk.Sku.Name -ne 'PremiumV2_LRS') {
    throw "Disk '$diskName' is not a Premium SSD v2 disk. Current SKU: $($disk.Sku.Name)"
}

Write-Host 'Current provisioning:'                               -ForegroundColor Cyan
Write-Host "  IOPS       : $($disk.DiskIOPSReadWrite)"          -ForegroundColor White
Write-Host "  Throughput : $($disk.DiskMBpsReadWrite) MB/s"     -ForegroundColor White

$diskUpdate = New-AzDiskUpdateConfig `
    -DiskIOPSReadWrite $newIops `
    -DiskMBpsReadWrite $newThroughputMBps

Update-AzDisk -ResourceGroupName $resourceGroupName -DiskName $diskName -DiskUpdate $diskUpdate | Out-Null

Write-Host 'Updated provisioning:'                               -ForegroundColor Green
Write-Host "  IOPS       : $newIops"                            -ForegroundColor White
Write-Host "  Throughput : $newThroughputMBps MB/s"             -ForegroundColor White

No deallocation, no detach, no maintenance window. This is one of the things that makes Premium SSD v2 fundamentally different from the old tier model, where changing performance meant changing disk size and potentially recreating the disk entirely.

Do not cut to the average

When you reduce provisioned IOPS, do not drop to your average consumed value. That leaves no room at all for spikes. What I do is provision at roughly 1.5 times the P95 value and revisit after another month. You get meaningful savings and you still have headroom for the days when the workload runs harder than usual.

Two alerts worth having on every migrated disk

The first alert fires when the workload is sustained close to the IOPS ceiling. You want to know about this before it becomes a visible problem, not after.

IOPS consumed percentage alert:

Signal: Data Disk IOPS Consumed Percentage
Aggregation: Average
Operator: Greater than
Threshold: 80
Evaluation window: 15 minutes
Frequency: 5 minutes

A 15-minute window at 80 percent filters out brief spikes and catches sustained pressure. If this fires regularly, it is a signal that provisioned IOPS should go up.

The second alert watches for latency climbing above what Premium SSD v2 should be producing under normal conditions.

Write latency alert:

Signal: Data Disk Write Latency E2E
Aggregation: Average
Operator: Greater than
Threshold: 5 (milliseconds)
Evaluation window: 10 minutes
Frequency: 5 minutes

Sub-millisecond is the ideal, but 5ms as a sustained average is the point where something needs a closer look. This alert gives you early warning before the application layer starts to feel it.

A cost governance view

After a few weeks of data, this query gives you a quick efficiency view across all monitored disks. It tells you which disks are underusing their provisioned IOPS and which are running close to the ceiling.

AzureMetrics
| where TimeGenerated >= ago(30d)
| where MetricName == "Data Disk IOPS Consumed Percentage"
| summarize
    AvgConsumedPct = avg(Average)
    by Resource
| extend
    Assessment = case(
        AvgConsumedPct < 30, "Over-provisioned - reduce IOPS",
        AvgConsumedPct < 60, "Monitor for another cycle",
        AvgConsumedPct < 85, "Provisioning looks healthy",
        "Near ceiling - consider increasing IOPS"
    )
| project Resource, AvgConsumedPct, Assessment
| order by AvgConsumedPct asc

The disks in the first category are the ones to act on first. In environments where you migrated ten or twenty disks, a few of those being significantly over-provisioned adds up quickly.

Closing

This is the pattern I use in practice when I go through this migration with a team. The snapshot approach from Part 2 gets the disk onto the right SKU. The monitoring and right-sizing in this part is what makes the economics actually work out in your favour month over month.

Premium SSD v2 is not a set-it-and-forget-it decision. It rewards the kind of active management that the old size-based tier model never really allowed you to do. Now that you have the visibility tools in place, revisiting provisioned values periodically is a five-minute PowerShell operation rather than a major change.

If you run into something that does not fit the patterns described across these three parts, I would like to hear about it. These articles come from real environments and the edge cases are always the interesting ones.

Your starting values are probably wrong​

The three metrics that tell you what is actually happening​

KQL queries for disk utilization​

Adjusting provisioned IOPS and throughput​

Two alerts worth having on every migrated disk​

A cost governance view​

Closing​

References​