High-level Computation

Overview

In this tutorial, you learn:

  • High level Xarray computations using CuPy arrays.

  • Applying custom kernels to DataArray with CuPy and NumPy

Prerequisites

Concepts

Importance

Notes

Familiarity with NumPy

Necessary

Basics of Cupy

Necessary

Familiarity with Xarray

Necessary

  • Time to learn: 40 minutes

Introduction

In the previous tutorial, we introduced the powerful combination of Xarray and CuPy for handling multi-dimensional datasets and leveraging GPU acceleration to significantly improve performance.

In this tutorial, we are going to explore high-level Xarray functions such as groupby, rolling mean, and weighted mean, and compared their execution times with traditional NumPy-based implementations.

High-level Xarray Functions: CuPy vs. NumPy

In this tutorial, we’ll explore the performance differences between high-level Xarray functions using CuPy and NumPy. CuPy is a GPU-based NumPy-compatible library, while NumPy is the well-known CPU-based library for numerical computations. We’ll focus on three high-level functions: groupby, rolling mean, and weighted mean. We’ll also compare the time it takes to execute each function using both CuPy and NumPy. Let’s create some sample data to work with.

We’ll use a 3-dimensional dataset (time, latitude, longitude) with random values:

import time
import numpy as np
import pandas as pd
import xarray as xr
import cupy as cp
import cupy_xarray  # Adds .cupy to Xarray objects
np.random.seed(0)

# Create the time range.
date = pd.date_range("2010-01-01", "2020-12-31", freq="M")

# Create the latitude range.
lat = np.arange(-90, 90, 1)

# Create the longitude range.
lon = np.arange(-180, 180, 1)

# Create random data
data_np = np.random.rand(len(date), len(lat), len(lon))
data_cp = cp.array(data_np)

# -- Create DataArray with Numpy data
data_xr_np = xr.DataArray(
    data_np,
    dims=["time", "lat", "lon"],
    coords=[date, lat, lon],
)

# -- Create DataArray with CuPy data
data_xr_cp = xr.DataArray(
    data_cp,
    dims=["time", "lat", "lon"],
    coords=[date, lat, lon],
)

Groupby

The groupby function is used to group data based on one or more dimensions. Here, we’ll group our data by the season in the time dimension using both CuPy and NumPy:

start_time_np = time.time()

grouped_data_np = data_xr_np.groupby("time.season")
mean_np = grouped_data_np.mean()

end_time_np = time.time()
time_np = end_time_np - start_time_np

The data type of data in grouped_data_np is numpy.ndarray.

[type(arr.data) for group, arr in grouped_data_np]
[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]
start_time_cp = time.time()

grouped_data_cp = data_xr_cp.groupby("time.season")
mean_cp = grouped_data_cp.mean()

end_time_cp = time.time()
time_cp = end_time_cp - start_time_cp

print(
    "GroupBy with Xarray DataArrays using CuPy provides a",
    round(time_np / time_cp, 2),
    "x speedup over NumPy.\n",
)
GroupBy with Xarray DataArrays using CuPy provides a 79.83 x speedup over NumPy.

What about the CuPy arrays? Does it preserve the array type?

[type(arr.data) for group, arr in grouped_data_cp]
[cupy.ndarray, cupy.ndarray, cupy.ndarray, cupy.ndarray]

What about different sizes of arrays? How does the performance compare then?

The example above showed a 1 degree DataArray. What if we increase the data size to 0.5 degree?

# Create the latitude range.
lat = np.arange(-90, 90, 0.5)

# Create the longitude range.
lon = np.arange(-180, 180, 0.5)

# Create random data
data_np = np.random.rand(len(date), len(lat), len(lon))
data_cp = cp.array(data_np)

# -- Create DataArray with Numpy data
data_xr_np = xr.DataArray(
    data_np,
    dims=["time", "lat", "lon"],
    coords=[date, lat, lon],
)

# -- Create DataArray with CuPy data
data_xr_cp = xr.DataArray(
    data_cp,
    dims=["time", "lat", "lon"],
    coords=[date, lat, lon],
)
start_time_np = time.time()

grouped_data_np = data_xr_np.groupby("time.season").mean()
mean_np = grouped_data_np.mean()

end_time_np = time.time()
time_np = end_time_np - start_time_np
start_time_cp = time.time()

grouped_data_cp = data_xr_cp.groupby("time.season").mean()
mean_cp = grouped_data_cp.mean()

end_time_cp = time.time()
time_cp = end_time_cp - start_time_cp

print(
    "GroupBy with Xarray DataArrays using CuPy provides a",
    round(time_np / time_cp, 2),
    "x speedup over NumPy.\n",
)
GroupBy with Xarray DataArrays using CuPy provides a 89.87 x speedup over NumPy.

Attention

Is this consistent with what you have learned in the previous modules? What if we test a very low resolution dataset?

Rolling Mean

The rolling() method is available in DataArray objects, providing support for rolling window aggregation. This feature allows for the computation of aggregated values over a sliding window of data points within the array.

A rolling window refers to a fixed-size window that moves sequentially across the data, calculating aggregated statistics or applying functions to the values within each window. This capability is particularly useful for analyzing time series or spatial data, where examining data within a specific window can reveal patterns, trends, or relationships.

The rolling mean is a widely used technique for smoothing data over a specified window.

In the example below, we calculate the rolling mean along the 'time' dimension with a window size of 10:

xr.set_options(use_bottleneck=False)
<xarray.core.options.set_options at 0x2ad1d0ccdac0>
start_time_np = time.time()

rolling_mean_np = data_xr_np.rolling(time=10).mean()

end_time_np = time.time()
time_np = end_time_np - start_time_np
start_time_cp = time.time()

rolling_mean_cp = data_xr_cp.rolling(time=10).mean()

end_time_cp = time.time()
time_cp = end_time_cp - start_time_cp
print(
    "Rolling mean with Xarray DataArrays using CuPy provides a",
    round(time_np / time_cp, 2),
    "x speedup over NumPy.\n",
)
Rolling mean with Xarray DataArrays using CuPy provides a 30.22 x speedup over NumPy.

Weighted Array Reductions

Weighted array reductions in Xarray empower users with the ability to perform aggregations on multidimensional arrays while considering the weights assigned to each element. They currently support weighted sum, mean, std, var and quantile. By default, aggregation results in Xarray’s rolling window operations are assigned the coordinate at the end of each window. However, it is possible to center the results by specifying center=True when creating the Rolling object.

For example, the weighted mean is another way to smooth data, taking into account the varying importance of each data point.

Here, we’ll use a uniform weight along the time dimension:

start_time_np = time.time()

weights_np = xr.DataArray(np.ones_like(data_np), dims=["time", "lat", "lon"])
weighted_mean_np = data_xr_np.weighted(weights_np).mean(dim="time")

end_time_np = time.time()
time_np = end_time_np - start_time_np
start_time_cp = time.time()

weights_cp = xr.DataArray(cp.ones_like(data_cp), dims=["time", "lat", "lon"])
weighted_mean_cp = data_xr_cp.weighted(weights_cp).mean(dim="time")

end_time_cp = time.time()
time_cp = end_time_cp - start_time_cp
print(
    "Weighted mean with Xarray DataArrays using CuPy provides a",
    round(time_np / time_cp, 2),
    "x speedup over NumPy.\n",
)
Weighted mean with Xarray DataArrays using CuPy provides a 13.32 x speedup over NumPy.

Similarly we can calculate weighted sum or weighted quantile, etc. To learn more about weighted array reduction, please see the user guide.

Coarsen large arrays

In Xarray, the coarsen operation is a powerful tool for downsampling or reducing the size of large arrays. When dealing with large datasets, coarsening allows for efficient summarization of data by aggregating multiple values into a single value within a defined coarsening window. This process is particularly useful when working with high-resolution or fine-grained data, as it enables the transformation of large arrays into smaller ones while preserving the overall structure and key characteristics of the data.

In order to take a block mean for every 3 days along time dimension and every 2 points along lat and lon, we can use the following:

start_time_np = time.time()

coarsen_np = data_xr_np.coarsen(time=3, lat=2, lon=2).mean()

end_time_np = time.time()
time_np = end_time_np - start_time_np

coarsen also works in similar fashion when using CuPy:

start_time_cp = time.time()

coarsen_cp = data_xr_cp.coarsen(time=3, lat=2, lon=2).mean()

end_time_cp = time.time()
time_cp = end_time_cp - start_time_cp
print(
    "Coarsen with Xarray DataArrays using CuPy provides a",
    round(time_np / time_cp, 2),
    "x speedup over NumPy.\n",
)
Coarsen with Xarray DataArrays using CuPy provides a 443.79 x speedup over NumPy.

Summary

In this notebook, we have learned about:

  • High level Xarray computations using CuPy arrays.

  • Applying custom kernels to DataArray with CuPy and NumPy