cuObject: RDMA Acceleration for S3-Compatible Object Storage

Overview
--------

This document explains the use of cuObject to enable high-performance, zero-copy data transfers between GPU and/or host memory and S3-compatible object storage, leveraging RDMA (Remote Direct Memory Access) connectivity over InfiniBand or RoCE. This approach significantly accelerates object storage operations for  on-premise and cloud environments where client and server support RDMA. currently only Mellanox NICS with DC transport is supported.

RDMA Acceleration for Object Storage
------------------------------------

What is cuObject RDMA Mode?
- cuObject enables direct, zero-copy movement of data between GPU memory, host pinned memory, and S3-compatible object storage systems using RDMA for the data path.
- By eliminating unnecessary CPU and memory copies, performance overhead and bottlenecks seen in traditional object storage workflows are minimized.

Key Benefits:
- Zero-copy transfers between memory and object storage—bypassing the CPU for maximum efficiency.
- Extremely high performance and scalability across multiple NICs.
- Low latency, ideal for demanding workloads such as AI or HPC pipelines.
- GPU Direct Storage support via NVIDIA GPUDirect.
- Full support for both GPU (device) and pinned host (system) memory.
- Simple integration into applications using the standard S3 API with minimal code changes.

How it works:
- cuObject leverages two channels for object storage access:
  1. Control Path: Standard HTTP/S3 compatible traffic for operations, metadata, and authentication.
  2. Data Path: Parallel RDMA channel delivering actual object data, orchestrated with special RDMA tokens.

Architecture Flow:

  Client Application
      |
      +--> HTTP API/S3 Command ----> S3-Compatible Server
      |                               |
      +--> RDMA Token Exchange -------+
      |                               |
      +--> RDMA Data Transfer ------> Object Storage (Zero-Copy GPU/Host Memory)

RDMA Token System
-----------------

RDMA tokens are opaque binary descriptors (approximately 80–120 bytes) conveying all the details needed for zero-copy RDMA transfers between client and server.

Each token conveys:
- Buffer address and size (works with both device and pinned host memory)
- RDMA access permissions (read or write)
- Relevant InfiniBand or RoCE connection parameters (GID, QPN, LID, RKEY, etc.)

RDMA Token Contents (High-Level):

Each RDMA token contains all the essential information required for a zero-copy data transfer between client and server memory spaces. At a high level, the token encodes:
- The client memory address and length of the buffer to be accessed (supporting both host-pinned and device/GPU memory)
- All connection parameters needed to set up the RDMA link (such as RDMA keys and address handles for InfiniBand or RoCE)

This information enables both sides to execute the RDMA operation, bypassing the kernel and minimizing CPU intervention. Consult cuObjRDMADescrProtocolFormat.pdf for full token field specifics.

Transmission:
- Out-of-band HTTP headers are used for tokens:
    * x-amz-rdma-token (for client requests)
    * x-amz-rdma-reply (for server responses)
    * x-amz-rdma-bytes-transferred (for statistics)

With these headers, RDMA transfers connect memory—CPU or GPU—directly between client and server, bypassing the kernel for top speed.

cuObjClient - Client Library
----------------------------

Overview:

cuObjClient is a client library for high-speed object storage I/O via RDMA, supporting both GPU and host memory. It hides the complexity of buffer registration and RDMA management behind intuitive callbacks.

Key Features:
- Dynamic Connection (DC) transport support
- Automatic buffer registration/deregistration (host, CUDA, or managed memory)
- Multi-threaded operation with per-thread offset handling
- User-defined operation callbacks for GET/PUT
- Seamless zero-copy for both CUDA and system pinned buffers

Core APIs:

Constructor:
  cuObjClient(CUObjOps_t& ops, cuObjProto_t proto=CUOBJ_PROTO_RDMA_DC_V1);
    - ops: user-defined callbacks for GET/PUT
    - proto: typically CUOBJ_PROTO_RDMA_DC_V1

Buffer Registration:
  cuObjErr_t cuMemObjGetDescriptor(void *ptr, size_t size);
    - Register a buffer (host-pinned, CUDA device, or managed)
    - Maximum 4 GiB per registration

Query Maximum Callback Size:
  ssize_t cuMemObjGetMaxRequestCallbackSize(void *ptr);

Object Operations:
  ssize_t cuObjGet(void *ctx, void *ptr, size_t size, loff_t offset=0, loff_t buf_offset=0);
  ssize_t cuObjPut(void *ctx, void *ptr, size_t size, loff_t offset=0, loff_t buf_offset=0);

    - ctx: user context for callbacks
    - ptr: registered memory
    - size: transfer length
    - offset: object offset
    - buf_offset: per-thread buffer offset

Buffer Deregistration:
  cuObjErr_t cuMemObjPutDescriptor(void *ptr);

Callback Interface:

Provide two function implementations and fill a CUObjIOOps struct:

  typedef struct CUObjIOOps {
      ssize_t (*get)(const void *handle, char *ptr, size_t size, loff_t offset, const cufileRDMAInfo_t*);
      ssize_t (*put)(const void *handle, const char *ptr, size_t size, loff_t offset, const cufileRDMAInfo_t*);
  } CUObjOps_t;

Callback Responsibilities:
- Receive RDMA descriptor info (cufileRDMAInfo_t)
- Trigger required metadata/object key actions to the remote storage
- Await server acknowledgement as required
- Return the total bytes transferred

Transport Selection:

Dynamic Connection Transport (default):
  auto client = new cuObjClient(ops, CUOBJ_PROTO_RDMA_DC_V1);
    - Ideal for concurrent clients
    - Makes use of shared DCIs for low overhead
    - /etc/cufile.json must provide valid RDMA IP/interface details:
        //"rdma_dev_addr_list": ["mlx5_0", "mlx5_1"],
        //"rdma_dev_addr_list": ["192.168.100.2", "192.168.101.2"],
      For optimal GPU/host zero-copy: set "rdma_peer_type" to "dmabuf"
            {
                "rdma_dev_addr_list": ["mlx5_0"],
                "rdma_peer_type": "dmabuf",
                "rdma_load_balancing_policy": "RoundRobin",
            }

cuObjServer - Server Library
----------------------------

Overview:

cuObjServer implements the RDMA-accelerated side of object storage for S3-compatible services, supporting concurrency, automatic connections, and DC transport.

Key Features:
- Multi-threaded, channel-based concurrency
- Scatter-gather I/O (up to 10 entries)
- Asynchronous operation and polling
- Telemetry & stats built-in
- Supports both GPU/device and host/pinned memory objects

Core APIs:

Constructor:
  cuObjServer(const char *ip, unsigned short port, unsigned proto, cuObjRDMATunable params);
    - ip: RDMA device/interface
    - port: server RDMA TCP/UDP port
    - proto: transport type
    - params: tuning parameters

Buffer Management:
  void* allocHostBuffer(size_t size);
  struct rdma_buffer* registerBuffer(void *ptr, size_t size);
  void deRegisterBuffer(struct rdma_buffer *rdma_buff);
    - Register both pinned host or CUDA device buffers

Object Operations:
  ssize_t handleGetObject(...);
  ssize_t handlePutObject(...);

    - GET: RDMA_WRITE to a client buffer
    - PUT: RDMA_READ from client buffer
    - channel: concurrency ID [0–127]
    - async_handle: required for async

Async Operations:
  int poll(cuObjAsyncEvent_t* events, int max_events, uint16_t channel=0);

Channel Management:
  uint16_t allocateChannelId();
  void freeChannelId(uint16_t);

cuObjRDMATunable - Server Tuning Parameters
------------------------------------------

Tune RDMA operation with cuObjRDMATunable:

Server Parameters (defaults):

  - cq_depth:           640         (completion queue depth)
  - service_level:      0           (QoS)
  - timeout:            16          (QP timeout: 4.096 * 2^timeout us)
  - hop_limit:          4           (hop count)
  - pkey_index:         0           (partition key)
  - max_sge:            10          (scatter-gather limit)
  - delay_interval:     5000        (poll delay, ns)
  - delay_mode:         BATCH       (polling mode: NONE/BATCH/ENTRY/ADAPTIVE)
  - qp_reset_on_failure:true        (reset QP on failure)
  - retry_cnt:          7           (retry count)
  - traffic_class:      96          (DSCP/ECN)

DC-Specific Parameters (defaults):

  - num_dcis:           128
  - dc_key:             0xffeeddcc

Example usage:

  cuObjRDMATunable params;
  params.setCqDepth(512);              // Deeper completion queue
  params.setServiceLevel(1);
  params.setDelayMode(CUOBJ_DELAY_NONE);
  auto server = new cuObjServer("192.168.100.1", 18515, CUOBJ_PROTO_RDMA_DC_V1, params);


Example: cuObjClient (Client Side) and cuObjServer (Server Side)
----------------------------------------------------------------

Client Example
--------------

Step 1: Implement I/O Callbacks

  ssize_t ObjectGetCallback(const void *handle, char* buf, size_t size, loff_t offset, const cufileRDMAInfo_t *infop) {
      // For a GET operation, issue an S3-compatible request to your backend including:
      //   - buf:     buffer to fill (host or device pointer)
      //   - size:    number of bytes to fetch
      //   - offset:  object offset
      //   - infop:   RDMA descriptor required for zero-copy
      // Backend should fill 'buf'.
      return size;
  }

  ssize_t ObjectPutCallback(const void *handle, const char* buf, size_t size, loff_t offset, const cufileRDMAInfo_t *infop) {
      // For a PUT operation, issue an S3-compatible upload request with:
      //   - buf:     buffer holding data to store (host or device memory)
      //   - size:    number of bytes to write
      //   - offset:  object offset
      //   - infop:   RDMA descriptor required for zero-copy
      // Backend should store the contents of 'buf'.
      return size;
  }

  CUObjIOOps ops = {
      .get = ObjectGetCallback,
      .put = ObjectPutCallback
  };

Step 2: Create the Client

  cuObjClient* client = new cuObjClient(ops, CUOBJ_PROTO_RDMA_DC_V1);

  // To use GPU memory:
  void *gpubuf;
  cudaMalloc(&gpubuf, 4 * 1024 * 1024);
  client->cuMemObjGetDescriptor(gpubuf, 4 * 1024 * 1024);

  // Or host pinned memory:
  void *hostbuf;
  hostbuf = malloc(4 * 1024 * 1024); // For best performance, pin the host buffer
  client->cuMemObjGetDescriptor(hostbuf, 4 * 1024 * 1024);

  my_ctx_t ctx = {}; // Fill as needed for context

  // PUT (works for any registered host or GPU memory):
  client->cuObjPut(&ctx, gpubuf, 4 * 1024 * 1024, 0, 0);
  client->cuObjPut(&ctx, hostbuf, 4 * 1024 * 1024, 0, 0);

  // GET:
  client->cuObjGet(&ctx, gpubuf, 4 * 1024 * 1024, 0, 0);
  client->cuObjGet(&ctx, hostbuf, 4 * 1024 * 1024, 0, 0);

Server Example
--------------

Step 1: Launch the Server

  cuObjRDMATunable params;
  params.setNumDcis(128);
  params.setCqDepth(640);

  cuObjServer* server = new cuObjServer("0.0.0.0", 18515, CUOBJ_PROTO_RDMA_DC_V1, params);

  // Register a host memory buffer for use:
  void* buffer = server->allocHostBuffer(4 * 1024 * 1024); // Page-aligned, 4MB
  struct rdma_buffer* rdma_handle = server->registerBuffer(buffer, 4 * 1024 * 1024);
  // 'rdma_handle' can be used for zero-copy RDMA I/O

  // Server now processes requests.

Step 2: Handle Operations

  // Upon receiving a request with an RDMA token via S3-compatible HTTP API,
  // check the operation (CUOBJ_OP_PUT/CUOBJ_OP_GET):
  //
  // Pseudocode:
  // if (operation_type == CUOBJ_OP_PUT) {
  //     // Store client (host or GPU) data
  //     ssize_t bytes_written = handlePutObject(rdma_handle, buf, size, offset, infop);
  // }
  // else if (operation_type == CUOBJ_OP_GET) {
  //     // Fetch data to client (host or GPU) buffer
  //     ssize_t bytes_read = handleGetObject(rdma_handle, buf, size, offset, infop);
  // }
  //
  // No extra data motion logic needed. cuObjServer performs the RDMA transfer and returns operation status.
  // RDMA completions and notifications are handled internally.

Protocol Flow
-------------

1. Client invokes cuObjPut or cuObjGet, submitting an RDMA buffer descriptor (host or GPU).
2. Server receives the control request and calls your registered callbacks (get/put).
3. RDMA data movement is performed automatically by cuObjServer.
4. Results are reported through the cuObject API.

Tips
----
- Always register buffers before RDMA I/O. Supports both host-pinned and GPU memory.
- For parallel workloads, use separate buffer offsets or threads.
- Tune cuObjRDMATunable (CQ depth, DCIs, etc.) for best throughput and latency.
- For advanced/async use-cases, consult the full cuObject API reference.

