3.28
The latest AIStore release, version 3.28, arrives nearly three months after the previous release. As always, v3.28 maintains compatibility with the previous version. We fully expect it to upgrade cleanly from earlier versions.
This release delivers significantly improved ETL offload with a newly added WebSocket communicator and optimized data flows between ETL pods and AIS targets in a Kubernetes cluster.
For Python users, we added resilient retrying logic that maintains seamless connectivity during lifecycle events - the capability that can be critical when running multi-hour training workloads. We've also improved JobStats
and JobSnapshot
models, added MultiJobSnapshot
, extended and fixed URL encoding, and added props
accessor method to the Object class.
Python SDK's ETL has been extended with a new ETL server framework that provides three Python-based web server implementations: FastAPI
, Flask
, and HTTPMultiThreadedServer
.
Separately, 3.28 adds a dual-layer rate-limiting capability with configurable support for both frontend (client-facing) and backend (cloud-facing, adaptive) operation.
On the CLI side, there are multiple usability improvements listed below. Users now have further improved list-objects (ais ls
) operation, amended and improved inline helps and CLI documentation. The ais show job
command now displays cluster-wide objects and bytes totals for distributed jobs.
Enhancements to observability are also detailed below and include new metrics to track rate-limited operations and extended job statistics. Most of the supported jobs will now report a j-w-f metric: number of mountpath joggers, number of (user-specified) workers, and a work-channel-full count.
Other improvements include new (and faster) content checksum, fast URL parsing (for Go API), optimized buffer allocation for multi-object operations and ETL, support for Unicode and special characters in object names. We've refactored and micro-optimized numerous components, and amended numerous docs, including the main readme and overview.
Last but not least, for better networking parallelism, we now support multiple long-lived peer-to-peer connections. The number of connections is configurable, and the supported batch jobs include distributed sort, erasure coding, multi-object and bucket-to-bucket copy, ETL, global rebalance, and more.
Table of Contents
- Configuration Changes
- New Default Checksum
- Rate Limiting
- ETL
- API Enhancements; Batch Jobs
- AWS S3
- CLI
- Observability
- Python SDK
- Benchmarking Tools
- Build and CI/CD
- Miscellaneous Improvements
Assorted commits for each section are also included below with detailed changelog available at this link.
Configuration Changes
We made several additions to global (cluster-wide) and bucket configuration settings.
Multiple xactions (jobs) now universally include a standard configuration triplet that provides for:
- In-flight compression
- Minimum size of work channel(s)
- Number of peer-to-peer TCP connections (referred to as
stream bundle multiplier
)
The following jobs are now separately configurable at the cluster level:
EC
(Erasure Coding)Dsort
(Distributed Shuffle)Rebalance
(Global Rebalance)TCB
(Bucket-to-Bucket Copy/Transform)TCO
(Multi-Object Copy/Transform)Archive
(Multi-Object Archiving or Sharding)
In addition, EC is also configurable on a per-bucket basis, allowing for further fine-tuning.
Commit Highlights
- 15cf1ca: Add backward compatible config and BMD changes.
- fc3d8f3: Add cluster config (tco, arch) sections and tcb burst.
- 7b46f0c: Update configuration part two.
- 8c49b6b: Add rate-limiting sections to global configuration and BMD (bucket metadata).
- 15d4ed5: [config change] and [BMD change]: the following jobs now universally support
XactConf
triplet
New Default Checksum
AIStore 3.28 adds a new default content checksum. While still using xxhash, it uses a different implementation that delivers better performance in large-size streaming scenarios.
The system now makes a clear internal delineation between classic xxhash for system metadata (node IDs, bucket names, object metadata, etc.) and cespare/xxhash
(designated as "xxhash2" in configuration) for user data. All newly created buckets now use "xxhash2" by default.
Benchmark tests show improved performance with the new implementation, especially for large objects and streaming operations.
Commit Highlights
- a045b21: Implement new content checksum.
- 7b69dc5: Update modules for new content checksum.
- 9fa9265: Refine cespare vs one-of-one implementation.
- d630c1f: Add cespare to hash micro-benchmark.
Rate Limiting
Version 3.28 introduces rate-limiting capability that operates at both frontend (client-facing) and backend (cloud-facing) layers.
On the frontend, each AIS proxy enforces configurable limits with burst capacity allowance. You can set different limits for each bucket based on its usage patterns, with separate configurations for GET, PUT, and DELETE operations.
For backend operations, the system implements an adaptive rate shaping mechanism that dynamically adjusts request rates based on cloud provider responses. This approach prevents hitting cloud provider limits proactively and implements exponential backoff when 429/503 responses are received. The implementation ensures zero overhead when rate limiting is disabled.
Configuration follows a hierarchical model with cluster-wide defaults that can be overridden per bucket. You can adjust intervals, token counts, burst sizes, and retry policies without service disruption.
Commit Highlights
- e71c2b1: Implemented frontend/backend dual-layer rate limiting system.
- 9f4d321: Added per-bucket overrides and exponential backoff for cloud 429 errors.
- 12e5787: Not rate-limiting remote bucket with no props.
- be309fd: docs: add rate limiting readme and blog.
- b011945: Rate-limited backend: complete transition.
- fcba62b: Rate-limit: add stats; prometheus metrics.
- 8ee8b44: Rate-limited backend; context to propagate vlabs; prefetch.
- c4b796a: Enable/disable rate-limited backends.
- 666796f: Core: rate-limited backends (major update).
ETL
ETL (Extract, Transform, Load) is a cornerstone feature designed to execute transformations close to the data with an extremely high level of node-level parallelism across all nodes in the AIS cluster.
WebSocket
Version 3.28 adds WebSocket (ws://
) as yet another fundamental communication mechanism between AIS nodes and ETL containers, complementing the existing HTTP and IO (STDIN/STDOUT) communications.
The WebSocket implementation supports multiple concurrent connections per transform session, preserves message order and boundaries for reliable communication, and provides stateful session management for long-lived, per-xaction sessions.
Direct PUT
The release implements a new direct PUT capability for ETL transformations that optimizes the data flow between components. Traditionally, data would flow from a source AIS target to an ETL container, back to the source AIS target, and finally to the destination target. With direct PUT, data flows directly from the ETL container to the destination AIS target.
Stress tests show 3x to 5x performance improvement with direct PUT enabled. This capability is available across all communication mechanisms (HTTP, IO, WebSocket) and ETL containers can detect direct PUT capability through environment variables.
Pod Lifecycle
The ETL framework now implements structured lifecycle transitions between Initializing, Running, and Stopped states with automated cleanup of Kubernetes resources when ETL enters the Stopped state. We've enhanced error capture and reporting from pods, including initialization failures, and added the ability to restart previously stopped ETL instances without recreating them.
ETL Framework
Version 3.28 introduces a reusable server interface for implementing transformations in Go and adds extensible base classes for developing ETL servers in Python. These include a multi-threaded server based on BaseHTTPRequestHandler, a Flask-based implementation for synchronous processing, and a FastAPI-based implementation for asynchronous processing.
The release refactors ETL runtime logic from transaction handlers to xactions for better state control, adds detailed error information gathering from pod logs during failures, and includes configurable timeout options for ETL operations.
Commit Highlights
- 7c42a3d: Support inline transform for websocket communicator.
- d85f26c: Introduce generic Go ETL webserver framework.
- de0df5c: Support multiple websocket connections per tcb/tco job.
- e1e4935: Implement
FastAPIServer
base class for async ETL processing. - ab99308: Implement
FlaskServer
for ETL transformations. - af4718e: Enhance error handling for unexpected pod failure.
- fa04c09: ETL pod lifecycle: basic state transitions.
- e176c5e: Add stats for offline transform; refactor GET/PUT request flows
API Enhancements; Batch Jobs
AIStore 3.28 adds non-recursive (batch) operation capability and introduces a num-workers
parallelism parameter for copy/transform/evict/delete bucket and multi-object operations. We've also added support for graceful error handling during batch operations with the continue-on-error parameter.
The release implements common logic to compute optimal worker counts based on current load and improves progress reporting with sentinel opcodes for intra-cluster synchronization during progress, finish, and abort operations. These enhancements have been standardized across all multi-object operations, including copy, prefetch, transform, delete, evict, and archive.
Commit Highlights
- 72c0b8c: Add non-recursive option to multi-object archive operations.
- fd3d6d8: Unify delete/evict with other multi-object APIs; support non-recursion.
- 657f42d: Add 'num-workers' parallelism to copy/transform bucket operations.
- 6c171cd: Add 'omitempty' to API JSON structures; bump versions.
- 750eb3a: Go-based API: add common 'get-node' function, reduce code.
- 373445c: Go-based API: add common 'membership' function, reduce code.
- a592f87: Multi-object copy/transform: when targets run different UUIDs (corner).
- 0b3b34a: [API change] copy, transform, prefetch jobs: non-recursive operation.
- a419869: Copy/transform bucket: add 'channel full' count and log.
- e5ec51a: Multi-object copy/transform: op-code 'abort'; refactoring.
- fa04c09: Copy/transform bucket: 'num-workers' vs number of mountpaths (disks).
- 717fd3a: Multi-object: archive, copy, and transform (major update).
- 35e5bbc: Copy/transform bucket: add 'num-workers' parallelism.
- 657f42d: [API change] copy/transform bucket: add 'num-workers' parallelism.
- 8434a74: Copy/transform: refactor control msg-s;
num-workers
,continue-on-err
. - ad14a2a: Multi-object copy/transform: more reasons to abort.
- 71becff: Copy/transform: sentinels to synchronize finishing and aborting.
AWS S3
The AWS S3 backend has been enhanced with a configurable multipart-upload threshold. The new multipart_size
bucket property allows users to set custom thresholds for when to use multipart uploads. The property supports a value of -1
to completely disable multipart uploads and accepts human-readable size formats (KB, MB, GB).
We've extended the AWS S3 bucket configuration options to include:
- extra.aws.cloud_region
- extra.aws.endpoint
- extra.aws.max_pagesize
- extra.aws.multipart_size
- extra.aws.profile
The release adds support for S3 list-object-versions
feature and enhances classification and handling of S3-specific errors with the introduction of an err-remote-retriable
type for better AWS error management. We've also optimized retry logic for transient AWS issues.
Performance optimizations include improved S3 connectivity and transfer efficiency, enhanced presigned HEAD request optimization in GET context, and improved AWS request checksum validation handling.
Commit Highlights
- 4f30cde: Enable custom multipart threshold via
multipart_size
bucket property. - e6a1bd9: Introduce
err-remote-retriable
and optimized 503 retry logic. - 91c86b6: S3 compatibility: add missing XML tags (for consistency).
CLI
The command-line interface has been enhanced in v3.28 with numerous usability improvements.
The ais show job
command now displays cluster-wide objects and bytes totals for distributed jobs, including summaries with checkmarks (✓) for prefetch, copy, transform, EC, and mirror operations. We've added a command to list all supported job types and improved the display of joggers, workers, and parallelism in job outputs.
Command ais ls
, when run with a --paged
option, will now display page numbers and show in-cluster ("cached") objects counts.
In-cluster vs remote content comparison has been enhanced with improved ais ls --diff
.
Administrative commands have also been improved and fixed, including: log
, cluster download-logs
commands, and cluster set-primary
.
Added a new admin API and CLI to drop (discard) in-memory object metadata cache.
Commit Highlights
- 0433042: Enhance 'ais show job' to display cluster-wide totals.
- 31a869c: Update list-objects to show page number and cached object count.
- a58881a: Improve 'ais ls --help' documentation.
- 8b41992: Enhance 'log get' and 'download-log' commands with better help.
- 1170fd3: Improve 'ais scrub' to generate detailed logs with relevant columns.
- ee754f5: Add admin API and CLI to drop in-memory object metadata cache
- a7354f0: Fix 'set-primary NODE_NAME' (inconsistency with 'NODE_ID').
- fb07d4a: Fix listing EC (erasure-coding) jobs.
- 108930e: Command 'ais show job' to list all supported jobs (names).
Observability
AIStore 3.28 adds new metrics to track rate-limited operations, including err.rate.retry.n
to count rate-limited errors and rate.retry.ns.total
to track total delay time due to rate limiting.
We've enhanced monitoring of interactions with remote storage by adding metrics for remote GET operations including count, size, and latency.
Enhancements also new metrics to track a j-w-f metric (number of mountpath joggers, number of (user-specified) workers, and a work-channel-full count) when running batch jobs.
We've improved performance monitoring with enhanced ais performance latency
and ais performance throughput
commands.
Commit Highlights
- 5c6b4c7: Running jobs to report j-w-f: number of mountpath joggers, number of workers, and channel-full counter.
- 1b0c39c: Implement j-w-f parallelism.
- 37c2434: Add reporting of number of joggers, workers, and channel-full metrics.
- dde4e8b: New feature flag to include (bucket, xaction) Prometheus variable labels with every GET and PUT transaction.
- 9794f4f: Track all remote read operations with improved metrics.
- 72275f7: Add 'xkind' variable label to remote GET metrics.
- 118a821: Remove 'xid' (job ID) from Prometheus labels.
Python SDK
AIStore 3.28 includes Python SDK v1.13.7, a substantial update with numerous enhancements.
The SDK now implements unified retry configuration for both HTTP and network failures with separate handling for HTTP status-based retries vs. connection failures. We've updated the retry strategy for ConnectTimeout
, RequestsConnectionError
, ReadTimeout
and implemented graceful recovery from transient network issues.
We've improved JobStats
and JobSnapshot
models with additional fields, including a glob_id
field and an unpack()
method to decode worker metrics, and introduced a MultiJobSnapshot
model for detailed job information.
URL handling has been fixed with object name URL encoding in request URLs. Object properties have been enhanced with props_cached
and props
accessor methods to the Object class, where props
forces object properties refresh via HEAD request on every access and props_cached
returns cached properties without network calls.
The ETL support has been extended with an extensible ETL server framework offering multiple server implementations, direct PUT support in HTTP and WebSocket communicators, serialization of ETL arguments as query parameters, the ability to override configured timeouts for ETL requests, and improved WebSocket communication for high-throughput processing.
Commit Highlights
- 30cdc3f: Release Python SDK v1.13.6.
- 568e6f5: Implement unified retry configuration for HTTP and network failures.
- 2df6f4d: Improve RequestClient connection retry strategy.
- b1a8e29: Add
props_cached
andprops
accessor methods to Object class. - 5b46ac3: Add URL encoding support for object names in request URLs.
- 55dd49a: Implement
details
method to retrieve detailed job information. - ff78492: Add integration tests for 'Streaming-Cold-GET' feature
- bc6931f: Support 'num_workers' in bucket-to-bucket copy and transform.
- b4eed70: Support Direct PUT in etl http webserver.
- aaa45f1: Support Direct PUT in etl flask webserver.
- be8bc92: Update ObjectFile stress tests and examples.
- 950eede: Amend error handling (in re: for updated object naming).
- 1cc3df4: Support URL encoding of object names in request URLs.
Benchmarking Tools
AIStore 3.28 includes enhancements to both aisloader
(Go-based) and pyaisloader
(Python-based) benchmarking tools.
The aisloader
now supports a --pctupdate
option to simulate workloads with object updates by creating a GET followed by PUT sequence on the same object. This allows testing write-update patterns common in certain ML workflows. Remote bucket testing has been improved with enhanced operation with remote buckets via the --s3endpoint
parameter and added latency measurement for combo operations. Performance monitoring has been improved with better tracking of individual operation latencies, transition to monotonic time for more accurate measurements, and enhanced validation and error reporting.
The pyaisloader
has been updated to utilize pickle-safe multiprocessing concurrency, enhanced parallelism with improved process safety, and better compatibility with Python's multiprocessing constraints. We've improved performance metrics gathering and reporting and enhanced integration with Python SDK testing infrastructure.
Commit Highlights
- 066fe04: Update pyaisloader benchmark classes for pickle-safe multiprocessing.
- ace8608: Add
--pctupdate
option to aisloader. - 216d4ad: Implement new update mode for aisloader to test versioning.
- 554f95a: Optimize aisloader work requests and field alignment.
- be8bc92: Update objectfile stress tests and examples for Python SDK.
Build and CI/CD
AIStore 3.28 has transitioned to Go 1.24 with various optimizations and upgraded multiple open-source dependencies for security, performance, and bug fixes. Specific upgrades include the JWT-Go package (to address security vulnerabilities), the LZ4 package (to v4), and Prometheus client library.
To improve GitLab and GitHub CI, added Kustomize-based deployment configurations with development-focused overlays for easier setup.
We've added ais-fs
volume hostpath support in Kubernetes configurations and enhanced the container image build process with cached dependencies for faster builds and improved dependency management. Deployment scripts have been streamlined for various environments.
Commit Highlights
- f30d327: Restore CI image tag to latest (after temporary revert).
- e917cb0: Cache KinD dependencies in image.
- da887f6: Include podman-docker install in CI image.
- 00dfd23: Refactor rules for k8s jobs.
- d10f6ea: Transition to rules + refactoring.
- 04c6fe1: Run csp tests via labels in MR.
- 645e8ee: Upgrade OSS packages.
- bf64360: Upgrade all OSS packages except AWS.
- cce5f41: Upgrade aws-sdk-go-v2 with disabled checksum validation warning.
- 0949d46: Transition to Go 1.24.
Miscellaneous Improvements
Some of the other notable improvements include:
- Optimized multi-object copying and transforming, new default content checksum, micro-optimized mountpath traversal, and improved logic to handle memory pressure.
- New build tag (
stdlibwalk
) to switch between standard library'sfilepath.WalkDir
and the default implementation. - Fast URL parsing (Go API).
- Optimized buffer allocation for multi-object operations and ETL.
Batch jobs now support a num-workers
parameter that allows user to further control (via API or CLI) parallelism when running IO and network-intensive workflows. Worker count is dynamically scaled based on system load and is never less than the number of mountpaths (disks).
The release also adds support for Unicode and special characters in object names, and improves classification and management of errors with the addition of an err-rate-limit
type for better handling of rate-limited operations and enhanced handling of network failures with better retry logic.
Code quality has been generally improved:
- Transitioned to golangci-lint v2
- Enabled additional linters
- Fixed spelling throughout the entire codebase and documentation
- Amended and improved numerous docs, including the main readme and overview.
- Refactored and micro-optimized numerous components
Commit Highlights
- 12ea066: Replace FS Walk with WalkDir.
- 0b3b34a: Add 'num-workers' parallelism to copy/transform operations.
- 9749b4d: Improve memory pressure handling; increase
size-to-GC
(tunable and configurable). - 7026201: Go API: fast URL parsing; cache.
- 2e38d81: Fix spelling across the board.
- a2f0dd2: FS Walk: inline visiting callback (micro-optimize).
- ad69848: Pkg 'keepalive': add common initialization, reduce code, refactor.
- a250bfd: Micro-optimize prefetch; simplify cold-get; add stats-updater (i/f); align fields.
- 5cf0b07: OCI backend: amend error handling (assorted status codes; formatting).