Transformers.js v4 introduces a new WebGPU runtime written in C++ and transitions the project to a pnpm monorepo structure. This update enables high-performance, hardware-accelerated AI model execution across various...

Transformers.js v4: Now Available on NPM!

Summary

Key Points

Implementation of a new WebGPU runtime via ONNX Runtime, supporting advanced architectures such as Mamba (state-space models), Multi-head Latent Attention (MLA), and Mixture of Experts (MoE).
Achieved a ~4x speedup for BERT-based embedding models through the adoption of the com.microsoft.MultiHeadAttention operator.
Migration from Webpack to esbuild reduced build times from 2 seconds to 200 milliseconds and decreased the transformers.web.js bundle size by 53%.
Release of @huggingface/tokenizers, a standalone, 8.8kB (gzipped) zero-dependency library for type-safe tokenization.
Support for models exceeding 8B parameters, with benchmarks demonstrating GPT-OSS 20B (q4f16) running at approximately 60 tokens per second on M4 Pro Max hardware.
Introduction of the ModelRegistry API for managing pipeline assets, including file metadata inspection, cache status checks, and download size calculation.

Technical Details

The v4 architecture leverages specialized ONNX Runtime Contrib Operators, such as com.microsoft.GroupQueryAttention, com.microsoft.MatMulNBits, and com.microsoft.QMoE, to optimize the execution of large language models. The repository has been restructured into a pnpm workspace, allowing for modular sub-packages that depend on the @huggingface/transformers core. This modularity extends to the model definitions, which have been refactored from a single 8,000-line file into focused modules to improve maintainability and extensibility.

New environment controls allow for more granular runtime configuration. The env.useWasmCache setting enables caching of WASM runtime files to support offline functionality, while env.fetch allows developers to implement custom fetch logic for authenticated requests or custom headers. For production monitoring, the ModelRegistry API provides methods like get_pipeline_files and get_file_metadata to inspect assets before loading, and the progress_callback now includes a progress_total event for end-to-end loading tracking.

Impact / Why It Matters

Developers can now deploy unified, WebGPU-accelerated code across browsers, server-side runtimes, and desktop applications. The combination of reduced bundle sizes and optimized operator support enables the deployment of sophisticated, large-scale models in resource-constrained environments.

Transformers.js v4: Now Available on NPM!

Transformers.js v4: Now Available on NPM!

Summary

Key Points

Technical Details

Impact / Why It Matters

↳ Sources