17–67× Faster Batch Operations, GPU Kernel Optimizations, and 463+ Security Fixes
A deep dive into the biggest performance and security release yet for the fastest secp256k1 library across CPU, CUDA, OpenCL, and Metal.
TL;DRTL;DR
UltrafastSecp256k1 v3.3 ships 61 commits spanning every backend — CPU, CUDA, OpenCL, and Metal. The headline numbers:
- Batch operations are 17–67× faster thanks to an all-affine fast path with Pippenger touched-bucket optimization
- OpenCL generator multiplication is ~10% faster via a precomputed affine table that eliminates per-thread table construction
- Schnorr batch verification got a full optimization pass — cached x-only pubkeys, reused scratch buffers, and a retuned crossover point
- 463+ code-scanning alerts resolved across the entire codebase
- Complete audit infrastructure — every P0, P1, and P2 audit item is now closed
This is an ABI-compatible drop-in upgrade from v3.21.1. No API changes, no breaking changes.
The Big Win: 17–67× Faster Batch OperationsThe Big Win: 17–67× Faster Batch Operations
The single largest improvement in v3.3 is a rewrite of how batch scalar multiplications work internally.
Before: Jacobian Coordinates All The Way DownBefore: Jacobian Coordinates All The Way Down
Previous versions performed multi-scalar multiplication using Jacobian coordinates throughout the pipeline. Every point addition required 16 field multiplications — the cost of a full Jacobian-to-Jacobian (J+J) addition. For a Pippenger-style multi-scalar multiplication with thousands of points, this adds up fast.
After: All-Affine Fast PathAfter: All-Affine Fast Path
v3.3 introduces an all-affine accumulation strategy. Instead of keeping bucket accumulators in Jacobian form, we batch-convert points to affine coordinates using a single Montgomery batch inversion, then accumulate using mixed Jacobian+Affine additions that cost only 11 field multiplications each.
Combined with touched-bucket tracking (skip empty buckets entirely) and window size tuning for the Pippenger algorithm, the result is dramatic:
| Operation | v3.21.1 | v3.3 | Speedup |
generator_mul_batch(64) | 1,090 μs | 63 μs | 17× |
generator_mul_batch(256) | 4,200 μs | 112 μs | 37× |
generator_mul_batch(1024) | 16,800 μs | 251 μs | 67× |
The larger the batch, the more dramatic the improvement — exactly the scaling behavior you want for wallet operations, BIP352 silent payment scanning, and Schnorr batch verification.
OpenCL: ~10% Faster Generator MultiplicationOpenCL: ~10% Faster Generator Multiplication
On the GPU side, OpenCL's scalar_mul_generator got a targeted optimization that delivers a clean ~10% throughput improvement.
The ProblemThe Problem
Every OpenCL work-item was constructing its own copy of the generator point multiplication table at kernel startup. This meant:
- 1 point doubling
- 13 mixed additions
- Conversion overhead
- All in Jacobian coordinates with 16 muls per add
For a 4-bit windowed scalar multiplication with ~64 window iterations in the hot loop, this per-thread setup cost was significant.
The FixThe Fix
v3.3 hardcodes a precomputed constant affine table containing {0G, 1G, 2G, ..., 15G} directly in the kernel source. The hot loop now uses point_add_mixed_impl (Jacobian + Affine, 11 muls) instead of point_add_impl (Jacobian + Jacobian, 16 muls).
Result on NVIDIA RTX 5060 Ti:
| Mode | Before | After | Change |
| Windowed | 287.8 ns/op (3.47 M/s) | 258.9 ns/op (3.86 M/s) | −10% |
| LUT | 126.0 ns/op (7.93 M/s) | 96 ns/op (10.71 M/s) | No regression |
This brings OpenCL to parity with CUDA on the windowed path (previously 1.11× slower, now 0.99×).
Additionally, __NV_CL_C_VERSION is now force-defined, ensuring NVIDIA-specific optimizations are always active on NVIDIA hardware.
CUDA: Precomputed Tweak Tables for BIP352CUDA: Precomputed Tweak Tables for BIP352
The CUDA backend received precomputed tweak tables for the BIP352 silent payment pipeline, eliminating redundant computation during the scan phase. A new BENCH_CLOCK_WARMUP mechanism also ensures benchmark results are stable from the first run.
Schnorr Batch Verification: Full Optimization PassSchnorr Batch Verification: Full Optimization Pass
Schnorr batch verification got 8 separate optimizations that compound together:
- Cached x-only pubkey lifts — avoid redundant square-root computations when the same pubkey appears multiple times
- Reused scratch buffers — eliminate per-batch allocation overhead
- Retuned crossover point — the batch-vs-single verification threshold is now optimized for N=128
- Reduced setup passes — fewer iterations over the input array before verification begins
- Fast path through N=64 — small batches skip unnecessary bookkeeping
- Trimmed seed serialization — less overhead in the random weight generation
- Reused SHA-256 base state — the batch weight derivation reuses the midstate instead of rehashing from scratch
- Field batch inversion scratch trimmed — reduced temporary memory usage in the batch modular inversion
These changes are especially impactful for Lightning Network nodes and other systems that verify many Schnorr signatures in bulk.
Metal GPU: GLV + wNAF + LUTMetal GPU: GLV + wNAF + LUT
Apple's Metal backend received three major feature additions:
scalar_mul_glv— GLV endomorphism-accelerated scalar multiplication for batched operations, matching the CUDA and OpenCL backends- wNAF w=4 — windowed Non-Adjacent Form with window size 4, providing better performance than simple binary scalar multiplication
scalar_mul_generator_lut— lookup-table-based generator multiplication, the fastest path for fixed-base scalar mul
These additions bring Metal to feature parity with the CUDA and OpenCL backends for BIP352 silent payment scanning on Apple Silicon.
Security & Hardening: 5 Critical ImprovementsSecurity & Hardening: 5 Critical Improvements
Solinas Reduction: Replacing a Broken Barrett ReductionSolinas Reduction: Replacing a Broken Barrett Reduction
The previous Barrett reduction implementation had a subtle correctness bug. v3.3 replaces it entirely with a correct Solinas reduction, along with missing shader header dependencies that could cause compilation failures on some platforms.
Constant-Time Message Signing (N-03)Constant-Time Message Signing (N-03)
The message signing path now uses a constant-time (CT) implementation, closing audit finding N-03. This prevents timing side-channel attacks during ECDSA message signing operations.
Secret Cleanup HardeningSecret Cleanup Hardening
Three separate hardening patches ensure that private key material is properly zeroized:
- Wallet seed-to-address derivation — intermediate key material is now securely erased after use
- ABI secret cleanup paths — the C ABI boundary ensures callers cannot accidentally leak private keys
- ECIES zero-ephemeral cleanup — ephemeral keys used in ECIES encryption are zeroized immediately after the shared secret is derived
463+ Code-Scanning Alerts Resolved463+ Code-Scanning Alerts Resolved
Over four pull requests, 463+ static analysis findings were systematically resolved:
- Missing braces on single-line
if/forbodies - Missing
constqualifiers on variables that are never modified - Integer widening conversions that could lose precision
- Dead stores — assignments to variables that are never subsequently read
- Uninitialized variable warnings
argumentSizemismatches in function calls
The codebase now passes CodeQL, clang-tidy, and SonarCloud with zero alerts.
Audit Infrastructure: P0+P1+P2 CompleteAudit Infrastructure: P0+P1+P2 Complete
v3.3 marks the completion of all planned audit infrastructure items:
- P0 (Critical): Core cryptographic correctness — constant-time verification, edge-case coverage, field arithmetic validation
- P1 (High): GPU backend audit runners — OpenCL and Metal now have full audit suites matching the CPU backend
- P2 (Medium): Extended coverage — CT
PrivateKeyoverloads,FE52 conditional_negate, and cross-platform consistency checks
The audit framework now runs 27 modules across 8 sections on every backend (CPU, CUDA, OpenCL, Metal), ensuring that a correctness regression on any platform is caught before it reaches a release.
Bug Fixes Worth NotingBug Fixes Worth Noting
ARM64 SHA-256 Intrinsics BugARM64 SHA-256 Intrinsics Bug
A subtle bug in the ARM64 SHA-256 implementation: vsha256h2q_u32 was called with a register (abcd) that had already been modified by the preceding vsha256hq_u32 call. This produced incorrect hashes on some ARM64 platforms. Fixed by saving the original value before the first hash round.
MSVC C2026 String Literal LimitMSVC C2026 String Literal Limit
Microsoft's MSVC compiler has a 16,380-character limit on string literals (error C2026). The precomputed point tables exceeded this. v3.3 works around the limit while keeping the tables as compile-time constants.
What's NextWhat's Next
With v3.3 establishing feature parity across all four GPU backends and completing the audit infrastructure, the next focus areas are:
- BIP352 full-pipeline GPU acceleration — moving the entire silent payment scan to GPU with zero CPU round-trips
- Multi-GPU support — distributing batch operations across multiple GPUs
- RISC-V vector extension — leveraging RVV 1.0 for field arithmetic on next-generation hardware
Try ItTry It
UltrafastSecp256k1 v3.3 is available now:
# C/C++
wget https://github.com/shrec/UltrafastSecp256k1/releases/tag/v3.3
# Python
pip install ufsecp==3.3
# Rust
cargo add ufsecp@3.3
# Node.js
npm install ufsecp@3.3Binaries are available for Linux x64/ARM64, macOS ARM64, Windows x64, Android (ARM64/ARMv7/x64), iOS (xcframework), and WebAssembly.
All release artifacts are signed with Sigstore cosign and include an SBOM.
UltrafastSecp256k1 is an open-source, high-performance secp256k1 elliptic curve library optimized for Bitcoin, Lightning, and BIP352 silent payments. It targets CPU (x86-64, ARM64, RISC-V), CUDA, OpenCL, and Metal backends.
GitHub: github.com/shrec/UltrafastSecp256k1
Release: v3.3