# UltrafastSecp256k1 v3.3 \ stacker news

17–67× Faster Batch Operations, GPU Kernel Optimizations, and 463+ Security Fixes

A deep dive into the biggest performance and security release yet for the fastest secp256k1 library across CPU, CUDA, OpenCL, and Metal.

TL;DRTL;DR

UltrafastSecp256k1 v3.3 ships 61 commits spanning every backend — CPU, CUDA, OpenCL, and Metal. The headline numbers:

Batch operations are 17–67× faster thanks to an all-affine fast path with Pippenger touched-bucket optimization
OpenCL generator multiplication is ~10% faster via a precomputed affine table that eliminates per-thread table construction
Schnorr batch verification got a full optimization pass — cached x-only pubkeys, reused scratch buffers, and a retuned crossover point
463+ code-scanning alerts resolved across the entire codebase
Complete audit infrastructure — every P0, P1, and P2 audit item is now closed

This is an ABI-compatible drop-in upgrade from v3.21.1. No API changes, no breaking changes.

The Big Win: 17–67× Faster Batch OperationsThe Big Win: 17–67× Faster Batch Operations

The single largest improvement in v3.3 is a rewrite of how batch scalar multiplications work internally.

Before: Jacobian Coordinates All The Way DownBefore: Jacobian Coordinates All The Way Down

Previous versions performed multi-scalar multiplication using Jacobian coordinates throughout the pipeline. Every point addition required 16 field multiplications — the cost of a full Jacobian-to-Jacobian (J+J) addition. For a Pippenger-style multi-scalar multiplication with thousands of points, this adds up fast.

After: All-Affine Fast PathAfter: All-Affine Fast Path

v3.3 introduces an all-affine accumulation strategy. Instead of keeping bucket accumulators in Jacobian form, we batch-convert points to affine coordinates using a single Montgomery batch inversion, then accumulate using mixed Jacobian+Affine additions that cost only 11 field multiplications each.

Combined with touched-bucket tracking (skip empty buckets entirely) and window size tuning for the Pippenger algorithm, the result is dramatic:

Operation	v3.21.1	v3.3	Speedup
`generator_mul_batch(64)`	1,090 μs	63 μs	17×
`generator_mul_batch(256)`	4,200 μs	112 μs	37×
`generator_mul_batch(1024)`	16,800 μs	251 μs	67×

The larger the batch, the more dramatic the improvement — exactly the scaling behavior you want for wallet operations, BIP352 silent payment scanning, and Schnorr batch verification.

OpenCL: ~10% Faster Generator MultiplicationOpenCL: ~10% Faster Generator Multiplication

On the GPU side, OpenCL's scalar_mul_generator got a targeted optimization that delivers a clean ~10% throughput improvement.

The ProblemThe Problem

Every OpenCL work-item was constructing its own copy of the generator point multiplication table at kernel startup. This meant:

1 point doubling
13 mixed additions
Conversion overhead
All in Jacobian coordinates with 16 muls per add

For a 4-bit windowed scalar multiplication with ~64 window iterations in the hot loop, this per-thread setup cost was significant.

The FixThe Fix

v3.3 hardcodes a precomputed constant affine table containing {0G, 1G, 2G, ..., 15G} directly in the kernel source. The hot loop now uses point_add_mixed_impl (Jacobian + Affine, 11 muls) instead of point_add_impl (Jacobian + Jacobian, 16 muls).

Result on NVIDIA RTX 5060 Ti:

Mode	Before	After	Change
Windowed	287.8 ns/op (3.47 M/s)	258.9 ns/op (3.86 M/s)	−10%
LUT	126.0 ns/op (7.93 M/s)	96 ns/op (10.71 M/s)	No regression

This brings OpenCL to parity with CUDA on the windowed path (previously 1.11× slower, now 0.99×).

Additionally, __NV_CL_C_VERSION is now force-defined, ensuring NVIDIA-specific optimizations are always active on NVIDIA hardware.

CUDA: Precomputed Tweak Tables for BIP352CUDA: Precomputed Tweak Tables for BIP352

The CUDA backend received precomputed tweak tables for the BIP352 silent payment pipeline, eliminating redundant computation during the scan phase. A new BENCH_CLOCK_WARMUP mechanism also ensures benchmark results are stable from the first run.

Schnorr Batch Verification: Full Optimization PassSchnorr Batch Verification: Full Optimization Pass

Schnorr batch verification got 8 separate optimizations that compound together:

Cached x-only pubkey lifts — avoid redundant square-root computations when the same pubkey appears multiple times
Reused scratch buffers — eliminate per-batch allocation overhead
Retuned crossover point — the batch-vs-single verification threshold is now optimized for N=128
Reduced setup passes — fewer iterations over the input array before verification begins
Fast path through N=64 — small batches skip unnecessary bookkeeping
Trimmed seed serialization — less overhead in the random weight generation
Reused SHA-256 base state — the batch weight derivation reuses the midstate instead of rehashing from scratch
Field batch inversion scratch trimmed — reduced temporary memory usage in the batch modular inversion

These changes are especially impactful for Lightning Network nodes and other systems that verify many Schnorr signatures in bulk.

Metal GPU: GLV + wNAF + LUTMetal GPU: GLV + wNAF + LUT

Apple's Metal backend received three major feature additions:

scalar_mul_glv — GLV endomorphism-accelerated scalar multiplication for batched operations, matching the CUDA and OpenCL backends
wNAF w=4 — windowed Non-Adjacent Form with window size 4, providing better performance than simple binary scalar multiplication
scalar_mul_generator_lut — lookup-table-based generator multiplication, the fastest path for fixed-base scalar mul

These additions bring Metal to feature parity with the CUDA and OpenCL backends for BIP352 silent payment scanning on Apple Silicon.

Security & Hardening: 5 Critical ImprovementsSecurity & Hardening: 5 Critical Improvements

Solinas Reduction: Replacing a Broken Barrett ReductionSolinas Reduction: Replacing a Broken Barrett Reduction

The previous Barrett reduction implementation had a subtle correctness bug. v3.3 replaces it entirely with a correct Solinas reduction, along with missing shader header dependencies that could cause compilation failures on some platforms.

Constant-Time Message Signing (N-03)Constant-Time Message Signing (N-03)

The message signing path now uses a constant-time (CT) implementation, closing audit finding N-03. This prevents timing side-channel attacks during ECDSA message signing operations.

Secret Cleanup HardeningSecret Cleanup Hardening

Three separate hardening patches ensure that private key material is properly zeroized:

Wallet seed-to-address derivation — intermediate key material is now securely erased after use
ABI secret cleanup paths — the C ABI boundary ensures callers cannot accidentally leak private keys
ECIES zero-ephemeral cleanup — ephemeral keys used in ECIES encryption are zeroized immediately after the shared secret is derived

463+ Code-Scanning Alerts Resolved463+ Code-Scanning Alerts Resolved

Over four pull requests, 463+ static analysis findings were systematically resolved:

Missing braces on single-line if/for bodies
Missing const qualifiers on variables that are never modified
Integer widening conversions that could lose precision
Dead stores — assignments to variables that are never subsequently read
Uninitialized variable warnings
argumentSize mismatches in function calls

The codebase now passes CodeQL, clang-tidy, and SonarCloud with zero alerts.

Audit Infrastructure: P0+P1+P2 CompleteAudit Infrastructure: P0+P1+P2 Complete

v3.3 marks the completion of all planned audit infrastructure items:

P0 (Critical): Core cryptographic correctness — constant-time verification, edge-case coverage, field arithmetic validation
P1 (High): GPU backend audit runners — OpenCL and Metal now have full audit suites matching the CPU backend
P2 (Medium): Extended coverage — CT PrivateKey overloads, FE52 conditional_negate, and cross-platform consistency checks

The audit framework now runs 27 modules across 8 sections on every backend (CPU, CUDA, OpenCL, Metal), ensuring that a correctness regression on any platform is caught before it reaches a release.

Bug Fixes Worth NotingBug Fixes Worth Noting

ARM64 SHA-256 Intrinsics BugARM64 SHA-256 Intrinsics Bug

A subtle bug in the ARM64 SHA-256 implementation: vsha256h2q_u32 was called with a register (abcd) that had already been modified by the preceding vsha256hq_u32 call. This produced incorrect hashes on some ARM64 platforms. Fixed by saving the original value before the first hash round.

MSVC C2026 String Literal LimitMSVC C2026 String Literal Limit

Microsoft's MSVC compiler has a 16,380-character limit on string literals (error C2026). The precomputed point tables exceeded this. v3.3 works around the limit while keeping the tables as compile-time constants.

What's NextWhat's Next

With v3.3 establishing feature parity across all four GPU backends and completing the audit infrastructure, the next focus areas are:

BIP352 full-pipeline GPU acceleration — moving the entire silent payment scan to GPU with zero CPU round-trips
Multi-GPU support — distributing batch operations across multiple GPUs
RISC-V vector extension — leveraging RVV 1.0 for field arithmetic on next-generation hardware

Try ItTry It

UltrafastSecp256k1 v3.3 is available now:

# C/C++
wget https://github.com/shrec/UltrafastSecp256k1/releases/tag/v3.3

# Python
pip install ufsecp==3.3

# Rust
cargo add ufsecp@3.3

# Node.js
npm install ufsecp@3.3

Binaries are available for Linux x64/ARM64, macOS ARM64, Windows x64, Android (ARM64/ARMv7/x64), iOS (xcframework), and WebAssembly.

All release artifacts are signed with Sigstore cosign and include an SBOM.

UltrafastSecp256k1 is an open-source, high-performance secp256k1 elliptic curve library optimized for Bitcoin, Lightning, and BIP352 silent payments. It targets CPU (x86-64, ARM64, RISC-V), CUDA, OpenCL, and Metal backends.

GitHub: github.com/shrec/UltrafastSecp256k1
Release: v3.3