Hardware Implementation of Number Theoretic Transform
Published:
This task is designed to evaluate your ability, as a prospective PhD candidate, to translate mathematical concepts from modern cryptography into efficient hardware implementations.
You are required to implement a simplified Number Theoretic Transform (NTT) accelerator using a Hardware Description Language (HDL) such as Verilog or SystemVerilog, simulate it using an open-source tool, and present your design and results during the interview. You may also use higher-level HDL frameworks that generate Verilog/SystemVerilog (e.g., Chisel or similar), provided the final design can be simulated using standard open-source tools.
Modern computing workloads, specifically in cryptography and secure computation which are increasingly dominated by structured, large-scale arithmetic over polynomials. Whether in post-quantum cryptography (e.g., Kyber, Dilithium) or fully homomorphic encryption (FHE), efficient polynomial multiplication is a central bottleneck.
The Number Theoretic Transform is the key algorithm that makes this efficient. Conceptually similar to the Fast Fourier Transform (FFT), the NTT operates entirely over finite fields, making it highly suitable for hardware implementation.
Why NTT Matters in Hardware
Polynomial multiplication in its naive form takes \(O(n^2)\) operations. This quickly becomes impractical for large polynomials used in cryptographic systems.
The NTT reduces this complexity to:
- \(O(n \log n)\) operations
- Structured, repeatable computation patterns
- Fully integer-based arithmetic (no floating point)
From a hardware perspective, this is extremely important:
- Predictable control flow -> easier to design FSMs
- Regular computation pattern -> reusable datapath
- Integer arithmetic -> simpler and more energy-efficient hardware
Because of these properties, NTT is widely used in:
- Fully Homomorphic Encryption (FHE)
- Lattice-based cryptography
- Secure multiparty computation
From FFT to NTT: What Changes?
You may already be familiar with the FFT, which operates over complex numbers and relies on floating-point arithmetic. While powerful, this introduces rounding errors and adds significant hardware complexity.
The NTT replaces this with:
- Arithmetic over integers modulo a prime (\(q\))
- Special values called roots of unity in finite fields
This means all operations become:
- Integer addition
- Integer multiplication
- Modular reduction
Roots of Unity in Finite Fields
To perform an NTT, we need:
- A prime \(modulus (q)\)
- A primitive root \(\omega\) such that:
This allows us to evaluate polynomials at structured points, similar to FFT.
Example Parameter Set for the Implementation
- Transform size: (\(n = 8\))
- Modulus: (\(q = 17\))
- Root of unity: (\(\omega = 9\))
These values are small enough to debug easily, yet capture the full structure of the algorithm.
The Butterfly Operation
The entire NTT is built from a simple computation unit called a butterfly.
Given inputs (\(a\)), (\(b\)), and a twiddle factor (\(\omega\)):
\[u = a + b \cdot \omega \mod q\] \[v = a - b \cdot \omega \mod q\]For the hardware interpretation, each butterfly requires:
- One modular multiplier
- One modular adder
- One modular subtractor
This is the fundamental building block of your design. Instead of building many butterfly units, you can:
- Build one butterfly unit
- Reuse it across all stages
This leads to a compact, iterative architecture.
Multi-Stage Structure of NTT
The NTT executes in \(log_2(n)\) stages.
For (\(n = 8\)):
- Stage 1: distance = 4
- Stage 2: distance = 2
- Stage 3: distance = 1
Each stage performs multiple butterfly operations across the dataset.
Hardware Insight
- The same datapath is reused across stages
Only:
- Memory access patterns change
- Twiddle factors change
This makes NTT ideal for:
- FSM-controlled designs
- Resource-constrained implementations
Memory Access Patterns
Unlike simple algorithms, NTT does not access memory sequentially. This often the bottleneck in hardware designs.
Example (\(n = 8\)):
- Stage 1: \((0,4), (1,5), (2,6), (3,7)\)
- Stage 2: \((0,2), (1,3), (4,6), (5,7)\)
- Stage 3: \((0,1), (2,3), (4,5), (6,7)\)
You must design:
- Address generation logic
- Efficient memory access (possibly dual-port RAM)
- Careful read/write scheduling
Remeber, Twiddle factors are powers of (\(\omega\)):
\[\omega^0, \omega^1, \omega^2, ..., \omega^{n-1}\]Minimal Hardware Architecture
All arithmetic must be performed \(modulo (q)\). For Modular Addition, you can compute sum and if result \(\geq q\), subtract \(q\). For Modular Subtraction, first compute the difference and ff negative, add \(q\). For Modular Multiplication, fist multiply normally and reduce modulo \(q\).
Note: Since \(q = 17\) is small, you may use simple \(% q\) logic. However, look for better ways to implement this for large primes.
A basic NTT accelerator consists of:
- Datapath
- Butterfly unit
- Modular arithmetic units
- Twiddle ROM
- Register file or small RAM
- Control Unit
- Stage counter
- Butterfly counter
- Address generator
- Twiddle index generator
Task Requirements
- Read this document carefully
- Implement an 8-point NTT in Verilog with following things in mind
- Input: 8 coefficients
- Output: Transformed coefficients
- Parameters of \(n = 8\), \(q = 17\) and \(\omega = 9\)
- Single butterfly unit (iterative architecture)
- ROM-based twiddle storage
- FSM-based control
- Precomputed twiddle factors
- Write a testbench to verify correctness
The NTT is more than just an algorithm, it is a structured computation pattern that maps cleanly to hardware. Understanding it provides a strong foundation for building accelerators used in modern secure computing systems.
Design Freedom
A possible high-level block diagram is shown below. This is provided for reference only, you are not required to follow this architecture. You are encouraged to explore and implement alternative design strategies.
flowchart TD
A[Start] --> B[Initialize Stage Counter]
B --> C{More Stages?}
C -- No --> Z[Done]
C -- Yes --> D[Initialize Butterfly Counter]
D --> E{More Butterflies?}
E -- No --> F[Increment Stage]
F --> C
E -- Yes --> G[Read a, b from memory]
G --> H[Read twiddle factor]
H --> I[Compute u]
I --> J[Compute v]
J --> K[Write u, v back to memory]
K --> L[Increment Butterfly Counter]
L --> E
You are expected to make independent design decisions. For future implementations, think about what needs to be changed inorder to implement:
- Implementing inverse NTT (INTT)
- Adding pipelining
- Supporting larger sizes (e.g., 16 or 32)
- Optimizing modular multiplication
Simulation Requirements
You must use an open-source simulator, such as:
Your submission must include a working testbench that:
- Loads data from text files
- Executes the design
- Produces correct outputs
Public Artifact Repository
You must create a public repository containing your solution. Repository must include at least following components:
- Source Code
- Testbench
- Input Files
- README.md
Interview Expectation
During the interview, you will be expected to:
- Walk through your architecture
- Explain your design decisions
- Discuss trade-offs (precision, latency, complexity)
- Explain your softmax implementation
- Demonstrate your simulation setup
This task is intended as a concise illustration of the core skills expected for this position. It is not designed to have a single “correct” solution. Instead, you are encouraged to explore different design approaches, experiment with trade-offs, and justify the decisions you make.
Focus on demonstrating clarity of thought, sound engineering judgment, and the ability to connect mathematical concepts to practical hardware implementations.
Good luck and happy coding/designing!