Hardware Implementation of Number Theoretic Transform

Published: April 25, 2026

This task is designed to evaluate your ability, as a prospective PhD candidate, to translate mathematical concepts from modern cryptography into efficient hardware implementations.

You are required to implement a simplified Number Theoretic Transform (NTT) accelerator using a Hardware Description Language (HDL) such as Verilog or SystemVerilog, simulate it using an open-source tool, and present your design and results during the interview. You may also use higher-level HDL frameworks that generate Verilog/SystemVerilog (e.g., Chisel or similar), provided the final design can be simulated using standard open-source tools.

Modern computing workloads, specifically in cryptography and secure computation which are increasingly dominated by structured, large-scale arithmetic over polynomials. Whether in post-quantum cryptography (e.g., Kyber, Dilithium) or fully homomorphic encryption (FHE), efficient polynomial multiplication is a central bottleneck.

The Number Theoretic Transform is the key algorithm that makes this efficient. Conceptually similar to the Fast Fourier Transform (FFT), the NTT operates entirely over finite fields, making it highly suitable for hardware implementation.

Why NTT Matters in Hardware

Polynomial multiplication in its naive form takes \(O(n^2)\) operations. This quickly becomes impractical for large polynomials used in cryptographic systems.

The NTT reduces this complexity to:

\(O(n \log n)\) operations
Structured, repeatable computation patterns
Fully integer-based arithmetic (no floating point)

From a hardware perspective, this is extremely important:

Predictable control flow -> easier to design FSMs
Regular computation pattern -> reusable datapath
Integer arithmetic -> simpler and more energy-efficient hardware

Because of these properties, NTT is widely used in:

Fully Homomorphic Encryption (FHE)
Lattice-based cryptography
Secure multiparty computation

From FFT to NTT: What Changes?

You may already be familiar with the FFT, which operates over complex numbers and relies on floating-point arithmetic. While powerful, this introduces rounding errors and adds significant hardware complexity.

The NTT replaces this with:

Arithmetic over integers modulo a prime (\(q\))
Special values called roots of unity in finite fields

This means all operations become:

Integer addition
Integer multiplication
Modular reduction

Roots of Unity in Finite Fields

To perform an NTT, we need:

A prime \(modulus (q)\)
A primitive root \(\omega\) such that:

\[\omega^n \equiv 1 \pmod{q}\]

This allows us to evaluate polynomials at structured points, similar to FFT.

Example Parameter Set for the Implementation

Transform size: (\(n = 8\))
Modulus: (\(q = 17\))
Root of unity: (\(\omega = 9\))

These values are small enough to debug easily, yet capture the full structure of the algorithm.

The Butterfly Operation

The entire NTT is built from a simple computation unit called a butterfly.

Given inputs (\(a\)), (\(b\)), and a twiddle factor (\(\omega\)):

\[u = a + b \cdot \omega \mod q\] \[v = a - b \cdot \omega \mod q\]

For the hardware interpretation, each butterfly requires:

One modular multiplier
One modular adder
One modular subtractor

This is the fundamental building block of your design. Instead of building many butterfly units, you can:

Build one butterfly unit
Reuse it across all stages

This leads to a compact, iterative architecture.

Multi-Stage Structure of NTT

The NTT executes in \(log_2(n)\) stages.

For (\(n = 8\)):

Stage 1: distance = 4
Stage 2: distance = 2
Stage 3: distance = 1

Each stage performs multiple butterfly operations across the dataset.

Hardware Insight

The same datapath is reused across stages
Only:
- Memory access patterns change
- Twiddle factors change

This makes NTT ideal for:

FSM-controlled designs
Resource-constrained implementations

Memory Access Patterns

Unlike simple algorithms, NTT does not access memory sequentially. This often the bottleneck in hardware designs.

Example (\(n = 8\)):

Stage 1: \((0,4), (1,5), (2,6), (3,7)\)
Stage 2: \((0,2), (1,3), (4,6), (5,7)\)
Stage 3: \((0,1), (2,3), (4,5), (6,7)\)

You must design:

Address generation logic
Efficient memory access (possibly dual-port RAM)
Careful read/write scheduling

Remeber, Twiddle factors are powers of (\(\omega\)):

\[\omega^0, \omega^1, \omega^2, ..., \omega^{n-1}\]

Minimal Hardware Architecture

All arithmetic must be performed \(modulo (q)\). For Modular Addition, you can compute sum and if result \(\geq q\), subtract \(q\). For Modular Subtraction, first compute the difference and ff negative, add \(q\). For Modular Multiplication, fist multiply normally and reduce modulo \(q\).

Note: Since \(q = 17\) is small, you may use simple \(% q\) logic. However, look for better ways to implement this for large primes.

A basic NTT accelerator consists of:

Datapath
- Butterfly unit
- Modular arithmetic units
- Twiddle ROM
- Register file or small RAM
Control Unit
- Stage counter
- Butterfly counter
- Address generator
- Twiddle index generator

Task Requirements

Read this document carefully
Implement an 8-point NTT in Verilog with following things in mind
- Input: 8 coefficients
- Output: Transformed coefficients
- Parameters of \(n = 8\), \(q = 17\) and \(\omega = 9\)
- Single butterfly unit (iterative architecture)
- ROM-based twiddle storage
- FSM-based control
- Precomputed twiddle factors
Write a testbench to verify correctness

The NTT is more than just an algorithm, it is a structured computation pattern that maps cleanly to hardware. Understanding it provides a strong foundation for building accelerators used in modern secure computing systems.

Design Freedom

A possible high-level block diagram is shown below. This is provided for reference only, you are not required to follow this architecture. You are encouraged to explore and implement alternative design strategies.

flowchart TD
    A[Start] --> B[Initialize Stage Counter]
    B --> C{More Stages?}
    C -- No --> Z[Done]
    C -- Yes --> D[Initialize Butterfly Counter]

    D --> E{More Butterflies?}
    E -- No --> F[Increment Stage]
    F --> C

    E -- Yes --> G[Read a, b from memory]
    G --> H[Read twiddle factor]
    H --> I[Compute u]
    I --> J[Compute v]
    J --> K[Write u, v back to memory]
    K --> L[Increment Butterfly Counter]
    L --> E

You are expected to make independent design decisions. For future implementations, think about what needs to be changed inorder to implement:

Implementing inverse NTT (INTT)
Adding pipelining
Supporting larger sizes (e.g., 16 or 32)
Optimizing modular multiplication

Simulation Requirements

You must use an open-source simulator, such as:

Your submission must include a working testbench that:

Loads data from text files
Executes the design
Produces correct outputs

Public Artifact Repository

You must create a public repository containing your solution. Repository must include at least following components:

Source Code
Testbench
Input Files
README.md

Interview Expectation

During the interview, you will be expected to:

Walk through your architecture
Explain your design decisions
Discuss trade-offs (precision, latency, complexity)
Explain your softmax implementation
Demonstrate your simulation setup

This task is intended as a concise illustration of the core skills expected for this position. It is not designed to have a single “correct” solution. Instead, you are encouraged to explore different design approaches, experiment with trade-offs, and justify the decisions you make.

Focus on demonstrating clarity of thought, sound engineering judgment, and the ability to connect mathematical concepts to practical hardware implementations.

Good luck and happy coding/designing!