About

Hi, I'm KyungIn Nam.
I build hardware that thinks faster.

Electrical Engineering student at UC Irvine with a focus on computer architecture, RTL design, and hardware acceleration. I work at the intersection of chip design and machine learning — turning research ideas into silicon-level solutions. My work has been published at IEEE/ACM DATE 2026.

Currently getting into ASIC back-end flow — synthesis, P&R, timing closure. Still a lot to learn, but that's the direction.

Irvine, CA BS EE @ UC Irvine Open to opportunities

Experience

Nov 2024 – Present

Undergraduate Researcher

Bias Lab · UC Irvine

Wrote Verilog IP blocks for Q8 fixed-point hardware acceleration and validated them through simulation and FPGA testing — a lot of debugging, but learned how throughput and power actually trade off in practice.
Built Python tooling to automate AWQ quantization sweeps across multiple design variants, which made benchmarking a lot less painful.
Contributed to a paper accepted at IEEE/ACM DATE 2026 — my main role was designing the benchmarking methodology and running the evaluation experiments.

Oct 2024 – Mar 2025

Hackers Intern

Hackers Fap · Irvine, CA

Rotated through the full semiconductor design pipeline, including front-end design, mask preparation, and back-end process steps, gaining hands-on experience with real fabrication workflows.
Focused on lithography stepper systems; assisted in calibrating exposure parameters, alignment routines, and pattern transfer processes to improve resolution and yield.

Publications

IEEE/ACM DATE 2026

T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

H. Oh, K. Nam, R. Bhattacharjya, H. Chen, T. Das, S. Yun, S. Jang, A. Ding, N. Dutt, M. Imani

IEEE/ACM Design Automation and Test in Europe Conference (DATE 2026)

The problem: ternary LLMs are memory-efficient (8× smaller than FP16), but running them on CPUs is bottlenecked by lookup table (LUT) fetches — those LUTs occupy less than 0.01% of RAM yet account for 87.6% of all memory transactions and 91.6% of execution time. T-SAR fixes this by repurposing SIMD vector registers to generate LUTs on-the-fly instead of loading them from memory, turning a bandwidth-bound problem into a compute-bound one — with only minimal ISA extensions and no new ALUs.

Results: 5.6–24.5× GEMM latency reduction and 1.1–86.2× GEMV throughput improvement over state-of-the-art CPU baselines, across models from 125M to 100B parameters. Memory request volume drops by 8.7–13.8×. On a mobile CPU, a 7B model prefill goes from >20s to under 1.7s. Hardware cost is minimal: only +1.4% area and +3.2% power on a 256-bit SIMD slice (synthesized at TSMC 28nm). Energy efficiency is 2.5–4.9× better than NVIDIA Jetson AGX Orin on the same workloads.

My role: ran the gem5 simulations used to evaluate the ISA extensions, did the ALU-level ternary operator analysis, and built the performance–energy models that drove the final architecture decisions.

Projects

FPGA-based Accelerator Prototype

Synthesized RTL accelerator designs onto FPGA and worked through the timing and utilization issues that only show up once you're on real hardware. Learned a lot about the gap between simulation and actual deployment.

FPGA RTL Synthesis Timing Closure Hardware Acceleration

PEANO-ViT Softmax Approximation RTL

Implemented a Verilog module for hardware-efficient softmax approximation in Q8 fixed-point. Spent a fair amount of time getting the value scaling right — small errors in fixed-point arithmetic compound quickly.

Verilog Fixed-Point Arithmetic RTL Design

UART Transmitter / Receiver

Designed and verified UART TX/RX modules in SystemVerilog — baud-rate generation, control logic, and testbench simulation. A foundational project that made me much more careful about timing constraints.

SystemVerilog Digital Design Simulation

Skills

RTL / HDL

SystemVerilog
Verilog
VHDL

FPGA & Simulation

FPGA synthesis & impl.
Timing analysis
Functional simulation
gem5
Cadence
LTSpice

Programming

Python
C / C++
PyTorch

Domains

Hardware Acceleration
Computer Architecture
Quantization & LLM Inference
Semiconductor Fabrication

Exploring

ASIC back-end flow
Synthesis (DC / Yosys)
Place & Route

Hi, I'm KyungIn Nam. I build hardware that thinks faster.