15 May 2025
TL;DR: Built an RTX 5070 + Ultra 7 workstation on a ~¥20,000 budget. KataGo inference speed boosted a hundredfold, creating an AI “divine assistant” gift for my amateur 6-dan uncle.
Ever since deciding to pursue my Master’s at CMU last year, I’ve been keenly interested in AI applications, especially local deployment. One day in July, while browsing r/localllama, it suddenly struck me – AlphaGo, which was all the rage before, is also AI, right? I wondered if there was an open-source version. Wouldn’t it be fascinating to run it on my own computer? I could leverage the power of AI to relive the joy of victory I didn’t quite get enough of when learning Go in elementary school 😆. A quick search revealed that while AlphaGo itself isn’t open-source (the original model required hundreds of TPUs, so even if it were open-source, few could use it), there is a community-replicated open-source model: Katago. It’s also more compressed than the original. Thanks to advancements in algorithms and hardware, KataGo can now run smoothly on consumer-grade hardware, surpassing top human players. Pleasantly surprised, a seed of an idea was planted in my mind.
In May 2025, I returned to Shanghai for two weeks. I pondered what meaningful gift I could bring for my uncle, whom I hadn’t seen in five years, an amateur 6-dan Go enthusiast. So, I decided to build a workstation capable of running KataGo for him: meaningful, and it would also satisfy my PC building itch :).
Since I decided to DIY, I was determined to achieve an effect distinct from pre-built brand machines or custom-assembled ones (otherwise, I’d risk getting an earful from my parents 🙃). The potential advantages of DIY are: 1) achieving higher specs than brand machines for the same budget; 2) using higher quality components than typical custom assemblies; 3) optimizing the configuration for specific use cases: paying only for the performance needed and leaving room for future upgrades. These “potential” advantages aren’t automatic; they require careful design to be fully realized. If the sole requirement is to get a certain configuration (like an i7+5080) at the lowest price, then a custom assembly from JD.com will undoubtedly be cheaper than DIY.
The Importance of Peripherals: My goal is to give my uncle the best user experience, not the highest benchmark scores. Peripherals are crucial for user experience, so the budget should be allocated accordingly.
Balance: This computer won’t just run KataGo; it also needs to handle various daily tasks. Therefore, a balanced configuration is needed to meet computing demands for the next ~5 years, rather than just buying the most powerful GPU and calling it a day.
System-Level Value: The most value-for-money computer isn’t made of the most value-for-money individual parts because when evaluating a component’s value, you need to consider the cost of the entire system. For example: an i9 is 20% faster than an i7 but 50% more expensive. If the i7 accounts for 25% of the total system cost, upgrading to an i9 increases the total system cost by 12.5% but boosts performance by 20%. For use cases that can fully utilize the i9, this upgrade is entirely worthwhile.
Stability: Since I’ll be returning to Pittsburgh after building this PC and won’t be able to handle after-sales issues, the components must be reliable, or at least easy to get warranty service for. Thus, I’ll lean towards mature products from major brands.
CPU
Ultra 5, 7, or 9? - The clock speed differences among the three most representative models, Ultra 5 245, Ultra 7 265, and Ultra 9 285, are negligible. The main difference lies in the core count.
CPU Model | P-Cores | E-Cores | Total Cores | Total Threads | MSRP |
---|---|---|---|---|---|
Ultra 5 245 | 6 | 8 | 14 | 16 | $329 |
Ultra 7 265 | 8 | 12 | 20 | 28 | $419 |
Ultra 9 285 | 8 | 16 | 24 | 32 | $589 |
Considering KataGo’s inference primarily relies on GPU performance rather than CPU multi-core performance, the Ultra 7 265 offers the best value. The Ultra 9 285’s extra 4 E-cores and higher price don’t bring significant performance gains for our use case. Meanwhile, the Ultra 5 is missing two P-cores, and since this architecture doesn’t support SMT, those might indeed be needed for daily tasks.
Motherboard
CPU overclocking offers very limited performance improvement (new-gen CPUs already have high base clocks), so I chose to forgo CPU overclocking. However, many AI applications (like LLM inference) are bottlenecked by memory bandwidth, making memory overclocking more useful. M810 platform motherboards, being super budget-friendly, cut too many features, so the mid-range B860 platform is recommended.
Graphics Card (GPU)
Running KataGo only uses about 1 GB of VRAM, so 50-series and even older cards are usable. Because GPUs are currently overpriced, I opted for the “good enough” Prime RTX 5070 12G.
Model | VRAM | FP16 Tensor TFLOPS (FP32 Accumulate) | MSRP (RMB) | Market Price (RMB) |
---|---|---|---|---|
RTX 5090 D | 32 GB GDDR7 | 419.2 TFLOPS (Wikipedia, NVIDIA) | From 16,499 RMB (ITHome, Sina Finance) | 28,000 – 39,000 RMB (36Kr, Sina Finance) |
RTX 5080 | 16 GB GDDR7 | 225.1 TFLOPS (Wikipedia, ZOL AI) | From 8,299 RMB (ITHome, ITHome) | ≈ 8,299 RMB (ITHome) |
RTX 5070 Ti | 16 GB GDDR7 | 177.4 TFLOPS (Wikipedia, Sina Finance) | From 6,299 RMB (Sina Finance, Sohu) | 7,000 – 8,000 RMB (Sohu) |
RTX 5070 | 12 GB GDDR7 | 123.9 TFLOPS (Wikipedia, Gamersky) | From 4,599 RMB (Gamersky) | ≈ 4,599 RMB (SMZDM) |
RTX 5060 Ti | 16 GB GDDR7 | 92.9 TFLOPS¹ | From 3,599 RMB (SMZDM Post) | ≈ 3,400 RMB (Zhihu) |
Case
RAM
SSD
Power Supply (PSU)
CPU Cooler
Monitor
After getting used to Apple products, my standards for screen quality have noticeably increased, hahaha. The contenders for this build were two new 32-inch 4K screens Dell launched at CES this year: U3225QE and S3225QC. My experience after trying them: both are fantastic! The U3225QE’s matte screen is very comfortable to look at, and its connectivity is powerful, perfect for connecting a laptop as a dock + 8 hours of office work a day. The S3225QC’s glossy screen + OLED colors are beautiful, and the spatial audio speakers are stunning, making it great as an AV screen connected to a desktop. The decision: I’m keeping the U3225QE for myself, and the S3225QC goes to my uncle. Here’s a parameter comparison:
Parameter | S3225QC | U3225QE |
---|---|---|
Model | S3225QC | U3225QE |
Screen Size | 31.6 inches | 31.5 inches |
Resolution | 3840×2160 | 3840×2160 |
Panel Type | QD-OLED | IPS Black |
Refresh Rate | 120 Hz | 120 Hz |
Contrast Ratio | Theoretically Infinite : 1 | 3,000 : 1 |
Response Time | 0.03 ms (GtG) | 5 ms (GtG) |
Color Gamut | 99 % DCI-P3 | DCI-P3 99 % / sRGB 100 % |
HDR Certification | VESA DisplayHDR True Black 400 | DisplayHDR 600 |
Speakers | Built-in 5×5 W | No built-in speakers |
USB-C Power Delivery | Up to 90 W | Up to 140 W |
Ports | 2×HDMI 2.1, 1×DisplayPort 1.4, 1×USB-C (DP+PD) | 1×HDMI 2.1, 2×DisplayPort 1.4 (Input), 1×DisplayPort 1.4 (Output), 2×Thunderbolt 4 (Up/Downstream), 1×USB-C (KVM Upstream), 4×USB-A, 1×2.5 GbE RJ45, 1×3.5 mm Audio Out |
Market Price (RMB) | ¥ 6,499 | ¥ 5,999 |
Budget: Around ¥20,000 for the host + peripherals.
Component | Model Name | Budget % |
---|---|---|
Monitor | Dell S3225QC (31.6-inch 4K QD-OLED 120Hz) | 31.0% |
Graphics Card (GPU) | ASUS PRIME RTX 5070 12G | 24.1% |
CPU | Intel Core Ultra 7 265 | 12.3% |
Motherboard | ASUS ROG STRIX B860-F WIFI | 9.6% |
RAM | Crucial 64GB (2x32GB) DDR5 5600MHz | 6.2% |
SSD | Samsung 9100 PRO 1TB PCIe 5.0 NVMe | 5.3% |
Keyboard & Mouse | Logitech ALTO KEYS K98M + MX Master 3S | 4.8% |
PSU | Thermalright SP850W Platinum ATX 3.1 | 3.0% |
Case | ASUS ProArt PA401 (Wood & Metal Edition) | 2.9% |
CPU Cooler | Thermalright Peerless Assassin 120 (PA120) | 0.9% |
To use KataGo happily, you need to get a few components sorted: 1) KataGo command-line program, 2) KataGo model weights, 3) KaTrain graphical interface. (If performance isn’t a high priority, you can skip the earlier steps and directly install KaTrain using its built-in OpenCL backend. If you’re willing to tinker, the TensorRT backend can be up to 2.5 times faster).
Step by step:
First, download the latest KataGo program from Github. There are different download options based on your operating system and the acceleration library/backend used. For NVIDIA GPUs, the highest performing backend is TensorRT. For Windows, you can download katago-v1.16.0-trt10.9.0-cuda12.8-windows-x64.zip. Here’s a comparison of different backends:
Executable Name | Platform | Backend | Build for Version | Notes |
---|---|---|---|---|
katago-opencl | Linux, Windows | OpenCL | — | General GPU acceleration, no specific drivers needed (GitHub, Zhihu Column, CSDN Blog) |
katago-cuda12.5 | Linux, Windows | CUDA | 12.5 | Optimized for NVIDIA GPUs with CUDA 12.5 drivers installed (GitHub, Zhihu Col., CSDN Blog) |
katago-trt10.2.0 | Linux, Windows | TensorRT | 10.2.0 | Highest throughput on GPUs supporting TensorRT 10.2.0 (GitHub, Zhihu Col., CSDN Blog) |
katago-cpu | Linux, Windows | Eigen | — | Pure CPU fallback, runs without a GPU (GitHub, Zhihu Column, CSDN Blog) |
katago-cpu-avx2 | Linux, Windows | Eigen AVX2 | — | Optimized for CPUs supporting AVX2 instruction set (GitHub) |
From the naming, you can tell it needs to be paired with TensorRT 10.9.0 and CUDA 12.8 (actually, CUDNN is also required, or it won’t run). This feels very much like doing AI work 😆.
Install Visual Studio, the Community version is fine. Running NVIDIA libraries requires DLL files provided by Visual Studio.
Download the CUDA Toolkit installer from the NVIDIA website. CUDA is backward compatible, so even if you install a version newer than CUDA 12.8, it’s okay.
After successfully installing CUDA, install CUDNN.
bin\cudnn*.dll (including cudnn64_9.dll)
→ C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vX.Y\bin
include\cudnn*.h
→ C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vX.Y\include
lib\cudnn*.lib
→ C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vX.Y\lib
sysdm.cpl
and press Enter to open System Properties.Path
, then click “Edit.”C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vX.Y\bin
vX.Y
with your actual CUDA version.where cudnn64_9.dll
in the command prompt to ensure CUDNN can be called by other programs.Install TensorRT. First, download it from here, then follow these steps (documentation):
Download the TensorRT ZIP package
TensorRT-10.x.x.x.Windows.win10.cuda-11.8.zip
TensorRT-10.x.x.x.Windows.win10.cuda-12.9.zip
TensorRT-10.x.x.x
. (NVIDIA Docs)
Note:
10.x.x.x
represents the TensorRT version number.cuda-x.x
represents the corresponding CUDA version, e.g., 11.8 or 12.9.
Add TensorRT library files to the system PATH You can add the TensorRT library files to your system’s PATH environment variable in one of two ways:
Method 1: Add the lib path from the extracted directory to PATH
<Installation_Directory>\TensorRT-10.x.x.x\lib
Method 2: Copy DLL files to the CUDA installation directory
.dll
files from the <Installation_Directory>\TensorRT-10.x.x.x\lib
directory to your CUDA installation directory, for example:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vX.Y\bin
Where vX.Y
is your CUDA version number, e.g., v11.8
.
bin
directory to the system PATH automatically.Install the TensorRT Python package
<Installation_Directory>\TensorRT-10.x.x.x\python
directory..whl
file for your Python version, for example:
python.exe -m pip install tensorrt-*-cp3x-none-win_amd64.whl
Replace cp3x
with your Python version, e.g., cp310
for Python 3.10.
At this point, KataGo is installed. You can now proceed to download model weights: https://katagotraining.org/networks/
Find the KataGo program downloaded in step 1 and the KataGo model weights downloaded in step 6. Run a performance test to verify successful installation:
.\katago.exe benchmark -model "path_to_weights"
KaTrain is a graphical interface designed for KataGo. Download the installer from Github: Katrain.exe
After successful installation, launch it. Click the menu button in the top left corner to open the “General and Engine Settings” interface. Update the “KataGo executable path” and “KataGo model file path” to the files downloaded in steps 1 and 6. In “KataGo Engine Settings,” the “Visits per move during analysis” can be increased from the default value. Return to the main interface, and all installations are complete!
KataGo has models of different architectures (sizes). The newest b28c512 series with the most parameters is the strongest. However, the previous generation b18c384 model is more efficient, able to test more variations in the same amount of time, making it suitable for comparing the pros and cons of different moves when hardware performance is limited.
Although the official Readme already spoils that TensorRT is currently the highest-performing backend, it doesn’t specify how much advantage it has over OpenCL and CUDA on 50-series GPUs. If the difference isn’t significant, one could entirely skip the hassle of setting up TensorRT. While generic backends are usable, how much of a boost does NVIDIA’s TensorRT actually provide? This directly relates to the robustness of Jensen Huang’s software moat. There’s no contest as to whether CPUs or GPUs are faster at running CNNs, but are CPUs sufficient if the demands aren’t high? With these two curiosities, I benchmarked all backends with both network architectures.
Another question is how many visits KataGo actually needs to be effective. I had ChatGPT summarize it, and the results are quite astounding – KataGo doesn’t need much computing power to achieve superhuman levels. This summary is based on the 2020 version of the model; in 2024, some netizens even say the latest model can reach 8-dan strength with just one visit. For learning Go, the main significance of stacking more visits is to analyze the pros and cons of other moves on the board.
Dan Level | Suggested Visits/Move† | Primary Basis | Notes |
---|---|---|---|
4 Dan | ≈ 6 visits | portkata calibration formula (2020) (GitHub) | — |
5 Dan | ≈ 8 visits | Ibid. (GitHub) | Blogger often uses “8 visits” for 5d practice |
6 Dan | ≈ 10 visits | Ibid. (GitHub) | — |
7 Dan | ≈ 12 visits | Ibid. (GitHub) | — |
8 Dan | ≈ 14 visits | Ibid. (GitHub) | — |
9 Dan | ≈ 16 visits | Ibid.; tested to beat Zen7 9d (GitHub) | |
Top Amateur / Near Pro | ≈ 128 visits | OGS discussion: tens to hundreds already surpass strongest amateurs (Online Go Forum) | b28c512 adds +300 Elo (Reddit) |
Superhuman | ≥ 2,048 visits | OGS “Potential Rank Inflation” & Adversarial Policies Paper (2022) (OGS, OpenReview) | 72% adversarial win rate can still break |
“Extreme Deduction” (Research / Mining) |
10,000 – 100,000 visits | Researcher & L19 discussion: 10k+ significantly reduces occasional blunders, stabilizes ko fights (L19, L19) | Diminishing returns, but good for long reads/flaw detection |
All results are from: .\katago.exe benchmark -model path_to_model
Because KataGo uses the MCTS algorithm, its performance can be somewhat affected by parallel computation. Therefore, the benchmarks report visits per second at the recommended thread count and the maximum visits per second. These two numbers are generally close.
Backend | Device | Rec. Threads | Visits/Sec (Rec. Threads) | Max Visits/Sec (Any Threads) | Speedup |
---|---|---|---|---|---|
Eigen (CPU) | Ultra 7 265 | 20 | 37.63 | 37.63 | 1.00x |
AVX2 (CPU) | Ultra 7 265 | 20 | 51.66 | 51.66 | 1.37x |
Metal | Apple M3 Max | 12 | 348.28 | 348.28 | 9.26x |
OpenCL | RTX 5070 | 24 | 1250.27 | ~1339 | 33.24x |
CUDA | RTX 5070 | 48 | 2294.01 | ~2400 | 60.97x |
TensorRT | RTX 5070 | 64 | ~3262 | ~3299 | 86.72x |
Backend | Device | Rec. Threads | Visits/Sec (Rec. Threads) | Max Visits/Sec (Any Threads) | Speedup |
---|---|---|---|---|---|
Eigen (CPU) | Ultra 7 265 | 16 | 13.48 | ~15.13 | 1.00x |
AVX2 (CPU) | Ultra 7 265 | 20 | 22.05 | 22.05 | 1.64x |
Metal | Apple M3 Max | 8 | 135.27 | ~138.61 | 10.04x |
OpenCL | RTX 5070 | 24 | 580.03 | ~580 | 43.03x |
CUDA | RTX 5070 | 24 | 926.79 | ~962 | 68.76x |
TensorRT | RTX 5070 | 40 | 1397.10 | ~1424 | 103.66x |
From the benchmarks:
Reflecting on the entire build and testing process, my biggest takeaway is how surprisingly powerful AI tools perform on consumer-grade hardware today. The immense development in algorithms and hardware since AlphaGo’s debut has enabled AI models that once required data centers to be deployed and run smoothly on a personal computer in a single morning. And now, this process is replaying itself with LLMs. We’ve seen the 175B parameter GPT-3 from 2020 score only 43.9% on MMLU, while the 4B parameter Qwen 3 from 2025 achieves nearly 70%—with only 1/44th the parameters and capable of local inference on a single RTX 4090. This fully illustrates the leap in algorithms and hardware over five years. AI has already revolutionized the world of Go; former world champions have chosen to pursue MBAs at Tsinghua because they no longer find joy in playing Go. When everyone can run AI smarter than themselves on their own devices, what changes will it bring to the world?
Pondering this, while marveling at AI’s magic, I inevitably feel some anxiety. The simplest prediction is that tasks that can be programmatically verified for correctness (like games, multiple-choice questions, writing code from tests) will foreseeably be rapidly solved and surpassed by AI once the RL environment is established. And with advancements in multimodality and computing power, RL environments will accommodate an increasing variety of tasks, even introducing AI judges for self-iteration. Thinking to the extreme, aren’t humans just embodied intelligences trained through RL from birth? 😜 So, the bottlenecks for AI completely replacing humans are: (1) RL environments cannot yet simulate “Earth Online,” (2) AI models cannot replicate human sensory input, and (3) model training lacks the decades-long, long-form contextual data that humans possess.
Thus, rational short-term responses include: 1) Be the one training these AIs, 2) Be the one using these AIs, 3) Stay far away from industries about to be disrupted 😆.
Mid-to-long-term responses lie in developing one’s own cross-task/cross-industry/cross-domain experience, to take root in unique composite fields where AI lacks training data and environments.
Fear stems from the unknown – uncertainty about the future. In an era of great change brought by AI, the most reassuring thing is the ability to learn quickly. After all, in the foreseeable future, AI will still need people to train, maintain, and operate it.
If you have your own thoughts on AI development, or any questions or suggestions about the build 方案 in this article, feel free to leave a comment and discuss~
Looking forward to seeing you in the next blog post! The next one is planned to be about exploring Kaggle competitions in CMU’s Intro to Deep Learning course 🚀