Exploration of Cryptocurrency Mining-Specific GPUs in AI Applications: A Case Study of CMP 170HX

Abstract

Since Ethereum’s transition to the Proof-of-Stake (PoS) mechanism in 2023, NVIDIA’s cryptocurrency mining-specific GPUs (e.g., the CMP series), designed with performance reductions in general computing and graphics rendering capabilities, have faced severe limitations in general-purpose computing scenarios, becoming typical electronic waste. This study systematically tests a computational power reuse scheme proposed by the open-source community—disabling specific instruction sets (e.g., Fused Multiply-Add instructions) through CUDA source code modifications—on the NVIDIA CMP 170HX platform. Experimental results validate the effectiveness of this approach, partially restoring the GPU’s computational capabilities in artificial intelligence (AI) tasks. Performance evaluations using open-source GPU benchmarks (OpenCL benchmark, mixbench) and AI benchmarks (LLAMA-benchmark) reveal that its FP32 floating-point performance exceeds 15 times the original capability, while inference performance for certain precision levels in large language models surpasses threefold improvements. Furthermore, based on hardware architecture analysis, this paper proposes theoretical conjectures for further improving computational utilization through alternative adaptation pathways. Combining energy efficiency ratios and cost models, the recycling value of such obsolete GPUs in edge computing and lightweight AI inference scenarios is evaluated. The findings demonstrate that rationally reusing residual computational power from mining GPUs can significantly mitigate the environmental burden of electronic waste while offering cost-effective hardware solutions for low-budget computing scenarios.

Introduction

Background

NVIDIA CMP GPUs and Their Sales Estimates

In a blog post dated February 18, 2021 ¹, NVIDIA stated, “GeForce is made for gaming, CMP is made for cryptocurrency mining,” announcing the launch of NVIDIA CMP (Cryptocurrency Mining Processor), a product line specifically designed to meet the unique demands of Ethereum mining. The company emphasized that “CMPs do no graphics, have no display outputs, and feature lower peak core voltage and frequency, optimized for maximum mining performance and efficiency. CMPs do not meet the specifications of GeForce GPUs, ensuring gamers retain access to GeForce products.” Subsequently, four CMP models—30HX, 40HX, 50HX, and 90HX—were showcased on NVIDIA’s CMP product page ². However, this was not NVIDIA’s first venture into mining-specific GPUs. According to GPU DATABASE ³, NVIDIA initially released the P106-100, a GPU dedicated to cryptocurrency mining, as early as June 19, 2017, followed by subsequent models such as the P106-090, P104-100, and P102-100.

NVIDIA has only disclosed CMP sales in select fiscal reports ⁴. Based on publicly available data, the company’s cryptocurrency-related revenue for its four fiscal quarters in 2022 was 155million,155million,266 million, 105 million, and 105million,and24 million, respectively. NVIDIA has historically obscured revenue generated from cryptocurrency mining, a practice that led to penalties from the U.S. Securities and Exchange Commission (SEC) ⁵. Consequently, the Manufacturer’s Suggested Retail Price (MSRP) for these CMP GPUs cannot be ascertained through official channels. According to third-party investigations ⁶, the approximate market prices of these CMP cards are summarized in the table below:

Model	Release Date	2021 Average Price (USD)	FP16 Computational Power (TFLOPS)
CMP 30HX	2021 Q1	~700-800	10.05
CMP 40HX	2021 Q1	~600-700	15.21
CMP 50HX	2021 Q2	~700-900	22.15
CMP 90HX	2021 Q2	~1400-1700	21.89
CMP 170HX	2021 Q3	~4000-5000	50.53

Table 1-1 Prices and Theoretical Performance of CMP GPUs in 2022

Based on the midpoint pricing, sales estimates for each GPU under different scenarios are derived as follows:

Model	Estimated ASP ($)	Scenario A (15/25/25/25/20/15) Estimated Sales	Scenario B (25/30/20/15/10) Estimated Sales	Scenario C (10/15/20/25/30) Estimated Sales
CMP 30HX	750	~110,000	~183,333	~73,333
CMP 40HX	650	~211,538	~253,846	~126,923
CMP 50HX	800	~171,875	~137,500	~137,500
CMP 90HX	1550	~70,968	~53,226	~88,710
CMP 170HX	4500	~18,333	~12,222	~36,667
Whole		~582,714	~640,127	~463,133

Table 1-2 Estimated Sales Volume of CMP GPUs

It is evident that the data derived from this inference are significantly lower than actual figures (excluding non-2022 sales and non-CMP series P106/P104/P102 models). Nonetheless, it is clear that hundreds of thousands of mining-specific GPUs have been either discarded or redirected to secondary markets following Ethereum’s transition to PoS.

Ethereum Mining Principles

Wang et al. (2020, Chapter 4, Section 4.1.1, pp. 88–91) ⁷ describe Ethereum’s GPU-dependent Proof-of-Work (PoW) mining process. In this mechanism, ETH selects the block with the highest difficulty as the valid block. Miner nodes generate blocks, while other nodes verify them. Miners attempt to find a suitable Nonce value through rapid computation to produce a result below the specified difficulty threshold. This process relies heavily on brute-force enumeration, whereas verifying the Nonce’s validity is straightforward.

The bottleneck of this algorithm lies in memory read/write performance, preventing miners from improving efficiency through faster hardware.

First, the network maintains a Directed Acyclic Graph (DAG) dataset occupying approximately 4GB of memory. In GPU mining, the DAG is loaded into VRAM and updated every 30,000 blocks. If the DAG is not pre-generated, mining nodes halt block production.

The Nonce-solving process proceeds as follows:

Perform SHA-3 computation on the block header input and a hypothetical Nonce value to generate a 128-byte Mix0.
Read page data from the DAG at a position linked to Mix0.
Mix Mix0 with the retrieved DAG page data to generate a 128-byte Mix1.
Repeat Steps 2 and 3 64 times, resulting in a 128-byte Mix64.
Process Mix64 into a 32-byte data digest.
If the digest is less than or equal to the difficulty threshold, the Nonce is valid; otherwise, iterate with the next Nonce.

GPU Export Restrictions to China

Since October 2018 ⁸, when the Trump administration severed ties between Chinese chipmaker Fujian Jinhua Integrated Circuit Company and U.S. suppliers, the U.S. government has imposed restrictions on semiconductor technology and high-performance GPU exports to mainland China. In August 2022 ⁹, NVIDIA disclosed in an SEC filing that it was prohibited from exporting A100 and H100 GPUs to mainland China. By October 2023 ¹⁰¹¹, as stated in two additional SEC reports, the U.S. government expanded the ban to include GPUs such as the 4090, A800, and H800. In April 2025 ¹², NVIDIA announced further restrictions blocking exports of products like the H20 to mainland China. Similarly, AMD ¹³ reported in SEC filings that its MI308 GPUs would be barred from export to the region.

Research Objectives

Given the aforementioned context, the following conclusions emerge:

Massive quantities (exceeding hundreds of thousands) of mining-specific GPUs face obsolescence. Ethereum’s Proof-of-Work (PoW) mining process relies heavily on GPU memory bandwidth and partial computational capabilities. Regions such as mainland China are subject to U.S.-imposed restrictions on high-performance GPU imports. Consequently, this study aims to validate whether these discarded mining-specific GPUs retain viable floating-point computational capabilities and AI computing capacity (for inference or training), thereby partially alleviating regional computational shortages caused by U.S. export controls in areas like mainland China.

Specific objectives of this study include:

Testing the floating-point performance of the CMP 170HX mining-specific GPU across half-precision (FP16), single-precision (FP32), and double-precision (FP64) formats. Evaluating the CMP 170HX’s performance under different GPU computing frameworks (CUDA/OpenCL). Assessing the AI inference capabilities of the CMP 170HX using the LLAMA-bench benchmark. Measuring the energy efficiency ratio (token/W) of the CMP 170HX during AI inference tasks.

Testing Tools and Benchmark Overview

MixBench

MixBench is a benchmarking tool ¹⁴ designed to evaluate the performance limits of GPUs (or CPUs) under mixed operational intensity kernels. The kernels executed are tailored for varying operational intensity values. Modern GPUs can hide memory latency by switching to threads capable of executing computational operations. This tool evaluates the actual optimal balance between memory and compute operations for a device. It supports implementations in CUDA, HIP, OpenCL, and SYCL for GPUs, and OpenMP for CPUs.

The tool conducts four types of experiments combined with global memory access:

Single-precision floating-point operations (Fused Multiply-Add, FMAs).
Double-precision floating-point operations (FMAs).
Half-precision floating-point operations (FMAs, GPU-only).
Integer multiply-add operations.

MixBench incrementally increases computational pressure and reports:

Compute iters (number of computational iterations per task).
Flops/byte (floating-point operations per byte).
Ex.time (execution time).
GFLOPS (giga floating-point operations per second; for integer operations, Iops/byte).
GB/sec (memory bandwidth in gigabytes per second).

OpenCL-Benchmark

OpenCL-Benchmark¹⁵ is a small OpenCL benchmarking tool designed to measure peak performance of GPUs and CPUs. It supports any GPU on Windows, Linux, macOS, and Android operating systems. The tool evaluates computational performance for FP64 (scalar), FP32 (scalar), FP16 (half2), INT64 (scalar), INT32 (scalar), INT16 (short2), INT8 (dp4a), memory bandwidth (coalesced/misaligned read/write), and PCIe bandwidth (send/receive/bidirectional). Specifically, the performance metrics for single-precision floating-point (FP32), double-precision floating-point (FP64), and half-precision floating-point (FP16) are measured in TFLOPs (trillion floating-point operations per second). For 64-bit integers (INT64), 16-bit integers (INT16), and 8-bit integers (INT8), performance is quantified using TIOPs (trillion integer operations per second).

GPU-Burn

GPU-Burn is a command-line-based, open-source stress-testing tool designed to maximally impose computational load on GPUs (Graphics Processing Units) ¹⁶. Its primary purpose is to evaluate the stability, thermal performance, and computational accuracy of GPUs under prolonged periods of intensive workloads.

The tool supports configuring parameters such as runtime duration, VRAM utilization, enabling/disabling Tensor Cores, and selecting computation modes (FP32, FP16, FP64).

In this study, GPU-Burn will serve as the control group without any modifications to its source code or compilation parameters, and will be directly compiled and used as-is.

PyTorch

PyTorch¹⁷ is an open-source deep learning framework developed by the Facebook Artificial Intelligence Research (FAIR) team. It serves as a tensor library optimized for deep learning, leveraging both GPUs and CPUs, and enables the construction of deep neural networks using a tape-based autograd system.

Typically, PyTorch is utilized for two primary purposes: a. As a replacement for NumPy, harnessing the computational power of GPUs; b. As a deep learning research platform that provides maximum flexibility and performance efficiency.

By leveraging PyTorch’s GPU support, we can develop scripts to evaluate the floating-point performance of the nVidia CMP 170HX.

llama.cpp

Llama.cpp¹⁸ is a tool implemented in pure C/C++ for inference of Meta’s LLaMA models (and other models). Its primary goal is to enable large language model inference with minimal setup and state-of-the-art performance on both local machines and in the cloud.

For this performance test, we will use llama-bench¹⁹, a performance testing tool for Llama.cpp. It allows configuration of parameters such as input prompt length, generated text length, batch size, and number of test repetitions. Users can also enable attention optimizations (e.g., Flash Attention) and specify offloading layers to the GPU. Additionally, test results can be exported in formats like Markdown, CSV, JSON, JSONL, or SQL for easier organization.

LLAMA-BENCH supports three test types:

Prompt Processing (pp): Batch process prompts (-p).
Text Generation (tg): Generate a sequence of tokens (-n).
Prompt Processing + Text Generation (pg): Process prompts then generate tokens (-pg).

All options can be specified multiple times to run multiple tests. Each pp and tg test executes all combinations of the specified options.

Each test can be repeated a specified number of times, with results averaged. Outputs include the average tokens per second (t/s) and standard deviation. Some formats (e.g., JSON) also retain individual results for each repetition.

Testing Environment and Hardware Specifications

Tested GPU Hardware Parameters

This experiment employs the NVIDIA CMP 170HX 8GB as the test subject. Since NVIDIA has not released a whitepaper for this product, specifications are derived from publicly available data on GPU DATABASE ²⁰ and inferred using NVIDIA A100 documentation ²¹²². The NVIDIA A100 and CMP 170HX share the same core architecture, boost clock frequency, thermal design power (TDP), and similar memory speeds. By scaling the CUDA core count ratio between the A100 and CMP 170HX, the theoretical performance ceiling of the CMP 170HX can be estimated.

General Specifications

Parameter	Value	Parameter	Value
GPU Name	NVIDIA CMP 170HX	Microarchitecture	Ampere
GPU Codename	GA100-105F-A1	Process Node	7 nm (TSMC)
Transistor Count	826 mm²	TDP	250 W
Bus Interface	PCIe x4 1.1

Table 2-1 General Specifications of the nVidia CMP 170HX GPU

Core Configuration

Parameter	Value	Parameter	Value
Base Clock	1140 MHz	Boost Clock	1410 MHz
CUDA Cores	4480	TMUs	280
Streaming Multiprocessors	70	Tensor Cores	280
L1 Cache	192 KB per SM	L2 Cache	8 MB

Table 2-2 Core Specifications of the nVidia CMP 170HX GPU

Memory Specifications

Parameter	Value	Parameter	Value
Memory Type	HBM2e	Memory Size	8 GB
Memory Clock	1458 MHz	Effective Memory Clock	2916 MHz
Memory Bus Width	4096-bit	Memory Bandwidth	1493 GB/s

Table 2-3 Memory Specifications of the nVidia CMP 170HX GPU

Theoretical Performance

Parameter	Value	Parameter	Value
Texture Fill Rate (Base)	319.2 GT/s	Boost Texture Fill Rate	394.8 GT/s
Boost FP16 Performance	50.53 TFLOPS	Boost FP32 Performance	12.63 TFLOPS
Boost FP64 Performance	6.317 TFLOPS	Ethereum Hash Rate (Boost)	164 MH/s
Table 2-4 Theoretical Performance of the nVidia CMP 170HX GPU

Supported Features

Parameter	Value	Parameter	Value
OpenCL	3.0	Vulkan	1.3
CUDA	8.0

Table 2-5 Feature Support of the nVidia CMP 170HX

Testing Platform

Due to experimental constraints, the GPU was tested via an Oculink-connected PCIe interface on a custom-built mini-PC platform. Oculink (Optical Link), developed by the PCI-SIG, is a cable-based internal/external connectivity standard based on the PCI Express (PCIe) protocol ²³²⁴. It provides a more flexible PCIe transmission method compared to traditional PCIe slots, particularly suited for servers, storage systems, and external PCIe devices (e.g., external GPU enclosures). Oculink employs specialized connectors, typically SFF-8611 (socket) and SFF-8612 (plug).

Test Platform Configuration

The hardware and software specifications of the test platform are as follows:

Parameter	Value	Parameter	Value
CPU	AMD Ryzen 7 7840HS	CPU Core Architecture/Codename	Zen4/Phoenix
CPU Cores/Threads	8C/16T	Base/Max Boost Clock	3.8 GHz / 5.1 GHz
CPU Instruction Sets	MMX, SSE, SSE2, SSE3, SSE4.1, SSE4.2, SSE4a, AES, AVX, AVX2, AVX-512, BMI1, BMI2, F16C, FMA3, BF16, SHA, ABM, AMD64	CPU TDP	54 W
Memory Bus Width	128-bit	Memory Frequency	5600 MT/s
Operating System	Debian GNU/Linux 12.9	Kernel Version	linux-kernel 6.1.0-31-amd64
NVIDIA Driver Version	nvidia-linux-driver-570.86.16	CUDA Version	12.8

Table 2-6 Test Platform Configuration

Test Tool Configuration

All testing tools were compiled from source code retrieved from their respective GitHub repositories as of March 20, 2025. By default, compilation parameters remained unmodified except for adjustments to CMakeLists.txt files. For certain tools, source code modifications were necessary to disable FMA (Fused Multiply-Add) instructions.

Mixbench

Following the developer’s compilation guidelines, an initial build was performed using default settings. To validate FMA disablement, CMakeLists.txt was modified as follows:

Configuration	Default	FMA Disabled
CMakeLists Modifications	None	`set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -fmad=false" CACHE STRING "nvcc flags" FORCE)` `string(APPEND CMAKE_CUDA_FLAGS " -fmad=false")`
Compilation Commands	`mkdir build` `cd build` `cmake ../mixbench-cuda` `cmake --build ./`	Same as default (only flags changed)

Table 2-7 MixBench Configuration

OpenCL-Benchmark

The developer-provided compilation script was used to build and execute the tool. To assess FMA’s impact on performance, the source code was modified as shown below:

Configuration	Default	FMA Disabled
Source Code Modifications	None	`diff` `diff --git a/src/lbm.cpp b/src/lbm.cpp` `index d99202f..28aeb25 100644` `--- a/src/lbm.cpp` `+++ b/src/lbm.cpp` `@@ -286,6 +286,8 @@ void LBM_Domain::enqueue_unvoxelize_mesh_on_device(const Mesh* mesh, const uchar` `}` `string LBM_Domain::device_defines() const { return` `+` `+"\n #pragma OPENCL FP_CONTRACT OFF" // Disables implicit FMA optimizations` `+"\n #define fma(a, b, c) ((a) * (b) + (c))" // Overrides OpenCL's fma() function` `"\n #define def_Nx "+to_string(Nx)+"u"` `"\n #define def_Ny "+to_string(Ny)+"u"` `"\n #define def_Nz "+to_string(Nz)+"u"`
Compilation Commands	./OpenCL-Benchmark/make.sh

Table 2-8 OpenCL-Benchmark Configuration

GPU-Burn

GPU-Burn served as a control group for floating-point performance testing. No modifications were made to its source code or compilation parameters.

Compilation Commands:
`make ./gpu_burn -tc 3600`

Table 2-9 GPU-Burn Configuration

PyTorch

A custom Python script based on PyTorch was developed to evaluate GPU floating-point performance through large-scale matrix operations. The script will be open-sourced on GitHub.

llama.cpp

To ensure compatibility with the CMP 170HX’s CUDA compute capability (8.0), modifications were made to ggml-cuda/CMakeLists.txt to disable FMA and enforce architectural optimizations:

Configuration	Default	FMA Disabled
CMakeLists Modifications	None	`set(CUDA_FLAGS --fmad=false)`
Compilation Commands	`cmake -B build -A x64 -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="80"` `cmake --build build --config Release --parallel 16`	Same as left

Table 2-10 llama.cpp Configuration

Computational Performance Testing

Results of performance tests may exhibit deviations due to experimental constraints, and limitations persist in tools such as OpenCL-benchmark and Mixbench.

Single-Precision (FP32) Performance

FP32 , or Single-Precision Floating-Point , uses 32 bits to represent a floating-point number. Typically, this includes 1 sign bit, 8 exponent bits, and 23 mantissa bits, adhering to the IEEE 754 standard . It remains the most common and foundational floating-point format in GPU computing, offering a balanced precision and range. The peak floating-point performance of GPUs is generally measured in FP32 throughput (measured in TFLOPS , or trillion floating-point operations per second).

FP32 strikes a moderate precision level with broad applicability, serving as the standard precision for the majority of graphics rendering and general-purpose GPU (GPGPU) applications. It benefits from widespread hardware support and robust optimization. Rendering calculations in most games, professional visualization tools, and CAD/CAM applications rely on FP32 for processing vertex coordinates, texture coordinates, color values, lighting computations, and more. FP32 also forms the basis for many scientific computations, physics simulations, signal processing, and financial modeling tasks.

Graph 3-1 nVidia CMP 170HX Float32 Benchmark Resault

By default , the FP32 performance of the 170HX is remarkably low, at approximately 0.39 TFLOPS , representing roughly 1/32 of its theoretical FP32 performance . This performance surpasses only the NVIDIA Tesla C870 (released in 2007), which had a theoretical FP32 throughput of about 0.346 TFLOPS ²⁵.

However, when applying a suggestion from user niconini’s blog post ²⁶—specifically, disabling Fused Multiply-Add (FMA) instructions during compilation—the CMP 170HX can achieve approximately 6.2 TFLOPS in benchmarks, reaching half of its theoretical FP32 performance . This performance surpasses that of the NVIDIA Tesla P6 ²⁷.

Half-Precision (FP16) Performance

FP16 , or Half-Precision Floating-Point , uses 16 bits to represent a floating-point number. It typically includes 1 sign bit, 5 exponent bits, and 10 mantissa bits. Compared to FP32, FP16 has a narrower numerical range and lower precision. However, it offers faster computation speed (theoretically twice as fast as FP32) and reduced memory usage and bandwidth (half of FP32’s requirements).

FP16 is primarily applied in deep learning training (specifically mixed-precision training ) and deep learning inference .

Graph 3-2 nVidia CMP 170HX Float16 Benchmark Resault

The FP16 performance of the nVidia CMP 170HX remains unaffected regardless of FMA status (whether Fused Multiply-Add instructions are enabled or disabled). This performance level is approximately equivalent to that of the NVIDIA GeForce RTX 4080 desktop GPU’s FP16 performance ²⁸ (excluding contributions from Tensor Cores). Theoretically, this could enable the CMP 170HX to handle certain inference tasks that are not memory bandwidth-bound, such as inference for Stable Diffusion.

However, the FP16 performance reported by PyTorch and GPU-Burn is only around 6.3 TFLOPS, which is significantly lower than the results from OpenCL-Benchmark and Mixbench. The author speculates that this discrepancy likely arises from differences in how FP16 precision data is handled by PyTorch and GPU-Burn compared to the other tools.

Similar to the FP32 scenario, OpenCL-based benchmarks show slightly higher performance than CUDA-based ones**. This phenomenon may stem from the fact that the CUDA-based Mixbench benchmark(with 1,024 compute iterations and a Flops/Byte ratio of 512.250) does not fully stress the GPU, thereby failing to unlock its maximum potential.

Double-Precision (FP64) Performance

Double Precision Floating Point (FP64) uses 64 bits to represent a floating-point number. It typically consists of 1 sign bit, 11 exponent bits, and 52 mantissa bits, adhering to the IEEE 754 standard. FP64 offers extremely high numerical precision and an extensive range of representable values. Its exceptional precision effectively mitigates error accumulation, making it indispensable for calculations requiring rigorous numerical accuracy.

Scientific computing and engineering simulations are the core application domains of FP64. In fields demanding complex mathematical computations with highly precise results, such as computational fluid dynamics (CFD), finite element analysis (FEA), climate and meteorological modeling, astrophysics, molecular dynamics, and quantum chemistry, FP64 is an essential requirement.

Graph 3-3 nVidia CMP 170HX Float64 Benchmark Resault

The FP64 performance of the CMP 170HX is approximately 1/64 of its theoretical peak performance, and this drops to 1/128 when FMA (Fused Multiply-Add) is disabled. This performance is far behind that of mainstream scientific computing GPUs like the A100, rendering it barely capable of meeting the requirements of any scientific computing tasks.

32-Bit Integer (INT32) Performance

INT32 refers to a 32-bit integer. It can represent integers ranging from -2,147,483,647 to 2,147,483,647 (signed) or 0 to 4,294,967,295 (unsigned). Integer operations are generally faster than floating-point operations and incur no precision loss within their representable range. They are commonly used for address calculations, indexing, bitmasking, shifting, logical operations (AND, OR, XOR), and other bit-level manipulations. INT32 is frequently employed in graphics rendering for state management, data packing/unpacking, and bitwise operations in specific algorithms (e.g., hashing).

Graph 3-4 nVidia CMP 170HX Integer32 Benchmark Resault

The INT32 performance of the NVIDIA CMP 170HX is not significantly restricted. However, applications relying heavily on this precision are relatively rare.

Similar to FP32 and FP16, and even more noticeably, the CMP 170HX’s INT32 performance in OpenCL is slightly higher than in CUDA. This discrepancy likely stems from the CUDA-based mixbench benchmark tool’s computational pressure (1024 Compute iters, Flops/byte = 512.250), which may not fully leverage the GPU’s potential.

Memory Bandwidth Testing

GPU memory bandwidth is one of the key performance bottlenecks. Even if a GPU has strong computational power (high TFLOPS), if data cannot be delivered to compute units or retrieved from them in a timely manner, the compute units will idle and wait, resulting in actual performance far below the theoretical peak. Memory bandwidth is critical for all data-intensive tasks. During the decoding phase of LLMs (large language models), memory bandwidth becomes the primary bottleneck. For bandwidth-sensitive algorithms like Ethash, their performance is almost entirely determined by memory bandwidth.

Graph 3-5 nVidia CMP 170HX Memory Bandwidth Benchmark Resault

Since the Ethereum mining algorithm relies heavily on bandwidth ⁷, the memory bandwidth of the CMP 170HX is fully retained. However, compared to the standard NVIDIA A100/A800 GPUs with 40GB/80GB memory, its 8GB capacity renders it incapable of handling large-scale model loading.

Nevertheless, even with only 8GB of memory, the CMP 170HX can still gain some performance advantages in bandwidth-dependent applications (e.g., the decoding phase of large language models) due to its retained bandwidth efficiency.

Large Language Model Inference Performance Testing

Test Configuration

Considering the nVidia CMP 170HX has only 8GB HBM2e memory, we selected the Qwen2.5-1.5B model to ensure all layers of LLM weights could be loaded into GPU VRAM. We tested different quantization versions of this model under the ggml framework (f32, f16, q8_0, q6_k, q4_k_m, q2_k_m) across three scenarios: full model loading, prefilling 512 tokens, and decoding 128 tokens.

The 1.5B Qwen2.5²⁹ model is a transformer-based architecture developed by the Qwen team featuring RoPE, SwiGLU, RMSNorm, Attention QKV bias, and tied embeddings. It has a total of 1.54B parameters (1.31B excluding the embedding layer). The architecture includes 28 layers with 12 Q heads and 2 KV heads (GQA). The maximum context length supports 32,768 tokens.

Following the usage guidelines of llama-bench in llama.cpp¹⁹, the testing command is: ./llama-bench -m ./models/Qwen2.5-1.5B-f32.gguf -ngl 28

Prefill Speed

The prefill stage involves the model processing input prompts. During this phase, all tokens in the input prompt are processed in parallel. For each token in the prompt, the model calculates corresponding Query (Q), Key (K), and Value (V) vectors, and uses self-attention mechanisms to compute inter-token relationships. The model computes and stores K and V vectors for all tokens across each Transformer block layer, forming cached contextual memory.

Due to parallel processing of the entire input sequence, this stage is computationally intensive (compute-bound) especially with long prompts. After controlling for prompt length, model size, and batch size, hardware compute capability determines performance. Theoretical performance estimation is calculated as:

$$ u_d = \frac{u_o}{o_{sm}} \cdot d_{sm} $$

Where $u_d$ represents CMP 170HX′s theoretical performance, $u_o$ is A100′s measured performance, $o_{sm}$ is A100′s SM count (108), and $d_{sm}$ is CMP 170HX’s SM count (70).

Graph 4-1 nVidia CMP 170HX llama-bench prefill Speed

However, CMP 170HX’s actual prefill performance lagged significantly behind theoretical expectations. Even with FMA disabled, prefill speeds only reached 14-45% of theoretical limits. This discrepancy likely stems from the inability to utilize Tensor Cores for acceleration. Disabling FMA significantly boosted prefill speeds for quantized models (Q8/Q6_K/Q4_K_M/Q2_K_M), with Q2_K_M achieving the highest improvement (231% of original rate). In contrast, f32/f16 models showed no performance gains.

Decoding Speed

After the prefill stage completes and the model generates the initial KV cache from the input prompt, the decoding phase begins. This stage aims to sequentially generate output tokens to form the final response or continuation.

The decoding process is autoregressive:

The model uses the hidden state of the last input token from the prefill stage to predict the first output token.
After generating the first output token, the model calculates its corresponding Q, K, V vectors.
The new token’s Q vector computes attention with all previous tokens’ K and V vectors (stored in the KV cache), including both the original prompt and previously generated tokens.
The new token’s K and V vectors are added to the KV cache.
The next output token is predicted based on the attention results.
Steps 2-5 repeat until the end-of-text token (e.g., <|endoftext|>) is generated or the maximum output length is reached.

The decoding stage heavily depends on the KV cache. Each new token requires reading the full, growing KV cache from memory for attention calculations. While each step processes only one new token (resulting in lower computational load per step compared to prefill), the process is serial (token N cannot be generated until token N-1 is completed). Additionally, repeatedly accessing the large, linearly growing KV cache (its size increases with generated tokens) makes this stage memory-bound (bandwidth-limited).

After controlling for KV cache size, generated token count, model scale, and decoding strategies, memory bandwidth becomes the primary performance bottleneck. Theoretical performance estimation is calculated as:

$$ u_d = \frac{u_o}{o_{bw}} \cdot d_{bw} $$

$u_d$ represents CMP 170HX’s theoretical performance, $u_o$ is A100′s measured performance, $o_{bw}$ is A100′s memory bandwidth (1555 GB/s), and $d_{bw}$ is CMP 170HX’s memory bandwidth (1493 GB/s).

Graph 4-2 nVidia CMP 170HX llama-bench decode Speed

Similar to the prefill stage, CMP 170HX’s decoding speed also underperforms theoretical expectations. However, since memory bandwidth is not a bottleneck, it achieves 39-78% of theoretical performance. Disabling FMA further boosts decoding speeds for quantized models (Q8/Q6_K/Q4_K_M/Q2_K_M), reaching 50-78% of theoretical performance.

Energy Efficiency Analysis

Based on the charts above, we observe that the prefiling speed of large language models significantly exceeds their decoding speed. According to the data presented, the decoding phase dominates the inference latency in large language models. To simplify calculations, we directly use decoding speed per watt as the unit for energy efficiency measurement.

Power consumption was recorded using the NVIDIA-SMI tool during the inference process.

Graph 4-3 nVidia CMP 170HX llama-bench decode Power Efficiency

In the charts, the energy efficiency of the CMP 170HX outperforms its theoretical efficiency (comparable to the A100) in half of the scenarios (F32, F16, Q8). Disabling FMA (Fused Multiply-Add) improves decoding speed for quantized large language models at Q6, Q4_K_M, and Q2_K_M levels on the CMP 170HX during inference. However, this also leads to a decline in energy efficiency.

Overall, the NVIDIA CMP 170HX demonstrates energy efficiency levels consistent with those of the GA100 core architecture when deployed for large language model inference.

Comprehensive Analysis and Discussion

The experimental conclusions presented above demonstrate the following:

a. The floating-point performance of the NVIDIA CMP 170HX can be leveraged through specific methods.
b. The memory bandwidth of the CMP 170HX grants it an advantage in large language model (LLM) inference.
c. Under full utilization of its performance, the energy efficiency of the CMP 170HX matches that of the A100 with the same GA100 core architecture.

Analysis of Floating-Point Performance

NVIDIA has imposed strict limitations on the floating-point performance of the CMP 170HX, primarily affecting FP32 and FP64 operations.

FP32 : Performance can be restored to approximately 6 TFLOPS by disabling Fused Multiply-Add (FMA) during CUDA compilation.
FP64 : No known method exists to recover its performance.
FP16 : Without Tensor Cores, FP16 appears unrestricted and can theoretically reach ~50 TFLOPS . However, frameworks like PyTorch fail to utilize this capability due to their inherent FP16 computation methods.

Evaluation of AI Performance

The CMP 170HX achieves energy efficiency comparable to the A100 only when the bottleneck of an application (e.g., LLM inference) is memory bandwidth . Disabling FMA improves decoding speed for quantized models (e.g., Q6, Q4_K_M) but reduces energy efficiency. Notably, disabling FMA enhances FP32 performance, suggesting that low-precision quantized models in Llama.cpp may rely heavily on FP32. The CMP 170HX may thus be suitable for highly quantized model inference.

Additionally, its integer performance remains uncrippled, enabling potential use for pure integer-based inference tasks.

Current Limitations

The widely used deep learning framework PyTorch cannot exploit the CMP 170HX’s FP16 performance . While disabling FMA improves FP32 performance, efforts to modify PyTorch (e.g., altering static FMA code) face significant challenges:

PyTorch’s massive codebase complicates manual adjustments to its CUDA-level FMA dependencies. Unlike tools like MixBench, PyTorch’s FP16 limitations may stem from deeper architectural constraints, requiring even more extensive modifications to its CUDA source code. No viable solution exists for restoring FP64 performance.

Theoretical Pathways for Full Performance Recovery

The following methods could theoretically unlock the CMP 170HX’s full computational potential, though each poses significant challenges:

a. Cracking NVIDIA drivers : Legally risky and technically demanding due to hardware-level restrictions.
b. Open-source driver approaches : Using NVIDIA’s open-source kernel drivers and Vulkan-based user-space drivers might bypass closed-source limitations. However, stability, functionality, and compatibility issues persist, and restrictions may be embedded in the GPU System Processor (GSP).
c. Custom CUDA programming : Developing standalone CUDA programs (instead of relying on PyTorch) could fully utilize the CMP 170HX’s 50 TFLOPS FP16 and 6 TFLOPS FP32 performance by avoiding FMA in code/compiler settings. This approach avoids framework limitations but requires manual optimization.

Conclusions and Recommendations

Key Conclusions

The nVidia CMP 170HX is an Ethereum-mining specialized GPU with significantly reduced floating-point computing performance but retained massive memory bandwidth. Manually writing CUDA code to disable FMA optimization can partially bypass its computational limitations (FP32). When running applications heavily reliant on memory bandwidth with memory requirements below 8GB, the CMP 170HX demonstrates energy efficiency comparable to the A100.

Application Recommendations

Thanks to its energy efficiency, the nVidia CMP 170HX is most suitable for use in regions lacking access to datacenter-level GPUs, such as community edge nodes that prioritize cost and service latency and security while not requiring advanced computational power. It can be employed for inference of small-scale large language models or highly memory bandwidth-intensive applications in cost-sensitive industrial simulations (e.g., FluidX3D).

Using the CMP 170HX requires knowledge of CUDA programming and GPU-related expertise. Currently, it is not feasible for gaming purposes.

Acknowledgments

As an independent researcher, this research process encountered numerous setbacks. I am grateful to the predecessors in the Chinese internet community for guiding me toward a viable path. I thank developers worldwide for their contributions to open-source tools. Special thanks to Yuki (祈洱) for compiling and installing the Linux driver for the CMP 170HX.

As a visually impaired individual, I found myself caught in the economic downturn and trade wars following the pandemic after graduation. I have struggled to find suitable employment. I am deeply grateful to my parents for their care during this anxious period and their support for my passions. Without their encouragement, I might have chosen a dark and uncertain path long ago.

Appendix

Ex.1. Sales Estimation Methodology

Since NVIDIA’s financial reports do not directly provide sales data for specific CMP card models, we need to estimate their sales volumes by analyzing total revenue figures and estimated average selling prices. The 2022 fiscal year marked the peak in CMP sales revenue, with total income reaching $550 million. We assume the majority of these sales occurred in the first three quarters of the 2022 fiscal year. Using the average selling price estimates from Table 2, we can make rough estimates of the sales volumes for each model. However, due to the lack of detailed sales structure data, we adopt a simplified approach based on total revenue and a weighted average price to calculate overall sales volume. More precise estimates would require granular quarterly revenue data and information about the sales mix proportions of different models across various periods. For instance, if the CMP 170HX contributed a larger portion of revenue during its initial high-price period, its actual sales volume might have been relatively lower compared to lower-priced models that likely sold in greater quantities.

Ex.2. Additional Tests

Ex.2.1. INT8 Performance

Graph EX.1 nVidia CMP 170HX Integer8 Benchmark Resault

Ex.2.2. PCIe Bandwidth Analysis

The CMP 170HX natively supports PCIe 1.1 x4 bandwidth. By populating additional coupling capacitors on the PCB, the interface can theoretically operate at PCIe 1.1 x16, though this modification was not validated in this study.

Graph EX.2 nVidia CMP 170HX PCI-E Bandwidth Benchmark Resault

References

Wuebbling, M. (2021, February 18). GeForce is made for gaming, CMP is made to mine . NVIDIA Blog. https://blogs.nvidia.com/blog/geforce-cmp/ ↩︎
NVIDIA Corporation. (n.d.). A Crypto Mining GPU for Professionals. NVIDIA. https://www.nvidia.com/en-us/cmp/ ↩︎
TechPowerUp. (n.d.). NVIDIA P106-100 specs. TechPowerUp. https://www.techpowerup.com/gpu-specs/p106-100.c2980 ↩︎
NVIDIA Corporation. (2022). 2022 Annual Report. NVIDIA. https://www.nvidia.com/en-us/ir/annual-reports/ ↩︎
U.S. Securities and Exchange Commission (SEC). (2022, May 6). SEC charges NVIDIA Corporation with inadequate disclosures about impact of cryptomining [Press Release No. 2022-79]. SEC. https://www.sec.gov/newsroom/press-releases/2022-79 ↩︎
Wilson, J. R. (2021, October 25). NVIDIA’s flagship mining GPU, the CMP 170HX, being sold for over $4,400 US. WccTech. https://wccftech.com/nvidia-cmp-170hx-flagship-mining-gpu-listed-on-sale-for-4500-us/ ↩︎
王欣, 史钦锋, 程杰. (2020). 深入理解以太坊. (李艺, 编). 北京: 机械工业出版社. ↩︎ ↩︎
Shilov, A. (2018, November 1). U.S. Government Indicts Chinese DRAM Maker JHICC on Industrial Espionage; Bans Exports To Firm. AnandTech. Retrieved from https://www.anandtech.com/show/13537/us-doc-sides-with-micron-bans-exports-to-chinese-dram-maker ↩︎
NVIDIA Corporation. (2022, August 26). Form 8-K. U.S. Securities and Exchange Commission. Retrieved from https://www.sec.gov/Archives/edgar/data/1045810/000104581022000146/nvda-20220826.htm ↩︎
NVIDIA Corporation. (2023, October 17). Form 8-K: Current report (Filing No. 0001045810-23-000217). U.S. Securities and Exchange Commission. Retrieved from https://www.sec.gov/Archives/edgar/data/1045810/000104581023000217/nvda-20231017.htm ↩︎
NVIDIA Corporation. (2023, October 24). Form 8-K: Current report (Filing No. 0001045810-23-000221). U.S. Securities and Exchange Commission. Retrieved from https://www.sec.gov/Archives/edgar/data/1045810/000104581023000221/nvda-20231023.htm ↩︎
NVIDIA Corporation. (2025, April 9). Form 8-K: Current Report. U.S. Securities and Exchange Commission. Retrieved from https://www.sec.gov/Archives/edgar/data/1045810/000104581025000082/nvda-20250409.htm ↩︎
Advanced Micro Devices, Inc. (2025, April 16). Form 8-K. U.S. Securities and Exchange Commission. Retrieved from https://www.sec.gov/Archives/edgar/data/2488/000000248825000039/amd-20250415.htm ↩︎
Konstantinidis, E., & Cotronis, Y. (2017). A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling. Journal of Parallel and Distributed Computing, 107, 37–56. https://doi.org/10.1016/j.jpdc.2017.04.002 ↩︎
ProjectPhysX. (n.d.). README.md in OpenCL-Benchmark. GitHub. Retrieved from https://github.com/ProjectPhysX/OpenCL-Benchmark ↩︎
geohot. (n.d.). README.md [README file]. In gpu-burn. GitHub. https://www.google.com/search?q=https://github.com/geohot/gpu-burn/blob/master/README.md ↩︎
PyTorch. (2023). Install PyTorch. PyTorch. https://pytorch.org/get-started/locally/ ↩︎
ggml-org. (n.d.). llama.cpp: README.md [README file]. In llama.cpp. GitHub. Retrieved from https://github.com/ggml-org/llama.cpp/blob/master/README.md ↩︎
ggml-org. (n.d.). llama-bench: README.md [README file]. In llama.cpp. GitHub. Retrieved from https://github.com/ggml-org/llama.cpp/blob/master/examples/llama-bench/README.md ↩︎ ↩︎
TechPowerUp. (n.d.). NVIDIA CMP 170HX 8 GB Specs. TechPowerUp GPU Database. Retrieved from https://www.techpowerup.com/gpu-specs/cmp-170hx-8-gb.c3830 ↩︎
NVIDIA. (2021, June). NVIDIA A100 Tensor Core GPU [Data sheet]. Retrieved from https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf ↩︎
NVIDIA. (2020, September). NVIDIA A100 40GB PCIe GPU Accelerator [Product brief]. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/A100-PCIE-Prduct-Brief.pdf ↩︎
PCI-SIG. (2017). PCI Express Base Specification Revision 4.0 Version 1.0. ↩︎
PCI-SIG. (2010). PCI Express Base Specification Revision 3.0. ↩︎
NVIDIA. (2008, April 14). NVIDIA Tesla C870 GPU Computing Processor Board [Product specification]. https://www.nvidia.com/docs/IO/43395/C870-BoardSpec_BD-03399-001_v04.pdf ↩︎
niconiconi. (2023, October 25). All GB/s without FLOPS - Nvidia CMP 170HX Review, Performance Lockdown Workaround, Teardown, Watercooling, and Repair. niconiconi’s blog. https://niconiconi.neocities.org/tech-notes/nvidia-cmp-170hx-review/ ↩︎
NVIDIA. (2017, August). Tesla-P6 product brief [Product brief]. https://images.nvidia.cn/content/pdf/tesla/Tesla-P6-Product-Brief.pdf ↩︎
TechPowerUp. (n.d.). NVIDIA GeForce RTX 4080 Specs. https://www.techpowerup.com/gpu-specs/geforce-rtx-4080.c3888 ↩︎
Qwen team. (Year, Month Day). Qwen 2.5 1.5B Instruct [Model card]. Hugging Face. https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct ↩︎