June 2024 Archives

Investigating SSMEC’s (State Micro) 486s with the UCA

Released in September 1989 by Intel, the legendary 486 CPU enjoyed widespread popularity in numerous PCs for many years before being gradually replaced by the Pentium and its successors. This era profoundly influenced the entire CPU industry for decades. Up until then, only Intel designed x86 microarchitectures, allowing third parties like AMD to produce Intel’s intellectual property in their own fabs. However, in the first half of the 1990s, new CPU manufacturers emerged with their own 486-compatible CPUs, designed through clean-room reverse engineering. As a result, the decade was marked by numerous lawsuits between Intel and its new competitors over patent infringements related to the x86 architecture.

By the time Intel discontinued the 486 in 2007, the definitive list of pin-compatible 486 CPU manufacturers was as follows:

- Intel – The original developer of the 486, Intel released the 486, followed by the 486DX and 486SX (with a disabled FPU), then the clock-doubled DX2 and SX2, and finally the clock-tripled Intel DX4, reaching speeds of up to 100 MHz.
- AMD – Biggest second source. Produced both 486-clone based on Intel’s IP and in-house tuned architecture like the AMD X5 / 5×86 up to 160 MHz.
- Cyrix – Short-lived but famous company that only produced CPUs based solely on their own original designs, such as the Cx486 and the Cyrix 5×86.
- ST Microelectronics – Only rebranded Cyrix CPUs
- Texas Instrument – Mostly rebranded Cyrix CPUs and a custom “barely compatible” 486 core (486SXL2)
- IBM – Mostly rebranded Cyrix CPUs, but also some Intel second source and even a custom 486 core internally used on IBM PCs (not on PGA)
- UMC – A Taiwanese company that produced some rare 486-compatible CPUs known as “Green CPU” using their in-house low-power microarchitecture.

Although it is extremely rare to discover an unknown manufacturer of a well-known CPU like the 486, this is precisely what happened a few years ago when pictures of a peculiar and unseen 486 marked “SM486” surfaced online. Recently, I managed to acquire a couple of these elusive CPUs (a DX33 and a DX2-66). The Universal Chip Analyzer is the perfect tool for an in-depth study of these rarities. Are they merely clones of an already known 486 architecture? Are they based on a brand-new design? Where do they come from?

The story behind State Microelectronics

The first step is to identify the company behind the laser-printed logo on the CPUs. Given that they originated from China, a quick search on the Chinese internet revealed another picture of the logo with the acronym “SSMEC,” which stands for “Shenzhen State Microelectronics Co. Ltd.” This company was initially established in 1993 under the name “Shenzhen State Micro Science and Technology Co. Ltd.” It was the first IC design company to be part of China’s “909 Project,” a national initiative aimed at developing China’s semiconductor chip industry. The goal was to establish China as a competitive player in the global semiconductor industry, reduce reliance on foreign technologies, foster in-house innovation, and acquire Chinese-controlled intellectual property.

Shenzhen State Microelectronics old and new logos

After numerous reorganizations over the years, SSMEC is now part of “Guoxin Microelectronics Co. Ltd.” a subsidiary of the state-owned giant Tsinghua Unigroup. Notably, Tsinghua Unigroup also owns “Yangtze Memory Technologies Corp” (better known as YMTC), the first Chinese-owned company to design and mass-produce the critical 3D NAND Flash used in smartphones and SSDs. I couldn’t find any reference to an x86 CPU developed by SSMEC on their (very undetailed) website. It’s difficult to determine exactly what this company is currently working on. The closest reference to a potential CPU is a 2011 award given by the Shenzhen Municipal Government for a “32-bit High-performance Integrated Communication Microprocessor.”

SM486DX33 Analysis

Physically, the SM486DX33 comes in a 168-pin PGA ceramic package. The printing is rotated 90° counterclockwise, but pin 1 is marked with a small square to ensure correct orientation. There are also two numbers laser-marked on top: “035” and “1650.” The latter appears to be a date code (Year 2016, Week 50), but 2016 seems quite late for a 486 CPU. Additionally, it appears that State Microelectronics changed its logo well before 2016. A 486-class CPU should have been produced in the 1990s, unless this chip was intended for a very specialized industry, such as aerospace or military. The back of the CPU has no markings, and the gold lid is slightly different from all other known Intel 486DX CPUs.

For further investigation, let’s insert the SM486DX33 into the Universal Chip Analyzer. The first step was to determine the correct voltage. At 3.45V, the SM486DX33 only booted up to 20 MHz, so 5V was clearly the correct voltage to achieve 33 MHz operation. Both benchmarks ran successfully, and the UCA concluded the test with a PASS status. Here is how it is detected:

As shown in the screenshots from the UCA, the CPU is detected as an Intel 486DX with a reset signature of 0x404. The CPUID instruction is not supported, and JTAG is not available. The 0x404 identifier matches that of an Intel 486DX with the D0 stepping (either SX419 or SX729). It is very common for other manufacturers to copy Intel’s reset signature to avoid issues with software detection. However, a look at the benchmark results leaves no room for doubt: with an INT Benchmark score of 126.5 and an FP Benchmark score of 69.4, the SM486DX33 delivers exactly the same results as an Intel 486DX-33. All the INT/FP instructions have the exact same latency and throughput, indicating that the microarchitecture of this CPU is a perfect clone of Intel’s 486.

Now, we need to determine whether this SSMEC CPU is simply an Intel-produced 486 die assembled into a custom ceramic package, or if it is a clone built using a custom foundry process (possibly using Intel’s wafer masks). Again, the Universal Chip Analyzer can assist with its power-consumption profiling features. I compared the SM486DX33 with four different Intel 486DX-33 CPUs based on the D0 stepping, as well as with an earlier Intel 486DX-33 C0-Stepping and a later one built on the aB0 Stepping.

The results are quite interesting. All early Intel 486DX-33 CPUs up to the D0-Stepping are based on Intel’s P648 process (also known as CHMOS IV) with a 1 µm gate length. Later 486 models, such as the SL-enhanced SX810, use the newer P650 process (CHMOS V – 0.8 µm). At 33 MHz, the power consumption of an Intel 486DX built on a 1 µm node is 3.00 Watts +/- 5% (2.85-3.15W). Therefore, a 486 CPU returning the 0x404 (D0) signature should fall within that range. However, the SM486DX33 has a power consumption of 2.37W, which does not align with a 1 µm process, despite its signature. This suggests that the chip is built on a 0.8 µm process like the SX810, which has a 0x415 reset signature and supports CPUID and JTAG.

SM486DX266 Analysis

Now let’s examine the clock-doubled 486DX2 at 66 MHz from SSMEC. The top markings are similar to those on the SM486DX33, with two numbers: “173” and “1425.” As before, this appears to be a date code (Year 2014, Week 25) but could be misleading for the reasons previously mentioned. There are still no markings on the bottom, but the lid is much smaller, resembling the one found on later Intel 486DX4 models.

The SM486DX266 also operates only at 5V to achieve its rated speed. The UCA was able to test it perfectly. Here is how it is detected:

This time, the CPU from State Microelectronics is detected as an Intel 486DX2-66 (aB0-Stepping) with the 0x435 reset identifier, matching Intel’s specification. More interestingly, the CPUID instruction is now supported, confirming the 0x435 identifier. The CPUID Vendor String is “GenuineIntel,” just like on Intel’s 486 DX2. JTAG is also supported and returns Intel’s Manufacturer ID and the same Product ID code as the original DX2s. With an INT Score of 218.0 and an FP Score of 135.4, the SM486DX266 achieves the exact same results as an Intel 486DX2-66, indicating they share the same microarchitecture. Now it’s time to compare the power profile of the State Micro 486DX2 with Intel’s DX2.

The results are surprising once again. The SM486DX266 requires half the power of any Intel 486DX2 based on the aB0-Stepping (SX807 and SX911). Intel’s aB0-Stepping (ID 0x435) is built on the P650 process (CHMOS V – 0.8 µm). However, the power consumption of the SSMEC CPU suggests it matches a more advanced process, such as the Intel P652 (0.6 µm) process used in later 486DX4 models. At 75 MHz, an Intel 486DX4 (0.6 µm) requires about 2.3W. When clocked down to 66 MHz, this almost perfectly matches the 1.95W of the SM486DX266. No genuine Intel DX2 CPUs have ever used a process more advanced than P648 (0.8 µm), making this finding quite interesting.

Conclusion

From a microarchitectural perspective, State Micro’s SM486s are clearly an exact replica of Intel’s 486 CPUs. The latency and throughput of instructions are identical between both CPUs. Even the JTAG identifier, which is the only way to distinguish between an Intel 486 and a third-party CPU like AMD using Intel’s masks, points to Intel. It remains unclear how State Micro obtained Intel’s 486 IPs and whether they had the legal rights to use them. Intel’s licensing of x86 products, especially during the 486 era, was extremely restricted. Only AMD and IBM had the legal right to produce 486s based on Intel’s IP, and that was granted after a long legal battle for AMD. It is highly unlikely that Intel would have granted such deep cloning rights to a state-owned company in China. Even if that were the case, they would likely have changed at least the JTAG identifier.

From a process standpoint, these SM486 CPUs reveal unexpected secrets. Both appear to be a node ahead of Intel’s genuine 486s (die-shrink), suggesting they likely did not come from an Intel fab. While the Intel 486DX-33 with D0-Stepping were built on a 1 µm process, the clone from State Micro seems to use a 0.8 µm (made-in-China) process. The same applies to the SM486DX2, which seems to be based on a 0.6 µm process, whereas Intel DX2s with aB0-Stepping were only based on the 0.8 µm process.

It’s possible that these CPUs were designed as test ICs to help establish a new Chinese foundry, which might explain their rarity and why they only surfaced in 2024. Another intriguing possibility is that they were produced to ensure long-term support for critical systems designed in the 90s, such as military applications, energy infrastructures (like nuclear or oil), or heavy civil technologies. For instance, many trains in France still use AMD 486DE2 processors as on-board computers.

CPU collectors have observed a high demand for aftermarket microprocessors from the 80s and 90s from Chinese buyers since 2010. The most sought-after CPUs include the 80C186 and the 486DX33 and DX2-66 models. CPUShack, one of the largest resellers in the USA, who shipped over 1000 of these 486s to China, noted that the SX419 & SX729 (for DX33s) and SX807 & SX911 (for DX2-66s) were by far the most popular among Chinese buyers. These specific models correspond exactly to the D0 and aB0 steppings that the SM486 CPUs replicate. The exact applications of these 486s in China remain a mystery, but it’s clear that there is a significant and ongoing demand for them.

If you have any additional information about these chips, please contact me or leave a comment below – I’d love to hear from you!

Designing a benchmark for the UCA – Part #1

A benchmark feature was planned very early in the development process of the Universal Chip Analyzer. Given the vast differences in microarchitecture between a 486 and an 8080, two benchmarks are necessary: one to compare older 8-bit CPUs from various manufacturers with incompatible instruction sets (Motorola 6800, Intel 8080, MOS 6502, etc.), and another for 16- and 32-bit CPUs based on Intel’s x86 ISA. Let’s begin with building an Integer/FP benchmark for “modern” CPUs like the 386 or 486. (I’ll cover how to evaluate older CPUs’ performance in a later post.)

So, what is a CPU benchmark? Essentially, it’s a score derived from the ratio between a piece of code designed to emulate real-world programs (using similar sets of instructions) and the time required to execute that code. While there is no debate about how to measure time, there are endless discussions about the best instructions to use for an effective benchmark. By profiling various common programs, benchmark writers determine how instructions are statistically used and then create synthetic code that mimics a similar instruction distribution.

Let’s see if and how this approach can apply to the UCA.

INT Performance Benchmark

In the 80s and 90s, the industry-standard for measuring integer performance was the “Dhrystone” benchmark, originally published in 1984 in Ada by Reinhold P. Weicker and later ported to C by Rick Richardson. Dhrystone was designed to evaluate overall system performance with a focus on integer operations, as floating-point instructions were rare at that time. The real-world programs used by Weicker to define the instruction distribution for Dhrystone were written in Fortran, Pascal, and long-obsolete languages like ALGOL. A complete description of the instruction statistics behind the C version of Dhrystone can be found in the dhry.h header file.

To implement a similar code in the Universal Chip Analyzer, I first needed to understand exactly what Dhrystone does. I began by compiling the original C code using GCC 4.9, targeting the “i386” architecture with the “-O1” optimization flag to avoid extreme optimization. Then, I used Intel’s Pin Architecture Analysis tool to log every instruction executed by the Dhrystone binary and sorted them. Finally, I grouped the instructions by family.

Key findings include:

- - MOV Instructions: About 30% of executed instructions are MOVs. Of these, 68.8% are used to read memory, 22.7% to write to memory (the usual 2:1 ratio), and 8.5% involve registers only.
  - Integer Operations: Integer ADDs account for 13.7% of total instructions executed, while Integer SUBs account for 3.1%.
  - Bit Operations: Bit operations (Boolean manipulation, shift, rotate, etc.) account for approximately 11% of total instructions, and bit-based comparisons account for 5.5%.
  - String Operations: About 20% of MOVs instructions are related to string operation (like movsd)
  - LEA Instructions: LEA instructions, a compiler trick to optimize basic arithmetic operations using a memory computation-related instruction, remain below 5%.
  - Program Flow Control: JUMPs are mainly conditional, with 70% being jnz (Jump if not zero). Stack operations (push/pop) and function control (call/ret) account for 16.3% of the total instructions.

As an arithmetic benchmark, Dhrystone also performs some multiplication (0.2% of the total) and division (also 0.2%). While this 0.4% may seem insignificant compared to the 16.7% for addition and subtraction, it’s important to understand that ADD and SUB instructions require only 2 cycles on an i486, while IDIV and IMULT instructions can require up to 43 cycles, making them 20 times slower. Consequently, these 0.4% of div/mult operations take as much time as 50% of all add/sub operations. This must be considered to avoid an issue where a single time-consuming instruction skews the final score.

Another crucial point is related to the memory access subsystem (including the cache, when available). A benchmark like Dhrystone doesn’t solely evaluate CPU performance, but rather how the CPU and memory perform together. To what extent? The instruction statistics show that approximately 30% of executed instructions reference a memory location, resulting in a read/write operation. With the slow memory used in the 1980s and 1990s, the final score could be directly linked to the performance of the memory (or the memory controller, or the internal cache). Should this be simulated by the UCA, given that the memory simulated by the UCA is extremely fast with zero wait state? I don’t think so. The goal here is to design a pure CPU benchmark, as independent of memory subsystem speed as possible. Nonetheless, some memory operations are still necessary to consider the latency and bandwidth of very common instructions referencing memory like MOVs.

With these results in mind, and taking into consideration the number of cycles required for instructions on CPUs ranging from the 8086 to the 80486, here is the instruction dispatch I selected for the Integer benchmark of the Universal Chip Analyzer:

- - 25% MOVs: 10% Direct (Reg/Reg), 10% Read (Reg/Mem), 5% Write (Mem/Reg)
  - 16% ADDs + 8% SUB: Basic arithmetic operation.
  - 5% MULT + 0.25% DIV: Same execution time than the 24% ADD/SUB
  - 12% Boolean operation: AND, OR, XOR, INC, DEC, …
  - 6% Rotation/Shift: ROR, ROL, SHL, SHR, …
  - 10% conditional JUMPs: JNZ, JZ, JNE, JE, …
  - ~20% for Flow control and stack management: PUSH, POP, CALL, RET, …

Of course, this code can be changed easily at any time to fit specific benchmarking needs.

FP Performance Benchmark

When considering floating-point benchmarks, the two clear choices were Whetstone and Linpack. I profiled both using various tools. Let’s begin with Linpack to understand why I ultimately preferred Whetstone.

As shown in the instruction statistics charts, the authors of Linpack recognized early on that the FMA (Floating-point Multiply-Add) would become the cornerstone of intensive compute activities for decades to come. Consequently, they fine-tuned Linpack to focus almost exclusively on FMA operations.

Linpack exhibits very few memory dependencies (less than 3%) and even fewer control flow and stack instructions (less than 2%), which could be great for the UCA. However, the floating-point instruction dispatch reveals that only four FP instructions are predominantly used: FADD, FMULT, and FLD/FSTP for loading static values and storing results in registers. Linpack essentially measures the FMA performance of an FP execution unit, which is insufficient for properly evaluating an entire FPU.

Now let’s profile Whetstone the same way, stating with all-instructions statistics:

Whetstone demonstrates a more balanced use of non-FP instructions, though this comes at the expense of a lower volume of FP instructions overall. Memory dependencies, stack usage, and function control are significantly higher compared to Linpack. Next, let’s examine the distribution of FP instructions:

Whetstone utilizes the complete set of instructions available on early x87 FPUs, maintaining a balance between very fast instructions like FADD and much slower functions like FSQRT (square root), as well as even more time-consuming logarithmic, exponential, or trigonometric instructions like FPATAN, FSIN, or FCOS (which can take hundreds of cycles on any 386 or 486!). This comprehensive range is exactly what we need for a thorough evaluation of FPU performance.

To summarize, for the Universal Chip Analyzer FP benchmark, we need the non-FP instruction dispatch statistics of Linpack (with very few memory dependencies and minimal stack/program control flow instructions) combined with the variety of FP operations found in Whetstone (not just FMA but the full suite of FP operations, from square roots to trigonometric functions). Of course, we must balance these instructions to ensure no single operation disproportionately affects the overall performance.

Here is my proposed dispatch that will be implemented as “V1” of the UCA FP Benchmark:

FMAs account for 45% of the total execution time, FDIV/FSQRT for 30% and log/exp/trigonometric functions for 17%. Memory dependencies are reduced to the bare minimum, as well as others program control flow instructions including stack operations.

Now it’s time to implement this!