## Computer Architecture / Security / Quantum / Where to Go Next

Yipeng Huang Rutgers University April 25, 2024

- Instruction level parallelism
- Limitations of instruction level parallelism
- Data level parallelism
- Limitations of data level parallelism
- Thread level parallelism
- Limitations of thread level parallelism
- Accelerators
- Quantum

## Synchronization. Clock signals.



(A AND NOT A) should always evaluate to FALSE, need a clock signal to denote when values are valid, and to ignore transients.

# The journey of an assembly instruction: pipeline



| IF | ID | ΕX | MEM | WB  |     |     |     |    |
|----|----|----|-----|-----|-----|-----|-----|----|
| i  | IF | ID | EX  | MEM | WB  |     |     |    |
| t  |    | IF | ID  | ΕX  | MEM | WB  |     |    |
|    |    |    | IF  | ID  | ΕX  | MEM | WB  |    |
|    |    |    |     | IF  | ID  | ΕX  | MEM | WB |

An assembly instruction needs multiple stages of combinational and sequential logic to execute.

At any given moment, several assembly instructions are in-flight.

Image sources: Wikimedia.

# The journey of an assembly instruction: pipeline



|   | IF  | ID | ΕX | MEM | WB  |     |     |     |    |
|---|-----|----|----|-----|-----|-----|-----|-----|----|
| Ļ | i   | IF | ID | EX  | MEM | WB  |     |     |    |
| _ | t → |    | IF | ID  | ΕX  | MEM | WB  |     |    |
|   |     |    |    | IF  | ID  | ΕX  | MEM | WB  |    |
|   |     |    |    |     | IF  | ID  | ΕX  | MEM | WB |

Where does the cache logic in setAssociative.v live?

Where does the logic in binSub.v live?

What is the length of the combinational logic of binSub.v?

What determines the clock speed?

## Microarchitectural support for Instruction Set Architectures



Figure 2.33 Floorplan of the Alpha 21264 [Kessler 1999].

Hennessy & Patterson. Computer Architecture: A Quantitative Approach

## Instruction Level Parallelism: modern trickery



**Figure 3.41 The Intel Core i7 pipeline structure shown with the memory system components.** The total pipeline depth is 14 stages, with branch mispredictions costing 17 cycles. There are 48 load and 32 store buffers. The six independent functional units can each begin execution of a ready micro-op in the same cycle.

- Superscalar instruction fetch
- Out-of-order instruction retirement
- Speculative execution

Hennessy & Patterson. Computer Architecture: A Quantitative Approach

- Instruction level parallelism
- Limitations of instruction level parallelism
- Data level parallelism
- Limitations of data level parallelism
- Thread level parallelism
- Limitations of thread level parallelism
- Accelerators
- Quantum

Limitations of microarchitecture trickery: security vulnerabilities

- See Prof. Mark Hill (U. Wisconsin) slides on Meltdown and Spectre
- Lipp, Moritz, et al. "Meltdown: Reading kernel memory from user space." 27th {USENIX} Security Symposium ({USENIX} Security 18). 2018.

- Instruction level parallelism
- Limitations of instruction level parallelism
- Data level parallelism
- Limitations of data level parallelism
- Thread level parallelism
- Limitations of thread level parallelism
- Accelerators
- Quantum

### Data Level Parallelism



**Figure 4.20** Block diagram of the multithreaded SIMD Processor of a Fermi GPU. Each SIMD Lane has a pipelined floating-point unit, a pipelined integer unit, some logic for dispatching instructions and operands to these units, and a queue for holding results. The four Special Function units (SFUs) calculate functions such as square roots, reciprocals, sines, and cosines.

Hennessy & Patterson. Computer Architecture: A Quantitative Approach

## Data Level Parallelism

| ZMM0  | Y            | MMO [ | XMM0  | ZMM1              | Y     | MM1        | XMM1  | ST(0)  | MM0   | ST(1)     | MM1     |    |           | AX RAX        | R8B R8W R8D    | R8            | R12D R12   | MSWC    | R0 | CR4  | l   |
|-------|--------------|-------|-------|-------------------|-------|------------|-------|--------|-------|-----------|---------|----|-----------|---------------|----------------|---------------|------------|---------|----|------|-----|
| ZMM2  | Y            | MM2   | XMM2  | ZMM3              | Y     | MM3        | XMM3  | ST(2)  | MM2   | ST(3)     | MM3     |    | вн ВХ ЕВ  | <b>SX RBX</b> | R9B R9W R9D    | R9            | R13D R13   | CR1     |    | CR5  |     |
| ZMM4  | Y            | MM4   | XMM4  | ZMM5              | YI    | MM5 [      | XMM5  | ST(4)  | MM4   | ST(5)     | MM5     |    | сценСХЕС  | XRCX          | R10B R10W R10D | R10           | N R14D R14 | CR2     | 2  | CR6  |     |
| ZMM6  | Y            | MM6   | XMM6  | ZMM7              | YI    | MM7 [      | XMM7  | ST(6)  | MM6   | ST(7)     | MM7     |    |           |               | R11B R11W R11D | R11 R15B R15V | R15D R15   | CR3     | 3  | CR7  |     |
| ZMM8  | Y            | MM8 [ | XMM8  | ZMM9              | Y     | <b>MM9</b> | XMM9  |        |       |           |         | В  | PLBPEBF   | RBP           |                |               | EIP RIP    | MXCS    | SR | CR8  |     |
| ZMM10 | ) <u>Y</u> I | MM10  | XMM10 | ZMM1 <sup>2</sup> | I YI  | MM11 [     | XMM11 | CW     | FP_IP | FP_DP     | FP_C    | S  | SIL SI ES | I RSI         |                | SP            |            |         |    | CR9  |     |
| ZMM12 | 2 YI         | MM12  | XMM12 | ZMM1:             | 3 YI  | MM13 [     | XMM13 | SW     |       |           |         |    |           |               |                |               |            |         |    | CR10 |     |
| ZMM14 | Y            | MM14  | XMM14 | ZMM1              | 5 YI  | MM15       | XMM15 | TW     |       | 8-bit reg | gister  |    | 32-bit r  | egister       | 80-bit r       | egister       | 256-bit re | egister |    | CR11 |     |
| ZMM16 | ZMM17        | ZMM18 | ZMM19 | ZMM20             | ZMM21 | ZMM22      | ZMM23 | FP_DS  |       |           | egister |    | 64-DIT I  | egister       |                | register      | 512-DIt re | egister |    | CR12 |     |
| ZMM24 | ZMM25        | ZMM26 | ZMM27 | ZMM28             | ZMM29 | ZMM30      | ZMM31 | FP_OPC | FP_DF | FP_IP     | (       | CS | SS        | DS            | GDTR           | IDTR          | DR0        | DR6     |    | CR13 |     |
|       |              |       |       |                   |       |            |       |        |       |           |         | ES | FS        | GS            | TR             | LDTR          | DR1        | DR7     |    | CR14 |     |
|       |              |       |       |                   |       |            |       |        |       |           |         |    |           |               |                |               | DR2        | DR8     |    | CR15 |     |
|       |              |       |       |                   |       |            |       |        |       |           |         |    |           |               |                |               | DR3        | DR9     |    |      |     |
|       |              |       |       |                   |       |            |       |        |       |           |         |    |           |               |                |               | DR4        | DR10    | DR | 12 D | R14 |
|       |              |       |       |                   |       |            |       |        |       |           |         |    |           |               |                |               | DR5        | DR11    | DR | 13 D | R15 |

- Instruction level parallelism
- Limitations of instruction level parallelism
- Data level parallelism
- Limitations of data level parallelism
- Thread level parallelism
- Limitations of thread level parallelism
- Accelerators
- Quantum

## Dennard Scaling, Moore's Scaling, Power Wall

For a few decades, shrinking transistors drove clock speeds higher.

Dennard Scaling: Shrinking transistors also use less power.

Win-win.

Circa 2005, Dennard Scaling hit physical limitations.



**FIGURE 1.15** Clock rate and Power for Intel x86 microprocessors over eight generations and 25 years. The Pentium 4 made a dramatic jump in clock rate and power but less so in performance. The Prescott thermal problems led to the abandonment of the Pentium 4 line. The Core 2 line reverts to a simpler pipeline with lower clock rates and multiple processors per chip. Copyright © 2009 Elsevier, Inc. All rights reserved.

Patterson & Hennessy. Computer Organization and Design: The Hardware/Software Interface.

## Drivers of CPU performance: scaling and architecture



Hennessy & Patterson. Computer Architecture: A Quantitative Approach

## With end to Dennard's scaling, the continuation of Moore's law provided parallelism instead



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

- Instruction level parallelism
- Limitations of instruction level parallelism
- Data level parallelism
- Limitations of data level parallelism
- Thread level parallelism
- Limitations of thread level parallelism
- <u>Accelerators</u>
- Quantum

|                                                                                                                                               | 1940s                                                              | 1950s                                                     | 1960s                                                    | 1970s                                                    | 1980s                 | 1990s                      | 2000s                                                    | 2010s                                                |  |
|-----------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|-----------------------------------------------------------|----------------------------------------------------------|----------------------------------------------------------|-----------------------|----------------------------|----------------------------------------------------------|------------------------------------------------------|--|
| Analog<br>continuous-<br>time<br>computing                                                                                                    | Analog<br>computers for<br>rocket and<br>artillery<br>controllers. | Analog<br>computers for<br>field problems.                | 1 <sup>st</sup><br>transistorized<br>analog<br>computer. | Analog-digital<br>hybrid<br>computers.                   |                       |                            |                                                          |                                                      |  |
| Digital<br>discrete-<br>time<br>computing                                                                                                     | Turing's<br>Bomba.                                                 | 1 <sup>st</sup><br>transistorized<br>digital<br>computer. | Moore's law<br>projection for<br>transistor<br>scaling.  | Dennard's<br>scaling for<br>transistor<br>power density. | VLSI<br>democratized. | FPGAs<br>introduced.       | End of<br>Dennard's<br>scaling.<br>CPUs go<br>multicore. | Cloud FPGAs:<br>Microsoft<br>Catapult.<br>Amazon F1. |  |
|                                                                                                                                               |                                                                    |                                                           |                                                          |                                                          |                       | Heterogenous architectures |                                                          |                                                      |  |
|                                                                                                                                               | Stored program computer.                                           | Microprogram<br>ming.                                     | Instruction set architecture.                            | Reduced<br>instruction set<br>computers.                 |                       | GPUs<br>introduced.        | Nvidia<br>introduces<br>CUDA.                            | ASICs: Google<br>TPUs. DE Shaw<br>Research<br>Anton. |  |
| Transistor scaling and architectural abstractions<br>drive digital revolution, make analog alternatives irrelevant heterogenous architectures |                                                                    |                                                           |                                                          |                                                          |                       |                            |                                                          |                                                      |  |

### Accelerators



- Instruction level parallelism
- Limitations of instruction level parallelism
- Data level parallelism
- Limitations of data level parallelism
- Thread level parallelism
- Limitations of thread level parallelism
- Accelerators
- <u>Quantum</u>





FIGURE 7.2 The number of qubits in superconductor (SC) and trapped ion (TI) quantum computers versus year; note the logarithmic scaling of the vertical axis. Data for trapped ions are shown as squares and for superconducting machines are shown as circles. Approximate average reported two-qubit gate error rates are indicated by color; points with the same color have similar error rates. The dashed gray lines show how the number of qubits would grow if they double every two years starting with one qubit in 2000 and 2009, respectively; the dashed black line indicates a doubling every year beginning with one qubit in 2014. Recent superconductor growth has been close to doubling every year. If this rate continued, 50 qubit machines with less than 5 percent error rates would be reported in 2019. SOURCE: Plotted data obtained from multiple sources [9].

Quantum Computing Progress and Prospects. National Academies Press.





Intel



FIGURE 7.2 The number of qubits in superconductor (SC) and trapped ion (TI) quantum computers versus year; note the logarithmic scaling of the vertical axis. Data for trapped ions are shown as squares and for superconducting machines are shown as circles. Approximate average reported two-qubit gate error rates are indicated by color; points with the same color have similar error rates. The dashed gray lines show how the number of qubits would grow if they double every two years starting with one qubit in 2000 and 2009, respectively; the dashed black line indicates a doubling every year beginning with one qubit in 2014. Recent superconductor growth has been close to doubling every year. If this rate continued, 50 qubit machines with less than 5 percent error rates would be reported in 2019. SOURCE: Plotted data obtained from multiple sources [9].

Quantum Computing Progress and Prospects. National Academies Press.













#### Moore's Law – The number of transistors on integrated circuit chips (1971-2018)

Moore's law describes the empirical regularity that the number of transistors on integrated circuits doubles approximately every two years. This advancement is important as other aspects of technological progress – such as processing speed or the price of electronic products – are linked to Moore's law.



Data source: Wikipedia (https://en.wikipedia.org/wiki/Transistor\_count)

The data visualization is available at OurWorldinData.org. There you find more visualizations and research on this topic.

Licensed under CC-BY-SA by the author Max Roser.



Our World in Data How to factor 2048 bit RSA integers in 8 hours using 20 million noisy qubits. Gidney and Ekerå. 2019.



**Fig. 2.** Performance space of quantum computers, measured by the error probability of each entangling gate in the horizontal axis (roughly inversely proportional to the total number of gates that can be executed on a NISQ machine), and the number of qubits in the system in the vertical axis. Blue dotted line approximately demarcates quantum systems that can be simulated using best classical computers, while the green colored region shows where the existing quantum computing systems with verified performance numbers lie (as of September 2018). Purple shaded region indicates computational tasks that accomplish the so-called "quantum supremacy," where the computation carried out by the quantum computer defies classical simulation regardless of its usefulness. The different shapes illustrate resource counts for solving various problems, with solid symbols corresponding to the exact entangling gate counts and number of qubits in NISQ machines, and shaded regions showing approximate gate error requirements and number of qubits for an FT implementation (not pictured are the regions where the error gets too close to the known fault-tolerance thresholds): cyan diamond and shaded region correspond to factoring a 1024-bit number using Shor's algorithm [14], magenta circle and shaded region represent simulation of a 72-spin Heisenberg model [20], and orange shaded region illustrates NF simulation [21].

#### An Outlook for Quantum Computing. Maslov et al.

## A guide to the rest of the CS:APP textbook

| CS:APP Chapter                    | Rutgers CS / ECE course in AY24-25 for further study                  |
|-----------------------------------|-----------------------------------------------------------------------|
| 4. Processor Architecture         | ECE 563 Computer Architecture I                                       |
| 5. Optimizing Program Performance | CS 214 Systems Programming / CS 415 Compilers                         |
| 7. Linking                        | CS 415 Compilers                                                      |
| 8. Exceptional Control Flow       | CS 416 Operating Systems Design                                       |
| 9. Virtual Memory                 | CS 214 Systems Programming / CS 416 Operating Systems Design          |
| 10. System-Level I/O              | CS 416 Operating Systems Design                                       |
| 11. Network Programming           | CS 352 Internet Technology                                            |
| 12. Concurrent Programming        | CS 214 Systems Programming / ECE 451 Parallel & Distributed Computing |

https://classes.rutgers.edu//soc/#keyword?keyword=QUANTUM&semester=92024&campus=NB&level=U,G

2024 Fall: ECE 493. Soljanin. Quantum Computing Algorithms. (seniors only). https://emina.flywheelsites.com/teaching/

2025 Spring: Physics 421. Roy. An Introduction to Quantum Computing. https://www.physics.rutgers.edu/ugrad/421/

## What my class is about

<u>Graduate seminar</u> on latest developments in <u>quantum computer engineering</u>

What is quantum computer engineering??

- realizing <u>quantum algorithms</u>
- on prototype quantum computers
- —a rapidly growing field!!

#### Goals of the course:

- explore open-source tools for using quantum computers
- read and discuss recent developments
- build foundation for you to pursue research or to be experts in industry



## Preview of the syllabus

- A systems view of quantum computer engineering
- Near-term intermediate-scale quantum algorithms
- Programming frameworks
- Emerging languages and representations
- Claims and counter claims for quantum advantage
- Extracting success
- Prototypes

## Very important to help develop next iteration of this course.

## <u>https://sirs.ctaar.rutgers.edu/blue</u>