# An Introduction to Practical 2

Lecture 4 for Advanced Deep Learning Systems

Aaron Zhao, Imperial College London, a.zhao@imperial.ac.uk

- 1. Introduction
- 2. Lab 3: A quantization search using MASE
- 3. Lab 4 (software stream): A toy Network Architecture Search (NAS) using MASE
- 4. Lab 4 (hardware stream): Writing and testing a fully-connected layer in SystemVerilog

# Introduction

## Introduction

Two labs are in Practical 2

- Lab 3: A quantization search using MASE.
- Lab 4 (software stream): A toy Network Architecture Search (NAS) using MASE.
- Lab 4 (hardware stream): Writing and testing a fully-connected layer in SystemVerilog.

Deliverable

- A Markdown file: with all answers (plots, tables ...) of the questions and optional questions.
- Corresponding code in your forked repository.

Examination (15%)

- Submission requires the Markdown files only.
- Lab oral to check on your code and Q&A.

Lab 3: A quantization search using MASE

We allow multi-precision, this means different layers can use a different precision setup. This is also know mixed-quantization.

We would like to have at most X% accuracy degradation, and focus on quantizing the computationally heavy layers (eg. linear, convolution).

- If the network has N layers.
- Each layer has *M* quantization choices.
- N<sup>M</sup> search space.

#### **Classic Approach**

```
class JSC_Tiny(nn.Module):
1
       def __init__(self, info, qparam):
2
           super(JSC_Tiny, self).__init__()
3
           self.seq_blocks = nn.Sequential(
4
                # 1st LogicNets Layer
5
               nn.BatchNorm1d(16), # batch norm layer
6
               QuantizedLinear(16, 5, qparam), # linear layer
7
           )
8
9
       def forward(self, x):
10
           return self.seq_blocks(x)
11
12
   for gparam in search_space:
13
     evaluate(model(info, qparam))
14
15
     . . .
```

The classic method is not very scalable because it interleaves network definitions with the quantization optimization, what if

- We have a new network
- Or we have a new optimization
- Or we want to use this optimization in conjunction with other optimizations

```
1 for i, config in enumerate(search_spaces):
2 mg = quantize_transform_pass(ori_mg, config)
3 evaluate(mg)
```

$$\hookrightarrow$$
 dummy\_in})

#### 'add\_common\_metadata\_analysis\_pass' uses the dummy input.

We have explained before that the  $f_x$ -graph is a skeleton, it records minimal information, so how do we actually fetch input and model information from this skeleton to run an actual inference?

The MaseGraph implementation largely relies on the torch fx\_graphs.

When traversing an fx\_graph, you actually need two components, that are the MASEGraph.fx\_graph itself and MASEGraph.modules. One can imagine the fx\_graph is a skeleton, it records minimal information.

- node.op
- node.target
- node.name
- node.args
- node.kwargs

node.op is "placeholder"

- node.name is set to the variable name for the input
- node.target not used
- node.args not used
- node.kwargs not used

node.op is "call\_function"

- node.name is function name
- node.target is the actual function
- node.args is the function arguments
- node.kwargs is the kwargs

node.op is "call\_module"

- node.name is module name
- node.target is also the module name
- node.args is the function arguments
- node.kwargs is the kwargs

#### Tracing the information for these ops

```
for node in graph.fx_graph.nodes:
1
     args, kwargs = None, None
2
     if node.op == "placeholder":
3
       result = dummy_in[node.name]
4
5
     . . .
6
     elif node.op == "call_function":
       args = load_arg(node.args, env)
7
       kwargs = load_arg(node.kwargs, env)
8
       result = node.target(*args, **kwargs)
9
     elif node.op == "call_module":
10
       args = load_arg(node.args, env)
11
       kwargs = load_arg(node.kwargs, env)
12
       result = graph.modules[node.target](*args, **kwargs)
13
14
     . . .
```

Full code available in the implementation of add\_common\_metadata pass.

Lab 4 (software stream): A toy Network Architecture Search (NAS) using MASE We want to pick the optimal architecture  $a \in A$  from a set of architectures A.

At the same time, we want to pick the optimal parameters  $w^*(a)$  for the architecture *a*.

$$\min_{a \in \mathcal{A}} \mathcal{L}_{val}(w^*(a), a)$$

$$s.t.w^*(a) = \operatorname{argmin}_w(\mathcal{L}_{train}(w, a))$$

$$(1)$$



### The idea of multiplied channels

```
class JSC_Three_Linear_Layers(nn.Module):
1
     def ___init___(self):
2
       super(JSC_Three_Linear_Layers, self).__init__()
3
       self.seq_blocks = nn.Sequential(
4
          nn.BatchNorm1d(16), # 0
5
          nn.ReLU(16), # 1
6
          nn.Linear(16, 16), # linear seq_2
7
          nn.ReLU(16), # 3
8
           nn.Linear(16, 16), # linear seq_4
9
          nn.ReLU(16), # 5
10
          nn.Linear(16, 5), # linear seq_6
11
          nn.ReLU(5), \# 7
12
       )
13
14
    def forward(self, x):
15
       return self.seq_blocks(x)
16
```

#### The idea of multiplied channels

```
class JSC_Three_Linear_Layers(nn.Module):
1
     def ___init___(self):
2
         super(JSC_Three_Linear_Layers, self).__init__()
3
         self.seq_blocks = nn.Sequential(
4
             nn.BatchNorm1d(16),
5
             nn.ReLU(16),
6
             nn.Linear(16, 32), # output scaled by 2
7
             nn.ReLU(32), # scaled by 2
8
             nn.Linear(32, 64), # input scaled by 2 but
9
              \leftrightarrow output scaled by 4
             nn.ReLU(64), # scaled by 4
10
             nn.Linear(64, 5), # scaled by 4
11
             nn.ReLU(5),
12
13
14
     def forward(self, x):
15
         return self.seq_blocks(x)
16
```

- The idea is to scale the input and output channels of the linear layer by a constant factor.
- In this lab, you will have to standardize this idea and implement it as a Transformation pass.
- Consecutive linear layers must be scaled by the same factor.
- Search through all the possible factors (brute force and Bayesian).

Lab 4 (hardware stream): Writing and testing a fully-connected layer in SystemVerilog Automatically generate a fully-connected layer in SystemVerilog, and test it using Cocotb.

Cocotb is a COroutine based COsimulation TestBench environment for verifying VHDL and SystemVerilog RTL using Python.

We use the Cocotb with Verilator backend.

Cocotb: direct testing in Python, no need to write a testbench in SystemVerilog.

Verilator: 'up-compiles' SystemVerilog into multithreaded C++, lightening fast, no need to open vendor tools when doing behavior level testing.

- Classic source to source generation
- Directly generate SystemVerilog from MaseGraph

| 1 | <pre>from chop.passes.graph.transforms import (</pre> |
|---|-------------------------------------------------------|
| 2 | <pre>emit_verilog_top_transform_pass,</pre>           |
| 3 | <pre>emit_internal_rtl_transform_pass,</pre>          |
| 4 | <pre>emit_bram_transform_pass,</pre>                  |
| 5 | <pre>emit_verilog_tb_transform_pass,</pre>            |
| 6 | )                                                     |
|   |                                                       |

#### **EmitVerilog Pass in MASE**



### EmitVerilog generates dataflow designs

- Generate functional elements (RTL)
- Generate memory components (BRAM)
- Dataflow accelerator design without making use of the DRAM



#### Dataflow accelerator designs

- A homogeneous Big Compute Core (normal design, ASIC)
- A series of tailored small compute cores (dataflow design, FPGA)



#### Advantages

- No complex control flow (minimal or no ISA design)
- (Almost) no waste of resources
- (Almost) fixed memory access pattern
- Deep pipeline



Disadvantages

- Re-program hardware for each new network
- Scalability issues
- If DRAM is utilized, hard to achieve great performance by filling up all pipeline stages



### The compute pattern

#### Simple blocking

```
1 # Breaking the vector into blocks
2 for i in range(0, n, block_size):
       # calculate end val considering the last block which
3
        \hookrightarrow can be smaller than block size
       end_val_i = min(i + block_size, n)
4
5
       # Retrieving block of a
6
       sub_a = a[i : end_val_i]
7
       # Retrieving corresponding elements from vector
8
       sub_b = b[i : end_val_i]
9
10
       # multiplication, actual hardware dimension is
11
        \leftrightarrow (block_size, 1)
       result += np.dot(sub_a, sub_b)
12
```

## The compute pattern



- N >> M, this gives you a chance to do a trade-off between resources and latency simply by changing M.
- Blocking can happen in a 2D shape!



M \* M mutlipliers in hardware Finish in N\*N/(M\*M) clock cycle

- Parallel Multipliers (*M*<sup>2</sup>).
- M Adder Trees  $(log_2(M))$ .
- *M* Accumulators (*M*).