The perfect fusion of multi-level memory and analog in-memory computing solves the problem of AI edge processing

The perfect fusion of multi-level memory and analog in-memory computing solves the problem of AI edge processing

Machine learning and deep learning have become an integral part of our lives. Artificial intelligence (AI) applications using natural language processing (NLP), image classification, and object detection are deeply embedded in many of the devices we use. Most AI applications are well served by cloud engines, such as word prediction when replying to emails in Gmail.

By Vipin Tiwari, Director, Embedded Memory Product Development, Microchip

Machine learning and deep learning have become an integral part of our lives. Artificial intelligence (AI) applications using natural language processing (NLP), image classification, and object detection are deeply embedded in many of the devices we use. Most AI applications are well served by cloud engines, such as word prediction when replying to emails in Gmail.

While we can enjoy the benefits of these AI applications, this approach poses challenges in terms of privacy, power consumption, latency, and cost. These problems can be solved with a local processing engine that can perform some or all of the computation (inference) at the source of the data. The memory power consumption of traditional digital neural networks is a bottleneck, making it difficult to achieve this goal. To address this, multi-level memory can be used in conjunction with analog in-memory computing methods to enable processing engines to meet lower milliwatt (mW) to microwatt (μW) power requirements to perform AI inference at the network edge .

Challenges for AI applications served through cloud engines

To provide services for AI applications through a cloud engine, users must upload some data to the cloud in an active or passive manner. The computing engine processes the data in the cloud and provides predictions, and then sends the prediction results to downstream users for use. The challenges of this process are outlined below:

The perfect fusion of multi-level memory and analog in-memory computing solves the problem of AI edge processing

Figure 1: Data transfer from edge to cloud

1. Privacy concerns: For always-on, always-aware devices, personal data and/or confidential information is at risk of misuse during upload or during its shelf life in the data center.

2. Unnecessary power consumption: If every bit of data is transmitted to the cloud, the hardware, radios, transmission means, and unnecessary computations in the cloud all consume power.

3. Latency of small batch inference: If the data comes from the edge, it sometimes takes at least one second to receive a response from the cloud system. When the delay exceeds 100 milliseconds, people have obvious perception, resulting in a poor user experience.

4. The data economy needs to create value: Sensors are ubiquitous and inexpensive; but they generate massive amounts of data. It’s not cost-effective to upload every bit of data to the cloud for processing.

To solve these challenges with a local processing engine, the neural network that performs the inference operation must first be trained for the target use case using the specified dataset. This typically requires high-performance computing (and memory) resources and floating-point arithmetic. Therefore, the training part of the machine learning solution still needs to be implemented on a public or private cloud (or on-premises GPU, CPU, and FPGA Farm), while combining the dataset to generate the best neural network model. The inference operation of a neural network model does not require backpropagation, so after the model is ready, it can be deeply optimized for local hardware using a small computing engine. Inference engines typically require a large multiply-accumulate (MAC) engine, followed by activation layers (such as rectified linear units (ReLU), sigmoid functions, or hyperbolic tangent functions, depending on the neural network model complexity) and pooling between layers chemical layer.

Most neural network models require a lot of MAC operations. For example, even the relatively small “1.0 MobileNet-224” model, which has 4.2 million parameters (weights), requires up to 569 million MAC operations to perform one inference. Most of these models are dominated by MAC operations, so the focus here is on the computational part of machine learning computations, while also looking for opportunities to create better solutions. Figure 2 below shows a simple fully connected two-layer network. Input neurons (data) are processed through the first layer of weights. The output neurons of the first layer are processed through the weights of the second layer and provide predictions (for example, whether the model can find a cat face in a given image). These neural network models use a “dot product” operation to compute each neuron in each layer, as shown in the following formula:

(The “bias” term is omitted from the formula for simplicity).

The perfect fusion of multi-level memory and analog in-memory computing solves the problem of AI edge processing

Layer 1

Tier 1

Layer 2

Tier 2

L1 Input Neurons (eg Image Pixels)

Layer 1 input neurons (e.g. image pixels)

L2 Output Neurons

Layer 2 output neurons

Figure 2: Fully connected two-layer neural network
In digital neural networks, weights and input data are stored in DRAM/SRAM. Weights and input data need to be moved to a MAC engine for inference. According to the figure below, with this approach, most of the power consumption comes from acquiring the model parameters and inputting the data to the ALU where the actual MAC operation takes place. From an energy perspective, a typical MAC operation using digital logic gates consumes about 250 fJ of energy, but consumes more energy during data transfer than the computation itself by two orders of magnitude, in the range of 50 picojoules (pJ) to 100 pJ. To be fair, there are a lot of design tricks to minimize data transfers from memory to ALU, but the overall digital scheme is still limited by the von Neumann architecture. This means that there are plenty of opportunities to reduce power waste. What if the energy consumption to perform a MAC operation could be reduced from about 100 pJ to a fraction of pJ?

Eliminate memory bottlenecks while reducing power consumption

Performing inference-related operations at the edge becomes a viable option if the memory itself can be used to eliminate previous memory bottlenecks. Using in-memory computing methods minimizes the amount of data that must be moved. This in turn also eliminates wasted energy during data transfer. The active power consumption generated by the flash cell operation is low, and almost no energy is consumed in standby mode, so the energy consumption can be further reduced.

The perfect fusion of multi-level memory and analog in-memory computing solves the problem of AI edge processing

Global Buffer

global buffer

Normalized Energy Cost

normalized energy consumption

NoC: 200-1000 PEs

NoC: 200-1000 PE

Buffer

buffer

1 × Reference

1×reference value

Figure 3: Memory bottlenecks in machine learning computations

Source: “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks” by Y.-H. Chen, J. Emer and V. Sze, 2016 International Symposium on Computer Architecture.

An example of this approach is the memBrain™ technology from Microchip subsidiary Silicon Storage Technology (SST). The solution is based on SST’s SuperFlash® memory technology, which has become the accepted standard for multi-level memory for microcontroller and smart card applications. This solution has an in-memory computing architecture built in, allowing computation to be done where the weights are stored. There is no data movement for the weights, only the input data needs to be moved from the input sensors (such as cameras and microphones) into the memory array, thus eliminating the memory bottleneck in the MAC computation.

This memory concept is based on two fundamental principles: (a) the analog current response of a transistor based on its threshold voltage (Vt) and the input data, and (b) Kirchhoff’s current law, a network of conductors that meet at some point , the algebraic sum of the currents is zero. It is also important to understand the basic non-volatile memory (NVM) bit cells in this multi-level memory architecture. The image below (Figure 4) shows two ESF3 (Embedded SuperFlash Gen 3) bit cells with a shared erase gate (EG) and source line (SL). Each bit cell has five terminals: control gate (CG), work line (WL), erase gate (EG), source line (SL) and bit line (BL). The erase operation of the bit cell is performed by applying a high voltage to EG. The programming operation is performed by applying high/low voltage bias signals to WL, CG, BL and SL. The read operation is performed by applying low voltage bias signals to WL, CG, BL and SL.

The perfect fusion of multi-level memory and analog in-memory computing solves the problem of AI edge processing

Figure 4: SuperFlash ESF3 cell

With this memory architecture, the user can program memory bit cells at different Vt voltages by fine-tuning the programming operation. Memory technology utilizes intelligent algorithms to adjust the floating gate (FG) voltage of a memory cell to obtain a specific current response from the input voltage. Cells can be programmed in the linear region or the subthreshold region, depending on the requirements of the end application.

Figure 5 illustrates the functionality of storing multiple voltages in a memory cell. For example, we want to store a 2-bit integer value in a memory cell. For this case, we need to program each cell in the memory array with one of 4 2-bit integer values ​​(00, 01, 10, 11), at which point we need to use four possibilities with sufficient spacing One of the Vt values ​​programs each cell. The four IV curves below correspond to the four possible states, and the current response of the cell depends on the voltage applied to the CG.

The perfect fusion of multi-level memory and analog in-memory computing solves the problem of AI edge processing

Level 1

Voltage 1

Level 2

Voltage 2

Level 3

Voltage 3

Level 4

Voltage 4

Input Voltage

Input voltage

Subthreshold Current

subthreshold current

Figure 5: Programming Vt voltage in ESF3 cells
The weights of the trained model are programmed to the floating gate Vt of the memory cell. Thus, all the weights of each layer of the trained model (e.g., fully connected layers) can be programmed on a matrix-like memory array, as shown in Figure 6. For inference operations, a digital input (eg, from a digital microphone) is first converted to an analog signal using a digital-to-analog converter (DAC) and then applied to a memory array. The array then performs thousands of MAC operations in parallel on the specified input vector, and the resulting output enters the activation phase of the corresponding neuron, which is then converted back to a digital signal using an analog-to-digital converter (ADC). These digital signals are then pooled before entering the next layer.
The perfect fusion of multi-level memory and analog in-memory computing solves the problem of AI edge processing

Input Data on WL/CG

Input data on WL/CG

Dot Product on the Bitlines

Dot Product on Bit Lines

Figure 6: Weight matrix memory array for inference
These multi-level memory architectures are very modular and flexible. Many memory slices can be joined together to form a large model with a mix of weight matrices and neurons, as shown in Figure 7. In this example, the MxN slice configuration is connected together by analog and digital interfaces between slices.
The perfect fusion of multi-level memory and analog in-memory computing solves the problem of AI edge processing
Figure 7: Modular structure of memBrain™
So far, we have mainly discussed the chip implementation of this architecture. A software development kit (SDK) is available to help develop solutions. In addition to the chip, the SDK also aids in the development of the inference engine. The SDK process is independent of the training framework. Users can create neural network models on demand using floating-point computation in all frameworks provided (such as TensorFlow, PyTorch, or others). After the model is created, the SDK helps quantize the trained neural network model and map it to a memory array. In this array, vector-matrix multiplication can be performed with input vectors from sensors or computers.
The perfect fusion of multi-level memory and analog in-memory computing solves the problem of AI edge processing

Figure 8: memBrain™ SDK process

Use Existing Frameworks

Use an existing framework

Augment with memBrain™ SDK

Expand your reach with the memBrain™ SDK

Train your model in the available frameworks

Train models in available frameworks

Model Quantization, Optimizezaton and Mapping for in-memory computing

Use model quantization, optimization, and mapping for in-memory computing

Fine-tuned programming algorithm to load the optimized weights into SuperFlash memBrains™ memory

Load optimized weights into SuperFlash memBrains™ memory using a fine-tuned programming algorithm


Advantages of the multi-level memory approach combined with in-memory computing capabilities include:

1. Ultra-low power consumption: Technology designed for low power consumption applications. The first advantage in terms of power consumption is that this solution uses in-memory computing, so no energy is wasted transferring data and weights from SRAM/DRAM during computation. The second advantage in terms of power consumption is that the flash cells operate at very low currents in subthreshold mode, so the active power consumption is very low. The third advantage is that there is virtually no power consumption in standby mode, as the non-volatile memory cells do not require any power to hold the data of the always-on device. This approach is also well suited to exploit the sparsity of the weights and input data. If the input data or weight is zero, the memory bit cell will not activate.

2. Reduced package size: The technology uses a split gate (1.5T) cell architecture, while the SRAM cell in the digital implementation is based on a 6T architecture. Furthermore, such cells are much smaller compared to 6T SRAM cells. In addition, a single cell can store a full 4-bit integer value, rather than requiring 4*6 = 24 transistors to achieve this as an SRAM cell does, essentially reducing the on-chip footprint.

3. Lower development costs: Due to memory performance bottlenecks and the limitations of the von Neumann architecture, many specialized devices (such as Nvidia’s Jetsen or Google’s TPUs) tend to improve performance per watt by shrinking the geometry, but this approach solves edge problems. Computational puzzles are expensive. By combining analog in-memory computing with multi-level memory, on-chip computing can be done in flash cells, allowing larger geometries to be used while reducing mask costs and development cycles.

The prospects for edge computing applications are very broad. However, power and cost challenges need to be addressed first before edge computing can develop. Using a memory approach that can perform on-chip computations in flash cells removes a major hurdle. This approach leverages a production-proven, standard-type multi-level memory technology solution optimized for machine learning applications.

About the Author

Vipin Tiwari has over 20 years of experience in product development, product marketing, business development, technology licensing, engineering management, and memory design. Currently, Mr. Tiwari is Director of Embedded Memory Product Development at Silicon Storage Technology, Inc., a Microchip subsidiary.

The Links:   NL160120BC27-14 LJ64AU34 BSM50GD170DL