Extracting subnets from the published SmolLM2 model for compute-efficient task performance on edge devices

#5
by MartialTerran - opened

I want to be able to use a subnet derived from the SmolLM2 134M model, to be useful on a midsized Arduino board (assume sufficient memory and storage, perhaps with external storage).

Potential Tasks for Extracted Subnets on Arduino:

Given the resource constraints of even powerful Arduinos, extracted subnets will be best suited for simpler NLP tasks. Examples include:

Text Classification: Sentiment analysis, topic classification (with a limited number of classes).

Keyword Spotting: Detecting specific keywords or phrases or equivalent subject matter in input text.

Short Text Generation: Generating very short responses or completions (highly constrained due to limited memory).

Code Completion (very limited): Suggesting very basic code completions based on a highly restricted vocabulary (more feasible with larger Arduinos and external storage).

Simple Question Answering: Answering basic factual questions based on a pre-defined knowledge base stored on the Arduino (requires careful memory management).

Arduino-Tool calling: Intelligently Read/Activate pins of the arduino based on text input.

Project Plan: Extracting and Running a SmolLM2 Subnet on Arduino
This plan outlines the process of extracting a smaller, more efficient subnet from the pre-trained SmolLM2 weights and deploying it on an Arduino for specific tasks. It builds on the previous plan but focuses on leveraging the capabilities of the existing pre-trained model.

Target Arduino Boards: (Same or less than as previous plan) Arduino Portenta H7, Arduino Nano RP2040 Connect (with significant optimization and external storage), or custom ESP32 boards with ample PSRAM.

Phase 1: Subnet Extraction and Quantization

Task 1.1: Identify Target Subnet Architecture:

Description: Determine the desired size and architecture of the subnet. This involves choosing the number of layers, hidden dimensions, and attention heads.

Details: Analyze the available resources on the target Arduino board (RAM, flash memory, processing power) to guide the selection of an appropriate subnet size. Prioritize smaller architectures to ensure feasibility on the Arduino. Consider using the smallest published SmolLM2 checkpoint (134M parameters) as a starting point and further pruning it.

Deliverables: Definition of the target subnet architecture (number of layers, hidden size, attention heads).

Task 1.2: Extract Subnet Weights:

Description: Extract the corresponding weights for the chosen subnet architecture from the full SmolLM2 checkpoint.

Details: The weights of smaller transformers within SmolLM2 are nested within the largest model. Carefully map the layers, attention heads, and other components of the target subnet to the corresponding parts of the full model checkpoint. This requires understanding the internal structure of the SmolLM2 model and how the weights are organized.

Deliverables: A set of weights specifically for the extracted subnet.

Task 1.3: Quantize the Subnet: (Same as Task 1.2 in the previous plan)

Description: Quantize the extracted subnet weights (e.g., to INT8) to reduce memory footprint and potentially improve performance.

Details: Use quantization tools provided by the chosen inference framework (TensorFlow Lite, onnxruntime, etc.).

Deliverables: Quantized subnet weights.

Phase 2: Arduino Development Environment Setup (Same as Phase 2 in the previous plan, but using the quantized subnet instead of the full model)

Phase 3: Implementation and Integration (Mostly the same as Phase 3 in the previous plan)

Phase 4: Testing and Optimization (Same as Phase 4 in the previous plan)

Important Considerations:

Memory Management: Crucial throughout the process. Consider techniques like dynamic memory allocation, memory mapping, and minimizing data copies.

Performance Optimization: Profile the code and optimize for speed. Techniques might include loop unrolling, minimizing function calls, and using efficient data structures.

External Storage: Likely required for even moderately sized subnets, especially for text generation tasks.

Limited Vocabulary: Use a smaller vocabulary to reduce memory requirements for the embedding matrix.

This plan offers a more practical approach by focusing on extracting and deploying smaller subnets rather than the computationally intensive process of using the whole SmolLM2 model on the Arduino itself. Remember that even seemingly simple tasks can be challenging on microcontrollers, and careful planning and optimization are essential.

Sign up or log in to comment