Choosing the Right MCU or SoC as the Cornerstone of Product Development

Introduction

As embedded applications become increasingly diverse, the modern Microcontroller Unit (MCU) has evolved rapidly from the early simple 8-bit chips to 32-bit products dominated by the ARM architecture, balancing high performance with low power consumption. Choosing the right MCU or System-on-Chip (SoC) is no longer just about comparing specs or prices; it requires a holistic consideration of the core architecture, memory subsystems, peripheral configurations, as well as the development ecosystem and long-term maintenance.

Beyond hardware specifications, modern embedded systems must also address multitasking, real-time performance, and security requirements. Real-Time Operating Systems (RTOS) have become a mainstream design choice, while for higher-end applications, some MCUs are capable of running Embedded Linux directly, further expanding system flexibility and functionality. With security mechanisms like TrustZone and Memory Protection Units (MPU) becoming standard equipment, the selection of a single MCU actually dictates the overall software architecture and future maintenance costs.

In this article, we will dive straight into the datasheets to explain the key specifications and common features of today’s mainstream MCUs. I will also highlight the critical points to watch out for during selection, sharing the judgment criteria and experiences I’ve accumulated in practical development.

The Current Mainstream

MCU Vendors

MCUs are primarily used for external control and communication. Currently, the global mainstream MCU market is dominated by the “Big Six”: Infineon, NXP, STMicroelectronics (ST), Texas Instruments (TI), Renesas, and Microchip. Recent studies show that the top five vendors combined account for approximately 55%–60% of the market share.

SoC Vendors

An SoC is an advanced chip that integrates specific functional modules on top of an MCU or Microprocessor Unit (MPU) foundation. For example, the ESP32 is an SoC because it integrates Wi-Fi and Bluetooth wireless modules onto an MCU base, while also building in SRAM, ROM, and rich peripherals like SPI, I2C, UART, ADC, and DAC.

Vendors are divided by vertical markets, integrating MCU/MPU cores, memory, dedicated accelerators, wireless modules, and interfaces to meet specific needs. Major application areas and representative vendors include:

Smartphones and Mobile Devices

MediaTek: The Dimensity series covers the mid-to-high-end market.
Apple: A-series and M-series SoCs, featuring deep hardware-software integration.
Qualcomm: Dominates the mobile SoC market with the Snapdragon series.

Video and Imaging Processing

Axis Communications: Uses their proprietary ARTPEC series SoCs.
HiSilicon: A Huawei subsidiary that once led the surveillance camera SoC market in 2018; however, its global market share has declined rapidly since 2020 due to export restrictions.
Novatek: Offers AI ISP IPCAM SoCs like the NT98539. Distributor technical data claims edge AI computing power of up to roughly 3 TOPS.
Socionext: Provides imaging processing SoCs and AI integration solutions focused on surveillance.
Ambarella: Focuses on AI computer vision SoCs for security surveillance, automotive ADAS, and consumer electronics.
Rockchip: Known for visual processor SoCs like the RV1106.

Automotive Radar and ADAS

Calterah: Offers 77 / 79 GHz mmWave radar SoCs supporting automotive functional safety ASIL-B.
NXP: The SAF85xx / SAF86xx series RFCMOS radar SoCs.
Texas Instruments: Single-chip 4D imaging radar SoCs and design resources.
Infineon: 24 GHz and 76–81 GHz automotive radar SoCs.

Communications and Networking

Broadcom: Trident / Tomahawk / Jericho series Ethernet switch chips.
Marvell: Prestera / Teralynx series, focusing on data infrastructure SoCs.
Realtek: Covers switching, PHY transceivers, and home multimedia SoCs.

Embedded and IoT

Espressif: ESP32 series IoT SoCs integrating Wi-Fi / Bluetooth.
Silicon Labs: EFR32 wireless SoCs covering Bluetooth, Zigbee, Thread / Matter, and sub-GHz.
UNISOC: 5G / Cellular and Wide Area IoT chip solutions.

AI and High-Performance Computing (HPC)

NVIDIA: GPU architecture-based SoCs.
Intel: Expanding from microprocessors to AI and autonomous driving sectors.
Kneron: Edge AI SoC processors.

MCU and SoC Specifications

Voltage and Power Management

Designing power for an MCU is far more complex than simply hooking up a 3.3 V line. To achieve extreme low power and high integration, modern MCUs typically feature multiple power rails and granular power modes.

Voltage Standards and Wide Voltage Design

While 5 V, 3.3 V, and 1.8 V are common standard voltages, modern MCUs often adopt a “Wide Voltage” design to accommodate different power sources:

1.71 V ~ 3.6 V: Common range for mainstream Cortex-M series, capable of running directly off a single Li-ion cell (stepped down from 3.7 V) or two alkaline batteries.
2.7 V ~ 5.5 V: Common in industrial grades or 5 V systems with high noise immunity requirements.

Power Rails

Different units inside an MCU have different power needs, leading to several common power pins:

VDD / VSS: The main power supply for the digital core and I/O.
VDDA / VSSA: Dedicated power for analog peripherals (ADC, DAC, PLL). To ensure measurement precision, it is usually recommended to isolate this from VDD via a filter circuit to avoid digital switching noise interfering with analog signals.
VBAT: Backup battery power. When the main VDD supply is cut, the system automatically switches to VBAT to keep the Real-Time Clock (RTC) running and prevent data loss in Backup Registers.

Power Modes

The core of low-power design lies in “sleeping on demand.” Modern MCUs provide multiple low-power modes, requiring engineers to trade off between Power Consumption, Wake-up Time, and Data Retention.

Note that actual values vary significantly depending on the MCU series (e.g., High Performance vs. Ultra-Low Power), the type of voltage regulator (LDO vs. DC/DC), and Sub-mode settings. The table below serves as a reference for general-purpose MCUs:

Mode	Typical Current	Wake-up Time	RAM Retention	Description
Run	20 ~ 500 µA/MHz	N/A	Yes	Varies greatly by process and architecture. Ultra-low power series can go as low as 20 µA/MHz.
Sleep	Several µA ~ mA	< 1 µs	Yes	CPU stops, peripherals continue running. Current depends on the number of active peripherals.
Stop	< 1 µA ~ Tens of µA	µs ~ Tens of µs	Mode Dependent	Varies based on the size of retained SRAM and active wake-up sources.
Standby	< 1 µA ~ 5 µA	Hundreds of µs ~ ms	No (Backup RAM only)	Wake-up is equivalent to a system reset. Oscillator stabilization time must be considered.

Wake-up Mechanisms and Power Efficiency

After entering sleep, the system needs specific events to wake up:

External Events: GPIO voltage level changes, external interrupts.
Communication Interfaces: Receiving UART data or CAN packets.
Timers: RTC alarms or Low-Power Timer (LPTIM) overflows.

To further improve energy efficiency, some high-end MCUs (like the STM32U5) integrate a DC-DC Buck Converter to replace the traditional LDO. While DC-DC conversion efficiency can reach over 90% (LDOs are less efficient when stepping down), it significantly extends battery life. However, its switching noise is higher, so its impact on analog circuits must be evaluated during design.

Case Study: NXP KW45 Power Architecture

The NXP KW45 is a wireless MCU with integrated Bluetooth. Its power system design offers extreme flexibility, primarily supporting two core power modes:

Bypass Mode: All power domains share the same external power source. This mode has the lowest hardware cost and fewest external circuits but consumes the most power.
DC-DC Buck Mode: This is the sweet spot for power and hardware cost. It uses the internal integrated DC-DC module for voltage regulation. When the RF is idle, the system can dynamically lower the DC-DC output voltage to save even more energy.

Furthermore, for more complex or ultra-low power needs, the chip supports PMIC Mode and Smart Power Switch Mode. Detailed electrical characteristics and configurations can be found in the official datasheet.

This flexible power architecture allows engineers to choose the most suitable power scheme based on the product’s power source (battery/mains), power budget, and cost considerations.

Practical Selection Checklist

When evaluating the power system, I recommend checking the following points:

System Voltage Matching: Is the existing power supply within the MCU’s wide voltage range? Do you need extra Level Shifters?
Analog Precision Needs: If high ADC accuracy is required, did you choose a package with a dedicated VREF+ pin?
Low Power Strategy: What state is the system in most of the time? Does the current in Stop mode meet the battery life target?
Wake-up Latency: Can the application tolerate the reset time when waking up from Standby mode?

CPU

In traditional MCUs, the CPU architecture had to be designed and developed by the vendors themselves. Modern MCUs utilize a more efficient business model: chip vendors (like NXP, TI) license processor core IP from suppliers, then integrate their own memory, peripherals, and other functional modules around this standard core.

Mainstream CPU IP suppliers include ARM (offering Cortex-M0/M3/M4/M7/M33 controller cores) and Synopsys (offering ultra-low power cores like the ARC EM series). This model allows chip vendors to focus on differentiating features while benefiting from a mature, stable processor core, a complete software toolchain, and broad ecosystem support.

The global CPU IP market is dominated by four major camps: ARM (~43.6% share), Synopsys (22.5%), Cadence (5.9%), and Alphawave (3.2%). Together, they hold over 75% of the market.

For instance, NXP’s S32K3XX MCU series uses the Arm Cortex-M7 core—a 32-bit CPU running up to 320 MHz, featuring built-in DSP capabilities and a single-precision Floating Point Unit (FPU).

On the other hand, Calterah’s mmWave radar chips, like the Alps series, use the Synopsys ARC EM6 processor core. The ARC EM6 is a 32-bit RISC core designed for ultra-low power embedded applications, also equipped with an FPU. This combination is particularly well-suited for FMCW radar signal processing, capable of handling complex FFT operations, Range-Doppler processing, and target tracking/detection algorithms at low power.

How to Measure CPU Performance

Simply asking “how fast is the CPU” is tricky. We use CoreMark / MHz as the primary performance indicator—the gold standard for evaluating modern MCU computational efficiency.

CoreMark, developed by EEMBC (Embedded Microprocessor Benchmark Consortium), is an open-source benchmark tool. You can download it for free from GitHub and run it on your target MCU to get a concrete score. Compared to the traditional Dhrystone MIPS (DMIPS), CoreMark / MHz better reflects the real-world performance of modern embedded processors across various scenarios.

If the official CPU performance figures are unavailable, I recommend running the CoreMark test yourself on a development board, asking the OEM for data, or consulting public academic/industry data for a preliminary assessment.

Reference for CoreMark / MHz values across mainstream architectures: WikiChip CoreMark
Official scores, certification labels, and download: EEMBC CoreMark® Scores

For CPUs without an FPU, processing complex signals or decimal calculations takes significantly more time. A practical trade-off is to use fixed-point arithmetic, sacrificing some precision for speed and cost.

CPU Performance Comparison Table

MCU Model	CPU Core	Clock Freq	DMIPS/MHz	CoreMark/MHz	Total DMIPS	Total CoreMark	FPU
NXP S32K144	ARM Cortex-M4F	112 MHz	1.25	3.27	140	366.6	Single
TI TMS570LS0332	ARM Cortex-R4	80 MHz	1.66	3.47	132	277.6	None
STM32U585	ARM Cortex-M33	160 MHz	1.5	4.07	240	651.2	Single
ESP32-S3 (Single Core)	Xtensa LX7	240 MHz	1.3	2.56	312	614.4	Single

The table above shows single-core performance. The ESP32-S3-WROOM uses a dual-core architecture (Xtensa LX7 × 2), and in dual-core mode, the total CoreMark measured value is around 1330.

Practical Evaluation: How Long for FFT and FIR?

In embedded systems, a common question during evaluation is: “How long does this MCU take to run a 256-point FFT?” This isn’t just a spec number; it determines if the system meets real-time requirements.

Basic Conversion Formula:

$$\text{Execution Time (µs)} = \frac{\text{Cycles}}{\text{Clock Frequency (MHz)}}$$

However, the actual cycle count is heavily influenced by the instruction set (FPU / DSP), memory latency (Flash Wait States), and compiler optimization. The data below is based on the CMSIS-DSP library under ideal memory conditions.

FFT Benchmark: Real FFT (RFFT) F32 vs. Q31

Many assume fixed-point (Q31) is always faster than floating-point (F32). However, on modern cores with an FPU (M4F / M7), F32 is often faster and much easier to develop with.

Core Arch	Test Conditions	256-pt RFFT (F32)	1024-pt RFFT (F32)	Key Findings
Cortex-M7	@ 320 MHz (e.g., S32K3)	~ 8,726 cycles (27.3 µs)	~ 36,337 cycles (113.6 µs)	Dual-issue pipeline advantage is significant; F32 beats Q31.
Cortex-M4F	@ 168 MHz (e.g., STM32F4)	~ 14,285 cycles (85.0 µs)	~ 55,538 cycles (330.6 µs)	High price/performance ratio, good for general audio.
Cortex-M4F	@ 112 MHz (e.g., S32K144)	~ 14,285 cycles (127.6 µs)	~ 55,538 cycles (495.9 µs)	Frequency directly impacts final timing.
Cortex-M3	@ 72 MHz (No FPU)	N/A (Extremely Slow)	N/A	Requires software float emulation. Not recommended for real-time DSP.

Note: If using Complex FFT (CFFT), the computation load is roughly double that of RFFT.

FIR Filter Performance
Apart from FFT, Finite Impulse Response (FIR) is a common metric. According to ST’s tests on the STM32F7 (M7), the execution efficiency for a 29-tap low-pass filter is: F32 (66 µs) > Q31 (73 µs) > Q15 (99 µs). This further confirms that on high-performance cores, leveraging the FPU for floating-point math is actually superior to traditional 16-bit fixed-point math.
The Silent Killer of Performance: Memory Latency
The figures above are “core running at full speed” ideal values.
- The Problem: Modern MCU core frequencies (e.g., 320 MHz) are far higher than Flash read speeds (~50 MHz). If code runs directly from Flash, the CPU must insert massive Wait States, slowing performance by 3 ~ 7 times.
- The Solution: Crucial algorithms (FFT / FIR) must be moved to Tightly Coupled Memory (TCM) or SRAM to ensure zero-wait-state performance.
Hardware Accelerators
If CPU performance is still insufficient, consider MCUs with built-in hardware accelerators:
- NXP PowerQuad / TI LEA (Low-Energy Accelerator): Computation units independent of the CPU.
- Calterah Alps-Pro BBA (Baseband Accelerator): A radar-specific baseband accelerator targeting data throughput, achieving equivalent processing speeds of < 5 µs.

How to Measure It Yourself?

To verify specific timing, I suggest the following methods:

DWT Cycle Count: ARM Cortex-M cores have a built-in DWT (Data Watchpoint and Trace) register that precisely counts CPU execution cycles. Results may be affected by cache hits and memory access latency, so interpret them based on system architecture.
GPIO Toggling: Set a specific GPIO pin High before calling the function (e.g., GPIO_PIN_X) and Low immediately after. Use an oscilloscope to measure the pulse width. This method reflects the true execution time including all system latencies and is closest to the physical reality.

Memory

In MCU system design, memory architecture not only dictates hardware cost but is also critical to system performance. When selecting parts, you must clearly distinguish between “Volatile Memory (RAM)” and “Non-Volatile Memory (Flash / NVM).”

Simply put: Flash stores code and fixed data; RAM stores variables and temporary data during computation.

Non-Volatile Memory (NVM)

Flash is used for permanent firmware storage and constant data; data persists after power loss.

There are two main categories of Flash architecture:

NOR Flash: Supports random access, allowing the CPU to read instructions and execute them directly (Execute-In-Place, XIP). This is the standard medium for MCU code storage.
NAND Flash: High density, low cost, but cannot perform random addressing for code execution. Mainly used for file system storage of large data.

To balance cost and durability, some MCUs (like NXP S32K, Infineon Traveo series) partition internal Flash into:

Program Flash (P-Flash): Main program storage. Optimized for code storage with fast reads but limited write endurance (~1k ~ 10k cycles).
Data Flash (D-Flash): Data storage. Optimized for data logging with high write endurance (100k+ cycles), often used to simulate EEPROM for parameter storage.

Note: Not all MCUs have hardware D-Flash partitions. Many general-purpose MCUs only provide a single block of Flash, requiring specific drivers and reserved areas to simulate EEPROM. Also, endurance figures are just grade concepts; refer to the datasheet for actual specs.

How does the CPU execute code stored in Flash? There are two main modes:

XIP (Execute-In-Place):
- Mechanism: CPU reads instructions directly from internal Flash without copying to RAM.
- Status Quo: This is the default operating mode for the vast majority of modern MCUs.
- Pros/Cons: Saves a lot of RAM and achieves zero-copy startup; however, Flash read speeds are slower, usually requiring an Instruction Cache (I-Cache) to boost performance.
Code in RAM:
- Mechanism: Copying code from Flash to RAM for execution.
- Use Case: Typically optimized only for specific performance-sensitive functions (like Interrupt Service Routines (ISR), DSP loops) to utilize RAM’s zero-wait characteristics; or when the system uses external storage (NAND / NOR Flash) with SDRAM, requiring the whole program to be moved.
- Misconception: General MCUs don’t move the “entire” program to SRAM; that would waste precious SRAM resources.

Volatile Memory (RAM)

RAM is used for variables, stack, and heap. Data is lost when power is cut. Common RAM types include:

Type	Full Name	Characteristics & Positioning	Suitable Scenarios
SRAM	Static RAM	Fastest, Highest Cost. Mainstream internal MCU memory. No refresh needed, syncs with CPU (zero wait).	Core computation, Stack, critical variables.
DRAM	Dynamic RAM	High Density, Low Cost. Needs periodic Refresh. Higher latency. Usually external expansion.	Embedded Linux, large video buffers.
PSRAM	Pseudo-Static RAM	The Middle Ground. Internal DRAM architecture (high density) but interface mimics SRAM (easy control).	AIoT model weights, audio buffers, GUI.

Lots of Code (Large Code Size) $\rightarrow$ Need larger Flash. As long as the MCU supports XIP, more code only consumes Flash, not SRAM. (Note: This refers to .text and .rodata, excluding .data and code deliberately moved to RAM.)
Lots of Variables (Large Data Size) $\rightarrow$ Need larger SRAM. When internal SRAM isn’t enough for large arrays, AI models, or screen buffers, you need to attach PSRAM for expansion.

Practical Evaluation: Is the Memory Enough?

To accurately evaluate memory needs, analyze the compiled Memory Segments. After compilation, the toolchain’s size command usually outputs a summary like this:

1
2
text    data     bss     dec     hex
20480   1024     2048    23552   5c00

The meanings of the sections are:

.text (Code): Code and read-only constants (.rodata). $\rightarrow$ Occupies Flash.
.data (Init Data): Initialized global variables (e.g., int a = 10;). $\rightarrow$ Occupies Flash (initial values) + Occupies RAM (at runtime).
.bss (Zero Data): Uninitialized or zero-valued variables (e.g., int b;). $\rightarrow$ Occupies RAM only.

The formulas for estimating Flash and SRAM needs are:

Flash Usage Estimation: $$ \text{Flash Usage} \approx \text{.text} + \text{.data} $$
- Recommendation: Reserve 20% space for future OTA updates or feature expansions.
SRAM Usage Estimation: $$ \text{SRAM Usage} = (\text{.data} + \text{.bss}) + \text{Stack} + \text{Heap} $$ SRAM usage consists of:
- Static Usage: .data + .bss (Fixed after compilation).
- Dynamic Usage:
  - Stack: Function calls and local variables. Need to estimate the deepest call chain (Worst-Case).
  - Heap: malloc dynamic allocation. Recommended to minimize usage in embedded systems.
- Recommendation: Keep a 30% safety margin to avoid stack overflow.

When evaluating the impact of large arrays on SRAM, calculate their byte size directly:

Basic Example: Declaring a buffer for ADC sampling uint16_t raw_data[1024];
- Calculation: $1024 \text{ (elements)} \times 2 \text{ (bytes/element)} = 2048 \text{ bytes} = 2 \text{ KB}$
- This array will directly consume 2 KB of SRAM.
Image Example: A 320 × 240 RGB565 image
- Calculation: $320 \times 240 \times 2 \text{ bytes} = 153,600 \text{ bytes} \approx 150 \text{ KB}$
- This is clearly too much for an MCU with only 128 KB SRAM; consider external PSRAM.
Advanced Example: SRAM Needs for 4096-point FFT In signal processing, FFT is a memory hog. For a 4096-point complex FFT, if the input data type is complex float (Real float + Imaginary float = 8 bytes):
- Basic Need (in-place buffer): If the algorithm supports In-place operation (output overwrites input), only one buffer is needed: $ 4096 \times 8 \text{ bytes} = 32 \text{ KB} $
- Extra Scratch Buffer: If the algorithm requires an extra work area of the same size, total demand becomes: $ 2 \times 4096 \times 8 \text{ bytes} = 64 \text{ KB} $

All FFT functions in the CMSIS-DSP library are designed for In-place computation (arguments marked as [in,out]), meaning the result directly overwrites the input buffer. Therefore, in this case, only 32 KB of SRAM is needed to complete a 4096-point floating-point FFT, with no extra output buffer required.

Memory Optimization Strategies

When you find RAM is running low, consider these strategies:

Use const: Declare lookup tables or fixed parameters as const to force them into Flash (.rodata), freeing up RAM.
Time vs. Space: Don’t load large data into RAM all at once; read and process it from Flash in chunks.
Optimize Data Types: If uint8_t works, don’t use int16_t. It adds up.
Hardware Expansion: If software optimization hits a wall, choose an MCU that supports external PSRAM.

I/O and Communication Capabilities

With process miniaturization, the number of Peripherals integrated inside MCUs has grown exponentially, but the Pin Count is limited by physical size and cost. This has made “Pin Multiplexing (PinMux)” a standard feature of modern MCUs.

PinMux isn’t just software definition; it involves physical MOSFET switches. Each switch has On-Resistance ($R_{ON}$) and Parasitic Capacitance ($C_{PAR}$). The more multiplexed functions a pin has, the larger the capacitance on the node, which limits signal bandwidth. Therefore, when selecting parts, prioritize dedicated pins or those with fewer multiplexed functions for high-speed interfaces (like RGMII, High-Speed SPI) to ensure signal integrity.

Evolution of Configuration Methods

Faced with complex PinMux (electrical attributes, conflict detection), manually scouring thousands of pages of reference manuals to write pin_mux.c is outdated and error-prone.

Traditional Method: Datasheet and Register Hell Early developers had to manually look up the base address and offset for each pin, calculating the correct bit masks to write to registers. For example, setting a UART TX pin might require configuring the clock gate, MUX mode, drive strength, and pull-up resistors simultaneously. If the hardware revision changed a pin, every line of related code had to be checked and fixed.
Modern Method: GUI Config Tools Modern MCU vendors provide powerful GUI configuration tools, such as TI SysConfig, NXP S32 Configuration Tools, or STM32CubeMX. Engineers simply specify the pin function in the GUI (e.g., mapping PTA12 to CAN0_TX), and the tool automatically detects available options and instantly blocks potential pin conflicts. Once confirmed, these configurations are saved as proprietary config files (like .syscfg, .mex, or .ioc) and automatically parsed by the toolchain during compilation to generate corresponding .c and .h driver code (like ti_drivers_config.c). Ultimately, the developer only needs to call top-level initialization functions like Board_init() or PINS_DRV_Init() in the application layer to drive the hardware, completely bypassing the tedious process of manually modifying low-level registers. If hardware changes later, just update the GUI settings and recompile.

Electrical and Peripheral Configuration

Beyond basic pin routing, the real power of modern GUI tools is visualizing complex “electrical characteristics” and “peripheral parameters,” allowing engineers to complete settings like filling out a form without flipping through register definitions.

Modern MCU I/O controllers (like the NXP PCR module) offer granular electrical control, typically including:

Direction: Set pin as Input or Output.
Resistors:
- Pull-up / Pull-down: Enable internal resistors to save external components.
- High-Z / Floating: High impedance mode, suitable for analog inputs or power saving.
Drive Strength:
- High Drive: Drives large capacitive loads or long traces, but emits stronger EMI.
- Low Drive: Reduces EMI, suitable for short traces or low-speed signals.
Slew Rate: Controls the steepness of signal edges. Unless running high frequencies, set to Slow to reduce ringing.
Passive Filter: Enables hardware debounce, suitable for button inputs.

The tools not only configure pins but also generate initialization code for peripherals, such as:

UART: Baud rate (e.g., 115200), Parity (None / Even / Odd), Stop bits.
PWM: Frequency, Duty cycle, Dead-time.
CAN / CAN-FD:
- FD Enable: One-click CAN-FD support.
- Bitrate: Separate settings for Arbitration rate (e.g., 500 kbps) and Data rate (e.g., 2 Mbps).

This “WYSIWYG” (What You See Is What You Get) configuration approach significantly lowers the barrier to entry for low-level driver development.

Linux vs. MCU Configuration Differences

When the system shifts from an MCU to an SoC running Linux, the logic of hardware configuration changes fundamentally.

MCU (Bare-metal / RTOS): Configuration is Static Compilation. Generated C code writes directly to registers. Drivers and hardware config are tightly coupled.
Linux Kernel (Device Tree & Pinctrl): Configuration is Declarative. Drivers do not contain hardware location info; instead, they describe hardware structure via the Device Tree (DT).
- Pinctrl Subsystem: The Linux kernel abstracts pin control. Drivers simply request a state (like “default” or “sleep”), and the Pinctrl subsystem operates the low-level registers based on the description in the Device Tree Source (DTS). This achieves complete decoupling of drivers and board-level configuration.

Linux CAN Bus in Action

On an MCU, we are used to directly manipulating CAN registers or calling CAN_Send() SDK functions. But in Linux, CAN is treated as a network interface—this is the famous SocketCAN architecture.

Steps to use SocketCAN:

DTS Configuration (.dts) First, define the CAN controller and its pins in the DTS:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
&flexcan1 {
    pinctrl-names = "default";
    pinctrl-0 = <&pinctrl_flexcan1>;
    status = "okay";
};

pinctrl_flexcan1: flexcan1grp {
    fsl,pins = <
        MX8MP_IOMUXC_SAI2_RXC__CAN1_TX      0x154
        MX8MP_IOMUXC_SAI2_RXD__CAN1_RX      0x154
    >;
};

Enable Interface After Linux boots, you won’t see a function like CAN_Init(). Instead, use standard network commands:

1
2
3
# Set Bitrate to 500 kbps and bring interface up
ip link set can0 type can bitrate 500000
ip link set can0 up

Send and Receive Data The application layer doesn’t need to know if the underlying hardware is SPI-to-CAN or native CAN; it uses the unified Socket API. Common debug tools include can-utils:
- Receive Data (Dump): candump can0
- Send ID 123, Data DEADBEEF: cansend can0 123#DEADBEEF
  This abstraction gives applications extremely high portability—regardless of how the underlying hardware changes, the upper-level software requires almost no modification.

Conclusion

Choosing an MCU is not just a component selection; it is the strategic cornerstone of systems engineering. Success requires moving beyond high-level specs to a deep datasheet analysis of power architectures, memory hierarchies, and I/O constraints.

Engineers must balance cost against capability:

Real-time Processing: Prioritize CPU throughput, zero-wait memory, and accelerators.
Battery Power: Focus on granular sleep modes and wake-up latency.
System Integration: Evaluate PinMux flexibility and ecosystem maturity.

Ultimately, don’t just pick a chip that works—select a foundation that secures your architecture, minimizes development risk, and ensures a smooth path from prototype to mass production.