With the rising complexity of today’s mobile devices and embedded systems, developers are facing an increasingly challenging task to design efficient memory subsystems that maximize system performance. NOR Flash often contains the boot code, operating system kernel, device drivers, middleware and other application-specific software and can result in megabytes of programs stored in non-volatile Flash memory.
Where performance is essential, these programs are moved from non-volatile memory to faster RAM for execution. However, where device size and cost are critical, an alternative approach known as Execute-in-Place (XiP) where the programs are executed directly from non-volatile memory is becoming increasingly popular.
Executing Code from NOR Flash
When using XiP, the non-volatile memory subsystem is constantly being accessed to retrieve program code and may potentially introduce memory bottlenecks into the primary execution path. Understanding the system architecture is critical to identifying any factors that affect memory performance and the resulting system performance.
System performance is often measured as the number of instructions per cycle (IPC). A CPU that requires 4 cycles to execute an instruction has an ideal IPC of 0.25, but many factors influence the actual IPC with one in particular, a cache miss being critical for XiP. A cache miss will stall the system as an instruction is fetched from memory, resulting in a lower IPC. Fortunately, due to a “locality of reference” in systems with level 1 and level 2 caches one can achieve cache hit rates over 99%.
Because system performance is affected by the ability of the memory subsystem to fill the cache when there is a cache miss, there are several factors to consider:
- Read Bandwidth: A high bandwidth bus is needed to minimize the overall read latency even though only a single cache line of memory is being read (typically 32 bytes). In addition, the nature of application programs requires the ability to make small, fast memory accesses throughout the entire code region with minimum latency.Read bandwidth performance varies across bus interfaces and operating frequencies and must be balanced against pin count. Consider the performance of a low-pin count SPI-DDR NOR with an initial access time of 120ns. It significantly outperforms Async Parallel NOR and is comparable to Page Mode NOR. While Burst Mode Parallel NOR has the highest bandwidth, its advantage over SPI-DDR is minimized in a cache-based system.
- Controller Latency: Initiating a read command incurs controller latency when dealing with address and protocol overhead, measured from the time the command is sent to the controller to when the controller returns the first byte of data.Controller latency is higher for SPI-DDR NOR, primarily due to the serialization of the command and address information required at the beginning of an SPI transaction. This gap in performance closes significantly as the memory bus frequency is increased. In many mobile and embedded systems a sub 200ns controller latency would provide adequate performance and allow SPI-DDR to be considered as a viable alternative to Parallel NOR.
- Instant and Average CPU Stall Times: When the next instruction to execute is not available in the cache, it must be loaded from memory. The impact on system responsiveness from instant delay depends upon how often the cache misses; if the miss rate is very low, the system can usually tolerate a relatively higher instant delay.The impact of stall time on system performance depends upon the CPU clock frequency. For CPU operating frequencies from 100 MHz to 166 MHz, SPI-DDR also provides an acceptable stall response when compared with both Burst and Page NOR. When SPI-DDR is compared to Burst Mode devices, a system developer will need to consider whether the additional pins (30+) required for the higher performance Burst Mode interface are a desirable tradeoff.
So what is the overall effect these factors have on a system’s IPC?
A typical mobile or embedded system has a cache miss rate of less than 1%. With a system with a CPU operating at 166 MHz and a 66 MHz memory bus and a cache miss rate of 0.5%, both Burst Parallel NOR and SPI-DDR NOR have a minimal impact on IPC of 1 to 2%. With a higher cache miss rate of 1%, Burst Parallel NOR provides an advantage by impacting the IPC by only 6% compared to 12% for SPI-DDR NOR.
In high-performance systems, Burst Parallel NOR will continue to be the preferred solution; however for slightly lower performance systems, SPI-DDR provides an attractive, low pin count alternative.