본문 바로가기

Paper Reading/Memory System

[Review] G. Vavouliotis, Exploiting Page Table Locality for Agile TLB Prefetching, ISCA 2021

1 Brief Summary

1.1 Motivation

TLB prefetching has a large room for improving performance. It can ideally achieve more than 1.2x speedup for several workloads. (QMM and BD workloads are the examples)

Figure 3 from the paper

When TLB miss occurs, PTE is obtained with page table walk. In the process of page table walk through the memory hierarchy, granularity of memory operation is 64B which is cache line size whereas a single PTE is 8B. This is the point where we can exploit PTE locality because we can prefetch 7 more PTEs without additional consumption of memory bandwidth. With naive approach, we can just prefetch all the 7 neighborhood PTEs at PQ (Prefetch queue). However, TLB prefetching is fundamentally limited by PQ size. Therefore, a smart method to exploit PTE locality is required within the constraints of PQ size.

1.2 Key Idea

Solution consists of two main parts.

(1) The paper suggests a new TLB prefetching technique, SBFP (Sampling-based free TLB prefetching) which can be combined with existing prefetchers.

(2) The paper suggests a dynamic TLB prefetching scheme, ATP (Agile TLB prefetcher) that can dynamically adjust to the program context. It includes several TLB prefetchers, including the new SBFP, and dynamically decides whether or not to prefetch TLB entries and which TLB prefetcher to use when deciding to prefetch TLB entries.

1.3 Design Details

1.3.1 SBFP (Sampling-based free TLB prefetching)

As shown in figure 5, there are 14 possible neighborhoods with regard to demand TLB entry. FDT (Free distance table) is a set of 14 counters, and each counter learns how likely the corresponding neighborhood is to be a useful prefetching entry.

Figure 5, 6 from the paper

For every page table walk after TLB miss, neighborhoods of demand TLB entries are stored in either PQ or Sampler based on its FDT value. If the value is larger than threshold, it is stored in PQ which is a free prefetch. Otherwise, it is stored in Sampler, a new on-chip buffer which stores only free distance.

 

FDT is updated(=increased) in both PQ hit or Sampler hit, so that it can keep track of how frequently the neighborhood is accessed in the recent phase. To avoid saturation, it has a decaying scheme and it occurs when one of FDT saturates.

 

It can be combined with existing prefetching mechanism in that inserting neighborhoods of prefetching TLB entry in PQ or Sampler.

1.3.2 ATP (Agile TLB prefetching)

Figure 7 from the paper

ATP decides whether or not to issue prefetch request and which prefetcher to use when it decided to prefetch.

 

Each prefetcher has its own FPQ (Fake prefetch queue), which stores its predicted virtual pages (not physical page number, thus fake) for prefetching, if permitted. Counters are updated based on hit/miss of current TLB miss and ATP uses these updated counters to decide prefetching mechanism. 

 

2 Strength

(1) The paper effectively showed that there is a chance for potential performance improvement by exploiting PTE locality in prefetching TLB.

(2) It is a flexible solution in that it can be integrated into any existing prefetching scheme.

 

3 Weakness

Figure 3 and 8 from the paper

(1) Figure 3 shows the upper bound of speedup achievable by exploiting PTE locality in TLB prefetching with existing TLB prefetcher. Figure 8 shows the actual speedup achieved with SBFP, which prefetches only a portion of the neighborhoods of demand PTE due to constraint on PQ size. With sampling, it achieves nearly ideal speedup for the SPEC workload, but it leaves a significant gap from the ideal speedup for BD workload. Sampling was not effective in approximating entire neighborhoods of demand PTE in certain cases.

(2) ATP achieves a higher speedup than using a single prefetcher. However, it comes with a significant area overhead because the metadata structure for each prefetcher, such as the history table, is accumulated. This speedup would be more meaningful if the metadata structure of each prefetcher were integrated in a more efficient and compact form.