본문 바로가기

Paper Reading

(23)
[Micro '23] K. Kanellopoulos, Victima: Drastically Increasing Address Translation Reachby Leveraging Underutilized Cache Resources 1 Background 1.1 Virtual memory & Page table (PT) Virtual memory designs allow any mapping from a virtual page to a physical page. The OS keeps a PT, which is a per-process data structure that records the virtual-to-physical mapping of the process. The PT is organized as a 4-level radix tree as shown in Figure 1, and the system sequentially accesses each level to find the corresponding phyiscal ..
[ISCA '21] A. Naithani, Vector Runahead 1 Motivation & Key Idea 1.1 Background: Memory stalls in OoO processors When a load instrcution misses the last level cache (LLC), the instruction often becomes the head of the reorder buffer (ROB), causing the processor to stall due to the full instruction window. This stall typically lasts for tens to hundreds of cycles, and becomes one of the main bottleneck for performance. 1.2 Motivation: L..
[HPCA '03] O. Mutlu, Runahead Execution 1 Motivation & Key Idea 1.1 Instruction window in OoO processor Out-of-Order (OoO) execution can tolerate long-latency cache misses better than in-order execution by scheduling subsequent instructions that are independent of the miss. A long-latency operation blocks the instruction window, even if subsequent instructions have completed execution. It is because the insturction window should ensur..
[NeurIPS '22] X. Wei, Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models 1 Introduction Transformer-based model의 memory, computation overhead를 낮추고자 low-precision arithmetic을 사용하는 quantization이 많이 연구되어 오고 있다. Transformer-based model은 outlier가 존재하며 이들은 구조화된 패턴 (예를 들어, 특정한 embedding dimension에 모여있다든지)을 보임이 알려져 있다. Outlier의 존재는 quantization performance에 심각한 damage를 가져오며, 기존의 접근법 중 하나로는 quantization granularity를 보다 finer하게 가져가는 것이 있는데, 이는 오히려 computation cost를 증가시키는 한계점이 ..
[Review] G. Vavouliotis, "Page Size Aware Cache Prefetching", Micro 2022 1 Brief Summary 1.1 Motivation Existing cache prefetchers keep track of metadata structures (for example, history table) in units of 4KB pages and stop prefetching if the predicted delta exceeds the 4KB page boundary. This is because going beyond the 4KB boundary doesn't assure physical contiguity. Cache prefetchers typically reside below the L2 hierarchy, where the address translation is alread..
[Review] G. Vavouliotis, Exploiting Page Table Locality for Agile TLB Prefetching, ISCA 2021 1 Brief Summary 1.1 Motivation TLB prefetching has a large room for improving performance. It can ideally achieve more than 1.2x speedup for several workloads. (QMM and BD workloads are the examples) When TLB miss occurs, PTE is obtained with page table walk. In the process of page table walk through the memory hierarchy, granularity of memory operation is 64B which is cache line size whereas a ..
[Review] M. Jalili, Reducing Load Latency with Cache Level Prediction, HPCA 2022 1 Brief Summary 1.1 Problem/Motivation As the number of cache hierarchy levels increases, level-wise sequential cache lookup results in higher load latency. Good cache system should be able to reduce the number of misses significantly from one level to the next, but they showed in experiment that certain level of cache cannot reduce the number of misses effectively for many workloads. (In figure..
[HPCA '22] M. Jalili, Reducing Load Latency with Cache Level Prediction 2 Motivation Miss analysis on workload 캐시 시스템이 locality를 잘 활용하고, prefetcher가 효과적으로 load를 줄인다면, 어플리케이션을 실행했을 때 cache level이 높아질수록 load miss의 수가 많이 줄어들어야 한다. Figure 3의 (a)는 기존 캐시 시스템이 잘 동작하는 예시인데, L1 -> L2 -> L3로 갈수록 miss의 수가 x3, x2배 줄어든다. 반면 나머지 5개의 workload들은 기존의 level-wise sequential한 캐시 접근이 비효율적인 여러 가지 상황을 보여준다. (b)의 경우는 L2 cache가 miss를 낮추지 못하므로 비효율적이며, L3도 L2에 비해 miss를 많이 낮추지는 못한다. (c)의 경우는 L..