본문 바로가기

분류 전체보기

(72)

tmux customize 하여 사용하기 tmux를 사용하면 session을 한 번 만들어두고 나중에 다시 사용할 때 해당 session에서 이어서 작업할 수 있다. 또한 window를 여러 개 만들어서 vim 작업, 프로그램 실행 등 여러 작업을 편리하게 왔다갔다 하면서 할 수 있다. Tutorial link for 'tmux' 아래 링크에 tmux의 사용법이 잘 정리되어 있다. 참고 링크: https://hbase.tistory.com/200 [Linux] tmux 설치와 사용법 및 예제 정리 원격 서버에 접속해서 작업을 하다보면 터미널 화면을 분할해서 사용해야하는 경우가 많다. 특히 하나의 터미널 창에 여러개의 터미널 화면을 분할해서 사용하는 'Terminal multiplexer'라는 종류의 hbase.tistory.com How to ..

[Review] G. Vavouliotis, Exploiting Page Table Locality for Agile TLB Prefetching, ISCA 2021 1 Brief Summary 1.1 Motivation TLB prefetching has a large room for improving performance. It can ideally achieve more than 1.2x speedup for several workloads. (QMM and BD workloads are the examples) When TLB miss occurs, PTE is obtained with page table walk. In the process of page table walk through the memory hierarchy, granularity of memory operation is 64B which is cache line size whereas a ..

[Review] M. Jalili, Reducing Load Latency with Cache Level Prediction, HPCA 2022 1 Brief Summary 1.1 Problem/Motivation As the number of cache hierarchy levels increases, level-wise sequential cache lookup results in higher load latency. Good cache system should be able to reduce the number of misses significantly from one level to the next, but they showed in experiment that certain level of cache cannot reduce the number of misses effectively for many workloads. (In figure..

[HPCA '22] M. Jalili, Reducing Load Latency with Cache Level Prediction 2 Motivation Miss analysis on workload 캐시 시스템이 locality를 잘 활용하고, prefetcher가 효과적으로 load를 줄인다면, 어플리케이션을 실행했을 때 cache level이 높아질수록 load miss의 수가 많이 줄어들어야 한다. Figure 3의 (a)는 기존 캐시 시스템이 잘 동작하는 예시인데, L1 -> L2 -> L3로 갈수록 miss의 수가 x3, x2배 줄어든다. 반면 나머지 5개의 workload들은 기존의 level-wise sequential한 캐시 접근이 비효율적인 여러 가지 상황을 보여준다. (b)의 경우는 L2 cache가 miss를 낮추지 못하므로 비효율적이며, L3도 L2에 비해 miss를 많이 낮추지는 못한다. (c)의 경우는 L..

[HPCA '23] Y. Kim, NOMAD: Enabling Non-blocking OS-managedDRAM Cache via Tag-Data Decoupling 1 Introduction On-package DRAM을 캐시로 사용하는데, 이것을 구현하는 방식에는 HW-based, OS-managed의 두 가지 방식이 있다. HW-based는 non-blocking cache로 동작할 수 있기 때문에 여러 개의 miss handling을 동시에 처리할 수 있다는 장점이 있으나, metadata를 추가적으로 접근해야 한다는 한계점이 있다. OS-managed는 address translation mechanism을 활용해 tag를 저장하여 metadata overhead가 없어지지만, blocking으로 인해 miss 시의 penalty가 높다. 이에, 여기서는 non-blocking으로 동작하는 OS-managed DRAM cache design을 제안한다. 이는..

[HPCA '20] T.J. Ham, A^ 3: Accelerating attention mechanisms in neural networks with approximation 이번 포스팅에서는 transformer 및 attention 가속과 관련된 유명한 논문인 A3를 정리 및 리뷰해본다. 1. Introduction Brief Background CNN, RNN을 지원하는 FPGA/ASIC-based accelerator는 많은 선행 연구가 있어 왔지만, attention mechanism을 사용하는 neural network에 대해서는 HW 가속기 지원이 충분하지 않다. (물론 지금은 많지만, A3 논문은 2020년에 발표되었다) Attention mechanism은 content-based similarity search를 통해, 현재 processing 중인 정보와 연관이 많은 것이 무엇인지를 결정한다. 이러한 특성 덕분에 현재 CV, NLP 등 deep learni..

[HPCA '23] J. Stojkovic, Memory-Efficient Hashed Page Tables 1 Introduction 현재 널리 사용되는 radix-tree page table은 메모리를 효율적으로 사용하고 caching 구조에 최적화되어 있지만, scalability가 떨어진다는 한계점이 있다. Tree 계층 구조를 따라 sequential한 메모리 접근을 해야 하므로, memory-level parallelism을 활용할 수 없다. 이에 대한 하나의 대안은 hashed page table (HPT) 이다. VPN을 hashing한 값을 table의 index로 하여 entry를 접근하고, collision이 없는 효율적인 hashing을 사용한다는 전제 하에 이상적으로 1번의 메모리 접근만으로 address translation을 할 수 있다. 그러나 HPT는 크게 4가지의 이유로 인해 그..

[ISCA '23] Y. Qin, FACT: FFN-Attention Co-optimized Transformer Architecture with Eager Correlation Prediction 1. Introduction Transformer 모델은 NLP, 컴퓨터 비전 등 DL의 여러 분야에서 높은 성능을 보이고 있는데, 그 핵심 매커니즘은 attention mechanism이다. 이는 모델이 input 간 문맥의 correlation을 학습하도록 하는 의미를 갖는다. CNN과 같은 이전의 딥 러닝 모델에 비해 정확도가 높은 대신 power과 latency의 cost가 높다. Latency Component & Power Breakdown Transformer 모델은 Figure 1의 (a)와 같이 N개의 block으로 이루어지며, 각 block 내에는 3가지 component로 구성된다. QKV generation, Attention, FFN이 그것이다. Figure 1의 (b)는 powe..

이전 1 2 3 4 5 ··· 9 다음

티스토리툴바