Mobile QR Code QR CODE

References

[1]

NVIDIA, NVIDIA Tesla V100 GPU Architecture, 2017.

[2]

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, “GPUWattch: Enabling energy optimizations in GPGPUs,” ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 487-498, 2013. [CrossRef]

[3]

V. Kandiah, S. Peverelle, M. Khairy, J. Pan, A. Manjunath, T. G. Rogers, T. M. Aamodt, and N. Hardavellas, “Accel- Wattch: A power modeling framework for modern GPUs,” Proc. of MICRO ’21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 738- 753, 2021. [CrossRef]

[4]

N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, “A case for core-assisted bottleneck acceleration in GPUs: Enabling flexible data compression with assist warps,” Proc. of the 42nd Annual International Symposium on Computer Architecture, pp. 41-53, 2015. [CrossRef]

[5]

S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram, “Warped-compression: Enabling power efficient gpus through register compression,” Proc. of the 42nd Annual International Symposium on Computer Architecture, pp. 502-514, 2015. [CrossRef]

[6]

AMD, The Polaris Architecture, 2016.

[7]

D. Wong, N. S. Kim, and M. Annavaram, “Approximating warps with intra-warp operand value similarity,” Proc. of 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016. [CrossRef]

[8]

S. Sardashti and D. A. Wood, “Decoupled compressed cache: Exploiting spatial locality for energy-optimized compressed caching,” Proc. of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 62-73, 2013.

[9]

T. M. Nguyen and D. Wentzlaff, “MORC: A manycoreoriented compressed cache,” Proc. of the 48th International Symposium on Microarchitecture, pp. 76-88, 2015. [CrossRef]

[10]

S. Hong, B. Abali, A. Buyuktosunoglu, M. B. Healy, and P. J. Nair, “Touché: Towards ideal and efficient cache compression by mitigating tag area overheads,” Proc. of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 453-465, 2019. [CrossRef]

[11]

G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and S. W. Keckler, “A case for toggle-aware compression for GPU systems,” Proc. of 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016. [CrossRef]

[12]

S. Lal, J. Lucas, and B. Juurlink, “E2MC: Entropy encoding based memory compression for GPUs,” Proc. of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2017. [CrossRef]

[13]

G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, “Base-delta-immediate compression: Practical data compression for on-chip caches,” Proc. of the 21st International Conference on Parallel Architectures and Compilation Techniques, pp. 377-388, 2012. [CrossRef]

[14]

G. Li, X. Chen, G. Sun, H. Hoffmann, Y. Liu, Y. Wang, and H. Yang, “A STT-RAM-based low-power hybrid register file for GPGPUs,” Proc. of the 52nd Annual Design Automation Conference, pp. 1-6, 2015. [CrossRef]

[15]

W. Jeon, J. H. Park, Y. Kim, G. Koo, and W. W. Ro, “Hi- End: Hierarchical, endurance-aware STT-MRAM-based register file for energy-efficient GPUs,” IEEE Access, vol. 8, pp. 127768-127780, 2020 [CrossRef]

[16]

C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, no. 3, July 1948. [CrossRef]

[17]

S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” Proc. of 2009 IEEE International Symposium on Workload Characterization (IISWC), 2009. [CrossRef]

[18]

G. M. Amdahl, “Validity of the single-processor approach to achieving large scale computing capabilities,” Proc. of Spring Joint Computer Conference, pp. 483-485, 1967 [CrossRef]

[19]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, “Parboil: A revised benchmark suite for scientific and commercial throughput computing,” Center for Reliable and High- Performance Computing, vol. 127, 2012.

[20]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing CUDA Workloads using a Detailed GPU Simulator,” Proc. of 2009 IEEE International Symposium on Performance Analysis of Systems and Software, 2009 [CrossRef]

[21]

NVIDIA, NVIDIA’s Fermi: The First Complete GPU Computing Architecture, 2009.