Mobile QR Code QR CODE

  1. (School of Electrical Engineering, University of Ulsan, Ulsan 44610, Korea)
  2. (Department of Smart Systems Software, Soongsil University, Seoul 06978, Korea)
  3. (School of Computer and Communication Engineering/Computer & Communication Research Center, Daegu University, Gyeongsangbukdo 38453, Korea)



Ionic screening, electrolyte, field-effect transistor, Monte Carlo method, electrical noise

I. INTRODUCTION

Memory and storage have been emerging as an important issue for accommodating explosive data generations from yet-another innovative services including Internet of Things (IoTs), social network service (SNS), and private internet broadcasting [1,2]. While the challenges on traditional storage systems are mainly on enlarging the capacity, the above-mentioned applications require fine-grained data management where the latency and random accessibility are important as well. To meet these requirements, many new storage and memory device technologies have been developed. Among them, phase change memory (PCM) has been actively researched for achieving high performance and large-capacity of memory/storage device simultaneously. It is expected to replace conventional DRAM devices due to its ability to scale very deeply down into the low nanometer regime and its low power consumption with non-volatility [3]. However, there are two major drawbacks in adopting this PCM technology to conventional memory/storage architectures: poor write performance and limited long-term endurance. Various architectural techniques for mitigating these drawbacks have been proposed, while maximally exploiting the benefits of the PCM [4].

Most previous approaches have tried to reduce the number of memory accesses or to optimize the internal architecture of PCM cell array under assumption that the internal architecture and interface to the memory controller of the PCMs are very similar to DRAM [5,6]. Though the proposed techniques have contributed to enhance the performance and energy consumption of the PCM memory systems significantly, most approaches have considered little or overlooked the characteristics of the industry-announced PCM devices. The PCM’s internal architectures and interfaces considered in industries are significantly different with those of the conventional DRAMs. For example, LPDDR2-NVM standard interface have been announced to come up with the many different characteristics of newly announced non-volatile memory devices including PCMs [7] and major PCM manufacturers have announced their PCM prototypes to be compatible with this LPDDR2-NVM standard interface [8,9]. Although this standard interface inherits many common features from conventional double data rate (DDR) interfaces of DRAMs, many distinctive features such as a three-phase addressing mechanism, different row buffer and bank architectures, and support asymmetric read and write operations using an overlay windows, are included as well. Detailed information about this standard interface is described in the later section.

Among several distinctive features of the LPDDR2-NVM supported PCMs, the row buffer architecture has been significantly revised from the conventional DRAM’s architecture in terms of the number of row buffer, the unit size of single row buffer, and buffer management policy. The LPDDR2-NVM interface defines 4 or 8 pairs of row buffer and each pair consists of row address buffers (RABs) and row data buffers (RDBs). These multiple row buffers are arbitrarily selected by the memory controller regardless of the physical memory address because the row buffers are not tightly coupled with the physical memory address in the LPDDR2-NVM interface. In addition, the unit size of single RDB is pretty much smaller ${-}$ 32 bytes in typical case ${-}$ than the unit size of single row buffer in DRAM. All above mentioned differences indicate that we have more flexibility to control the PCM’s row buffers and the sophisticated mechanism of controlling these row buffers is desirable to maximize the performance of the PCM-based memory systems.

In this paper, we investigate the performance of the memory system affected by the row buffer architecture and its management policy targeting for LPDDR2-NVM compatible devices. To this end, we devise a proactive row buffer architecture for enhancing the performance of the PCM memory system. The proposed scheme efficiently traces and adaptively controls the number of prefetched rows depending on the real-time memory access characteristics. Our trace-driven simulations using real workloads and practical timing parameters extracted from industry-announced PCM prototypes demonstrate that the proposed row buffer architecture considering a LPDDR2-NVM enhances the system-level memory performance and energy consumption by 12.2% and 0.3% on average, respectively, compared to the conventional row buffer architecture under the same cost (area) restrictions.

The rest of this paper is organized as follows. Section 2 shows the backgrounds of this work including a brief introduction of LPDDR2-NVM industry standard interface and related work. Reconfiguration of row buffer architecture is introduced with its motivational example in Section 3. Then we introduce the proactive row data buffer management in Section 4. Section 5 evaluates the proposed scheme. Finally, Section 6 concludes this work.

II. BACKGROUND

1. LPDDR2-NVM Interface

LPDDR2-NVM interface includes many different features from the conventional DDR interface to support different behaviors of non-volatile memory devices including asymmetric read and write operations. The representative features compared to the conventional DDR interface are:

${-}$ Three-phase addressing mechanism for supporting large size of memory (up to 32 Gb).

${-}$ No multi-bank architecture.

${-}$ Multiple RABs and RDBs which are arbitrarily selected by the memory controller regardless of the physically accessed address.

${-}$ Smaller unit size of RDB (typically 32 bytes).

${-}$ Indirect write operations via overlay window.

${-}$ Dual operation that enables read operation while performing cell programming in the other partitions.

Fig. 1 shows the internal structure of a LPDDR2-NVM compatible PCM device and its interface. In LPDDR2-NVM, address and commands are transferred through command/address (CA) pins while the conventional DRAMs have dedicated 12 to 16 pins for transferring the address and command separately. The LPDDR2-NVM specifies 10 bits of CA pins and they are used with a DDR architecture even for the address phase as shown in Fig. 2. This indicates that the memory controller transfers up to 20 bits of command or/and address data per single memory clock cycle. In addition, three-phase addressing mechanism is used for supporting larger size of memory devices than the conventional DRAMs which originally use two-phase address mechanism. As shown in Fig. 2(a), three-phase addressing consists of preactive, activate, and read/write phases. In preactive phase, only upper 3 to 12 bits of the row address are transferred, and this partial row address is stored into the designated RAB. In activate phase, remaining row address is transferred. The entire row address after combining it with the upper row address stored in the RAB is transferred to the memory array. Then the corresponding row data is transferred from the memory array to the designated RDB. Finally, the data is transferred from the RDB to memory controller at the last phase. The size of upper row bits and lower row bits are determined by the density of device and the unit size of RDB, respectively.

Fig. 1. Functional block diagram of JEDEC LPDDR2-NVM standard-compatible PCM device.

../../Resources/ieie/JSTS.2019.19.6.527/fig1.png

Fig. 2. Addressing comparison of LPDDR2-NVM and the conventional DRAM (1Gb device with 16-bit data width).

../../Resources/ieie/JSTS.2019.19.6.527/fig2.png

LPDDR2-NVM also supports multiple pairs (4 or 8) of row buffers where each pair consists of RAB and RDB. Unlike the row buffers in traditional DRAMs, each row buffer can be arbitrarily selected by the memory controller. The BA signals, which are originally used to select a bank in the conventional DRAM, are used to select a row buffer. Note that these BA signals are only intended to select a row buffer not a physical bank address of the memory array [7]. In each phase, the memory controller selects a proper RAB and/or RDB by controlling these BA signals regardless of the physically accessed memory address.

PCM shows a relatively long program latency because of its operating principle. This long program latency may also affect read performance if a read request arrives during the program operation. Similar to the multi-bank architecture in traditional DRAM devices, multi-partition architecture and parallel operation in LPDDR2-NVM alleviate the read performance degradation. Parallel operations can read the data in a partition while another partition is being programmed. However parallel program operations are not allowed.

Another distinctive feature of the LPDDR2-NVM is to support an asymmetric read and write operations. The process of read operation is very similar to the conventional DRAMs except three-phase addressing and row buffer management. However, write operation ${-}$ strictly speaking non-volatile cell programming ${-}$ is completely different to the conventional DRAMs. Write operation is done indirectly through the special registers, called overlay window, similar to the NOR flash. Single write operation requires several overlay window accesses to complete non-volatile cell programming. The size of overlay window is 4 KBs and it consists of several memory-mapped registers such as a command address register, a command code register, a command execution register, and program buffers to properly control LPDDR2-NVM devices and write operations. Single word overwrites, buffer overwrites, suspend and other cell programming operations are supported through this overlay window.

2. Related Work

Compared with many system-level approaches to enhance the performance and energy consumption of the PCM memory system, relatively less research has been conducted on optimizing the row buffer architecture and its managements.

Lee et al. analyzed the row buffer architecture under assumption that the baseline buffer architecture of the PCM is similar to that of the conventional DRAMs [5]. Instead of using single 2-KB big size buffer, they reorganized the row buffer with multiple small size of row buffers so that the energy consumption on the small size row buffer is significantly saved by reducing the number of sense amplifiers. Performance has been enhanced as well through reorganizing the row buffer architecture. However, their architecture does not consider about the asymmetric read and write characteristics of LPDDR2-NVM where the write operation is performed only through the overlay window access in industry PCM devices.

Yoon et al. separately considered the hot rows from the cold rows so that hot row is cached in the DRAM [10]. This approach contributes to decrease the number of hot row misses, which finally results in performance and energy enhancement. However, their approach is a type of system-level technique and does not care about the row buffer architecture itself and its optimization. They just exploit the locality information of row buffers. Their analysis is also assumed of similar internal buffer architecture to DRAM.

Li et al. considered the LPDDR2-NVM interface in their research. However, they just utilized the channel and bus model of the LPDDR2-NVM to design a photonic-channel based memory communication infrastructure for PCM [11]. Park et al. enhanced the performance of memory system with LPDDR2-MVM using address phase skipping technique, but they only omit the address phase passively similar to the open-row policy in the conventional DRAM [12]. The configuration of row buffer architecture affects the performance of LPDDR2-NVM [13]. Hence, a row buffer prefetch technique has been proposed, but it does not consider write operation in LPDDR2-NVM [14].

III. RECONFIGURATION OF ROW BUFFER ARCHITECTURE

LPDDR2-NVM standard provides more flexibility in designing and managing the row buffer architecture than traditional LPDDR2 standard. For example, the memory controller can select a row buffer arbitrarily in the LPDDR2-NVM like a fully-associative cache while the row buffer selection in the conventional DRAM is fixed by the internal architecture of DRAM similar to a directed-mapped cache. This flexibility enables us to design various row buffer management schemes considering the access patterns of the applications. The difference of management and configuration policy causes a different RDB hit during read and write operations and this, in turn, leads to performance variations of the memory systems.

1. Motivational Example

In designing row buffer architecture, determining the unit size of RDB and the number of RDBs are as important as determining the total number of bytes dedicated for RDBs. Fig. 3 shows a motivational example of this work. We simply compare the number of RDB hits on the three different configurations; (a) the largest-RDB configuration, (b) the highest-number-of-RDB configuration, and (c) the adaptive RDB configuration. In the figure, the box with thick solid line means one physical RDB which consists of one or more basic units ${-}$ the box with dotted line. The size of one basic unit is equal to the size of one cacheline in microprocessors. The largest-RDB configuration has only one physical RDB which consists of 4 basic units while the highest-number-of-RDB configuration has four physical RDBs where each RDB size is equal to one basic unit.

The example memory access patterns are presented on top of the figure. The first half of pattern is sequential while the second half of pattern is random. A grayed-box and horizontally-lined-box represents an RDB miss and an RDB hit, respectively. For fair comparison, all configurations start with the same initial state ${-}$ Cachelines 4, 5, 6, and 7 are stored in the RDBs.

In the largest-RDB configuration, the request of Cacheline 0 incurs an RDB miss at time $T0$. This RDB miss evicts all cachelines in the RDB, and then Cachelines 0 to 3 are fetched from the memory array as shown in Fig. 3(a). Since the next three memory accesses are sequential, all three requests incur RDB hits. However, remaining memory accesses from $T4$ to $T7$ incur consecutive RDB misses again because of only one physical RDB is available. In total, 5 RDB misses and 3 RDB hits are occurred during the example memory accesses. In contrast, the highest-number-of-RDB configuration handles a random memory access pattern efficiently because one RDB stores only one cacheline. However, this configuration is very weak to sequential memory access patterns. The requests of Cachelines 0 to 3 continuously incur RDB misses from $T0$ to $T3$ as shown in Fig. 3(b). In total, we observe 6 RDB misses and 2 RDB hits.

Fig. 3. RDB hit ratio varying on row buffer architecture and memory access patterns.

../../Resources/ieie/JSTS.2019.19.6.527/fig3.png

Based on the observation above, both the largest-RDB configuration and the highest-number-of-RDB configuration provide limited capability to the given example memory access patterns. Each configuration has clear advantages but also has clear disadvantages depending on the memory access patterns. The characteristic of memory access patterns may vary on the application changes. Even in the same application, it may vary according to the time changes. Thus, changing the RDB configuration dynamically even in the same application is desirable to increase the chances of RDB hit as shown in Fig. 3(c). The number of RDB misses can be reduced by reconfiguring the row buffer from 3 RDBs to one RDB which consists of 4 basic units, and then replacing them with four consecutive cachelines at $T0$. It makes the next three memory accesses as RDB hits. The adaptive-RDB configuration again modifies its configuration to three RDBs, one RDB with the size of two basic units and two RDBs with the size of one basic unit, respectively, at $T4$. This reconfiguration turns the remaining random memory accesses from $T5$ to $T7$ as RDB hits. In total, 2 RDB misses and 6 RDB hits are occurred in this adaptive configuration. This adaptive RDB reconfiguration clearly shows the best RDB hit ratio with information of incoming memory access pattern. However, it is important to predict the characteristics of incoming memory access accurately because this adaptive RDB configuration with inaccurate predictions may incur even more RDB misses.

IV. PROACTIVE ROW BUFFER CONTROL POLICY

Reconfigurable RDB architecture must be an attractive way to increase the RDB hits and finally to enhance the performance of the memory system. By exploiting recently announced LPDDR2-NVM’s flexible features of selecting any RDBs regardless of the requested physical memory address, we propose a proactive row buffer control method which enables the dynamic reconfiguration of RDB without requiring hardware modification in LPDDR2-NVM specification. The proposed method mainly consists of a row buffer prefetch and an overlay-window aware address pinning techniques.

1. Row Buffer Prefetch

In traditional cache memory architecture, prefetch technique has been mainly used to maximally utilize the limited size of cache memory by proactively moving the specific data from main memory to the cache memory even before it is explicitly requested. Similar to this, we propose a row buffer prefetch technique that moves the specific data to the row buffer in advance when the memory device is in idle state. By doing this in LPDDR2-NVM device, we realize the adaptive RDB architecture. Fig. 4 shows the key concept of logical RDB reconfiguration architecture using a row buffer prefetch. The row buffer consists of 4 physical RDBs. When there is an RDB miss, one physical RDB is allocated to serve this request. In addition to this basic operation, we allocate more physical RDBs for prefetching the consecutive row data if the next request is expected to be sequential. This prefetch operation implicitly acts like allocating two RDBs for one memory request, which eventually creates a similar effect of increasing the size of a single RDB. Depending on the number of RDBs used for prefetching, the size of a single RDB (not the physical size but the logical size) can be varied, and this is the basic principle of our proposed dynamic RDB reconfiguration. The number of RDBs used for prefetching is increased if the access patterns are expected to be strongly sequential, while it is decreased when expecting random access patterns. For implementing above mentioned operations, no additional control logics are required in the memory devices. The memory controller just issues a row buffer prefetch to the command queue if the memory controller predicts that the incoming memory access pattern is sequential.

Fig. 4. Prefetch-based dynamic RDB reconfiguration.

../../Resources/ieie/JSTS.2019.19.6.527/fig4.png

Some commercial DRAM controllers offer memory access reordering feature to increase row buffer hit ratio. It changes the order of memory access in the memory controller using reordering buffer, and then returns the results to the processor as in-order. Reordering may be able to increase the RDB hit ratio at LPDDR2-NVM as well, but we do not consider reordering issue because it is orthogonal problem with our proposed scheme. This paper only focuses on the relation between the characteristics of memory access pattern and the RDB reconfiguration.

We give a higher priority to row buffer prefetch request than regular memory accesses from the processor when there are conflicted requests, because the row buffer prefetch requires less time than regular memory accesses because of skipping the read/write phase (the last phase of the operation) of three-phase addressing. Although this may slightly and temporarily increase the response time of the coming memory request, we found that the long-term benefits of this policy are greater than the temporal response time degradation.

Since the number of row buffers is very limited in most memory devices, performance enhancement depends heavily on the accuracy of prediction. In the latter subsections, we propose a simple but efficient system-level row buffer prediction and management policy for the dynamic RDB reconfiguration.

2. Tagged Row Buffer Prefetch

In our design, decision of prefetch is mainly depending on detecting whether the current memory access pattern will be sequential or not. For efficiently detecting the characteristics of memory access patterns, we devise a tagged row buffer prefetch scheme, $TPRE$, similar to tagged prefetch in a cache [15]. As shown in Fig. 5(a), $TPRE$ uses one tag bit on the RDB tracking table for issuing a prefetch command. The tag bit is set when the corresponding RDB is firstly activated. This bit is cleared once the corresponding RDB is rereferenced, and then the row buffer prefetch is requested to move the consecutive data to another RDB as shown in Fig. 5(b). The assumption of $TPRE$ is that memory access pattern will be mostly sequential if the RDB is accessed more than twice. This assumption is justified in that the unit size of single RDB is larger than the size of cacheline.

Fig. 5. Operation of tagged row buffer prefetch, $TPRE$.

../../Resources/ieie/JSTS.2019.19.6.527/fig5.png

3. Multiple Row Buffer Prefetch

Row buffer prefetch is initiated by a prediction, and thus, the accuracy of the prediction mainly determines the reduction of total execution time. Since $TPRE$ assumes the memory access patterns as sequential patterns$\textit{,}$ it may incur unnecessary row buffer prefetches that are evicted without even being referenced. To minimize these unnecessary row buffer prefetches, we propose a multiple row buffer prefetch technique, $MPRE$.

$MPRE$ uses a two-bits saturating counter to accurately predict the characteristic of incoming memory access pattern. This saturating counter changes a mode between STRONG RANDOM to STRONG SEQUENTIAL according to the recent activity of each RDB as shown in Fig. 6. When an RDB hit occurs, it decides that the incoming memory access will be a sequential memory access, then it promotes the mode of saturating counter up to STRONG SEQUENTIAL mode. When an RDB miss occurs, $MPRE$ demotes the mode of saturating counter down to STRONG RANDDOM because incoming memory access will be high likely random memory access.

Fig. 6. Mode transition and initial allocation for predicting incoming memory accesses in $MPRE$.

../../Resources/ieie/JSTS.2019.19.6.527/fig6.png

Single global saturating counter for entire RDBs can be a simple solution, however, it turns out that the accuracy of single global saturating counter is very poor especially for the mixed patterns with sequential and random access patterns. In this case, the RDB miss due to the random memory access demotes the mode of the global saturating counter so quickly because it uses information from successive memory accesses. As a result, the part of sequential memory access pattern is frequently predicted as a random memory access pattern. To avoid this misprediction, $MPRE$ uses one saturating counter for each RDB, so that it tracks multiple mixed memory accesses. When an RDB hit occurs, $MPRE$ promotes the mode of the corresponding RDB while the modes of the other RDBs are remaining without changing. In opposite case which means an RDB miss, $MPRE$ demotes the modes of all RDBs at once because the requested memory access is not a part of any sequential access that are tracked by each saturating counter.

It is also important to decide an initial mode of the saturating counter when new data is allocated to the RDB. Since the row buffer prefetch is performed from the result of predicting sequential access pattern, $MPRE$ assigns WEAK SEQUENTIAL mode to the RDB caused by a row buffer prefetch as shown in Fig. 6. In the other case where the RDB allocation is caused by the processor after an RDB miss, $MPRE$ assigns WEAK RANDOM to the mode of the corresponding RDB.

The overhead of keeping and managing two-bit saturating counter is not significant because the memory controller should always keep the address of data stored in RDBs with its validity to check whether the memory controller can skip the addressing phase in LPDDR2-NVM or not. For selecting a victim, $MPRE$ uses a simple least recently used (LRU).

4. RDB Pinning for Overlay Window Access

As described in Section II, write operation in LPDDR2-NVM is translated into several overlay window accesses, and the address of overlay window is not changed regardless of the target address of write operation. This means that the write memory access causes intensive accesses of the overlay window which only use specific range of memory addresses. If there are multiple intensive write requests, the RDBs that contain overlay window have more chances to be referenced before they are evicted. However, due to the very limited number of RDB in LPDDR2-NVM, the RDBs that contain overlay window can still be selected as a victim by conventional replacement policy like an LRU. To avoid this situation, we propose a simple overlay window pinning scheme, $MPRE+OW$ that assigns a certain number of RDBs only for overlay window access. This pinning method is implemented with negligible overhead because the address comparison is inevitable to decide an RDB hit in conventional memory controller targeting for LPDDR2-NVM.

In the proposed pinning scheme, we do not pin all RDBs that contain overlay window. Among several types of overlay window accesses, the program buffer accesses show a low RDB hit ratio because the address of program buffer access changes according to the address of write request. Therefore, the proposed scheme does not pin the RDB that contains program buffer.

../../Resources/ieie/JSTS.2019.19.6.527/ALGORITHM1.png

Table 1. Simulated system configuration details

Number of cores

4

Processor

UltraSPARC-III+, 2 GHz (OoO)

L1 cache (Private)

I/D-cache: 32 KB, 4-way 64 B block

L2 cache (Shared)

2 MB, 4-way 64 B block

PCM main memory

4 GB, LPDDR2-800, 64-bit wide

Preactive to Activate ($t_{RP}$ )

3 $t_{CK}$$^1$

Activate to Read/Write ($t_{RCD}$)

120 ns

Read/write latency

6 tCK/3 $t_{CK}$$^1$

Cell program time ($t_{program}$)

150 ns

The number of partitions

16

$^1$ $t_{CK}$ is a memory clock cycle (2.5 ns at LPDDR2-800)

Algorithm 1 describes how $MPRE+OW$ manages row buffers proactively with combination of previously proposed $MPRE$. When there is a new memory read request, it first checks whether the new access is read or miss. If the RDB hit occurs, the mode of corresponding RDB is promoted. After the promotion, if its mode is higher than WEAK SEQUENTIAL, then new row buffer prefetch is requested. If the RDB miss occurs, all RDBs are demoted and one of unpinned RDBs is selected as a victim for new request. Then the mode of victim RDB is set to WEAK RANDOM. If the prefetch is requested as a result of the RDB hit process, a victim is selected from unpinned RDBs for new prefetch request. Then the mode of victim RDB is set to WEAK SEQUENTIAL.

There is no prefetch request for the RDBs contain overlay window, so the mode for those are set as WEAK RANDOM temporarily and it is updated when it is selected as victim RDB for read memory access. The write memory access also does not change the mode of other RDBs. The address phase of write memory access starts from READ\textbackslash WRITE phase if the access hit the pinned RDBs. Otherwise, it starts from PREACTIVE phase.

V. EXPERIMENTAL RESULTS

1. Evaluation Setup

We develop a cycle-accurate trace-driven simulator using SystemC to evaluate the total execution time and the total execution energy. The traces have been extracted from Simics full-system simulator [16] with the information of processor clock cycle which indicates the issue time of memory access issue. We calculate the total execution time of trace by using idle time of memory system obtained from the processor clock cycle and simulated memory access latency.

We simulate 4-core out-of-order processor systems, operating at 2 GHz clock frequency with the shared last level cache. The main memory system has 64-bit bus with four LPDDR2-NVM compatible PCM chips. The timing parameters of non-volatile memory (phase-change memory in our experiments) are extracted from the JEDEC LPDDR2-NVM standard and the industrial prototype [17]. The details of the simulation setup are summarized in Table 1.

Eight multi-threaded benchmarks from the PARSEC benchmark suite [18] are selected. Table 2 summarizes the characteristics of each benchmark in terms of the ratio of read operations normalized to the write operations and the frequency of the memory accesses. Based on this setup, we intensively evaluate the proposed prefetch-based proactive row buffer management schemes, $TPRE$, $MPRE$ and $MPRE+OW$.

Table 2. Memory access characteristics of the benchmarks

Applications

R/W ratio

Mem. Accesses/ 1K CPU cycles

blackscholes

3.02

4.2

bodytrack

2.80

1.2

facesim

1.57

7.6

ferret

2.71

6.3

freqmine

2.20

4.7

raytrace

1.73

2.5

streamcluster

2.53

2.2

swaptions

3.28

1.2

vips

1.77

4.7

X264

2.87

3.7

We evaluate them focusing on the total execution time and the total energy consumption. As a baseline, we use the static optimum RDB configuration that has a minimum execution time among all possible RDB configurations. Note that we assume the same row activation time regardless of the size of RDB. From the extensive design space explorations, 8ⅹ128 bytes RDB configuration is selected as the static optimum RDB configuration for all benchmarks.

2. Performance Evaluations

Before evaluating the performance and energy consumption, we first analyze the RDB hit ratio and prefetch ratio that directly affect the latency and energy consumption of memory devices. Table 3 compares the RDB hit ratio of $TPRE$, $MPRE$, and $MPRE+OW$. We separately present the RDB hit ratio of read access, $\textit{r}$$_{RD}$, and overlay window access, $\textit{r}$$_{OW}$ to clearly show the effects of each row management schemes.

As we expect, $\textit{r}$$_{OW}$ is generally higher than $\textit{r}$$_{RD}$ in all applications. This means that the overlay window access shows higher spatial and temporal locality than other type of memory access in LPDDR2-NVM. Compared with the baseline configuration, the most na\"{i}ve scheme, $TPRE$ shows higher $\textit{r}$$_{RD}$ because it prefetches row buffer aggressively. However, this aggressive row buffer prefetches also decrease $\textit{r}$$_{OW}$, which negatively affects the total execution time. The $\textit{r}$$_{RD}$ and $\textit{r}$$_{OW}$ of $MPRE$ are enhanced in most applications except $bodytrack$. By exploiting the history of memory accesses, $MPRE$ efficiently reduces the unnecessary prefetches and evictions of RDB that contains high locality overlay window data. We observe further improvement of $\textit{r}$$_{OW}$ in $MPRE+OW$ because $MPRE+OW$ tries to keep the RDB contains overlay window as long as possible when the memory controller predicts that there will be several write accesses in upcoming memory requests. Overall, compared with the static optimum RDB configuration, $MPRE+OW$ enhances $\textit{r}$$_{RD}$ and $\textit{r}$$_{OW}$ by 16.0% and 3.0%, on the average, respectively.

Table 3. Comparison of the RDB hit ratio (%)

Applications

Static

$TPRE$

$MPRE$

$MPRE+OW$

$r_{RD}$

$r_{OW}$

$r_{RD}$

$r_{OW}$

$r_{RD}$

$r_{OW}$

$r_{RD}$

$r_{OW}$

blackscholes

27.0

73.8

39.2

64.4

42.2

71.7

42.1

79.0

bodytrack

21.5

74.7

31.5

65.9

30.9

73.2

30.8

79.0

facesim

41.7

83.7

56.1

81.5

69.9

83.0

69.8

84.1

ferret

35.8

77.2

48.4

70.3

56.8

75.3

56.7

80.4

freqmine

30.3

78.5

39.0

68.1

43.6

77.0

43.5

80.6

raytrace

25.7

80.7

36.3

70.6

38.9

79.8

38.9

81.7

streamcluster

36.7

79.0

50.0

74.8

56.9

77.5

56.8

81.6

swaptions

28.7

73.9

44.4

63.9

49.0

71.6

48.9

79.2

vips

14.0

86.3

19.2

71.9

21.8

85.7

21.8

87.6

x264

25.2

74.4

34.5

64.6

38.1

72.8

38.0

79.2

average

28.7

78.2

39.9

69.6

44.8

76.8

44.7

81.2

We first define a row buffer prefetch ratio, $\textit{r}$$_{PF}$ which is the fraction of the number of RDB allocations caused by row buffer prefetch over the total number of RDB allocations to further analyze the effect of a row buffer prefetch. We also define a good row buffer prefetch ratio to evaluate the prediction accuracy of each scheme. The meaning of good row buffer prefetch is that the prefetched row data in an RDB is referenced more than once before it is evicted. Otherwise, we consider it as a bad row buffer prefetch. We define a ratio of good row buffer prefetch, $\textit{r}$$_{G.PF}$, as the fraction of the number of good row buffer prefetches over the number of total row buffer prefetches. The $\textit{r}$$_{PF}$ and $\textit{r}$$_{G.PF}$ are good indicators that show the prediction accuracy of the proposed schemes.

Table 4. Comparison of the prefetch ratio and the good prefetch ratio (%)

Applications

$TPRE$

$MPRE$

$MPRE+OW$

$r_{PF}$

$r_{G.PF}$

$r_{PF}$

$r_{G.PF}$

$r_{PF}$

$r_{G.PF}$

blackscholes

34.5

24.9

18.3

73.1

20.0

73.2

bodytrack

35.0

17.4

14.1

53.9

15.1

53.7

facesim

30.9

40.6

31.5

85.4

32.3

85.4

ferret

33.0

30.7

25.0

81.2

26.9

81.1

freqmine

32.3

20.0

16.7

69.8

17.7

69.9

raytrace

30.6

20.0

14.8

71.7

15.4

71.6

streamcluster

32.8

32.3

25.8

74.3

27.5

74.3

swaptions

33.8

32.5

23.5

78.0

25.6

78.0

vips

35.6

8.3

9.4

70.5

9.7

70.6

x264

35.2

17.4

15.5

72.7

16.7

72.5

average

33.4

24.4

19.5

73.1

20.7

73.0

Fig. 7 compares the total execution time of $TPRE$, $MPRE$, and $MPRE+OW$. The total execution time of each scheme is normalized to that of the static RDB configuration. $TPRE$ mostly takes higher execution time than the static optimum configuration. As shown in Table 3, $TPRE$ successfully increases the RDB hit ratio in all applications. However, aggressive prefetches in $TRE$ negatively increase the number of unnecessary evictions of high temporal and spatial locality row data in the applications. One example is that $TPRE$ frequently evicts RDBs that contain high temporal and spatial locality overlay window data. As a result, the RDB hit ratio for overlay window access operation is degraded as shown in Table 3. Only in $bodytrack$, $raytrace$ and $swaptions$ that show the lower R/W ratio and lower memory accesses per a 1K CPU cycle than other benchmarks, $TPRE$ reduces the memory access time. Overall, $TPRE$ increases the total execution time by 5.2%, on average.

Fig. 7. Comparison of the total execution time (normalized to the static optimum RDB configuration).

../../Resources/ieie/JSTS.2019.19.6.527/fig7.png

Compared with $TPRE$, $MPRE$ is designed to minimize unnecessary row buffer prefetches by exploiting the history of memory access patterns. As shown in Tables 3 and 4, $MPRE$ enhances the RDB hit ratio as well as the good prefetch ratio for all applications significantly. These enhancements directly connect to 2.4% to 21.6% enhancements of the total execution time in all applications. On the average, $MPRE$ enhances the total execution time by 8.0%.

Finally, $MPRE+OW$ reduces the total execution time even more than $MPRE$ by pinning RDBs dedicated for the overlay window access. Compared to $MPRE$, $MPRE+OW$ shows the higher reduction ratios on the total execution time with the applications that have high R/W ratio such as $blackscholes$ and $swpations$ than the applications that has low R/W ratio such as $facesim$, $raytrace$ and $vips$. We analyze that $MPRE$ evicts the RDB that contains overlay window data frequently even though write request arrives soon. These unnecessary evictions are prevented by $MPRE+OW$ efficiently, which leads to the performance enhancement from 4.7% to 23.2%. In summary, $MPRE+OW$ reduces the total execution time by 12.2% on average compared to the static optimum RDB configuration.

3. Energy Consumption Evaluations

We analyze the energy consumption of the proposed row buffer management schemes. Although the proposed row buffer prefetch schemes reduces the total execution time by hiding row buffer activation time, it may consume additional energy if the prefetch is not a good row buffer prefetch. To analyze the energy consumption, we model the energy consumption of LPDDR2-NVM devices based on the manufacturer’s datasheet. Our energy model mainly focuses on finding the differences of energy consumption during row and row buffer activation which are the dominant sources of memory energy consumption.

Fig. 8. Comparison of the total execution energy (normalized to the static optimum RDB configuration).

../../Resources/ieie/JSTS.2019.19.6.527/fig8.png

Fig. 9. Sensitivity analysis by changing the size of RDB (normalized to the static optimum RDB configuration).

../../Resources/ieie/JSTS.2019.19.6.527/fig9.png

Fig. 8 shows the energy consumption of $TPRE$, $MPRE$ and $MPRE+OW$, normalized to that of the static optimum RDB configuration. The energy consumption of $TPRE$ is higher than that of the static optimum RDB configuration for all applications. It is obvious that the additional energy consumption from too many unnecessary RDB prefetches and the increased execution time result in increasing energy consumption. As shown in Table 4, only 24.4% of total prefetches in $TPRE$ are classified as a good prefetch. $MPRE$ also shows higher energy consumption than the static optimum RDB configuration for all applications even though the total execution time is decreased. Similar to $TPRE$, still 26.9% of prefetches are useless in $MPRE$. We analyze that this additional energy consumption slightly exceeds the benefits of reducing energy consumption in most applications. Finally, $MPRE+OW$ shows slightly lower or almost same energy consumption than the baseline configuration in most applications. Unlike $MPRE$, in $MPRE+OW$, the energy benefit of $MPRE+OW$ by reducing the total execution time exceeds the energy overhead of unnecessary prefetches. This summarizes that managing RDB proactively can successfully reduce the total execution energy as well as the total execution time.

4. Sensitivity Analysis

In the above experiments, the physical size RDB is fixed based on LPDDR2-NVM specification. Since the physical size of RDB may affect to the memory access time and energy consumption significantly, we analyze the total execution time and the total execution energy of the proposed schemes when the physical size of RDB increases. We change the physical size of single RDB while fixing the number of RDB to 8 for all configurations. As shown in Fig. 9, even simple $TPRE$ shows the reduction of the total execution time when the size of RDB increases more than 8ⅹ512 bytes. The enhancement ratios of the total execution time in $MPRE$ and $MPRE+OW$ increase significantly until the physical RDB size reaches 8ⅹ512 bytes. After 8ⅹ512 bytes, the enhancement ratios are not increased or slightly decreased. This means that setting the RDB to 8ⅹ512 bytes gives the best performance.

As described previously, the proposed prefetch-based RDB managements affects to the energy consumption of the memory devices positively and negatively at the same time. By increasing the physical RDB size, the positive effect of reducing the total execution time is getting better while the negative effect of additional energy consumption due to the unnecessary prefetch is getting worse. We observe that the positive effect is higher than the negative effect only in 8ⅹ256 bytes configuration. This means that the RDB to 8ⅹ256 bytes is the best configuration in energy consumption perspective, which is a different configuration in performance perspective.

VI. CONCLUSIONS

Memory interface affects the performance of memory system significantly, but it has less addressed or overlooked. This paper focused on the role of memory interface targeting LPDDR2-NVM compatible nonvolatile memory devices because it has quite different mechanisms compared to conventional LPDDR interfaces. Based on observations, we proposed a proactive row buffer management that enables the logical reconfiguration of row buffer architecture in runtime using prefetch techniques. Extensive evaluations using a trace from full system simulator demonstrate that the proposed method enhances the performance and energy consumption of memory system by 12.2% and 0.3%, on average, compared to the design-time optimization technique without a memory device modification.

ACKNOWLEDGMENTS

This work was supported by the 2018 Research Fund of University of Ulsan.

REFERENCES

1 
IDC , April 2014, The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things, EMC digital universe with research & analysisGoogle Search
2 
Zypryme , November 2013(4), Global Smart Meter Forecasts, 2012-2020, Smart Grid InsightsGoogle Search
3 
Raoux Simone, et al , 2008, Phase-Change Random Access Memory: A Scalable Technology, IBM Journal of Research and Development, Vol. 52, pp. 465-479DOI
4 
Zilberberg Omer, Weiss Shlomo, Toledo Sivan, 2013, Phase-Change Memory: an Architectural Perspective, Comput. Surveys, Vol. 45DOI
5 
Lee Benjamin C., Ipek Engin, Mutlu Onur, Burger Doug, 2009, Architecting Phase Change Memory as a scalable DRAM Alternative, ISCADOI
6 
Qureshi Moinuddin K., Srinivasan Vijayalakshmi, Rivers Jude A., 2009, Scalable High-Performance Main Memory System using Phase-Change Memory Technology, ISCADOI
7 
JEDEC , 2013, Low-Power Double Data Rate 2 Non-Volatile Memory, JESD209-FGoogle Search
8 
Clarke Peter, Nov 2011, Samsung preps 8-Gbit phase-change memory, EE TimesGoogle Search
9 
Choi Youngdon, et al , A 20nm 1.8V 8Gb PRAM with 40MB/s program bandwidth, ISSCCDOI
10 
Yoon HanBin, et al , 2011, DynRBLA: A high-performance and energy-efficient row buffer locality-aware caching policy for hybrid memories, SAFARI Technical Report No. 2011-005Google Search
11 
Li Zhongqi, Zhou Ruijin, Li Tao, 2013, Exploring high-performance and energy proportional interface for phase change memory systems, HPCADOI
12 
Park Jaehyun, et al , 2014, Accelerating memory access with address phase skipping in LPDDR2-NVM, JSTS, Vol. 14, No. 6, pp. 741-749Google Search
13 
Park Jaehyun, Shin Donghwa, Lee Hyung Gyu, 2015, Design space exploration of row buffer architecture for phase change memory with LPDDR2-NVM interface, VLSI-SOCDOI
14 
Park Jaehyun, Shin Donghwa, Lee Hyung Gyu, 2015, Prefetch-based dynamic row buffer management for LPDDR2-NVM devices, VLSI-SOCDOI
15 
Srinivasan Viji, Davidson Edward S., Tyson Gary S., Feb 2004, A Prefetch Taxonomy, IEEE Trans. Comput., Vol. 53, No. 2, pp. 26-140DOI
16 
Magnusson Peter S., et al , 2002, Simics: A full system simulation platform, Computer, Vol. 35, No. 2, pp. 50-58DOI
17 
Bienia Christian, Kumar Sanjeev, Singh Jaswinder Pal, Li Kai, 2008, The PARSEC benchmark suite: Characterization and architectural implications, PACTDOI

Author

Jaehyun Park
../../Resources/ieie/JSTS.2019.19.6.527/au1.png

Jaehyun Park received the B.S. in Electrical Engineering and Ph.D degree in Electrical Engineering and Computer Science from Seoul National University, Seoul, Korea, in 2006 and 2015, respectively.

From 2009 to 2010, he was a Visiting Scholar with the University of Southern California, Los Angeles, CA.

He was an Exchange Scholar with School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ from 2015 to 2018.

He is currently an Assistant Professor at the School of Electrical Engineering, University of Ulsan, Ulsan, Korea.

Dr. Park has received 2007 and 2012 ISLPED Low Power Design Contest Award, and 2017 13th ACM/IEEE ESWEEK Best Paper Award.

His current research interests include energy harvesting and management, low-power IoT systems and nonvolatile memory systems.

Donghwa Shin
../../Resources/ieie/JSTS.2019.19.6.527/au2.png

Donghwa Shin received the B.S. degree in computer engineering and the M.S. and Ph.D. degrees in computer science and electrical engineering from Seoul National University, Seoul, South Korea, in 2005, 2007, and 2012, respectively.

He joined the Dipartimento di Automaticae Informatica, Politecnico di Torino, Turin, Italy, as a Researcher.

He is currently an Assistant Professor at the Department of Smart Systems Software, Soongsil University, Seoul, South Korea.

His research interests have covered systemlevel low-power techniques, and he is currently focusing on energy-aware neuromorphic computing.

Dr. Shin serves (and served) as a Reviewer of the IEEE TComputers, TCAD, TVLSI, ACM TODAES, TECS, and so on.

He serves on the Technical Program Committee of IEEE and ACM technical conferences, including DATE, ISLPED, and ASP-DAC.

Hyung Gyu Lee
../../Resources/ieie/JSTS.2019.19.6.527/au3.png

Hyung Gyu Lee received the Ph.D. degree in Computer Science and Engineering from Seoul National University, Seoul, Korea, 2007.

He was a senior engineer with Samsung Electronics from 2007 to 2010.

He also worked as a research faculty with the Georgia Institute of Technology, Atlanta, GA, from 2010 to 2012.

Currently he is an associate professor with the School of Computer and Communication Engineering, Daegu University, Korea.

His research interests include embedded system design, low power system, and memory system design focusing on emerging non-volatile memory and storage technologies.

Also, energy harvesting and wearable IoT applications are his current research interests.

He received three best paper awards from HPCC 2011 and ESWEEK 2017, ESWEEK 2019 and one design contest award from ISLPED 2014.