Ms. Khushbu Lalwani
Department of VLSI Engineering
JIT College, Lonara , Nagpur, India.
Pro.f. Mayuri Chawla
Department of VLSI Engineering
JIT College, Lonara, Nagpur, India.
As designers and researchers strive to achieve higher performance, ﬁeld-programmable gate arrays (FPGAs) become an increasingly attractive solution. FPGAs can provide application speciﬁc acceleration that cannot be matched by processors In this paper, we report on the design of efficient cache controller suitable for use in FPGA-based processors. Cache memory is a common structure in computer system and has an important role in microprocessor performance. The design of a cache is an optimization problem that is mainly related with the maximization of the hit ratio and the minimization of the access time. Some aspects related with the cache performance are the cache size, number of words per block and latency. Semiconductor memory which can operate at speeds comparable with the operation of the processor exists; it is not economical to provide all the main memory with very high speed semiconductor memory. FPGA cache controller will proposed by this research work . We believe that our design work achieves less circuit complexity, less power consumption and high speed in terms of FPGA resource usage.
KEYWORD Cache memory, main memory, processor, cache controller, FPGA
Any embedded system contains both on-chip and off-chip memory modules with different access times. During system integration, the decision to map critical data on to faster memories is crucial. In order to obtain good performance targeting less amounts of memory, the data buffers of the application need to be placed carefully in different types of memory. There have been huge research efforts intending to improve the performance of the memory hierarchy. Recent advancements in semiconductor technology have made power consumption also a limiting factor for embedded system design. Cache Memory being faster than the DRAM, cache memory comprising of Cache Memory is configured between the CPU and the main memory. The CPU can access the main memory only via the cache memory. Cache memories are employed in all the computing applications along with the processors. The size of cache allowed for inclusion on a chip is limited by the large physical size and large power consumption of the Cache Memory cells used in cache memory. Hence, its effective configuration for small size and low power consumption is very crucial in embedded system design.
We present an optimal cache configuration technique for the effective reduction of size and high performance. The proposed methodology was tested in real time hardware using FPGA. Matrix multiplication algorithm with various sizes of workloads is hence validated. For the validation of the proposed approach we have used Xilinx ISE 9.2i for simulation and synthesis purposes. The prescribed design was implemented in VHDL.
The cache memory allows faster memory access time compared to DRAM memory access time, but comes at the expense of larger energy consumption per access. The CPU can access the main memory only via the cache memory. The cache memory is transparent to the application being executed in the CPU. Cache memories are used in almost every processor which exists today. Cache memory is included in a system with the purpose of exploiting spatial and temporal locality exhibited by the application’s memory access behavior. The size of cache allowed for inclusion on a chip is limited by the large physical size and large power consumption of the Cache Memory cells used in Cache Memory.
II.SYSTEM ARCHITECTURE OVERVIEW
Cache memory is fast but quite small; it is used to store small amounts of data that have been accessed recently and are likely to be accessed again soon in the future. Data is stored here in blocks, each containing a number of words. To keep track of which blocks are currently stored in Cache, and how they relate to the rest of the memory, the Cache Controller stores identifiers for the blocks currently stored in Cache. These include the index, tag, valid and dirty bits, associated with a whole block of data. To access an individual word inside a block of data, a block offset is used as an address into the block itself. Using these identifiers, the Cache Controller can respond to read and write requests issued by the CPU, by reading and writing data to specific blocks, or by fetching or writing out whole blocks to the larger, slower Main Memory. Figure 1 shows a block diagram for a simple memory hierarchy consisting of CPU, Cache (including the Cache controller and the small, fast memory used for data storage), Main Memory Controller and Main Memory proper.
III. RELATED WORK
The demand for computer architecture requires designers to tune processor parameters to avoid excessive energy wastage. Tuning on per-application basis allows greater saving in energy consumption without a noticeable degradation in performance. On- chip caches often consume significant fraction of the total energy budget and are therefore prime candidates for adaptation.
- Review on Performance of Static Random Access Memory
Santhiya.V1, Mathan.N2 stated that one of the most adopted method is to lower the supply voltage. Techniques based on replica circuits which minimize the effect of operating conditions variability on the speed and power. In this paper different static random access memory are designed in order to satisfy low power, high performance circuit and the extensive survey on features of various static random access memory designs were reported.
- Impact Of Design And Stability Parameters On Low Power Cache memory Performance
Prashant Upadhyay, Ajay Kumar Yadav stated that the effect of temperature  and supply voltage (VDD) on the stability parameters of (SRAM) which is Static Noise Margin (SNM), Write Margin (WM) and Read Margin (RM). The effect has been observed at Cadence software PSpice 16.2 Version. The temperature has a significant effect on stability along with the VDD.
- Effective Cache Configuration for High Performance Embedded Systems
Srilatha, Guru Rao, Prabhu G Benakop stated that two ones mask is with two possible hit ways and so on for the other two cases. It can be clearly observed that the 8 bits function is more accurate than the 4 bit function. It is noteworthy to observe the accuracy of the miss prediction. In matrix multiplication the miss rate is 43.8% of 1 million accesses and the predicted are 43.7% with XOR function and 39.53% without function. It is also remarkable to see that the accuracy of the 4bit XOR function is more efficient than the 8bits simple checksum function. Another observation concerns the hits. The 8bits XOR function predicts the only correct way (one hot mask) with great possibility, converting the 4-way associative cache to direct map in terms of power consumption without converting the cache to phased even it is behaved so. All these reasons lead us to choose the 8bit XOR-based function generation for the implementation.
- Review on Performance of Static Random Access Memory
Santhiya.V1, Mathan.N2 stated that Techniques based on replica circuits which minimize the effect of operating conditions variability on the speed and power. In this paper different static random access memory are designed in order to satisfy low power, high performance circuit and the extensive survey on features of various static random access memory designs were reported.
- Impact Of Design And Stability Parameters On Low Power Cache memory Performance
Prashant Upadhyay, Ajay Kumar Yadav stated that the effect temperature and supply voltage (VDD) on the stability parameters of Cache Memory which is Static Noise Margin (SNM), Write Margin (WM) and Read Margin (RM). The effect has been observed at Cadence software PSpice 16.2 Version. The temperature has a significant effect on stability along with the VDD.
The aim of this project is to design and implement a simplified Cache Controller for a hypothetical computer system. The Cache for this computer system will store a total of 256 bytes, organized into blocks of 32 words each; each word will have 1 byte. Figure 2 shows the organization of the Cache.
As part of its operation, the CPU will issue read and write requests to the memory system. As shown in Figure 1, the CPU only interacts directly with the Cache. The Cache Controller must determine what action needs to be performed based on the current contents of the Cache Memory, as well as the identifiers associated with each cache block.
The complete structure for the target system is shown in Figure 3 below. The system consists of the CPU, an SDRAM controller, and the cache acting as the intermediary between the two. The cache itself consists of a controller and a local SRAM block; the SRAM block is connected as shown. The project aim is to design the controller, and using it, assemble a complete Cache using the Block RAM memory module .
A. CPU Address Word Organization
The CPU issues 16 bit address words. The address word is received by the Cache Controller and stored in the Address Word Register (AWR). The AWR consists of three fields according to the cache organization (see Figure 2): the cache memory can store 8 blocks, each of which contains 32 1-byte words. Therefore, the Index field consists of 3 bits and the Offset field consists of 5 bits. Finally, the upper 8 bits of the address word represent the Tag. Figure 3 illustrates the AWR organization.
B. Cache Controller Behavior
The aim of the project is to design a logic circuit that responds appropriately to read and write requests made by the CPU. The potential actions available to the cache controller are reading or writing to local (cache) memory, fetching whole data blocks from main memory, or writing out blocks to main memory. The controller decides which of these actions to take based on the indicators it stores internally, which describe the status of each block in the cache memory. Specifically, the controller must compared the tag portion of the address with the tags stored for each of the eight blocks, as well as check the status of the dirty and valid bits for the block being targeted.
- Write a word to cache [hit]: The cache receives a write request from the CPU. In this situation, the request made by the CPU is found in the cache, thus being a cache hit. The index and offset portion of the address supplied by the CPU is to be sent to the local SRAM as the write address for the new data. The dirty and valid bits associated with the targeted block must be set to 1, and the data must be written to the local SRAM memory.
- Read a word from cache [hit]: The cache receives a read request from the CPU. In this situation, the request made by the CPU is found in the cache, thus being a cache hit. The index and offset portion of the address supplied by the CPU is sent to the local SRAM and the read data is routed back to the CPU.
- Read/Write from/to cache [miss] and dirty bit = 0:A read or write request is received, and the associated block is not found in cache. This requires performing a block replacement in Cache. First the dirty bit for the corresponding CPU address must be analyzed. If the dirty bit is 0 then the entire CPU address with the offset portion set to “00000” is passed to the SDRAM memory controller as the base address of the block to be read from memory, and the full block (all 32 bytes) must be read from the SDRAM memory controller, and written to the local Cache SRAM. The tag value from the CPU address needs to replace the value in the corresponding tag register, and the valid bit must be set to 1. Finally, the requested read or write operation must be performed following the procedure outlined for the hit cases above.
- Read/Write from/to cache [miss] and dirty bit = 1: A read or write request is received, and the associated block is not found in cache (a miss) which leads to a block replacement operation. Again the dirty bit for the corresponding CPU address must be analyzed. If the dirty bit is 1 then the recently used (and soon to be replaced) block must be written back into main memory. The data block in local SRAM must be written to main memory via the SDRAM memory controller. The block must be written to the following base address: [Tag & Index & 00000].
Next, the new block (based on the address requested by the CPU) must be copied into the cache. This requires reading data from the SDRAM memory controller by issuing the following address: [Tag & Index & 00000]. The Tag is now the Tag field of the AWR, requested by the CPU recently. The full block being read from the SDRAM memory controller must be written to the local cache SRAM. The original tag associated with this block must also be replaced with the new tag, as issued by the CPU. Once this process is complete, the transaction originally requested by the CPU can be completed, following the procedure outlined for the hit cases above.
C. Interface Specifications
Below are described the details of the three interfaces the cache controller must interact with: the CPU, the SDRAM controller, and the local SRAM (Block RAM). In each case, the type of information provided through the interface, as well as the interface timing is described.
The CPU will periodically issue read or write transaction requests. The CPU interface consists of a strobe CS, a read/write indicator WR/RD, a 16-bit address ADD, 8-bit data in and out ports DIN and DOUT, and finally a ready indicator input RDY. Finally, the CPU is assumed to be synchronous with the cache controller, and shares the same clock signal, CLK. All interface ports are shown in Figure 5.
When the CPU issues a transaction, it first sets the appropriate address on the address bus and sets the read/write indicator to the correct value; if a read is being issued, the WR/RD signal is low (0), whereas if a write is issued, the signal is high (1). Finally, if a write is performed, the appropriate data is also set on the DOUT port. Once all transaction signals are stable, the strobe CS is asserted, and stays asserted for 4 clock cycles. Figure 6 below shows an example of a write request.
The ready indicator RDY is used by the Cache Controller to indicate to the outside world when a transaction is complete. When idle, the cache controller should leave the RDY signal asserted (thus indicating it is ready to accept transactions). Once a transaction is received, the signals should be de-asserted, and should be kept low until the requested operation is complete.
E. Local RAM
The Block RAM memory used to implement the local memory for the cache has the interface shown in Figure 10 (based on Tutorial 3). It consists of an 8-bit address ADD, 8-bit data input and output DIN and DOUT, and a write enable signal WEN. Finally, the Block RAM shares the same clock CLK as the Cache Controller.
All operations issued to the cache controller are synchronized on the rising edge of the clock. Read operations are performed by setting the appropriate address on the address bus. The addressed data will propagate to the output DOUT after the next rising edge of the clock. Write operations are performed by setting a specific address and data on the appropriate ports, and then asserting the write enable signal WEN. On the next rising edge, the data will be written to the specified address. Read and write operations are shown in Figures 11 and 12 below.
In this paper, we have presented design of cache memory on FPGA for detecting cache miss. Such an approach would be of great utility to many modern embedded applications, for which both high-performance and low-power are of great importance. This cache memory and cache controller may used in FPGA based processors. The best way to improve performance/energy efficiency is to achieve fast and low-energy access at each level of memory hierarchy and to concentrate memory accesses on the closest level to the processor. We have compared the new design to existing designs with the help of software simulation of VHDL.
- Review on Performance of Static Random Access Memory, International Journal of Advanced Research in Computer and Communication Engineering Vol. 4, Issue 2, February 2015.
- Design of Cache Controller for Multi-core Processor System International Journal of Electronics and Computer Science Engineering 520 ISSN: 2277-1956, November 2014, Vipin S. Bhure Praveen R. Chakole.
- FPGA Implementation of cache memory. International Journal of Engineering Research and Application(IJERA) ISSN 2248-9622 Vol.3, Issue3 May – June 2013.Yogesh Watile, A.S. Khobragade .
- Yogesh S. Watile, A. S. Khobragade “Design of Cache Memory with Cache Controller Using VHDL” International Journal of Innovative Research in Science, Engineering and Technology, Vol. 2, Issue 7, July 2013.
- Jianwei Dai and Lei Wang. “An Energy-Efficient L2 cache Architecture Using Way Tag Information Under Write Through Policy”, IEEE transactions on very large scale integration (VLSI) systems, vol. 21, no.1 Jan 2013.
- Vipin S. Bhure, Dr. Dinesh Padole “Design of Cache Controller for Multi-core Systems Using Multilevel Scheduling Method 2012, Fifth International Conference on Emerging Trends in Engineering and Technology.
- Jongsok Choi, Kevin Nam, Andrew Canis, Jason Anderson, Stephen Brown, and Tomasz Czajkowski, “Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems” ECE Department, University of Toronto, Toronto, ON, Canada ,Altera Toronto Technology Centre, Toronto, ON, Canada, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.
- Jan C. Kleinsorge Sascha Plazar and Peter Marwedel. Wcet- aware static locking of instruction caches. In Proceedings of the 2012 International Symposium on Code Generation and Optimization, pages 44–52, 2012.
- Qutaiba Ibrahim “Design & implementation of high speed network devices using SRL16 reconfigurable content addressable memory (RCAM)”, International Arab Journal of e-Technology, Vol.2,No.2, June 2011.
- M. Arun, A. Krishnan “Comparative power analysis of pre- computation based content addressable memory”, Journal of Computer Science7(4):471-474,2011.
- Nawaf Almoosa, Yorai Wardi, and Sudhakar Yalamanchili “Controller Design for Tracking Induced Miss-Rates in Cache Memories 2010 8th IEEE International Conference on Control and Automation Xiamen, China, June 9-11, 2010.
- A. Putnam, D. Bennett, E. Dellinger, J. Mason, P. Sundarara-jan, and S. Eggers, “Chimps: A c-level compilation flow forhybridcpu-fpga architectures,” in Field Programmable Logicand Applications, 2008.
- L. Chen, X. Zou, J. Lei, and Z. Liu, “Dynamically reconfigurable cache for low-power embedded system,” in ICNC, 2007.
- A.Gordon-Ross, F.Vahid, and N.Dutt,“Fast configurable-cache tuning with a unified second-level cache,” in Proceedings of the 2005 international Symposium on Low Power Electronics and Design, ISLPED ’05.