Abhijit Gadge#1, Prof. Supratim Saha*2, Prof. Parag Jawarkar#3
#Electronics & Telecommunication Department Of Engineering,
Tulsiramji Gayakwad Patil College Of Engineering
Network on chip is a paradigm for enabling efficient communication between processing elements in next generation systems on chip made up of tens and hundreds of intellectual property components. The network is composed of routers and links. The router in turn is made up of packet queues, crossbar, and scheduler. In this brief, we propose a folded pipelined architecture to the round robin scheduler used in the NoC router.
Keywords— network on chip, router, scheduler, round robin, arbitration, pipelining
Crossbar switches are the common building blocks for In- ternet routers, data-center and HPC interconnects and on- chip networks [1, 2, 3]. The core switching fabric often has no buffers, saving in this way memory area and buffer speed. Arriving packets issue requests to a central scheduler, and get switched upon scheduler grants; meanwhile, packets wait at input packet buffers, in front of the crossbar. In order to isolate traffic flows1 , and provide the basis for proper con- gestion management, these input buffers must be organized in per-flow queues, forming what is widely known as virtual- output-queuing (VOQ) [4, 5] crossbars, as shown in Fig. 1.
The speed/switching efficiency tradeoff of a VOQ cross- bar critically depends on the design and implementation of its crossbar scheduler. Most commercial crossbars today rely on independent, per-input and per-output round-robin (RR) arbiters [6, 7], that yield maximal matchings after a few rounds of handshaking. The basic time complexity of these schedul- ing algorithms is approximately equal to that of two pro- grammable priority arbiters , and increases linearly with the number of iterations; hence, iterations normally come along with a port speed penalty. Furthermore, although it- erations improve the delay performance –as measured un- der uniform traffic–, they do not improve switch through- put under unfavorable, non-uniform traffic. In those cases, speedup is usually employed to cover the missing through- put; speedup however seriously affects the energy and the effective capacity of switching systems.
The Scheduler acts as the central switch arbiter. It analyzes the occupied Virtual Output Queues of each input_block and configures the input_blocks and interconnect muxes to connect inputs to outputs and allow serial data transfer across the switch. The scheduling algorithm attempts to achieve a large number of simultaneous connections, but also avoids conflicts of multiple inputs connecting to a single output or a single input connecting to multiple outputs. The scheduling algorithm chosen is a modified i-round robin  scheduler. The base i-round robin specification comes from . The i-round robin algorithm is derived from the round robin algorithm, which is an improvement upon the Round-Robin Matching algorithm. Each of these algorithms assumes an N-input by N- output cell switch with input queuing. To alleviate head-of-line blocking at the input queues, each input maintains a separate queue for each possible output destination. As shown in Section 3.2, these queues are implemented as Virtual Output Queues utilizing a shared memory bank. The goals for the scheduling algorithm is to match input queues containing waiting packets with output queues to achieve the maximum throughput while maintaining stability and eliminating starvation. The round robin algorithm matches inputs to outputs in a single iteration; however, after this iteration, several possible input and output ports may remain unutilized. The i-round robin algorithm uses multiple iterations to find paths to utilize as many input and output ports as possible (pseudo-maxsize matching) until it converges to finding no more possible matches. The single iteration round robin algorithm is a specialization of i-round robin and may be characterized as i-round robin with only a single iteration, or 1-round robin.
Figure 2 shows the i-round robin scheduler implementation chosen for this design. The input to the scheduler is the occupancy vectors  from each of the input_blocks with packets waiting to be scheduled. There are such 8 vectors (1 per crossbar input port), each with 8 bits (1 per destination per input). This 8×8 state is used as the request vector into the iterative arbitration logic. The scheduler also contains 8 Grant Arbiters and 8 Accept Arbiters. The Grant and Accept Arbiters each consist of programmable priority encoders and a state pointer to record which input should have the highest priority on the next arbitration cycle. The 8- bit feedback signal from the Decision registers to the Grant arbiters is an enable vector, which enables arbitration only for unmatched ports on each successive iterations. Finally, after a number of iterations, the scheduler arrives at a final scheduling solution which it outputs to each of the input_blocks (indicating which destination the data block has been scheduled to transmit a packet for) and each interconnect mux (indicating which input_block it is receiving data from). As an enhancement to the original i-round robin algorithm proposed in , this scheduler also includes an 8-bit busy input from each of the switch outputs. These busy signals are asserted if that output does not have enough downstream credit to send another transfer. When the busy signal is asserted, that output port is disabled from the Grant Arbitration
The arbiters are an important piece of the scheduler design [10, 11, 12]. The Grant Arbiters and Accept Arbiters are identically designed with the exception of the rules determining when the priority state may be updated.
Figure 6 illustrates the arbiter design chosen for this implementation. The arbiter is based on a simple round-robin arbiter, with the exception that it also includes an update_enable signal to allow the i-round robin algorithm to only update the priority under certain circumstances . This limited updating produces de- synchronizing behavior between the Grant and Accept arbiters, producing improved traffic fairness and decreasing undesirable bursting characteristics. The scheduler unit verification was accomplished using a combination of directed input stimulus and assertions. The inputs to the scheduler are the occupancy vectors of each of the input_blocks. For unit-level testing, short testcases that exercised the occupancy vectors were used to stimulate the scheduler inputs. The scheduling decisions were partially checked using assertion blocks to verify that the decisions conformed to all-0 or 1-hot encoding. For the unit-level scheduler testbench, no checking was done to verify that a correct schedule was determined given the current occupancy vectors. That checking was deferred to the system-level simulations.
We designed the scheduler to run at a clock speed of 175 MHz, and each time slot con- sists of nine clock cycles. It thus has approxi- mately 51 ns to complete three iterations of round robin . A novel pipelining scheme allows the scheduler to overlap the accept phase of one iteration with the request-grant phase of the next. As a result, i iterations require only 2i + 2 clock cycles instead of the 4i clock cycles they would require without pipelining. With three iterations, this saves four clock cycles. Figure 2 shows the pipelining scheme. Each phase consumes two clock cycles. The sched- uler uses the last clock cycle to update round- robin pointers and to prepare for the next time slot. At ﬁrst glance, it appears that the scheduler must carry out the three iterations of the scheduling algorithm sequentially. However, a simple observation allows us to start an iter- ation’s grant phase before the previous itera- tion’s accept phase completes.
Before an iteration’s grant phase, the sched- uler must mask off those requests accepted in the previous iteration. So we might expect that the scheduler would have to wait for comple- tion of the accept phase to determine whether to consider a request in the next iteration. However, because we know that an input that receives at least one grant will definitely be matched, we can simply OR together all its grants. This way, we can start the next itera-tion before the accept phase has completed.
In this paper, we have presented a pipelined architecture for a round robin scheduler to be used in the routing node of network on chip. The pipelined implementation incorporates a folded architecture which allows the scheduler to be constructed using only half the number of arbiters that would be used in the nonpipelined implementation. The simulation results show that because of folding and pipelining, there is a slight speed penalty, which can be overcome by running the scheduler at a higher clock rate.
- Abdelrazag Imbewa and Mohammed A. S. Khalid, “FLNR: A Fast Light-Weight NoC Router for FPGAs”, IEEE International Symposium on System on Chip, 2013.
- McKeown, N. W. 1995 Scheduling Algorithms for Input-Queued Cell Switches. Doctoral Thesis. UMI Order Number: UMI Order No. GAX96-02658., University of California at Berkeley.
- N. McKeown, M. Izzard, A. Mekkittikul, B. Ellersick, and M.Horowitz, “The Tiny Tera: A Packet Switch Core,” IEEE Micro, vol. 17, pp.26-33, Jan.-Feb. 1997.
- Wijetunga, P., “High-performance crossbar design for system-on-chip,” System-on-Chip for Real-Time Applications, 2013. Proceedings. The 3rd IEEE International Workshop on , vol., no.pp. 138- 143, 30 June-2 July 2013.
- Shin, H.J.; Hodges, D.A., “A 250-Mbit/s CMOS crosspoint switch,” Solid-State Circuits, IEEE Journal of , vol.24, no.2pp.478-486, Apr 1989.
- Gupta, P.; McKeown, N., “Designing and implementing a fast crossbar scheduler,” Micro, IEEE , vol.19, no.1pp.20-28, Jan/Feb 1999.
- Pape, J; “Implementation of an On-chip Interconnect Using the i-SLIP Scheduling Algorithm: Intermediate Report – Specification and Timeline,” Nov 2006.
- N. Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems Perspective. Addison-Wesley, 2004.
- Mhamdi, L., Kachris, C., and Vassiliadis, S. 2006. A reconfigurable hardware based embedded scheduler for buffered crossbar switches. In Proceedings of the 2006 ACM/SIGDA 14th international Symposium on Field Programmable Gate Arrays (Monterey, California, USA, February 22 – 24, 2006). FPGA ’16. ACM Press, New York, NY, 143-149.
- Kun-Yung Ken Chang; Shang-Tse Chuang; McKeown, N.; Horowitz, M., “A 50 Gb/s 32×32 CMOS crossbar chip using asymmetric serial links,” VLSI Circuits, 1999. Digest of Technical Papers. 1999 Symposium on , vol., no.pp.19-22, 1999.
- Brinkmann, A.; Niemann, J.-C.; Hehemann, I.; Langen, D.; Porrmann, M.; Ruckert, U., “On-chip interconnects for next generation system-on-chips,” ASIC/SOC Conference, 2012. 15th Annual IEEE International , vol., no.pp. 211- 215, 25-28 Sept. 2012.
- Partridge, C., Carvey, P. P., Burgess, E., Castineyra, I., Clarke, T., Graham, L., Hathaway, M., Herman, P., King, A., Kohalmi, S., Ma, T., Mcallen, J., Mendez, T., Milliken, W. C., Pettyjohn, R., Rokosz, J., Seeger, J., Sollins, M., Storch, S., Tober, B., and Troxel, G. D. 1998. A 50-Gb/s IP router. IEEE/ACM Trans. Netw.6, 3 (Jun. 1998), 237-248.
- P. Guerrier and A. Greiner, “A Generic Architecture for On-Chip PacketSwitched Interconnections”, Proc. Design Automation and Test in Europe, 2010.
- Lines, A., “Nexus: an asynchronous crossbar interconnect for synchronous system-on-chip designs,” High Performance Interconnects, 2013. Proceedings. 11th Symposium on , vol., no.pp. 2- 9, 20-22 Aug. 2013.
- S.Q. Zheng, M. Yang, and F. Masetti-Placci, “Constructing schedulers for high-speed, high-capacity switches/routers,” Int. J. Comput. Appl., vol.25, no.4, pp.264–271, 2013.
- Si-Qing Zheng, Mei Yang, Francesco Masetti: Hardware Scheduling in High-speed, High-capacity IP Routers. IASTED PDCS 2012: 631-636