A Pipelined Scheduler Architecture for Network on Chip
Volumn 2

A Pipelined Scheduler Architecture for Network on Chip

Abhijit Gadge#1, Prof. Supratim Saha*2, Prof. Parag Jawarkar#3

#Electronics & Telecommunication Department Of Engineering,

Tulsiramji Gayakwad Patil College Of Engineering

RTMNU, Nagpur,India


2 jawarkar@gmail.com

3 supratimsaha2012@gmail.com


Network on chip is a paradigm for enabling efficient communication between processing elements in next generation systems on chip made up of tens and hundreds of intellectual property components.  The network is composed of routers and links.  The router in turn is made up of packet queues, crossbar, and scheduler.  In this brief, we propose a folded pipelined architecture to the round robin scheduler used in the NoC router.

Keywords— network on chip, router, scheduler, round robin, arbitration, pipelining


Crossbar switches are the common building blocks for In- ternet  routers,  data-center  and  HPC  interconnects  and  on- chip  networks [1, 2, 3]. The  core  switching  fabric  often  has  no buffers,  saving in this way memory area and buffer speed. Arriving packets issue requests to a central scheduler,  and get switched upon scheduler grants; meanwhile, packets wait at input packet buffers, in front of the crossbar.  In order to isolate traffic flows1 , and provide the basis for proper con- gestion management, these input buffers must be organized in per-flow queues, forming what is widely known as virtual- output-queuing (VOQ) [4, 5] crossbars, as shown in Fig. 1.

Figure 1: Crossbar switch based router

The speed/switching efficiency tradeoff of a VOQ cross- bar critically depends on the design and implementation of its crossbar scheduler. Most commercial crossbars today rely on independent, per-input and per-output round-robin (RR) arbiters [6, 7], that yield maximal matchings after a few rounds of handshaking.  The basic time complexity of these schedul- ing  algorithms  is  approximately  equal  to  that  of  two  pro- grammable priority arbiters [1], and increases linearly with the  number  of  iterations;  hence,  iterations  normally  come along with a port speed penalty.  Furthermore, although it- erations  improve  the  delay  performance  –as  measured  un- der  uniform  traffic–,  they  do  not  improve  switch  through- put under unfavorable, non-uniform traffic.  In those cases, speedup is usually employed to cover the missing through- put;  speedup however seriously affects the energy and the effective capacity of switching systems.


The Scheduler acts as the central switch arbiter.  It analyzes the occupied Virtual Output  Queues of each input_block and configures the input_blocks and interconnect muxes to  connect inputs to outputs and allow serial data transfer across the switch.  The scheduling  algorithm attempts to achieve a large number of simultaneous connections, but also  avoids conflicts of multiple inputs connecting to a single output or a single input  connecting to multiple outputs.  The scheduling algorithm chosen is a modified i-round robin  [1] scheduler. The base i-round robin specification comes from [1].  The i-round robin  algorithm is derived from the round robin algorithm[1], which is an improvement upon the  Round-Robin Matching algorithm.  Each of these algorithms assumes an N-input by N- output cell switch with input queuing.  To alleviate head-of-line blocking at the input  queues, each input maintains a separate queue for each possible output destination.  As  shown in Section 3.2, these queues are implemented as Virtual Output Queues utilizing a  shared memory bank.  The goals for the scheduling algorithm is to match input queues  containing waiting packets with output queues to achieve the maximum throughput while  maintaining stability and eliminating starvation.   The round robin algorithm matches inputs to outputs in a single iteration; however, after this  iteration, several possible input and output ports may remain unutilized.  The i-round robin  algorithm uses multiple iterations to find paths to utilize as many input and output ports  as possible (pseudo-maxsize matching) until it converges to finding no more possible  matches.  The single iteration round robin algorithm is a specialization of i-round robin and may be  characterized as i-round robin with only a single iteration, or 1-round robin.

Figure 2:  Round robin scheduler

Figure 2 shows the i-round robin scheduler implementation chosen for this design.  The input to  the scheduler is the occupancy vectors [8] from each of the input_blocks with packets  waiting to be scheduled.  There are such 8 vectors (1 per crossbar input port), each with 8  bits (1 per destination per input).  This 8×8 state is used as the request vector into the  iterative arbitration logic.   The scheduler also contains 8 Grant Arbiters and 8 Accept Arbiters. The Grant and  Accept Arbiters each consist of programmable priority encoders and a state pointer to  record which input should have the highest priority on the next arbitration cycle.  The 8- bit feedback signal from the Decision registers to the Grant arbiters is an enable vector,  which enables arbitration only for unmatched ports on each successive iterations. Finally, after a number of iterations, the scheduler arrives at a final scheduling solution  which it outputs to each of the input_blocks (indicating which destination the data block  has been scheduled to transmit a packet for) and each interconnect mux (indicating which  input_block it is receiving data from). As an enhancement to the original i-round robin algorithm proposed in [9], this scheduler also  includes an 8-bit busy input from each of the switch outputs.  These busy signals are  asserted if that output does not have enough downstream credit to send another transfer.   When the busy signal is asserted, that output port is disabled from the Grant Arbitration

A. Arbiters

The arbiters are an important piece of the scheduler design [10, 11, 12].  The Grant Arbiters and Accept Arbiters are identically designed with the exception of the rules determining when the priority state may be updated.

Figure 3: Arbiter

Figure 6 illustrates the arbiter design chosen for this implementation.  The arbiter is based  on a simple round-robin arbiter, with the exception that it also includes an update_enable  signal to allow the i-round robin algorithm to only update the priority under certain  circumstances [13].  This limited updating produces de- synchronizing behavior between the Grant and Accept arbiters, producing improved  traffic fairness and decreasing undesirable bursting characteristics. The scheduler unit verification was accomplished using a combination of directed input  stimulus and assertions.  The inputs to the scheduler are the occupancy vectors of each of  the input_blocks.  For unit-level testing, short testcases that exercised the occupancy  vectors were used to stimulate the scheduler inputs.  The scheduling decisions were  partially checked using assertion blocks to verify that the decisions conformed to all-0 or  1-hot encoding.  For the unit-level scheduler testbench, no checking was done to verify  that a correct schedule was determined given the current occupancy vectors.  That  checking was deferred to the system-level simulations.

Pipelined implementation

We designed the scheduler to run at a clock speed of 175 MHz, and each time slot con- sists of nine clock cycles. It thus has approxi- mately 51 ns to complete three iterations of round robin . A novel pipelining scheme allows the scheduler to overlap the accept phase of one iteration with the request-grant phase of the next. As a result, i iterations require only 2i + 2 clock cycles instead of the 4i clock cycles they would require without pipelining. With three iterations, this saves four clock cycles. Figure 2 shows the pipelining scheme. Each phase consumes two clock cycles. The sched- uler uses the last clock cycle to update round- robin pointers and to prepare for the next time slot.  At first glance, it appears that the scheduler must carry out the three iterations of the scheduling algorithm sequentially. However, a simple observation allows us to start an iter- ation’s grant phase before the previous itera- tion’s accept phase completes.

Figure 4: Pipelined scheduler

Before an iteration’s grant phase, the sched- uler must mask off those requests accepted in the previous iteration. So we might expect that the scheduler would have to wait for comple- tion of the accept phase to determine whether to consider a request in the next iteration. However, because we know that an input that receives at least one grant will definitely be matched, we can simply OR together all its grants. This way, we can start the next itera-tion before the accept phase has completed.

Figure 5: Simulation results


In this paper, we have presented a pipelined architecture for a round robin scheduler to be used in the routing node of network on chip.  The pipelined implementation incorporates a folded architecture which allows the scheduler to be constructed using only half the number of arbiters that would be used in the nonpipelined implementation.  The simulation results show that because of folding and pipelining, there is a slight speed penalty, which can be overcome by running the scheduler at a higher clock rate.


  1. Abdelrazag Imbewa and Mohammed A. S. Khalid, “FLNR: A Fast Light-Weight NoC Router for FPGAs”, IEEE International Symposium on System on Chip, 2013.
  2. McKeown, N. W. 1995 Scheduling Algorithms for Input-Queued Cell Switches. Doctoral Thesis. UMI Order Number: UMI Order No. GAX96-02658., University of California at Berkeley.
  3. N. McKeown, M. Izzard, A. Mekkittikul, B. Ellersick, and M.Horowitz, “The Tiny Tera: A Packet Switch Core,” IEEE Micro, vol. 17, pp.26-33, Jan.-Feb. 1997.
  4. Wijetunga, P., “High-performance crossbar design for system-on-chip,” System-on-Chip for Real-Time Applications, 2013. Proceedings. The 3rd IEEE International Workshop on , vol., no.pp. 138- 143, 30 June-2 July 2013.
  5. Shin, H.J.; Hodges, D.A., “A 250-Mbit/s CMOS crosspoint switch,” Solid-State Circuits, IEEE Journal of , vol.24, no.2pp.478-486, Apr 1989.
  6. Gupta, P.; McKeown, N., “Designing and implementing a fast crossbar scheduler,” Micro, IEEE , vol.19, no.1pp.20-28, Jan/Feb 1999.
  7. Pape, J; “Implementation of an On-chip Interconnect Using the i-SLIP Scheduling Algorithm: Intermediate Report – Specification and Timeline,” Nov 2006.
  8. N. Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems Perspective. Addison-Wesley, 2004.
  9. Mhamdi, L., Kachris, C., and Vassiliadis, S. 2006. A reconfigurable hardware based embedded scheduler for buffered crossbar switches. In Proceedings of the 2006 ACM/SIGDA 14th international Symposium on Field Programmable Gate Arrays (Monterey, California, USA, February 22 – 24, 2006). FPGA ’16. ACM Press, New York, NY, 143-149.
  10. Kun-Yung Ken Chang; Shang-Tse Chuang; McKeown, N.; Horowitz, M., “A 50 Gb/s 32×32 CMOS crossbar chip using asymmetric serial links,” VLSI Circuits, 1999. Digest of Technical Papers. 1999 Symposium on , vol., no.pp.19-22, 1999.
  11. Brinkmann, A.; Niemann, J.-C.; Hehemann, I.; Langen, D.; Porrmann, M.; Ruckert, U., “On-chip interconnects for next generation system-on-chips,” ASIC/SOC Conference, 2012. 15th Annual IEEE International , vol., no.pp. 211- 215, 25-28 Sept. 2012.
  12. Partridge, C., Carvey, P. P., Burgess, E., Castineyra, I., Clarke, T., Graham, L., Hathaway, M., Herman, P., King, A., Kohalmi, S., Ma, T., Mcallen, J., Mendez, T., Milliken, W. C., Pettyjohn, R., Rokosz, J., Seeger, J., Sollins, M., Storch, S., Tober, B., and Troxel, G. D. 1998. A 50-Gb/s IP router. IEEE/ACM Trans. Netw.6, 3 (Jun. 1998), 237-248.
  13. P. Guerrier and A. Greiner, “A Generic Architecture for On-Chip PacketSwitched Interconnections”, Proc. Design Automation and Test in Europe, 2010.
  14. Lines, A., “Nexus: an asynchronous crossbar interconnect for synchronous system-on-chip designs,” High Performance Interconnects, 2013. Proceedings. 11th Symposium on , vol., no.pp. 2- 9, 20-22 Aug. 2013.
  15. S.Q. Zheng, M. Yang, and F. Masetti-Placci, “Constructing schedulers for high-speed, high-capacity switches/routers,” Int. J. Comput. Appl., vol.25, no.4, pp.264–271, 2013.
  16. Si-Qing Zheng, Mei Yang, Francesco Masetti: Hardware Scheduling in High-speed, High-capacity IP Routers. IASTED PDCS 2012: 631-636

Related posts





Strategic Reward System- Analyzing the gap between “Best Fit” and “Best Practice”


Leave a Comment