Mr.Roshan Rakhunde#, Prof. Tushar Uplanchiwar#
#Department of Electronics and Communication Engineering
Tulsiramji Gaikwad- Patil College of Engineering& Nagpur,Maharashtra,India
In this brief, we explain the concept of performance monitoring as applicable to embedded system processor designs. The types of performance monitoring strategies are explained. Also, the Nios II softcore reconfigurable embedded processor is examined in detail. An hardware description language implementation of the Nios II softcore processor is implemented and simulation results using the Modelsim HDL simulator are presented.
Keywords—embedded systems, reconfigurable processors, HDL simulation
Monitoring is the process of gathering information about a system [1, 2]. We gather information which normally cannot be obtained by study- ing the program code only. The collected information may be used for program testing, debugging, task scheduling analysis, resource dimensioning, perfor- mance analysis, ﬁne-tuning and optimisation of algorithms. The applicability of monitoring is wide, and so is the spectrum of available monitoring tech- niques. In this section we give a general presentation of a monitor, and describe different monitoring systems, the type of information collected by monitors, and the problem-related issues with monitoring. In essence, a monitor works in two steps: detection (or triggering) and recording. The ﬁrst operation refers to the process of detecting the object of interest. This is usually performed by a trigger object that is inserted in the system, which when executed, or gets activated, indicates an event of inter- est for recording. The latter operation, recording, is the process of collecting events and saving them in buffer memory, or communicate them to external computer systems for the purpose of further analysis or debugging. An event is a record of information which usually constitutes the object of interest together with some additional meta data regarding that object (e.g. the time when the object was recorded, the object’s source address, task/process ID, CPU node, etc.). The type of monitored objects depend on the level of abstraction which the user is interested in. Section below describes different abstraction levels that are associated with program execution. The trigger object may be an instruction, or a function, that is inserted in the software. It may also be a physical sensor, or probe, connected with physical wires in the hardware, such as CPU address, data, and control buses. An important issue regarding the monitoring process is the amount of ex- ecution interference that may be introduced in the observed system due to the involved operations of a monitor. This execution interference, or perturbation, is unwanted because it may alter the true behaviour of the observed system, in particular such systems that are inherently timing-sensitive such as real-time and distributed systems.
Monitoring Abstraction Levels
Software execution may be monitored at different levels of abstraction as the information of interest is different in levels of detail. Higher-level information refers to events such as inter-process communication and synchronisation. In contrast, lower-level information refers to events such as the step-by-step exe- cution trace of a process. The execution data collected at the process level in- cludes the process state transitions, communication and synchronisation inter- actions among the software processes, and the interaction between the software processes and the external process. The execution data collected at the func- tion level includes the interactions among the functions or procedures within a process. The user can isolate faults within functions using the function-level execution data. In this section, the different levels of abstraction in software execution are identiﬁed.
A. System Level
The system-level may be seen as the user’s, or the real-world, view of the computer system. It abstracts away all implementation details and only provides information that is relevant to the system’s user (or to the real-world process). For instance, the press of a button in a car’s instrument board, and the activation/deactivation of the car’s Traction Control System (anti-spinning system) feature, would be considered as system-level events. This level of information is normally useful for system-test engineers during the ﬁnal steps in the development process.
B. Process and OS Level
To monitor program execution at the process level, we consider a process as a black box which can be in of the three states: running, ready, or waiting. A process changes its state depending on its currents state and the current events in the system. These events include interactions among the processes and the interactions between the software processes and the real world. The events that directly affect the program execution at the process level are distinguished from those events that affect the execution at lower levels. Assigning a value to a variable, arithmetic operations, and procedure calls, for instance, are events that will not cause immediate state changes of the running process. Inter-process communication and synchronisation are events that may change a process’ running status and affect its execution behaviour. The following events are typically considered as process level events:
² Process Creation
² Process Termination
² Process State Changes
² Process Synchronisation
² Inter-process Communication
² External Interrupts
² I/O Operations
C. Functional Level
The goal of monitoring program execution at the function level is to localise faulty functions (or procedures) within a process. At this level of abstraction, functions are the basic units of the program model. Each function is viewed as a black box that interacts with others by calling them or being called by them with a set of parameters as arguments. So the events of interest are function calls and returns. The key values for these events are the parameters passed between functions.
D. Instruction Level
The instructional level of abstraction refers to the step-by-step execution of CPU-instructions. It is, from a software perspective, regarded as the lowest level of abstraction of a program for a modern CPU. To monitor each executed instruction is, however, a heavy duty on any monitor since it requires at least the CPU-performance of the system being observed, and the collected amounts of event traces are too huge to be of practical use. Instead, it is sufﬁcient enough to monitor just those instructions that affect the execution path of a program, e.g. conditional branches, traps, exceptions, etc. Using this infor- mation in combination with the software’s source, or object code, it is possible to reconstruct the execution behaviour. For many programs2, such a method reduces the amount of recorded data with several orders of magnitude.
Types of Monitoring Systems
Monitoring systems for software or system-level analysis are typically classiﬁed into three types: 1) software monitoring systems, 2) hardware monitoring systems, and 3) hybrid monitoring systems. In the following we will describe each type of system. Chapter 6 gives a more detailed presentation of monitoring systems that relates to our work on hardware and hybrid monitoring.
A. Software Monitoring Systems
In this category of monitoring systems, only software is used to instrument, record, and collect information about software execution. Software monitor- ing systems offer the cheapest and most ﬂexible solution where a common technique is to insert instrumentation code at interesting points in the target software. When the instrumentation code is executed the monitoring process is triggered and information of interest is captured into trace buffers in target system memory. The drawbacks of instrumentation is the utilisation of target resources such as memory space and processor execution time.
B. Hardware Monitoring Systems
In this category of monitoring systems, only hardware (custom or general) is used to perform detection, recording and collection of information regarding the software. For this to work, the target system must lend itself for observations by external means (the monitoring hardware).
The primary objective of hardware monitoring is to avoid, or at least minimize, interference with the execution of the target system. A hardware monitoring system is typically separated from the target system, and thus, does not use any of the target system’s resources. Execution of the target software is monitored using passive hardware (or probes) connected to the system buses and signals. In this manner, no instrumentation of the program code is necessary. Hardware monitoring is especially useful for monitoring real-time and distributed systems since changes in the program execution time are avoided. In general, the operation of monitoring hardware can be described by the three steps: event detection, event matching, and event collection. In the ﬁrst step, detection, the hardware monitor listens continuously on the signals. In the second step, the signal samples are compared with a predeﬁned pattern which deﬁnes what to be considered as events. When a sample matches an event-pattern, the process triggers the ﬁnal step, collection, where the sampled data is collected and saved. The saved samples may be stored locally in the monitoring hardware, or be transferred to a host computer system where usually more storage capacity can be obtained.
Apart from the advantage of avoiding target interference, are the typical precision and accuracy of hardware monitors. Since the sole duty of a hardware monitor is to perform monitoring activities (usually at equal or higher system speed than the target’s) the risks of loosing samples are minimized. A disadvantage of hardware monitors is their dependency on the target’s architecture. The hardware interfaces, and the interpretation of the monitored data must be tailored for each target architecture it is to be used in. Thus, monitoring solutions using hardware are more expensive than software alternatives.
Moreover, a hardware monitor may not be available for a particular target, or takes time to customize, which may increase the costs further in terms of delayed development time.
Another problem with hardware is the integration and miniaturisation of components and signals in today’s chips which renders difﬁculties in reach- ing information of interest, e.g. cache-memory, internal registers and buses, and other on-chip logic. To route all internal signals out from a chip may be impossible because of limited pin counts.
In general, hardware monitoring is used tos monitor either hardware de- vices or software modules. Monitoring hardware devices can be useful in performance analysis and ﬁnding bottlenecks in e.g. caches (accesses/misses), memory latency, CPU execution time, I/O requests and responses, interrupt latency, etc. Software is generally monitored for debugging purposes or to examine bottlenecks, load-balancing (degree of parallelism in concurrent and multiprocessor systems), and deadlocks.
C. Hybrid Monitoring Systems
Hybrid monitoring uses a combination of software and hardware monitoring and is typically used to reduce the impact of software instrumentation alone. A hardware monitor device is usually attached to the system in some way, e.g. to a processor’s address/data bus, or on a network, and is made accessible for instrumentation code that is inserted in the software. The instrumentation is typically realised as code that extracts the information of interest, e.g. variable data, function parameters, etc., which is then sent to the monitor hardware. For instance, if the monitor hardware has memory-mapped registers in the sys- tem, the instrumentation would perform data store operations on the monitor’s memory-addresses. The hardware then proceeds with event processing, ﬁl- tering, time-stamping, etc., and then communicates the collected events to an external computer system. This latter part typically resembles the operation of a pure hardware monitor. The insertion of instrumentation code also resembles the technique used in a software monitoring system; i.e. it can either be done manually by the programmer, automated by a monitoring control application or by compiler directives.
D. The Probe Effect
Instrumentation of programs, also called “probing”, is convenient because it is a general method which technically is applicable in many systems. For concurrent programs however, the delay that is introduced by the insertion of additional instructions may alter the behaviour of the program. The probe- effect, which originates from Heisenberg’s Uncertainty Principle3 applied to programs [3, 4], may result in that either a non-functioning concur- rent program works with inserted delays, or a functioning program stops work- ing when the inserted delays are removed. This can also be seen as a difference between the behaviour of a system being tested and the same system not being tested. Typical errors related to the probe-effect are synchronisation errors in regions containing critical races for resources .
Not only may concurrent programs suffer from the probe-effect, but also real-time systems are concerned since they are inherently sensitive to timing disturbances, especially if deadlines are set too tight (i.e. non or low-relaxed worst-case execution times). Consequently, distributed/parallel real-time sys- tems are most sensitive to probe-effects. This is one important reason why testing and debugging (using monitoring) of real-time systems (particularly distributed real-time systems) is so difﬁcult [6, 7, 8]. Hence, probe-effects must be avoided in the development of real-time systems. There are basically three approaches to eliminate the probe-effect:
² Leave the probes in the ﬁnal system. In this approach the probes that have been used during development are left in the ﬁnal product. This way we avoid behavioural changes due to removal of probes. The disadvantage is of course that the ﬁnal system may suffer from inferior performance.
² Include probe-delays in schedulability analysis. In real-time systems design it is straightforward to include the probes in the execution time of the program, i.e. dedicate resources (execution time, memory, etc) to probes. However, this method does not guarantee the ordering of events, it only provides enough execution time to compensate for the inserted delays.
² Use non-intrusive hardware. Bus-snoopers and logic analysers are typical examples of passive hardware which do not interfere with the system. Other techniques are the use of multi-port memories, reﬂective memory, and use of special hardware. There are also hybrid monitoring systems which utilise hardware support together with software instrumentation. The disadvantage of this solution may be higher development and product costs due to extra hardware.
PERFORMANCE MONITOR UNIT
The performance counter core with Avalon® interface enables relatively unobtrusive, real-time profiling of software programs [9, 10]. With the performance counter, you can accurately measure execution time taken by multiple sections of code. You need only add a single instruction at the beginning and end of each section to be measured. The main benefit of using the performance counter core is the accuracy of the profiling results.
Alternatives include the following approaches:
- GNU profiler, gprof—gprof provides broad low-precision timing information about the entire software system. It uses a substantial amount of RAM, and degrades the real-time performance. For many embedded applications, gprof distorts real-time behaviour too much to be useful.
- Interval timer peripheral—The interval timer is less intrusive than gprof. It can provide good results for narrowly targeted sections of code.
The performance counter core is unobtrusive, requiring only a single instruction to start and stop profiling, and no RAM. It is appropriate for high-precision measurements of narrowly targeted sections of code.
The performance counter core is a set of counters which track clock cycles, timing multiple sections of your software. You can start and stop these counters in your software, individually or as a group. You can read cycle counts from hardware registers.
The core contains two counters for every section:
- Time: A 64-bit clock cycle counter
- Events: A 32-bit event counter
- Section Counters
Each 64-bit time counter records the aggregate number of clock cycles spent in a section of code. The 32-bit event counter records the number of times the section executes. The performance counter core can have up to seven section counters.
- Global Counter
The global counter controls all section counters. The section counters are enabled only when the global counter is running. The 64-bit global clock cycle counter tracks the aggregate time for which the counters were enabled. The 32-bit global event counter tracks the number of global events, that is, the number of times the performance counter core has been enabled.
Configurable Soft-Core Processor
The Nios II processor is a configurable soft-core processor, as opposed to a fixed, off-the-shelf microcontroller [11, 12, 13]. In this context, configurable means that you can add or remove features on a system-by-system basis to meet performance or price goals. Soft-core means the processor core is not fixed in silicon and can be targeted to any Altera FPGA family.
Configurability does not require you to create a new Nios II processor configuration for every new design. Altera provides ready-made Nios II system designs that you can use as is. If these designs meet your system requirements, there is no need to configure the design further. In addition, software designers can use the Nios II instruction set simulator to begin writing and debugging Nios II applications before the final hardware configuration is determined.
A. Flexible Peripheral Set and Address Map
A flexible peripheral set is one of the most notable differences between Nios II processor systems and fixed microcontrollers. Because the Nios II processor is implemented in programmable logic, you can easily build made-to-order Nios II processor systems with the exact peripheral set required for the target applications. A corollary of flexible peripherals is a flexible address map. Altera provides software constructs to access memory and peripherals generically, independently of address location. Therefore, the flexible peripheral set and address map does not affect application developers. There are two broad classes of peripherals: standard peripherals and custom peripherals.
B. Standard Peripherals
Altera provides a set of peripherals commonly used in microcontrollers, such as timers, serial communication interfaces, general-purpose I/O, SDRAM controllers, and other memory interfaces. The list of available peripherals continues to grow as Altera and third-party vendors release new peripherals.
C. Custom Peripherals
You can also create custom peripherals and integrate them in Nios II processor systems. For performance-critical systems that spend most CPU cycles executing a specific section of code, it is a common technique to create a custom peripheral that implements the same function in hardware. This approach offers a double performance benefit: the hardware implementation is faster than software; and the processor is free to perform other functions in parallel while the custom peripheral operates on data.
D. Custom Instructions
Like custom peripherals, custom instructions allow you to increase system performance by augmenting the processor with custom hardware. The custom logic is integrated into the Nios II processor’s arithmetic logic unit (ALU). Similar to native Nios II instructions, custom instruction logic can take values from up to two source registers and optionally write back a result to a destination register. Because the processor is implemented on reprogrammable Altera FPGAs, software and hardware engineers can work together to iteratively optimize the hardware and test the results of software running on hardware. From the software perspective, custom instructions appear as machine-generated assembly macros or C functions, so programmers do not need to understand assembly language to use custom instructions.
E. Automated System Generation
Altera’s SOPC Builder design tool [14, 15] fully automates the process of configuring processor features and generating a hardware design that you program in an FPGA. The SOPC Builder graphical user interface (GUI) enables you to configure Nios II processor systems with any number of peripherals and memory interfaces. You can create entire processor systems without performing any schematic or HDL design entry. SOPC Builder can also import HDL design files, providing an easy mechanism to integrate custom logic in a Nios II processor system. After system generation, you can download the design onto a board, and debug software executing on the board. To the software developer, the processor architecture of the design is set. Software development proceeds in the same manner as for traditional, nonconfigurable processors.
The functional units of the Nios II architecture form the foundation for the Nios II instruction set. However, this does not indicate that any unit is implemented in hardware. The Nios II architecture describes an instruction set, not a particular hardware implementation. A functional unit can be implemented in hardware, emulated in software, or omitted entirely. A Nios II implementation is a set of design choices embodied by a particular Nios II processor core. All implementations support the instruction set defined in the Instruction Set Reference chapter of the Nios II Processor Reference Handbook. Each implementation achieves specific objectives, such as smaller core size or higher performance. This allows the Nios II architecture to adapt to the needs of different target applications. Implementation variables generally fit one of three trade-off patterns: more or less of a feature; inclusion or exclusion of a feature; hardware implementation or software emulation of a feature. An example of each trade-off follows:
- More or less of a feature—For example, to fine-tune performance, you can increase or decrease the amount of instruction cache memory. A larger cache increases execution speed of large programs, while a smaller cache conserves on-chip memory resources.
- Inclusion or exclusion of a feature—For example, to reduce cost, you can choose to omit the JTAG debug module. This decision conserves on-chip logic and memory resources, but it eliminates the ability to use a software debugger to debug applications.
- Hardware implementation or software emulation—For example, in control applications that rarely perform complex arithmetic, you can choose for the division instruction to be emulated in software. Removing the divide hardware conserves on-chip resources but increases the execution time of division operations.
A. Register File
The Nios II architecture supports a flat register file , consisting of thirty two 32-bit general-purpose integer registers, and up to thirty two 32-bit control registers. The architecture supports supervisor and user modes that allow system code to protect the control registers from errant applications. The Nios II processor can optionally have one or more shadow register sets. A shadow register set is a complete set of Nios II general-purpose registers. When shadow register sets are implemented, the CRS field of the status register indicates which register set is currently in use. An instruction access to a general-purpose register uses whichever register set is active. A typical use of shadow register sets is to accelerate context switching. When shadow register sets are implemented, the Nios II processor has two special instructions, rdprs and wrprs, for moving data between register sets. Shadow register sets are typically manipulated by an operating system kernel, and are transparent to application code. A Nios II processor can have up to 63 shadow register sets.
B. Arithmetic Logic Unit
The Nios II ALU operates on data stored in general-purpose registers. ALU operations take one or two inputs from registers, and store a result back in a register.
Some Nios II processor core implementations do not provide hardware to support the entire Nios II instruction set. In such a core, instructions without hardware support are known as unimplemented instructions. The processor generates an exception whenever it issues an unimplemented instruction so your exception handler can call a routine that emulates the operation in software. Therefore, unimplemented instructions do not affect the programmer’s view of the processor.
The Nios II architecture supports user-defined custom instructions. The Nios II ALU connects directly to custom instruction logic, enabling you to implement in hardware operations that are accessed and used exactly like native instructions.
The Nios II architecture supports single precision floating-point instructions as specified by the IEEE Std 754-1985 [17, 18]. The basic set of floating-point custom instructions includes single precision floating-point addition, subtraction, and multiplication. Floating-point division is available as an extension to the basic instruction set. These floating-point instructions are implemented as custom instructions.
C. Memory and I/O Organization
This section explains hardware implementation details of the Nios II memory and I/O organization [. The discussion covers both general concepts true of all Nios II processor systems, as well as features that might change from system to system. The flexible nature of the Nios II memory and I/O organization are the most notable difference between Nios II processor systems and traditional microcontrollers. Because Nios II processor systems are configurable, the memories and peripherals vary from system to system. As a result, the memory and I/O organization varies from system to system.
A Nios II core uses one or more of the following to provide memory and I/O access:
- Instruction master port—An Avalon® Memory-Mapped (Avalon-MM) master port that connects to instruction memory via system interconnect fabric
- Instruction cache—Fast cache memory internal to the Nios II core
- Data master port—An Avalon-MM master port that connects to data memory and peripherals via system interconnect fabric
- Data cache—Fast cache memory internal to the Nios II core
- Tightly-coupled instruction or data memory port—Interface to fast on-chip memory outside the Nios II core
D. JTAG Debug Module
The Nios II architecture supports a JTAG debug module [19, 20] that provides on-chip emulation features to control the processor remotely from a host PC. PC-based software debugging tools communicate with the JTAG debug module and provide facilities, such as the following features:
- Downloading programs to memory
- Starting and stopping execution
- Setting breakpoints and watchpoints
- Analysing registers and memory
- Collecting real-time execution trace data
The debug module connects to the JTAG circuitry in an Altera FPGA. External debugging probes can then access the processor via the standard JTAG interface on the FPGA. On the processor side, the debug module connects to signals inside the processor core. The debug module has nonmaskable control over the processor, and does not require a software stub linked into the application under test. All system resources visible to the processor in supervisor mode are available to the debug module. For trace data collection, the debug module stores trace data in memory either on-chip or in the debug probe. The debug module gains control of the processor either by asserting a hardware break signal, or by writing a break instruction into program memory to be executed. In both cases, the processor transfers execution to the routine located at the break address. The break address is specified in SOPC Builder at system generation time. Soft-core processors such as the Nios II processor offer unique debug capabilities beyond the features of traditional, fixed processors. The soft-core nature of the Nios II processor allows you to debug a system in development using a full-featured debug core, and later remove the debug features to conserve logic resources. For the release version of a product, the JTAG debug module functionality can be reduced, or removed altogether.
Register Transfer Level (RTL) Simulation
RTL simulation is a powerful means of debugging the interaction between a processor and its peripheral set. When debugging a target board, it is often difficult to view signals buried deep in the system. RTL simulation alleviates this problem as it enables you to functionally probe every register and signal in the design. You can easily simulate Nios II-based systems in the ModelSim simulator with an automatically generated simulation environment that Qsys and the Nios II SBT for Eclipse create.
We have presented a detailed account of performance monitoring of softcore embedded reconfigurable processors and mentioned the various techniques available in the literature for performance monitoring. We have explored the architecture of the Nios II softcore reconfigurable embedded processor and we have performed an RTL level simulation of a Nios II system using the Modelsim HDL simulator software.
- N. L. Binkert, B. M. Beckmann, G. Black, S. K. Reinhardt, A. G. Saidi, A. Basu, J. Hestness, D. Hower, T. Krishna, S. Sardashti,
- R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The Gem5 Simulator,” SIGARCH Computer Architecture News. ACM, vol. 39, no. 2, pp. 1–7, 2011.
- D. Zaparanuks, M. Jovic, and M. Hauswirth, “Accuracy of Performance Counter Measurements,” in Intl. Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, pp. 23–32, 2009.
- A. Gordon-Ross and F. Vahid, “A Self-Tuning Conﬁgurable Cache,” in Design Automation Conference (DAC). ACM, pp. 234–237, 2007.
- ARM, ARM Cortex-A9, Technical Reference Manual, ARM Inc., 2009.
- Analog Devices, ADSP-BF535 Blackﬁn Processor Hardware Reference,
- Analog Devices, Inc., 2004.
- Renesas, SuperHTM Family, User Manual, Renesas Inc., 2004.
- Xilinx, Zynq-7000 All Programmable SoC Technical Reference Manual, UG585 (v1.7) ed., Xilinx Inc., 2014.
- S. Koehler, J. Curreri, and A. D. George, “Performance Analysis Challenges and Framework for High-Performance Reconﬁgurable Com-puting,” Parallel Computing. Elsevier Science Publishers B. V., vol. 34, no. 4-5, pp. 217–230, 2008.
- A. G. Schmidt, N. Steiner, M. French, and R. Sass, “HwPMI: An Extensible Performance Monitoring Infrastructure for Improving Hard-ware Design and Productivity on FPGAs,” Intl. J. of Reconﬁgurable Computing. Hindawi Publishing Corp., vol. 2012, pp. 2:2–2:2, 2012.
- Aeroﬂex Gaisler, GRLIB IP Library User’s Manual, 2014.
- B. Sprunt, “The Basics of Performance Monitoring Hardware,” IEEE Micro, vol. 22, no. 4, pp. 64–71, 2012.
- SUN Microsystems, OpenSPARC T2 Core Microarchitecture Speciﬁca-tion, Sun Microsystems, Inc., 2007.
- J. Dongarra, K. London, M. S., P. Mucci, H. Terpstra D.and You, and M. Zhou, “Experiences and Lessons Learned With a Portable Interface to Hardware Performance Counters,” in Intl. Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, p. 289.2,
- V. M. Weaver, “Linux perf event Features and Overhead,” in FastPath Workshop, Intl. Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2013.
- S. Eranian, Perfmon2: A Flexible Performance Monitoring Interface for Linux, HP Labs, 2006.
- The SPARC Architecture Manual, Version 8, SPARC International, Inc, 1992.
- M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, and R. Brown, “MiBench: A Free, Commercially Representative Embedded Benchmark Suite,” in Workload Characterization (WWC). IEEE Computer Society, pp. 3–14, 2001.