Efficient Integer DCT Architecture for mpeg-4 part 10/HEVC
Volumn 3

Efficient Integer DCT Architecture for mpeg-4 part 10/HEVC

Mrs Daksha Amol Taori                                                                                     

Department of VLSI Engineering                                                                     

JIT, Lonara,  Nagpur, India 



 Prof Nakul Nagpal

Department of VLSI Engineering

JIT, Lonara, Nagpur, India



In this project we have proposed a low-cost high throughput multistandard transform (MST) core, which can support MPEG 1/2/4 (8 × 8), H.264 (8 × 8 &4 × 4), and Video Codecs VC-1 (8 × 8, 8 × 4, 4×8 & 4×4) transforms. Common sharing distributed arithmetic (CSDA) algorithm combines factor sharing and distributed arithmetic sharing techniques, to exploit the available resources on FPGAs, which incorporates pipelining and parallel processing of the input samples. With the help of this architecture the throughput of the design is increased. The main strategy is to reduce the nonzero elements using CSDA algorithm. The reduction in adders in the proposed MST is achieved up to 44.5%, compared with the direct implementation method. The proposed MST core has an eightfold operation frequency throughput rate with eight parallel computation paths.

Keywords— Multistandard, common sharing distributed arithmetic (CSDA),integer transform,  discrete cosine transform (DCT), multistandard transform (MST), FPGA


 Different transforms are widely used in image and video compression applications. Various groups, such as The International Organization for Standardization (ISO), International Telecommunication Union Telecommunication Standardization Sector (ITU-T), and Microsoft Corporation, have developed different Transform dimensions and coefficients, corresponding to several applications.

GroupsVideo StandardsDimensions 
ISOMPEG-1/2/48 × 8 
ITU-TH.261, H.2638 × 8 
H.2648 × 8, 4 × 4 
MicrosoftVC-18 × 8, 4 × 8, 8 × 4, 4 × 4 

Fig 1 Various video standards diagram

Discrete Cosine Transform represents the finite sequences of data points in terms of sum of cosine term functions oscillating at different frequencies.  2D DCT is often used in signal and image processing because of its strong energy compaction property.

Number of researchers have worked on various transform core designs, that includes discrete cosine transform (DCT) as well as integer transform, which uses distributed arithmetic

(DA), factor sharing (FS) , and matrix decomposition methods for reducing hardware cost. The inner product can be implemented using ROMs and accumulators at place of multipliers that reduces the area cost. Yu and Swartzlander present very efficient method for reducing size of ROMs with recursive DCT algorithms. Delta matrix is used to share hardware resources using the FS method. They generates matrices for multi standards as linear combinations obtained from the same matrix and delta matrix, and show that by factorization the same matrix coefficients can share the same hardware resources.

Recently, reconfigurable architectures have been proposed as a solution that achieves very good flexibility of processors in field-programmable gate array (FPGA) platform or application-specific integrated circuit (ASIC).

In this paper we have proposed a MST core that supports MPEG-1/2/4 (8 × 8), H.264 (8 × 8 &4 × 4), and Video CodesVC-1 (8 × 8, 8 × 4, 4 × 8 & 4 × 4) transforms. However, the proposed MST core includes DA and FS schemes as common sharing distributed arithmetic (CSDA) that reduces the hardware cost. The main strategy is to reduce the nonzero elements using CSDA algorithm; which results in, few requirement of adders in the adder-tree circuit. According to the proposed designed strategy, the selected canonic signed digital (CSD) coefficients can achieve excellent sharing capability for hardware resources.

The paper is represented as follows. Section II shows the proposed plan of the work. Section III provides the mathematical derivation of the proposed CSDA algorithm. Section IV addresses the proposed 2-D CSDA-MST architecture, which includes derivation of eight-point and four point, transforms, choice of coefficients, and the proposed 2-D CSDA architecture. Section V presents comparisons and discussions. Finally, Section VI offers a detailed conclusion.


The proposed CSDA combines the FS and DA methods to gain better resource sharing for inner-product operation. The FS and DA method are described in the following.

Fig 2 Block diagram


A. Mathematical Derivation of Factor Sharing

In Factor Sharing method the same factor is shared in different coefficients among the same input. Consider two different elements S1 and S2 having the same input X as an example

S1 = C1X, S2 = C2X.                                          (1)

Assuming that thecoefficients C1 and C2 having same factor Fs can be found in the, (1) can be rewritten as follows:

S1 = (Fs2k1+ Fd1)X

S2 = (Fs2k2+ Fd2)X                                           (2)

where k1 and k2 specifies the weight position of the sharedfactor Fs in coefficientsC1 and C2, respectively. Theremainder coefficients are Fd1 and Fd2 after extracting the shared factor Fs forcoefficients C1 and C2, respectively

Fd1= C1− Fs2k1

Fd2= C2− Fs2k2.                                        (3)

B. Mathematical Derivation of CSD Format Distributed


For a general matrix multiplication-and accumulation the inner product can be written as follows:

where Ai is an N-bit CSD coefficient, and the inputdata is Xi. Equation (4) can be given as listed below

where Yj = LAi,j Xi, Ai,j∈ {−1, 0, 1}, and j= 0, . .(N − 1).

In (5), Yj can be computed by addingor subtracting Xi with Ai,j≠ 0. The resultant product Y can thenbe obtained by shifting and adding every nonzero Yj. Inthis manner, the inner product computation in (4) and can beimplemented using shifters and adders instead of multipliers.Therefore, CSD DA-based architecture can achieve a efficientarea optimization.

C. Proposed Common Sharing Distributed Arithmetic (CSDA) Algorithm

The proposed CSDA algorithm combines the Factor Sharing (FS) and Distributed Arithmetic (DA)methods. The coefficients matrix is expanded at the bit level thenin each coefficientthe same factors are shared bythe FS method; after thatthe DA method is applied to share the same input combination for each coefficient position.The proposed CSDA algorithm example in a matrix inner product is given asfollows:

WhereC11 andC22 are the coefficients of five-bit CSD numbers

Fig. 1 shows the flow of proposed CSDA algorithm. In these four coefficients the same shared factor Fs is [1 −1 ], andC11 ≈C22 can use Fs instead of [1 −1 ], with the correspondingposition under the FS method. To share the same position for the input,and the DA shared coefficient DA1which is equal to (X1+ X2) Fsthe DA method is subsequently applied.Finally, the matrix inner products in (6) can be implementedby shifting and adding every nonzero weight position.Fig. 2 shows the coefficient searching flow of the proposedCSDA algorithm. To obtain better resource sharing for inner product operation, the proposed CSDA combines the Factor Sharing andDistributed Arithmetic methods. The FS method is adopted first to findthe factors that have higher hardwareresource sharing capability, where the hardware resource in this paper isdefined as the use of number of adders. Next, the DA methodis used to search the shared coefficient based on the resultsof the Factor Sharing method. The adder-tree circuits will be followedby the designed CSDA circuit. Thus, the CSDA methodaims to reduce the nonzero elements. For estimating the number of adders in the CSDA loop the CSDA shared coefficient is used.Therefore, the iteration searching loop requires a large numberof iteration loops to determine the smallest hardware resource with the help of thesesteps, and the CSDA shared coefficient can be generated.Notice that the optimal factor or coefficient in only FS orDA methods is not conducted for the smallest resource in theproposed CSDA method. Thus, a number of iteration loops areneeded for determining the better CSDA shared coefficient. Anexample of the performance in FS, DA, and CSDA methodsis shown in a later section.Although the joint DA and sub-expression sharing method,which adopts DA method first for eliminating multipliers andthen to reduce the number of adders by the sub-expressionsharing method for sharing the same adder operations, ispresented in, the proposed CSDA method adopts theFS method first to share the same adder operations and thenadopts the DA method to reduce the nonzero term, which canreduce the operation of the adder tree. Therefore, the proposedCSDA improves the sharing capacity and is further adoptedto implement multiple block sizes and coefficients.

Fig 3 Example of a CSDA algorithm.
Fig.4 Architecture of the proposed 1-D CSDA-MST.

The 1-D eight-point MST corearchitecture is shown in Fig. 3, which consists of a selected butterfly (SBF) module, an even part CSDA (CSDA_E) and an odd part CSDA (CSDA_O), eight error-compensated error trees (ECATs)  and a permutation module.

Fig5 Architecture of the proposed 1-D CSDA-MST

The designed 2-D CSDA-MST core consists of two 1-D CSDA-MST core with Core-1 and Core-2 having a transposed memory (TMEM), as shown in Fig. 4. Core-1 and Core-2 have different word length for each arithmetic, register, and MUX. The TMEM is designed using sixty-four 12-bit registers, where the output data from Core-1 is transposed and fed into Core-2. Both coreshave four pipeline stages: two in the even and odd part CSDA circuit, and two in ECATs. Therefore, the proposed 2-D CSDA-MST core has a latency of 16 clock cycles (= 4 + 8 + 4), and the TMEM executes transposed operation after 12 clock cycles (= 4 + 8) when 8 pixels are input.


The CSDA-MST core can achieve high performance, with a high throughput rate and low-cost VLSI design, supporting MPEG-1/2/4, H.264, and VC-1 MSTs. By using the proposed CSDA method, the number of adders and MUXs in the MST core can be saved efficiently. Measured results show the CSDA-MST core with a throughput rate of 1.28 G-pels/s, which can support (4928 × 2048@24 Hz) digital cinema format with only 30 k logic gates. Because visual media technology has advanced rapidly, this approach will help meet the rising high-resolution specifications and future needs as well.


  1. Kuo-Huang Chang1, Yi-Cheng Chen2, Chung-Cheng Hsieh1, Chi-Wu Huang2 and Chi-Jeng Chang1
  2. *1 Institute of Applied Electronics Technology *2 Department of Industrial Education
  3. s/s AES Processors,” IEEE Transactions On Computers, vol. 55, pp. 366–372, April 2006.
  4.  A. Hodjat and I. Verbauwhede, “Interfacing a high speed crypto accelerator to an embedded cpu,” In Proc. 38th Asilomar Conference on Signals, Systems, and Computers, vol. 1, pp. 488–492, November 2004.
  5.  Pawel Chodowiec and Kris Gaj, “Very Compact FPGA Implementation of the AES Algorithm”, Cryptographic Hardware and Embedded Systems 2003, vol. 2779, pp. 319–333, September 2003.
  6. G. Rouvroy, F.-X. Standaert, J.-J. Quisquater and J.-D. Legat, “Compact and efficient encryption/decryption module for FPGA implementation of the AES Rijndael very well suited for small embedded applications”, In Proc. IEEE Int. Conf. on Inf. Tech.: Coding and Computing, vol. 2, pp. 583–587, Las Vegas, NV, USA, April. 2004
  7. Satish N.chalurkari , Nilesh khochare ,B.B. mashram, “Survey on Modular Attack on RSA Algorithm”, International Journal of Computational Engineering & Management, ISSN: 22307893 “Comparison of Hardware Implementation of Stream Ciphers”
  8. Michalis Galanis, Paris Kitsos,Giorgos Kostopoulos, Nicolas SkI avos ,and Costas Goutis Electrical and Computer Engineering Department, University of Patras, Greese . The International Arab Journel of information Technology Oct 2005.
  9.  “Image Compression Using Discrete Wavelet Transform” M. Mozammel Hoque Chowdhury and Amina Khatun Department of Computer Science and Engineering Jahangirnagar University
  10.  Savar, Dhaka-1342, Bangladesh International Journal of Computer Science. July 2012

Related posts



“Site for JIT Management Resource”


Smart Medicine Box


Leave a Comment