KR100859185B1

KR100859185B1 - Multiplier on finite field FT (2m)

Info

Publication number: KR100859185B1
Application number: KR1020060044858A
Authority: KR
Inventors: 홍춘표; 김창훈; 권순학
Original assignee: 학교법인 영광학원; 성균관대학교산학협력단
Priority date: 2006-05-18
Filing date: 2006-05-18
Publication date: 2008-09-18
Also published as: KR20070111718A

Abstract

본 발명은 타원곡선 암호 프로세서를 위한 유한체 GF(2 ^m )상의 새로운 곱셈기에 관한 것이다. The present invention relates to a new multiplier on finite field GF (2 ^m ) for an elliptic curve cryptographic processor.

본 발명에 따른 실시예는 유한체 GF(2^m )상의 곱셈기에 있어서, 벡터 A를 입력받아 저장하고, 기준클럭에 의해 상위 비트측으로 한 비트 쉬프트하는 제1 레지스터부(10)와, 상기 제1 레지스터부에 저장된 값을 사전에 설정된 알고리즘에 의해 배타적 논리합 연산을 수행하는 배타적 논리합 연산부(40)와, 벡터 B를 입력받아 저장하고, 기준클럭에 의해 상위 비트측으로 한 비트 쉬프트하는 제2 레지스터부(20)와, 기준클럭 단위로 상기 제2 레지스터부(20)에 저장된 값 및 상기 배타적 논리합 연산부(40)를 통해 논리연산된 값을 곱셈 연산한 곱셈값과, 다수의 세부레지스터(R0 ~ R6) 각각에 저장되어 있던 각각의 값들과 XOR연산하여 저장 및 딜레이하는 제3 레지스터부(30);를 구비하여 구성되는 것을 특징으로 한다. According to an embodiment of the present invention, a multiplier on a finite field GF ( 2 ^m ) includes: a first register section 10 for receiving and storing a vector A and shifting a bit to an upper bit side by a reference clock; An exclusive OR operation unit 40 which performs an exclusive OR operation by a predetermined algorithm, and a second register unit which receives and stores a vector B and shifts a bit to an upper bit side by a reference clock. 20), a multiplication value obtained by multiplying a value stored in the second register unit 20 and a logical operation value through the exclusive OR operation unit 40 in reference clock units, and a plurality of detailed registers R0 to R6. And a third register unit 30 for performing XOR operation on each of the values stored in each of them, and storing and delaying them.

유한체, 곱셈기, 연산회로, 곱셈 알고리즘, 가우시안 정규기저 Finite Field, Multiplier, Arithmetic Circuit, Multiplication Algorithm, Gaussian Normal Basis

Description

Multiplier Over WF (2m) using Gaussian Normal Basis}

도 1은 본 발명에 따른 유한체 GF(2 ^m )상의 곱셈기 중 비트-레벨 곱셈기의 회로도.
도 2는 도 1에 도시한 제3 레지스터부의 상세 구성도.
도 3은 본 발명에 따른 유한체 GF(2 ^m )상의 곱셈기 중 워드-레벨 곱셈기의 회로도.
도 4는 도 3에 도시한 논리연산부의 상세 회로도.1 is a circuit diagram of a bit-level multiplier of multipliers on a finite field GF (2 ^m ) according to the present invention.
FIG. 2 is a detailed configuration diagram of the third register unit shown in FIG. 1. FIG.
3 is a circuit diagram of a word-level multiplier of multipliers on a finite field GF (2 ^m ) according to the present invention.
4 is a detailed circuit diagram of a logic operation unit shown in FIG. 3;

본 발명은 유한체 GF(2 ^m )상의 곱셈기로, 보다 상세하게는 타원곡선 암호시스템 (ECC : Elliptic Curve Cryptosystems)을 위한 곱셈기로서, 기존의 동일한 형태의 곱셈기에 비해 낮은 하드웨어 복잡도 및 최대 처리기 지연시간을 가진다. 또한 워드레벨로 설계되었기 때문에 계산지연시간 및 하드웨어 면적에 있어 상충 관계를 개선 할 수 있다.
1980년대 중반 Victor Miller와 Neal Kobliz에 의해 제안된 타원곡선 암호 시스템(Elliptic Curve Cryptosystem: ECC)는 최근 학계나 산업계로부터 많은 관심을 모으고 있다. ECC의 가장 주된 장점은 RSA나 ElGamal과 같은 다른 암호 시스템에 비해 현저히 작은 키를 사용하면서(약 1/6 정도) 동일한 안전도를 가진다.
작은 키를 사용한다는 것은 계산 시간, 전력 소모 그리고 저장 공간의 감소를 의미한다. 이러한 장점 때문에 최근 IEEE 1363 및 NIST은 공개키 암호 시스템을 위해 ECC에 기반한 타원곡선 전자서명 알고리즘(Elliptic Curve Digital Signature: ECDSA)를 표준으로 채택하였다. ECDSA를 위해 유한체는 GF(p)와 GF(2 ^m )을, GF(2 ^m )상의 원소 표기법으로는 가우시안 정규기저(Gaussian Normal Basis: GNB)와 다항식 기저(Polynomial Basis: PB) 표기법을 사용한다. 여기서 p는 소수이고 GNB는 정규 기저(Normal Basis: NB)의 특별한 경우로서 8로 나누어지지 않는 모든 양의 정수 m에 대해 존재한다.
상기 ECC를 GF(2 ^m )상에서 구현할 경우, GF(2 ^m )상의 덧셈, 곱셈, 역원(혹은 나눗셈) 연산이 필요하다. 여기서 덧셈은 비트별 XOR 연산으로, 기저 표기법에 상관없이 동일하다. 그러나 곱셈 및 역원 연산은 선택된 기저에 따라 서로 다른 하드웨어 구조를 갖는다. PB를 사용한 곱셈기 설계의 경우 m값과 원소 생성에 사용되는 기약다항식(Irreducible Polynomial)에 상관없이 동일한 하드웨어 구조 설계가 가능한 장점이 있다. NB를 사용할 경우 임의의 원소 A ∈ GF(2 ^m )의

(0 ≤ i ≤ m-1) 연산은 i-비트 순환 쉬프트로 Fermat의 이론을 이용하면 연속된 연산으로 역원 연산을 수행할 수 있다. 따라서 NB를 사용하여 ECC를 구현할 경우 곱셈기의 성능은 매우 중요하다.
지금까지 다양한 NB 곱셈기들이 제안되었다. 이러한 곱셈기들 중 Messey와 Omura 곱셈기는 패러럴 입력 시리얼 출력 구조를 가지지만 곱셈기의 최대 처리기 지연시간은

에 비례한다. 따라서 이 곱셈기는 ECC와 같이 매우 큰 m을 요구하는 응용에서는 높은 최대 처리기 지연시간을 보인다. Agnew등은 Messey와 Omura 곱셈 알고리즘을 이용하여 패러럴 입력 패러럴 출력 구조의 선형 곱셈기를 제안하였다. 이 곱셈기는 Messey와 Omura 곱셈기에 비해 m-비트 레지스터를 추가함으로써 최대 처리기 지연시간을 T_A + 2T_X 로 줄였다. 여기서 T_A 는 2-입력 AND 게이트 딜레이 시간이고 T_X 는 2-입력 XOR 게이트 딜레이 시간이다. 또한 최근 Reyhani-Masoleh와 Hasan은 정규기저 원소의 대칭성을 이용하여, T_A + 2T_X 의 최대 처리기 지연시간을 가지는 저면적 선형 곱셈기를 제안하였다.
그런데, 유한체 GF(2 ^m )상의 덧셈은 비트별 XOR 연산으로, 빠르고 간단하게 구현할 수 있지만, 다른 연산들은 매우 복잡하다. 특히 지수 및 역원 연산이 가장 복잡한데 이러한 연산들은 반복적인 곱셈 연산으로 이루어진다. 따라서 효율적인 곱셈 연산기의 구현은 필요하다.The present invention is a multiplier on a finite field GF (2 ^m ), more specifically a multiplier for Elliptic Curve Cryptosystems (ECC), which has lower hardware complexity and maximum processor latency than conventional multipliers of the same type. Has In addition, because it is designed at the word level, it is possible to improve the trade-off between the calculation delay time and the hardware area.
The Elliptic Curve Cryptosystem (ECC), proposed by Victor Miller and Neal Kobliz in the mid-1980s, has recently attracted much attention from academia and industry. The main advantage of ECC is that it has the same security, using significantly smaller keys (about 1/6) compared to other cryptographic systems such as RSA and ElGamal.
Using a smaller key means less computation time, less power, and less storage space. Because of these advantages, IEEE 1363 and NIST recently adopted the ECC-based Elliptic Curve Digital Signature (ECDSA) as a standard for public key cryptosystems. For ECDSA, finite bodies use GF ( p ) and GF (2 ^m ), and Gaussian Normal Basis (GNB) and Polynomial Basis (PB) notation for elemental notation on GF (2 ^m ). do. Where p is a prime number and GNB is a special case of Normal Basis (NB) and is present for all positive integers m not divided by 8.
When implementing the ECC over GF (2 ^m), requires addition, multiplication, inverse (or division) operation on GF (2 ^m). Here, addition is a bitwise XOR operation, which is the same regardless of the underlying notation. Multiplication and inverse operations, however, have different hardware structures depending on the basis chosen. The multiplier design using PB has the advantage that the same hardware structure can be designed regardless of the value of m and the irreducible polynomial used to generate the element. When using NB, any element A ∈ GF (2 ^m )

The operation (0 ≤ i ≤ m -1) is an i -bit cyclic shift, and using Fermat's theory, the inverse operation can be performed as a continuous operation. Therefore, when implementing ECC using NB, the performance of the multiplier is very important.
Various NB multipliers have been proposed. Of these multipliers, the Messey and Omura multipliers have a parallel input serial output structure, but the maximum processor latency of the multipliers is

Proportional to Thus, this multiplier has a high maximum processor latency for applications that require very large m such as ECC. Agnew et al. Proposed a linear multiplier of parallel input parallel output structure using Messey and Omura multiplication algorithms. The multiplier reduces the maximum processor latency to T _A + 2 T _X by adding m -bit registers over the Messey and Omura multipliers. Where T _A is the 2-input AND gate delay time and T _X is the 2-input XOR gate delay time. Also, Reyhani-Masoleh and Hasan recently proposed a low-area linear multiplier with a maximum processor delay of T _A + 2 T _X , using the symmetry of regular base elements.
However, addition on finite field GF (2 ^m ) is a bitwise XOR operation that can be implemented quickly and simply, but other operations are very complex. In particular, exponential and inverse operations are the most complex, and these operations consist of iterative multiplication operations. Therefore, the implementation of an efficient multiplication operator is necessary.

본 발명은 전술한 점을 감안하여 안출된 것으로서, 타원곡선 암호시스템을 위한 유한체 GF(2 ^m )상의 새로운 곱셈기를 제공하는 것에 그 목적이 있다.
또한, 본 발명은 정규기저 원소들의 대칭성을 이용할 뿐만 아니라 정규기저 원소 계수의 인덱스를 변형함으로써 기존에 제안된 곱셈기보다 낮은 하드웨어 복잡도를 가지지만 동일한 최대 처리기 지연시간을 가지는 유한체 GF(2^m)상의 곱셈기를 제공함에 다른 목적이 있다.
또한, 본 발명은 기존에 제안된 곱셈기 보다 낮은 하드웨어 복잡도 및 최대 처리기 지연시간을 보이는 유한체 GF(2^m)상의 곱셈기를 제공함에 다른 목적이 있다.
또한, 본 발명은 그 구조가 매우 규칙적이기 때문에 VLSI 구현에 매우 적합한 유한체 GF(2^m)상의 곱셈기를 제공함에 다른 목적이 있다. SUMMARY OF THE INVENTION The present invention has been made in view of the foregoing and an object thereof is to provide a new multiplier on a finite field GF (2 ^m ) for an elliptic curve cryptographic system.
In addition, the present invention not only exploits the symmetry of the normal basis elements, but also modifies the index of the normal basis element coefficients so that the finite field GF (2 ^m ) phase has lower hardware complexity than the conventional multiplier but has the same maximum processor delay. There is another purpose in providing a multiplier.
It is another object of the present invention to provide a multiplier on finite field GF (2 ^m ) which exhibits lower hardware complexity and maximum processor delay than the proposed multiplier.
It is another object of the present invention to provide a multiplier on the finite field GF (2 ^m ) which is very suitable for VLSI implementation since its structure is very regular.

전술한 목적을 달성하기 위한 본 발명에 따른 실시예는 유한체 GF(2^m )상의 곱셈기에 있어서, 벡터 A를 입력받아 저장하고, 기준클럭에 의해 상위 비트측으로 한 비트 쉬프트하는 제1 레지스터부(10)와, 상기 제1 레지스터부에 저장된 값을 사전에 설정된 알고리즘에 의해 배타적 논리합 연산을 수행하는 배타적 논리합 연산부(40)와, 벡터 B를 입력받아 저장하고, 기준클럭에 의해 상위 비트측으로 한 비트 쉬프트하는 제2 레지스터부(20)와, 기준클럭 단위로 상기 제2 레지스터부(20)에 저장된 값 및 상기 배타적 논리합 연산부(40)를 통해 논리연산된 값을 곱셈 연산한 곱셈값과, 다수의 세부레지스터(R0 ~ R6) 각각에 저장되어 있던 각각의 값들과 XOR연산하여 저장 및 딜레이하는 제3 레지스터부(30);를 구비하여 구성되는 것을 특징으로 한다.
여기서, 본 발명의 실시예는 비트연산이기 때문에 한 비트씩 논리연산 된다.
상기 제3 레지스터부(30)의 상기 다수의 세부레지스터(R0 ~ R6)는, 각각 상기 제2 레지스터부(20)에 저장된 값과 상기 배타적 논리합 연산부(40)에서 논리 연산된 값을 논리곱하는 앤드게이트(30a)와, 상기 앤드게이트(30a)에서 논리곱 된 값과 하위 세부레지스터로부터 쉬프트된 값을 배타적 논리합 연산하는 XOR 게이트(30b)와, 상기 XOR 게이트에서 논리 연산된 값을 딜레이시키는 딜레이부(30c);를 구비하여 구성되는 것을 특징으로 한다.
본 발명에 따른 다른 실시예는 유한체 GF(2^m)상의 곱셈기에 있어서, 벡터 A를 입력받아 저장하고, 기준클럭에 의해 상위 비트측으로 워드크기 w만큼씩 쉬프트하는 A 레지스터부(100)와, 벡터 B를 입력받아 저장하고, 기준클럭에 의해 상위 비트측으로 워드크기 w만큼씩 쉬프트하는 B 레지스터부(110)와, 기준클럭 단위로 상기 A 레지스터부에 저장된 값과 상기 B 레지스터부에 저장된 값을 사전에 설정된 알고리즘에 따라 각각 논리연산하는 F0 논리연산부(120a)와 F1 논리연산부(120b)를 구비하는 논리연산부(120)와, 상기 F1 논리연산부(120b)에서 출력되는 값과 외부에서 제공되는 제어값을 논리곱하는 논리곱 연산부(130)와, 상기 F0 논리연산부(120a)와 논리곱 연산부(130)에서 출력되는 값을 배타적 논리합 연산을 수행하는 제1 배타적 논리합 연산부(140)와, 상기 제1 배타적 논리합 연산부(140)에서 처리된 값과 하위레지스터(R0 ~ R6)에서 워드크기 w만큼 쉬프된 하위레지스터값을 다시 배타적 논리합 연산하는 제2 배타적 논리 연산부(160)와, 상기 제2 배타적 논리 연산부에서 출력되는 값을 딜레이시키는 딜레이부(150);를 구비하며,
상기 논리연산부(120)는 상기 A 레지스터부(100)에 저장된 값을 사전에 설정된 알고리즘에 의해 배타적 논리합 연산을 수행하는 GNB 배타적 논리합 연산부(121)와, 상기 GNB 배타적 논리합 연산부(121)를 통해 논리연산된 값과 상기 B 레지스터부(110)에 저장된 값을 곱셈연산 하는 엔드게이트들(123)로 이루어지는 것을 특징으로 한다.
여기서, 본 발명의 다른 실시예는 워드연산이기 때문에 한 워드크기 w만큼씩 논리연산 된다.
상기 제어값은, 첫번째 클럭에서 마지막 전 클럭까지는 상기 F1 논리연산부(120b)의 출력값을 상기 제1 배타적 논리합 연산부(140)에 전달하기 위해 제어값 1에 의해 상기 제1 배타적 논리합 연산부(140)에서 논리곱연산을 수행하고, 마지막 클럭에서 상기 F1 논리연산부(120b)의 출력값을 상기 제1 배타적 논리합 연산부(140)에 전달되지 않도록, 제어신호 0에 의해 상기 제1 배타적 논리합 연산부(140)에 논리곱연산 하는 것을 특징으로 한다.
이하에서는 첨부된 도면을 참조하여 본 발명에 따른 유한체 GF(2^m)상의 곱셈기를 보다 상세하게 설명한다.
도 1은 본 발명에 따른 유한체 GF(2^m)상의 곱셈기 중 비트-레벨 곱셈기의 회로도이고, 도 2는 도 1에 도시한 제3 레지스터부의 상세 구성도이며, 도 3은 본 발명에 따른 유한체 GF(2^m)상의 곱셈기 중 워드-레벨 곱셈기의 회로도이고, 도 4는 도 3에 도시한 기능블록의 상세 회로도이다.
본 발명에 따른 유한체 GF(2^m)상의 곱셈기는 비트 단위의 곱셈 연산을 수행하는 비트-레벨 곱셈기와 워드 단위의 곱셈 연산을 수행하는 워드-레벨 곱셈기로 크게 구분할 수 있다. 상기 비트-레벨 곱셈기와 워드-레벨 곱셈기는 가우시안 정규기저를 이용하여 구현된다.
이하, 도 1 및 도 2에 해당되는 본 발명의 실시예를 구체적으로 설명하면 아래와 같다.
먼저, 제 1 레지스터부(10)에 저장된 값으로부터 배타적 논리합 연산부(40)를 통해 논리 연산된 값과 제 2 레지스터부(20)에 저장된 값을 제 3 레지스터부(30)의 앤드게이트(30a)에서 곱셈하여 연산 결과를 출력하고, 초기 클럭에서는 딜레이부(30c)인 D _i 의 모든 값이 0이므로 셋팅되어 앤드게이트(30a)에서 곱셈 연산 결과값이 그대로 딜레이부(30c)에 저장되고 2번째 클럭부터는 앤드게이트(30a)에서 곱셈 연산 결과값과 이전 클럭에서 딜레이부(30c)에 저장되어 있는 값이 제 3 레지스터부(30)의 세부레지스터(R0 ~ R6)들에서 상위의 세부레지스터로 전달되어 XOR 게이트부(30b)에서 XOR 연산을 실행하여 제 3 레지스터부(30)의 각각 딜레이부(30c)에 저장된다.
그리고, 다수의 세부레지스터(R0 ~ R6)는, 제 1레지스터부(10)의 값이 배타적 논리합 연산부(40)의 논리 연산을 실행한 값과 제 2레지스터부(20)의 값, 이 2개의 입력값을 받아 논리곱하는 앤드게이트(30a)와, 이 앤드게이트(30a)의 각 값과 제 3레지스터부(30)의 하위 딜레이부(30c)의 출력값을 XOR 연산하는 XOR 게이트(30b), 이 값을 딜레이시키기 위한 딜레이부(30c)로 구성되며, 제 3레지스터부(30)인 R _i 의 출력값은 딜레이부(30c)인 D _i 의 출력값이므로 1비트 출력값이다.
상기한 비트-레벨 곱셈기를 구현하기 위하여 적용되는 알고리즘은 아래의 표 (1)과 같다.
GNB를 이용한 GF(2 ^m )상의 곱셈 알고리즘 Input :

Output :

,

for all

, where

. Initial :

. 1. For

to

2. For

to

3.

4. End for 5. End for 6. Return D
상기한 표 (1)과 같은 곱셈 알고리즘을 유도하는 과정을 설명한다.
먼저, 표 1은 도 1의 비트-레벨 곱셈기의 회로도에 대한 알고리즘으로 표현한 것이다. 그리고, 표 1에서 스탭 3은 도 2에 해당하는 과정을 보인 것이다.
그리고, 유한체 GF(2 ^m )은 GF(2)상의 m차원 벡터 공간으로 GF(2 ^m )상의 원소 A는 기저

에 대해

와 같이 표현할 수 있다.

이때, 수학식 1은 제1레지스터부(10)에 해당하는 것이고, 입력 벡터를 B로하면 제2레지스터부(20)에도 해당된다. 그리고,

형태의 basis를 NB(Normal Basis)라 한다. GF(2 ^m )상에서 ECC의 높은 안전성을 위해서 소수인 m을 요구한다. 이러한 조건은 Pohlig-Hellman 형태의 공격을 회피하기 위해 필요하다. 예를 들면, NIST와 IEEE 1363에서는 ECDSA(Elliptic Curve Digital Signature Algoritym)을 위해 권고하는 필드 사이즈 m=163, 233, 283, 409, 571으로서 m은 홀수인 소수이다. 따라서 본 발명에서는 m이 홀수인 소수에 대해서만 고려한다.
이하에서 설명되는 수학식 2 내지 수학식 16은 도 1에 도시된 배타적 논리합 연산부(40)에 대한 진행과정을 보인 것이다.
이어서, NB에 대해,

라 하면,

이고,

이다. 여기서 λ_ij ⁽ ^s ⁾를 GF(2)의 원소라 하자. 그러면, 임의의 t에 대해,

이고, λ의 위, 아래 첨자는 mod(modulation operation) m이다.

_s 의 계수를 비교하면, λ_ij ⁽ ^s ⁾= λ_i _- _t,j _- _t ⁽ ^s-t ⁾이고, λ_ij ⁽ ^s ⁾=λ_i _- _s,j _- _s ⁽⁰⁾임을 알 수 있다. GF(2 ^m )상의 원소 A, B의 곱 C = AB는

이고, C의 계수 c_s 는

이다.
여기서

_i

_j 대신

_j 를 사용하고,

_i 와

_m-i 의 대칭성을 이용하면 아래의 수학식 (6)을 얻을 수 있다.

위 수학식 (6)에서

=

이다.
이어서, 가우시안 정규기저를 이용한 유한체 GF(2 ^mk )상의 곱셈 알고리즘에 대해서 설명한다.
우선, m, k를 소수 p≠2에 대해, p=mk+1인 양의 정수라 하고, K=<τ>는 GF(p)^×에서 위수(order) k인 유일한 부분군이라 하자. β가 GF(2 ^mk )상의 단위원(unity)에 대한 p번째 원시근이라면, 다음 원소

를 GF(2)상의 (m, k)타입의 Gauss period라 한다. ord _p 2를 mod p에 대한 2의 위수라 하고, gcd(mk/ord _p 2, m)=1이라 가정하면,

는 GF(2 ^m )상에서 NB의 원소이고, 0≤i≤m-1에 대해,

라 놓으면, {

₀,

₁,

₂,...

_m-1}은 GF(2)상의 GF(2 ^m )에 대한 기저이며, 이것을 GF(2 ^m )상에서 m 또는 (m, k) 타입의 가우시안 정규기저라 부른다.
이어서, 가우시안 정규기저를 이용하여

_i 를 구하면,

를 얻을 수 있고, 1+τ^u 2 ^v =0∈GF(p)인 0≤u≤k-1과 0≤v≤m-1가 유일하게 존재한다. 만약, t≠u 또는 i≠v이면, t와 i에 의해 결정되는 임의의 0≤σ(t, i)≤m-1에 대하여, 1+τ^t 2 ⁱ ∈K_σ ₍ _t _, _i ₎을 얻는다. 따라서 임의의 t′에 대해, 1+τ^t 2 ⁱ =τ^t ^′2 ^σ ⁽ ^t ^, ⁱ ⁾와 같이 쓸 수 있다. 여기서 i≠v일 때,

이다. 또한, i=v일 때,

이다. 그러므로

는 i≠v에 대해,

에서 많아야 k개의 기저 원소의 합으로 계산되고,

는 많아야 k-1개의 기저 원소와 상수 부분 k=0,1∈GF(2)의 합으로 계산된다.

라 놓고, λ_ij 를 GF(2)의 원소라 하자. 양변에 2의 거듭 제곱을 하면, λ_ij ⁽ ^s ⁾=λ_i-j _, _s-j 이다.
상기한 점을 감안하였을 때 다음과 같은 정리가 성립된다.
(정리)
k가 짝수일 때,

이 타입 k GNB이면,
λ_ij ⁽⁰⁾=λ_ij
이다.
(증명) 이 증명은 λ_ij ⁽ ^s ⁾=λ_i-j _, _-j 임을 보이면 충분하다. 수학식 (9)와 (10)으로부터 λ_ij =1이면
1+τ^s 2 ⁱ =τ^s ^′2 ^j
를 만족하는 (s,s') (mod k)의 홀수 순서쌍이 존재함을 안다. 이때, S를 모든 순서쌍 (s,s') (mod k)의 집합이라 하고, T를 1+τ^t 2 ⁱ ^- ^j =τ^t ^′2 ^-j 를 만족하는 모든 순서쌍 (t,t') (mod m)의 집합이라 하자. 식 (13)의 양변을 τ^s ^′2 ^j 으로 나누면 τ^-s ^′2^- ^j +τ^s-s ^′2 ⁱ ^- ^j =1이고, τ의 위수가 k이고 k는 짝수이므로 -1=τ^k/ ²이고 τ^-s ^′2^- ^j =1+τ^k/ ²⁺ ^s-s ^′2 ⁱ ^- ^j 이다. 여기서, 사상 f_S : S → T는 f_S (s,s´)=(k/2+s-s´,-s´)으로 주고, 사상 f_T : T → S는 f_T (t,t´)=( k/2+t-t´,-t´)으로 주면 두 사상은 일대일 대응이므로 위 정리는 증명된다.
이어서, 상기한 정리를 바탕으로,

의 계수 c_s 는

와 같다.
대응되는 행렬 X=(x_st )에 대해 GF(2)상에서 원소 x_st , 0≤s,t≤m-1를

라 정의하자.
그러면 X의 t번째 열벡터 X_t =(x_0t , x_1t ,…, x_m-1 _, _t )^T이고, 이 때, (x_0t , x_1t ,…, x_m-1 _, _t )^T는 행벡터 (x_0t , x_1t ,…, x_m-1 _, _t )의 전치 행렬이다.
또한,

이기 때문에 모든 열벡터 X_t , t=0,1,…,m-1의 합은 (c ₀, c ₁,…, c_m _-1)^T이다. 그리하여 열벡터 X_t 를 재배열하고 계산 과정에서 부분합의 신호를 재사용함으로써 게이트 복잡도 및 임계 경로를 줄일 수 있다.
한편, m-1=2υ라 하고 다음과 같이 X의 열벡터의 치환에 의해 m×m 행렬 Y=(y_st )를 정의하자.
υ가 홀수일 때, Y를 정의하면
(X_υ,...,X₃, X₁, X_m-1,...,X_m-υ,X_υ-1,...,X₂,X₀,X_m-2,...X_m-υ+1 ) 이고,
υ가 짝수일 때, Y를 정의하면
(X_υ,...,X₂, X₀, X_m-2,...,X_m-υ,X_υ-1,...,X₃,X₁,X_m-1,...X_m-υ+1 ) 이다.
그러면 Y_t =(y_0t , y_1t ,…, y_m-1 _, _t )^T인 Y의 모든 열벡터 Y_t , t=0,1,…,m-1의 합은 (c ₀, c ₁,…, c_m _-1)^T인 X의 모든 열벡터 X_t , t=0,1,…,m-1의 합과 같다. 병렬-입력, 병렬-출력 곱셈 구조를 구현하기 위해 Y의 열벡터의 합을 계산하는 대신에 Y의 이동한 대각 벡터의 합으로 계산한다. 이것은 다음의 결과로부터 얻을 수 있다. 행렬 Y의 표현에서 벡터 X_t 의 X_m-t 사이에 정확히 t-1개의 열이 존재한다. 또한, X_t 의 s번째 원소와 X_m-t 의 s+t번째 원소는 그들의 가수(summand)에서 a_i 의 같은 항을 가진다.
또한,

를 얻는다. 이 때, 상기한 수학식 (16)에서 세 번째 식 표현은 아래 첨자 i에서 합의 재배열로 나오고 마지막 식 표현은 λ_ij =λ_i _- _j, _- _j 로부터 나온다. 그러므로 x_st 와 x_s ₊ _t _, _m _- _t 는 식의 표현에서 같은 항

를 가지고 AB를 계산하는 동안 XOR 게이트의 수를 절약할 수 있다.
따라서, 전술한 바와 같이 상기 표 1과 같은 새로운 GNB 곱셈 알고리즘을 얻을 수 있게 되는데, 도 1에 상기한 바와 같은 GNB 곱셈 알고리즘을 이용하여 구현되는 비트-레벨 곱셈기의 구성을 나타내었다.
상기한 비트-레벨 곱셈기는 벡터 A를 입력받아 저장하고, 기준클럭에 의해 상위 비트측으로 쉬프트하는 제1 레지스터부(10)와 상기 제1 레지스터부에 저장된 값을 사전에 설정된 알고리즘에 의해 배타적 논리합 연산을 수행하는 배타적 논리합 연산부(40), 벡터 B를 입력받아 저장하고, 기준클럭에 의해 상위 비트측으로 쉬프트하는 제2 레지스터부(20), 기준클럭 단위로 상기 제1 레지스터부에 저장된 값과 상기 제2 레지스터부에 저장된 값 및 상기 배타적 논리합 연산부를 통해 논리연산된 값을 곱셈 연산하는 제3 레지스터부(30)를 구비하여 구성되게 된다.
이하, 본 발명의 실시예의 과정을 상세히 설명하면 아래와 같다.
먼저, 수학식 14와 수학식 15에 보여지는 바와 같이,

의 계수 c_s 는 수학식 14와 수학식 15에 의해

임을 알 수 있습니다.
또한 X_t =(x_0t , x_1t ,…, x_m-1 _, _t )^T는 X의 t번째 열벡터이고 X의 열벡터의 치환에 의해 X_t 가 재배열되어 m×m 행렬 Y를 구할 수 있다.
예를 들면, 수학식 21에서 t가 0일 때, 즉 첫 번째 클럭에서 y₀₀=(a ₂+a ₅)b ₃, y₁₁=(a ₁+a ₃+a ₆+a ₀)b ₂, y₂₂=(a ₃+a ₆+a ₀+a ₁)b ₁, y₃₃=(a ₅+a ₂)b ₀, y₄₄=(a ₅+a ₀+a ₁+a ₂)b ₆, y₅₅=a ₆ b ₅, y₆₆=(a ₀+a ₁+a ₂+a ₅)b ₄이 계산된다.
여기서 y _s,s 의 a의 계수만를 살펴보면 y₀₀와 y₃₃, y₁₁와 y₂₂, y₄₄와 y₆₆이 같다는 것을 알 수 있다. 또한 y₀₀와 y₃₃에 사용된 (a ₂+a ₅)은 y₄₄와 y₆₆에 재사용되고 y₁₁와 y₂₂에서 사용된 (a ₀+a ₁)도 y₄₄와 y₆₆에 재사용된다. 이러한 성질을 이용하여 제 1레지스터부(10)에 저장된 A로부터 A 계수의 합으로 표현한 회로도로 구현한 부분이 배타적 논리합 연산부(40)이다. 이 배타적 논리합 연산부는 m에 의해 좌우된다.
배타적 논리합 연산부(40)에서의 결과 값인 각각의 A 계수의 합, (a ₂+a ₅), (a ₁+a ₃+a ₆+a ₀), (a ₃+a ₆+a ₀+a ₁), (a ₅+a ₂), (a ₅+a ₀+a ₁+a ₂), a _6,(a ₀+a ₁+a ₂+a ₅)와 제 2레지스터부(20)의 B의 계수, b ₃, b ₂, b ₁, b ₀, b ₆, b ₅, b ₄의 곱을 계산하기 위한 것이 제 3레지스터(30)의 R _i 에 해당하는 앤드게이트(30a)이다.
그리고, 초기에는 R _i 의 딜레이부(30c) D _i 들이 0으로 셋팅되어 있으므로 앤드게이트(30a)의 값과 제 3레지스터(30)의 R _i 의 딜레이부(30c)부터 전달된 D _i =0이 XOR 게이트(30b)에서 XOR 연산을 한 후, 각 D _i 에 저장되면 y₀₀=(a ₂+a ₅)b ₃, y₁₁=(a ₁+a ₃+a ₆+a ₀)b ₂, y₂₂=(a ₃+a ₆+a ₀+a ₁)b ₁, y₃₃=(a ₅+a ₂)b ₀, y₄₄=(a ₅+a ₀+a ₁+a ₂)b ₆, y₅₅=a ₆ b ₅, y₆₆=(a ₀+a ₁+a ₂+a ₅)b ₄을 얻게 된다.
또한, 아래의 식은 수학식 21에서 두 번째 클럭의 과정을 설명하기 위한 것이다.

먼저, t가 1일 때 즉 2번째 클럭에서는 제 1레지스터부(10)에 저장된 A값과 제 2레지스터부(20)에 저장된 B값이 각각 상위 비트측으로 쉬프트 된 후 y₆₀=(a ₁+a ₂+a ₃+a ₆)b ₅, y₀₁=(a ₃+a ₆)b ₄, y₁₂=(a ₂+a ₄+a ₀+a ₁)b ₃, y₂₃=(a ₄+a ₀+a ₁+a ₂)b ₂, y₃₄=(a ₆+a ₃)b ₁, y₄₅=(a ₆+a ₁+a ₂+a ₃)b ₀, y₅₆=a ₀ b ₆이 계산된다.
제 1레지스터부(10)의 값이 상위 비트측으로 쉬프트 된 후 의 배타적 논리합 연산부(40)를 실행한 값, (a ₁+a ₂+a ₃+a ₆), (a ₃+a ₆), (a ₂+a ₄+a ₀+a ₁), (a ₄+a ₀+a ₁+a ₂), (a ₆+a ₃), (a ₆+a ₁+a ₂+a ₃), a ₀과 제 2레지스터부(20)의 값이 상위 비트측으로 쉬프트 된 후의 B의 계수, b ₅, b ₄, b ₃, b ₂, b ₁, b ₀, b ₆의 곱을 제 3레지스터(30)의 R _i 에 해당하는 앤드게이트(30a)에서 계산한다.
이번 클럭에서는 첫 번째 클럭에서 R _i 의 딜레이부(30c) D _i 들에 저장된 y₀₀, y₁₁, y₂₂, y₃₃, y₄₄, y₅₅, y₆₆을 전달받아 XOR 게이트(30b)에서 XOR 연산을 한 후 각 D _i 에 저장된다.
이와 같이, 본 발명의 실시예에 따른 GF(2⁷)상의 비트-레벨 곱셈기는 7번의 클럭을 수행한 후에 최종 결과값을 얻을 수 있다. 그리고, 본 발명의 실시예의 AND 게이트와 XOR 게이트는 2-입력 1-출력 게이트이다. 따라서 곱과 XOR 연산은 단지 두 개의 입력값만 받아서 연산을 수행한다.
이어서, GNB를 이용한 유한체 GF(2 ^m )상의 워드-레벨 곱셈기에 대해서 설명한다.
본 발명에 따른 유한체상의 곱셈기에 적용되는 워드-레벨 구조는 데이터를 일정한 크기의 워드 단위로 나눈 다음, 워드 단위로 처리 및 전송한다. 데이터 크기가 m비트이고 워드의 크기가 w비트이면, 워드 개수

이 된다. 비트 시리얼 구조는 m클럭 사이클마다 결과를 출력하지만, 워드-레벨 구조는 L클럭 사이클마다 결과를 출력한다. 워드-레벨 구조는 워드의 크기가 커질수록 연산 시간을 단축할 수 있으나, 하드웨어 복잡도가 증가한다. 그러나 시간과 공간을 만족시키는 가장 적합한 워드 크기를 찾는다면 둘 사이의 상충 관계를 개선할 수 있다. 아래의 표 (2)는 GNB를 이용한 GF(2 ^m )상의 워드-레벨 곱셈 알고리즘이다.
이하, 도 3 및 도 4에 도시된 본 발명의 다른 실시예를 상세히 설명하면 아래와 같다.
먼저, A 레지스터부(100)는 A를 입력받아 저장하고, 기준클럭에 의해 상위 비트측으로 워드크기 w만큼씩 쉬프트한다.
그리고, B 레지스터부(110)는 벡터 B를 입력받아 저장하고, 기준클럭에 의해 상위 비트측으로 워드크기 w만큼씩 쉬프트한다.
이때, 논리연산부(120)는 워드 크기 w에 좌우된다. w비트이면 w개의 세부적인 논리연산부 f₀, f₁, …, f _w _-1로 구성된다. 즉, 워드 크기가 2이므로 2개의 논리연산부로 구성되어 f₀ 논리연산부(120a)와 f₁ 논리연산부(120b)로 구성된다.
이러한 f₀ 논리연산부(120a)와 f₁ 논리연산부(120b)는 동일한 구조를 가진다. f _i 논리연산부(120)는 A 레지스터부(100)에 저장된 값을 사전에 설정된 알고리즘에 의해 배타적 논리합 연산을 수행하는 GNB 배타적 논리합 연산부(121)와, GNB 배타적 논리합 연산부(121)를 통해 논리연산된 값과 상기 B 레지스터부(110)에 저장된 값을 곱셈연산 하는 엔드게이트들(123)로 구성되어 있다.
또한, 논리곱 연산부(130)는 m과 w에 의해 좌우된다. L과 r은 L=

,

으로 정의되어 L클럭 후에 결과값을 얻게 된다. 또한 마지막 L번째 클럭에서는 w-r개의 원소가 첫 번째 클럭에서의 결과값과 중복되므로 f₀, f₁, …, f _r , f _r ₊₁, …, f _w _-1 논리연산부의 결과값 중에서 r개의 논리연산부 결과값 f₀, f₁, …, f _r 을 제외한 w-r개의 논리연산부 결과값은 제 1 배타적 논리합 연산부(140)에 영향을 미치지 않아야 한다.
따라서, 논리곱 연산부(130)는 m과 w에 따라 w-r개가 나타난다. 그리고 제어신호는 마지막 클럭에서 중복되는 부분을 피하기 위한 방법으로 0을 곱해준다. 그러면 제어신호는 (1,1,…,1,0)과 같이 L-1번째 클럭까지는 1을, 마지막 L번째 클럭에서만 0으로 제어한다.
이때, m=7, w=2이므로 L=4가 되어 4클럭 후에 결과값을 얻게 된다. r=1로서 마지막 4번째 클럭에서는 f₁ 논리연산부의 출력값이 1번째 클럭의 출력값과 중복되므로 f₁ 논리연산부의 출력값이 결과값에 영향을 미치지 않게 하기 위해 논리곱 연산부의 제어신호 0과 논리곱 연산을 수행하여 모든 값이 0으로 셋팅되고 출력값이 중복되지 않게 되는 것입니다.
제 1 배타적 논리합 연산부(140)는 수학식 20에서와 같이 동시에 논리연산부(120)의 w개 세부 논리연산부, f₀, f₁, …, f _w _-1 논리연산부의 대응되는 m비트 출력값을 XOR하기 위한 것이다. 따라서

개의 2-입력 XOR 게이트가 필요하게 됩니다.
따라서 제 1 배타적 논리합 연산부(140)는 w에 의해 좌우된다. 이때, m=7, w=2이므로 7개의 2-입력 XOR 게이트로 구성되며 워드 크기 w값이 증가하면 XOR 게이트도 증가하게 된다.
제 2 배타적 논리합 연산부(160)는 XOR 게이트와 같은 역할을 수행합니다. 각 클럭마다 출력되는 제 1 배타적 논리합 연산부(140)의 출력값과 딜레이부(150)의 레지스터(R1, R2, R3, R4, R5, R6, R0)의 XOR 연산을 수행한다. 제 2 배타적 논리합 연산부(160)는 워드 크기 w와 상관없이 m비트값을 출력하기 위한 것이므로 m개의 XOR 게이트로 구성된다.
딜레이부(150)는 단지 딜레이의 역할로서 임시로 출력값을 저장하였다가 다음 클럭에서 출력되는 값과 같이 제 2 배타적 논리합 연산부(160)에서 XOR 연산을 위해 딜레이시키기 위한 역할을 수행한다.
따라서, 딜레이부(150)는 도 1에 도시된 딜레이부(30)의 R _i 와 혼돈되지 않도록 모호성을 배제하기 위해서는 R _i 를 D _i 로 바꾸어 생각하여야 한다.
그리고, f _i 논리연산부(120)는 각각 m비트의 출력값을 가지게 됩니다. 또한 제어신호에 의해 1번째 클럭부터 L-1번째 클럭까지는 f _i 논리연산부는 각 m비트의 출력값이 제 1 배타적 논리합 연산부에 입력되어야 하므로, f _i 논리연산부는 각 m비트 출력값이 제어신호 1과 각각 곱해져서 제 1 배타적 논리합 연산부(140)에 전달e된다.
즉 제어신호 1과 f _i 논리연산부(120)는 각 m비트 출력값이 논리곱 연산부에서 연산 수행한 결과는 f _i 논리연산부의 출력값과 같다. 하지만 마지막 L클럭에서는 w-r개의 f _i 논리연산부의 각 m비트 출력값이 1번째 클럭의 출력값과 중복됨을 피하기 위해 각 m비트와 0을 곱하여 출력값이 제 1 배타적 논리합 연산부(140)에 전달되지 않게 하는 역할을 한다.
이 제어신호는 L에 의해 좌우되며 마지막 클럭에서만 0이 입력되어 중복을 회피할 수 있다. 이는 m/w가 정수가 아닌 경우에 발생하여 중복을 피하는 것이다. 만약 m/w가 정수인 경우는 논리곱 연산부(130)는 필요없으며 따라서 제어신호도 필요하지 않다.

알고리즘 4.2. GF(2 ^m )상의 GNB를 이용한 워드-레벨 곱셈 알고리즘 Input : A, B ∈ GF(2 ^m ) Output : D=(D ₀ , D ₁, …, D_m _-1), D_s =c_s for all 0≤s≤m-1, where AB=

. Initial : A ← (a ₀, a ₁, …, a_m _-1), B ← (b ₀, b ₁, …, b_m _-1), D ← (D ₀, D ₁, …, D_m _-1) ← (0, 0, …, 0). 1. For t=0 to L-2 2. For s=0 to m-1 3. D_s ₊₍ _t ₊₁₎ _w _- _r ← y_s _, _s ₊ _tw + y_s _, _s ₊ _tw ₊₁ + … + y_s _, _s ₊ _tw ₊₍ _w _-1) + D_s ₊ _tw _- _r 4. End for 5. End for 6. t=L-1 7. For s=0 to m-1 8. D_s ₊₍ _t ₊₁₎ _w _- _r ← y_s _, _s ₊ _tw + y_s _, _s ₊ _tw ₊₁ + … + y_s _, _s ₊ _tw ₊₍ _w _-1)- _r + D_s ₊ _tw _- _r 9. End for 10. Return D

,

먼저, 표 2는 도 3의 워드-레벨 곱셈기의 회로도에 대한 알고리즘으로 표현한 것이다.
여기서, 표 2의 알고리즘에서 0≤t≤L-2일 때, 모든 0≤s≤m-1에 대한 y_s _, _s ₊ _tw , y_s _, _s ₊ _tw ₊₁, …, y_s _, _s ₊ _tw ₊₍ _w _-1)을 계산하기 위해 w개의 블록이 필요하지만 t=L-1일 때는 r개의 블록이 t=0일 때의 원소와 중복되므로 w-r개의 블록 y_s _, _s ₊₍ _L _-1) _w , y_s _, _s ₊₍ _L _-1) _w ₊₁, …, y_s _, _s ₊₍ _L _-1) _w ₊₍ _w _- _r _)-1만 계산하면 된다. 그리고 y_s _, _s ₊₍ _L _-1) _w ₊₍ _w _-1)- _r =y_s _, _s ₊ _m _-1이다.
위 알고리즘을 상세하게 설명하면, 첫 번째 사이클(t=0)일 때, 모든 0≤s≤m-1에 대해, D_s ₊ _w _- _r = D_s _- _r + y_s _, _s + y_s _, _s ₊₁ + … + y_s _, _s ₊ _w _-1은 동시에 계산된다. 즉, D_w _- _r = y ₀ _, ₀ + y ₀ _, ₁ + … + y ₀ _,w _-1, D_w _- _r ₊₁ = y ₁ _, ₁ + y ₁ _, ₂ + … + y ₁ _,w , …, D_w _- _r ₊₍ _m _-1) = y_m _-1 _,m _-1 + y_m _-1 _, ₀ + … + y_m _-1 _,w _-2가 동시에 계산된다. 다시 말해, w개의 블록 - y_s _, _s 블록, y_s _, _s ₊₁ 블록, …, y_s _, _s ₊ _w _-1 블록이 동시에 계산되는 것이다. 또한, t=1일 때, 모든 0≤s≤m-1에 대해, D_s ₊₂ _w _- _r = D_s ₊ _w _- _r + y_s _, _s ₊ _w + y_s _, _s ₊ _w ₊₁ + … + y_s _, _s ₊₂ _w _-1은 동시에 계산된다. 즉, D ₂ _w _- _r = D_w _- _r + y ₀ _,w + y ₀ _,w ₊₁ + … + y ₀ _, ₂ _w _-1 = y ₀ _, ₀ + y ₀ _, ₁ + … + y ₀ _, ₂ _w _-1, D ₂ _w _- _r ₊₁ = D_w _- _r ₊₁ + y ₁ _,w ₊₁ + y ₁ _,w ₊₂ + … + y ₁ _, ₂ _w = y ₁ _, ₁ + y ₁ _, ₂ + … + y ₁ _, ₂ _w , …, D ₂ _w _- _r ₊₍ _m _-1) = D_w _- _r ₊₍ _m _-1) + y_m _-1 _,w _-1 + y_m _-1 _,w + … + y_m _-1 _, ₂ _w _-2 = y_m _-1 _,m _-1 + y_m _-1 _, ₀ + … + y_m _-1 _, ₂ _w _-2로 w개의 블록이 동시에 계산된다. 마지막으로 L-1번째 사이클(t=L-1)일 때, 모든 0≤s≤m-1에 대해, D_s ₊ _Lw _- _r = D_s ₊₍ _L _-1) _w _- _r + y_s _, _s ₊₍ _L _-1) _w + y_s _, _s ₊₍ _L _-1) _w ₊₁ + … + y_s _, _s ₊₍ _L _-1) _w ₊₍ _w _- _r _)-1 = D_s ₊₍ _L _-1) _w _- _r + y_s _, _s ₊₍ _L _-1) _w + y_s _, _s ₊₍ _L _-1) _w ₊₁ + … + y_s _, _s ₊ _m _-1은 동시에 계산된다.
이때, 이하에서 설명되는 수학식 17 내지 수학식 19는 도 3에 도시된 논리연산부(120)에 해당한다.
여기서 D_s ₊ _Lw _- _r = D_s ₊ _m = D_s . 즉,

D ₀ = D ₍ _L _-1) _w _- _r + y _0,( _L _-1) _w + y _0,( _L _-1) _w ₊₁ + … + y _0, _m _-1

= y ₀ _, ₀ + y ₀ _, ₁ + … + y ₀ _,m _-1 = c ₀

D ₁ = D ₁₊₍ _L _-1) _w _- _r + y _1,1+( _L _-1) _w + y _1,1+( _L _-1) _w ₊₁ + … + y _1,0

= y ₁ _, ₁ + y ₁ _, ₂ + … + y ₁ _, ₀ = c ₁
……
……

D_m _-1 = D_m _-1+( _L _-1) _w _- _r + y_m _-1, _m _-1+( _L _-1) _w + y_m _-1, _m _-1+( _L _-1) _w ₊₁ + … + y_m _-1, _m _-2

= y_m _-1 _,m _-1 + y_m _-1 _, ₀ + … + y_m _-1, _m _-2
이 동시에 계산된다. 다시 말해, 고정된 s에 대해, 마지막 출력값 D_s 는 다음과 같은 방법으로 연속적으로 계산된다.

이때, 수학식 20은 도 3에 도시된 논리연산부(120), 논리곱 연산부(130), 제1 배타적 논리연산부(140), 딜레이부(150) 및 제2 배타적 논리연산부(160)에 해당한다.
앞 절에서 x_s _-1, _s' (=y _s-1, _s )는 x_s _, _s' (=y _s, _s )의 벡터 a_i , b_i 들을 오른쪽으로 한 번 순환 쉬프트해서 얻어진다는 것을 알았다. 그러므로 w개의 블록을 동시에 계산하기 위해서 y_s _, _s 블록과 y_s _, _s 의 벡터 a_i , b_i 들을 오른쪽으로 각각 한 번, 두 번, …, w-1번 순환 쉬프트하여 얻게 되는 y_s _, _s ₊₁ 블록, y_s _, _s ₊₂ 블록, …, y_s _, _s ₊ _w _-1 블록을 동시에 계산하여야 하며 이때, 각 블록의 구조는 동일하다. 모든 0≤s≤m-1에 대해, y_s _, _s 를 실행하기 위해서는 m개의 AND 게이트와

개의 XOR 게이트가 필요함을 알 수 있다.
따라서 D_s ₊ _tw _- _r 을 계산하기 위해서는 w개의 블록을 동시에 계산해야 하므로 wm개의 AND 게이트와 많아야

개의 XOR 게이트가 필요하다. 또한, D_s ₊ _tw _- _r 는 D_s ₊₍ _t _-1) _w _- _r , y_s _, _s ₊₍ _t _-1) _w , y_s _, _s ₊₍ _t _-1) _w ₊₁, …, y_s _, _s ₊₍ _t _-1) _w ₊₍ _w _-1)의 합을 계산하기 위해 wm개의 XOR 게이트가 필요하다. 특히, t=L-1일 때 r개의 블록이 중복되므로 중복되는 r개의 블록 값을 회피하기 위해 제어 신호 '0'과 AND 연산을 하게 되므로 rw개의 AND 게이트가 필요하게 된다. 따라서 제안된 워드-레벨 곱셈기의 복잡도는 도 3에 도시한 바와 같이 wm+rm개의 AND 게이트와 많아야

개의 XOR 게이트로 이루어진다. 처리 지연 시간은 T _A와 많아야

이고, r개의 블록에서 나온 값과 제어 신호가 AND 연산을 한 후, D_s ₊₍ _t _-1) _w _- _r + y_s _, _s ₊₍ _t _-1) _w + y_s _, _s ₊₍ _t _-1) _w ₊₁ + … + y_s _, _s ₊₍ _t _-1) _w ₊₍ _w _-1)를 계산하므로 T _A와

이다.
따라서 처리 지연 시간은 많아야 2T _A +

이다. 특히 k=2이면 워드-레벨 곱셈기의 처리 지연 시간은 2T _A +

이고, wm+rm개의 AND 게이트와 많아야

개의 XOR 게이트가 필요하다.
전술한 바와 같은 본 발명에 따른 워드-레벨 곱셈기의 성능 분석을 한 결과 기존의 곱셈기보다 작은 하드웨어 복잡도와 훨씬 낮은 처리 지연 시간을 나타내었다.
전술한 바와 같은 내용을 바탕으로, w=2인 GF(2⁷)상의 GNB 타입 4를 사용한 워드-레벨 곱셈 C=AB=

는 아래의 수학식 (21)과 같이 얻을 수 있다. w=2이므로 아래의 밑줄 친 두 개의 원소를 XOR 연산하여 레지스터에 저장하게 되는데 밑줄 친 원소의 첫 번째 항은 a_i 들의 공통된 항을 가지는 주 대각 원소를 계산하는 f ₀ 블록이고, 두 번째 항은 주 대각 원소의 벡터 a_i , b_i 들을 한 번 순환 쉬프트하여 계산하는 f ₁ 블록이다.

여기서, 수학식 21은 논리 연산부(120)를 거쳐 제1 배타적 논리 연산부(140)까지의 과정을 보인 것으로 1 클럭에 해당한다. 이때, 밑줄친 부분에서 + 의 앞부분은 F0 논리연산부(120a)에 해당되고, + 의 뒷부분은 F1 논리연산부(120b)에 해당된다.
도 3은 w=2에 대한 GF(2⁷)상의 GNB 타입 4를 사용한 C=AB의 대응되는 쉬프트 레지스터 회로이고, 도 4는 도 3에서의 f_j 블록 구조를 나타낸다. 워드-레벨 구조에서는 한 클럭 사이클마다 w=2개의 원소를 연산하므로 A, B, R 레지스터는 w 크기만큼 순환 쉬프트하고 L=4 클럭 사이클 후에 모든 연산이 끝나고 결과가 출력된다. m이 홀수이기 때문에 두 개의 원소씩 처리하면 마지막 클럭 사이클에서는 중복되는 원소의 연산이 나타나므로 이 중복되는 연산을 회피해야 한다.
워드의 크기 w에 따라 중복되는 원소의 개수 w-r도 다르게 나타나므로 중복되는 원소의 개수만큼 마지막 클럭 사이클에서 제어 신호로 제거해 주어야 한다. 도 3에 나타나듯이 제안된 곱셈기는 (1, 1, 1, 0)의 제어 신호를 사용하여 마지막 제어 신호에서는 중복되는 값을 회피하였다. 또한, m/w가 정수가 아니므로 곱셈 결과 레지스터 R_i 는 정확한 c_i 를 가지지 않는다. 따라서 정확한 출력을 위해 R_i 의 위치를 변경해야만 하고 r만큼 순환 쉬프트하여 해결할 수 있다. 이와 같이 w=2에 대한 GF(2⁷)상의 GNB 타입 4에 대한 워드-레벨 곱셈기를 살펴보면, 공간 복잡도가 증가하였고 처리 지연 시간 또한 2T _A+4T _X이지만, 비트 시리얼 곱셈기가 m클럭 사이클 후에 출력되는데 반해, L클럭 사이클 후에 출력할 수 있다는 장점이 있다. 기존의 곱셈기는 선택된 원시 기약 다항식에 따라 서로 다른 하드웨어 구조를 가지지만 본 발명에서 제안된 곱셈기는 동일한 GNB 타입만 가지면 동일한 형태의 하드웨어 구조를 얻을 수 있다. 따라서 본 발명에 따른 유한체 GF(2^m)상의 곱셈기는 ECC의 곱셈기로 매우 적합하다고 할 수 있다.
전술한 바와 같은 본 발명에 따른 유한체상의 곱셈기는 GF(2^m)상의 곱셈기의 FPGA 구현 및 기능 검증을 위해 VHDL로 회로를 기술한 것이고, Xilinx사의 ISE 6.3i를 이용하여 회로를 합성한 후, Mento Graphics사의 Model Sim을 이용하여 그 기능을 검증하였다.
이상 설명한 내용을 통해 당업자라면 본 발명의 기술 사상을 일탈하지 아니하는 범위에서 다양한 변경 및 수정이 가능함을 알 수 있을 것이다.
따라서 본 발명의 기술적 범위는 실시 예에 기재된 내용으로 한정되는 것이 아니라 특허 청구의 범위 및 그와 균등한 것들에 의하여 정해져야 한다.Embodiment according to the present invention for achieving the above object is a finite bodyGF(2 ^m In the multiplier on?), The first register unit 10 which receives and stores the vector A and shifts one bit to the upper bit side by the reference clock, and the value stored in the first register unit are exclusively set by a predetermined algorithm. An exclusive OR operation unit 40 performing an OR operation, a second register unit 20 receiving and storing a vector B, and shifting a bit to an upper bit side by a reference clock, and the second register unit in units of a reference clock. A multiplication value obtained by multiplying the value stored in (20) and the logical operation value through the exclusive OR operation unit 40, and XOR operation with each of the values stored in each of the plurality of detailed registers R0 to R6 And a third register unit 30 for delaying.
Here, since the embodiment of the present invention is a bit operation, the logical operation is performed bit by bit.
Each of the plurality of detailed registers R0 to R6 of the third register unit 30 is an AND that logically multiplies a value stored in the second register unit 20 and a value logically calculated by the exclusive OR operation unit 40. A delay unit for delaying the OR operation of the gate 30a, the exclusive OR of the value multiplied by the AND gate 30a with the shifted value from the lower detail register, and a delay unit delaying the value logically calculated by the XOR gate. It is characterized by comprising a (30c);
Another embodiment according to the invention is a finite bodyGF (2 ^m )In multiplicator of phase, vector A is inputted and stored, and word size to upper bit side by reference clock.wA register section 100 that shifts by a number and vector B are received and stored, and the word size is shifted to the upper bit side by a reference clockwThe B register unit 110 shifts by a number, and the F0 logic operation unit 120a and F1 respectively perform a logical operation on a value stored in the A register unit and a value stored in the B register unit in a reference clock unit according to a preset algorithm. A logical operator 120 including a logical operator 120b, a logical AND operator 130 for logically multiplying a value output from the F1 logical operator 120b and a control value provided from the outside, and the F0 logical operator 120a. ) And a first exclusive OR operation unit 140 performing an exclusive OR operation on the value output from the AND product operation unit 130, and the values and lower registers R0 to R6 processed by the first exclusive OR operation unit 140. Word sizewAnd a second exclusive logic operation unit 160 for performing an exclusive OR operation on the shifted lower register values and a delay unit 150 for delaying a value output from the second exclusive logic operation unit.
The logic operation unit 120 performs a logical OR operation for performing an exclusive OR operation on a value stored in the A register unit 100 by a preset algorithm, and the logic through the GNB exclusive OR operation unit 121. And end gates 123 that multiply the calculated value with the value stored in the B register 110.
Here, another embodiment of the present invention is one word size because it is a word operationwLogical operations are performed by
The control value is controlled by the first exclusive logical OR operation unit 140 by the control value 1 to transfer the output value of the F1 logical operation unit 120b to the first exclusive logical OR operation unit 140 from the first clock to the last previous clock. Logic operation is performed on the first exclusive OR unit 140 by a control signal 0 so as to perform an AND operation and not transmit the output value of the F1 logical operation unit 120b to the first exclusive OR unit 140 at the last clock. It is characterized by multiplying.
Hereinafter, with reference to the accompanying drawings finite body according to the present inventionGF(2^mWill be described in more detail.
1 is a finite body according to the present inventionGF(2^mFig. 2 is a circuit diagram of a bit-level multiplier of Fig. 1, Fig. 2 is a detailed configuration diagram of the third register part shown in Fig. 1, and Fig. 3 is a finite body according to the present invention.GF(2^mFig. 4 is a circuit diagram of a word-level multiplier of Fig. 3, and Fig. 4 is a detailed circuit diagram of the functional block shown in Fig. 3.
Finite body according to the present inventionGF(2^mThe multiplier on) can be roughly divided into a bit-level multiplier performing a bit-wise multiplication operation and a word-level multiplier performing a word-based multiplication operation. The bit-level multiplier and word-level multiplier are implemented using a Gaussian normal basis.
Hereinafter, an embodiment of the present invention corresponding to FIGS. 1 and 2 will be described in detail.
First, an AND gate 30a of the third register unit 30 is converted into a value logically calculated through an exclusive OR operation unit 40 from a value stored in the first register unit 10 and a value stored in the second register unit 20. Multiply by to output the calculation result, and at the initial clock, the delay unit (30c) _i Since all values of 0 are set to 0, the multiplication result at the AND gate 30a is stored in the delay unit 30c as it is. ) Is transferred to the upper registers from the detailed registers R0 to R6 of the third register unit 30 to execute an XOR operation in the XOR gate unit 30b to perform the third register unit 30. Are stored in each of the delay units 30c.
The plurality of detailed registers R0 to R6 include two values, in which the value of the first register unit 10 performs the logical operation of the exclusive OR operation unit 40 and the value of the second register unit 20. An AND gate 30a for receiving and multiplying an input value, and an XOR gate 30b for performing an XOR operation on each value of the AND gate 30a and the output value of the lower delay unit 30c of the third register unit 30, R, which is composed of a delay unit 30c for delaying the value, is the third register unit 30. _i The output value of D is the delay unit 30c. _i Is a 1-bit output value.
The algorithm applied to implement the bit-level multiplier described above is shown in Table (1) below.

Multiplication Algorithm on GF (2 ^m ) Using GNB Input:

Output:

,

for all

, where

. Initial:

. 1. For

to

2. For

to

3.

4. End for 5. End for 6. Return D

A process of deriving a multiplication algorithm as shown in Table 1 will be described.
First, Table 1 represents the algorithm for the circuit diagram of the bit-level multiplier of FIG. In addition, in Table 1, step 3 shows a process corresponding to FIG.
And, finite bodyGF(2 ^m )silverGF(2) topm3d vector into spaceGF(2 ^m Element onABasal

About

It can be expressed as

In this case, Equation 1 corresponds to the first register unit 10, and when the input vector is B, the equation 1 also corresponds to the second register unit 20. And,

The basis of the form is called NB (Normal Basis).GF(2 ^m For high safety of ECCmRequires. This condition is necessary to avoid Pohlig-Hellman type of attack. For example, NIST and IEEE 1363 recommend field sizes for Elliptic Curve Digital Signature Algoritym (ECDSA).mAs 163, 233, 283, 409 and 571mIs an odd prime number. Therefore, in the present inventionmOnly those odd prime numbers are considered.
Equations 2 to 16 described below show the progress of the exclusive OR operation unit 40 shown in FIG. 1.
Then, for NB,

Say,

ego,

to be. hereλ _ij ⁽ ^s ⁾ToGFLet's call it an element of (2). Then, randomtAbout,

ego,λSubscripts above and below are mod (modulation operation)mto be.

_s If you compare the coefficients of,λ _ij ⁽ ^s ⁾=λ _i _- _{t, j} _- _t ⁽ ^st ⁾ego,λ _ij ⁽ ^s ⁾=λ _i _- _{s, j} _- _s ⁽⁰⁾It can be seen that.GF(2 ^m Element onA,BMultiply byC=ABIs

ego,CCoefficient ofc _s Is

to be.
here

_i

_j instead

_j Using,

_i Wow

_mi By using the symmetry of the following equation (6) can be obtained.

In equation (6) above

=

to be.
Then, finite field using Gaussian normal basisGF(2 ^mk Next, the multiplication algorithm of () will be described.
first,m,kDecimalpFor ≠ 2,p=mkIs a positive integer of +1,K= <τ>GF(p)^×Order inkLet's call it the only subgroup that is.βendGF(2 ^mk Unity on)pThe first primitive root, the next element

ToGF(2)m,kIt is called Gauss period of type. ord _p Mod 2pIs the power of 2 for gcd (mk/ ord _p 2,m) = 1

IsGF(2 ^m ) Is an element of NB and 0≤i≤mFor -1,

Lay, {

₀,

₁ ,

₂ , ...

_m-1}silverGF(2) topGF(2 ^m ) Is the basis forGF(2 ^m On)m or (m,kThis is called the Gaussian regular basis of type.
Then, using a Gaussian normal basis

_i If you find,

To get 1+τ ^u 2 ^v = 0GF(p)u≤k-1 and 0≤v≤m-1 is unique. if,t≠u ori≠vIf,tWowiAny 0 ≤ determined byσ(t,i) ≤mFor 1, 1+τ ^t 2 ⁱ ∈K _σ ₍ _t _, _i ₎Get Thus randomtFor ′, 1+τ ^t 2 ⁱ =τ ^t ^′2 ^σ ⁽ ^t ^, ⁱ ⁾Can be written as: herei≠vwhen,

to be. Also,i=vwhen,

to be. therefore

Isi≠vAbout,

At mostkIs calculated as the sum of the base elements,

Should be at mostk-1 base element and constant partk= 0,1∈GF(2) Is calculated as the sum of

Lala,λ _ij ToGFLet's call it an element of (2). If you multiply two by both sides,λ _ij ⁽ ^s ⁾=λ _ij _, _sj to be.
In view of the above, the following theorem holds.
(theorem)
kIs an even number,

This typekIf it is GNB,
λ _ij ⁽⁰⁾=λ _ij
to be.
(proof) This proofλ _ij ⁽ ^s ⁾=λ _ij _, _-j It is enough to show that From equations (9) and (10)λ _ij = 1
1+τ ^s 2 ⁱ =τ ^s ^′2 ^j
To satisfy (s,s') (modkNotice that there is an odd ordered pair of). At this time,SAll ordered pairs (s,s') (modkIs called a set,T1+τ ^t 2 ⁱ ^- ^j =τ ^t ^′2 ^-j Any ordered pair that satisfies (t,t ') (modmLet's call it a set. Both sides of equation (13)τ ^s ^′2 ^j Divided by τ ^-s ^′2^- ^j +τ ^ss ^′2 ⁱ ^- ^j = 1,τHead countkegokIs even, so -1 =τ ^{k /} ²egoτ ^-s ^′2^- ^j = 1 +τ ^{k /} ²⁺ ^ss ^′2 ⁱ ^- ^j to be. Wheref _S :S→TIsf _S (s,s´) = (k /2+ss´,-s´) give, thinkf _T :T→SIsf _T (t,t´) = (k /2+tt´,-t´), the two theories correspond one-to-one, so the theorem is proved.
Then, based on the above theorem,

Coefficient ofc _s Is

Same as
Corresponding matrixX= (x _st )AboutGFElement on (2)x _st , 0≤s,t≤m-1

Let's define
thenXoftVectorX _t = (x _0t ,x _1t ,… ,x _m-1 _, _t )^TAnd, at this time, (x _0t ,x _1t ,… ,x _m-1 _, _t )^TIs a row vector (x _0t ,x _1t ,… ,x _m-1 _, _t ) Is a transpose matrix.
Also,

Because all column vectorsX _t ,t= 0, 1,... ,mThe sum of -1 is (c ₀,c _One,… ,c _m _-One)^Tto be. Thus the column vectorX _t Gate complexity and critical paths can be reduced by rearranging and reusing the subsumed signals in the computation.
Meanwhile,mLet's say -1 = 2υXBy substitutingm × m processionY= (y _st Let's define
when υ is odd,YIf you define
(X _υ , ..., X ₃ , X ₁ , X _m-1 , ..., X _m-υ , X _υ-1 , ..., X ₂ , X ₀ , X _m-2 , ... X _{m-υ + 1} ) ego,
when υ is even,YIf you define
(X _υ , ..., X ₂ , X ₀ , X _m-2 , ..., X _m-υ , X _υ-1 , ..., X ₃ , X ₁ , X _m-1 , ... X _{m-υ + 1} ) to be.
thenY _t = (y _0t ,y _1t ,… ,y _m-1 _, _t )^TsignYAll columns ofY _t ,t= 0, 1,... ,mThe sum of -1 is (c ₀,c _One,… ,c _m _-One)^TsignXAll columns ofX _t ,t= 0, 1,... ,mIs equal to the sum of -1. To implement parallel-input, parallel-output multiplicationYInstead of calculating the sum of the column vectors of,YCompute the sum of the shifted diagonal vectors of. This can be obtained from the following result. processionYVector in representation ofX _t ofX _mt Exactly betweentThere is one column. Also,X _t ofsWith the first elementX _mt ofs+tThe first element is in their summanda _i Has the same term
Also,

Get At this time, the third expression in the above expression (16) is subscriptediIn consensus rearrangement and the final expressionλ _ij =λ _i _- _j, _- _j Comes from thereforex _st Wowx _s ₊ _t _, _m _- _t Is the same term in the expression of

HaveABThe number of XOR gates can be saved while calculating.
Accordingly, as described above, a new GNB multiplication algorithm as shown in Table 1 can be obtained. FIG. 1 shows a configuration of a bit-level multiplier implemented using the GNB multiplication algorithm as described above.
The bit-level multiplier receives and stores a vector A, and performs an exclusive OR operation on a first register 10 and a value stored in the first register by a reference clock. The exclusive OR operation unit 40 which performs the operation, the second register unit 20 which receives and stores the vector B, shifts to the upper bit side by the reference clock, the value stored in the first register unit in the unit of the reference clock, and the first register unit. And a third register unit 30 for multiplying the values stored in the two register units and the logically calculated values through the exclusive OR operation unit.
Hereinafter, the process of an embodiment of the present invention will be described in detail.
First, as shown in equations (14) and (15),

Coefficient ofc _s By Equation 14 and Equation 15

You can see that.
AlsoX _t = (x _0t ,x _1t ,… ,x _m-1 _, _t )^TIsXoftIs the tenth column vectorXBy substitutingX _t Is rearrangedm×m processionYCan be obtained.
For example, in Equation 21tIs 0, i.e. y at the first clock₀₀= (a ₂+a ₅)b ₃, y₁₁= (a _One+a ₃+a ₆+a ₀)b ₂, y₂₂= (a ₃+a ₆+a ₀+a _One)b _One, y₃₃= (a ₅+a ₂)b ₀, y₄₄= (a ₅+a ₀+a _One+a ₂)b ₆, y₅₅=a ₆ b ₅, y₆₆= (a ₀+a _One+a ₂+a ₅)b ₄This is calculated.
Where y _{s, s} ofaLooking only at the coefficient of, y₀₀And y₃₃, y₁₁And y₂₂, y₄₄And y₆₆It can be seen that this is the same. Also y₀₀And y₃₃Used in (a ₂+a ₅) Is y₄₄And y₆₆Reused for y₁₁And y₂₂Used in (a ₀+a _One) Degrees y₄₄And y₆₆Is reused. By using this property stored in the first register unit 10AfromA The part implemented with the circuit diagram expressed as the sum of the coefficients is the exclusive OR calculation unit 40. This exclusive OR operation unitmDepends on.
Each of the result values in the exclusive OR operation unit 40A Sum of coefficients, (a ₂+a ₅), (a _One+a ₃+a ₆+a ₀), (a ₃+a ₆+a ₀+a _One), (a ₅+a ₂), (a ₅+a ₀+a _One+a ₂),a _6,(a ₀+a _One+a ₂+a ₅) And the second register part 20BCoefficient of,b ₃,b ₂,b _One,b ₀,b ₆,b ₅,b ₄R of the third register 30 is for calculating the product of _i This corresponds to the end gate 30a.
And initially R _i Delay part (30c) D _i Are set to 0, the value of the AND gate 30a and the R of the third register 30 are zero. _i D delivered from the delay unit 30c of _i 0 = XOR operation on XOR gate 30b, then each D _i Y is stored in₀₀= (a ₂+a ₅)b ₃, y₁₁= (a _One+a ₃+a ₆+a ₀)b ₂, y₂₂= (a ₃+a ₆+a ₀+a _One)b _One, y₃₃= (a ₅+a ₂)b ₀, y₄₄= (a ₅+a ₀+a _One+a ₂)b ₆, y₅₅=a ₆ b ₅, y₆₆= (a ₀+a _One+a ₂+a ₅)b ₄You get
In addition, the following equation is for explaining the process of the second clock in (21).

first,tIs 1, that is, stored in the first register unit 10 at the second clock.AValue and stored in the second register section 20BY after each value is shifted to the upper bit side₆₀= (a _One+a ₂+a ₃+a ₆)b ₅, y₀₁= (a ₃+a ₆)b ₄, y₁₂= (a ₂+a ₄+a ₀+a _One)b ₃, y₂₃= (a ₄+a ₀+a _One+a ₂)b ₂, y₃₄= (a ₆+a ₃)b _One, y₄₅= (a ₆+a _One+a ₂+a ₃)b ₀, y₅₆=a ₀ b ₆This is calculated.
A value obtained by executing the exclusive OR operation unit 40 after the value of the first register unit 10 is shifted to the upper bit side, (a _One+a ₂+a ₃+a ₆), (a ₃+a ₆), (a ₂+a ₄+a ₀+a _One), (a ₄+a ₀+a _One+a ₂), (a ₆+a ₃), (a ₆+a _One+a ₂+a ₃),a ₀And the value of the second register section 20 after being shifted to the upper bit side.BCoefficient of,b ₅,b ₄,b ₃,b ₂,b _One,b ₀,b ₆The product of R in the third register 30 _i It calculates in the AND gate 30a corresponding to this.
In this clock, R is the first clock _i Delay part (30c) D _i Y stored in the field₀₀, y₁₁, y₂₂, y₃₃, y₄₄, y₅₅, y₆₆After receiving XOR operation at XOR gate 30b, each D _i Are stored in.
As such, GF (2) according to an embodiment of the present invention.⁷The bit-level multiplier on) can get the final result after seven clocks. And, the AND gate and the XOR gate of the embodiment of the present invention are two-input one-output gates. Thus, the product and XOR operations take only two inputs and perform the operation.
Then, finite body using GNBGF(2 ^m Will be described.
The word-level structure applied to the finite field multiplier according to the present invention divides data into units of a word having a certain size, and then processes and transmits the units of words. Data sizemBits and the size of the wordwBit, word count

Becomes Bit serial structuremOutput results every clock cycle, but the word-level structureLOutput results every clock cycle. The word-level structure can shorten the computation time as the word size increases, but the hardware complexity increases. However, finding the most appropriate word size to satisfy time and space can improve the tradeoff between the two. Table (2) below uses GNB.GF(2 ^m Is a word-level multiplication algorithm.
Hereinafter, another embodiment of the present invention shown in FIGS. 3 and 4 will be described in detail.
First, the A register unit 100 receives A and stores A, and a word size toward the upper bit side by a reference clock.wShift by
The B register 110 receives and stores the vector B, and the word size is increased to the upper bit side by the reference clock.wShift by
At this time, the logical operation unit 120 depends on the word size w. w bits, w detailed logical operations f₀, f_One,… , f _w _-OneIt consists of. In other words, since the word size is 2, it consists of two logical operations.₀ Logical operation unit 120a and f_One Logical operation unit 120b.
Such f₀ Logical operation unit 120a and f_One The logical operation unit 120b has the same structure. f _i The logic operation unit 120 performs a logical OR operation on the GNB exclusive OR operation 121 which performs an exclusive OR operation on a value stored in the A register unit 100 by a preset algorithm, and a value calculated through the GNB exclusive OR operation unit 121. And end gates 123 that multiply the values stored in the B register unit 110.
In addition, the logical AND operation unit 130mandwDepends on.LandrsilverL=

,

Is defined asLAfter the clock you get the result. Also lastLOn the first clockw-rElements overlap because of the result at the first clock₀, f_One,… , f _r , f _r ₊₁,… , f _w _-One Out of the result of logical operationrLogical operator result f₀, f_One,… , f _r excludingw-rThe result of the two logical operations must not affect the first exclusive OR operation 140.
Therefore, the logical product operation unit 130mandwDepending on thew-rThe dog appears. The control signal is then multiplied by zero to avoid overlapping parts of the last clock. Then the control signal is like (1,1,…, 1,0)L1 for the first clock, lastL0 is controlled only at the first clock.
At this time,m= 7,w= 2L= 4 and you get the result after 4 clocks.r= 1, f is the last 4th clock_One Since the output value of the logical operation part overlaps with the output value of the first clock, f_One In order to prevent the output value of the logical operation part from affecting the result value, all the values are set to 0 and the output values are not duplicated by performing the logical product operation with the control signal 0 of the logical product operation part.
The first exclusive OR operation unit 140 simultaneously performs the operation of the logical operation unit 120 as shown in Equation 20.wLogic units, f₀, f_One,… , f _w _-One Corresponding logic unitmTo XOR bit output value. therefore

You will need two 2-input XOR gates.
Therefore, the first exclusive logical OR operation unit 140wDepends on. At this time,m= 7,w= 2, consisting of seven 2-input XOR gates, word sizewIncreasing the value also increases the XOR gate.
The second exclusive OR operation 160 performs the same function as the XOR gate. An XOR operation of the output value of the first exclusive OR operation unit 140 outputted for each clock and the registers R1, R2, R3, R4, R5, R6, and R0 of the delay unit 150 are performed. The second exclusive OR operation 160 is a word sizewIrrespective ofmIs to output the bit valuemIt consists of two XOR gates.
The delay unit 150 merely serves as a delay, temporarily stores the output value, and serves to delay the XOR operation in the second exclusive OR operation unit 160 as a value output from the next clock.
Therefore, the delay unit 150 is the R of the delay unit 30 shown in FIG. _i To exclude ambiguity so as not to be confused with _i D _i Should be changed to
And f _i Logic operation unit 120 ismIt will have the output value of the bit. Also, from the first clock by the control signalLF until -1 clock _i Logic operation unitmSince the output value of the bit should be input to the first exclusive OR operation part, f _i Logic operation unitmThe bit output value is multiplied by the control signal 1, respectively, and transmitted to the first exclusive OR calculation unit 140.
Control signals 1 and f _i Logic operation unit 120 is eachmThe bit output value is calculated by the logical AND. _i It is the same as the output value of the logical operation unit. But lastLW on clock-r f _i Each m-bit output value of the logic operation unit is used to avoid overlapping the output value of the first clock.mBy multiplying the bit with 0, the output value is not transmitted to the first exclusive OR operation 140.
This control signalLIt is dependent on 0 and only 0 is inputted at the last clock to avoid duplication. This happens when m / w is not an integer to avoid duplication. If m / w is an integer, the logical AND operation unit 130 is not necessary, and thus no control signal is required.

Algorithm 4.2. Word-Level Multiplication Algorithm Using GNB on GF (2 ^m ) Input: A , B ∈ GF (2 ^m ) Output: D = ( D ₀ , D ₁ ,…, D _m _-1 ), D _s = c _s for all 0≤ s ≤ m -1, where AB =

. _{Initial: A ← (a 0,} a 1, ..., a m -1), B ← (b 0, b 1, ..., b m -1), D ← (D 0, D 1, ..., D m - ₁ ) ← (0, 0,…, 0). 1.For t = 0 to L -2 2.For s = 0 to m -1 3.D _s _{+ (} _t ₊₁₎ _w _- _r ← y _s _, _s ₊ _tw + y _s _, _s ₊ _tw ₊₁ + … + y _s _, _s ₊ _tw _{+ (} _w _-1) + D _s ₊ _tw _- _r 4. End for 5. End for 6. t = L -1 7.For s = 0 to m -1 8. D _s _{+ (} _t ₊₁₎ _w _- _r ← y _s _, _s ₊ _tw + y _s _, _s ₊ _tw ₊₁ +. + y _s _, _s ₊ _tw _{+ (} _w _-1) _-r + D _s ₊ _tw _- _r 9.End for 10. Return D

,

First, Table 2 is an algorithm of the circuit diagram of the word-level multiplier of FIG.
Where 0≤ in the algorithm of Table 2t≤LWhen -2, all 0≤s≤mFor -1y _s _, _s ₊ _tw ,y _s _, _s ₊ _tw ₊₁,… ,y _s _, _s ₊ _tw _{+ (} _w _-One)To calculatewBlocks requiredt=L-1rBlockstOverlaps with an element when 0w-rBlocksy _s _, _s _{+ (} _L _-One) _w ,y _s _, _s _{+ (} _L _-One) _w ₊₁,… ,y _s _, _s _{+ (} _L _-One) _w _{+ (} _w _- _r _)-OneYou only need to calculate. Andy _s _, _s _{+ (} _L _-One) _w _{+ (} _w _-One)- _r =y _s _, _s ₊ _m _-Oneto be.
The algorithm above is described in detail, the first cycle (t= 0), all 0≤s≤mFor -1,D _s ₊ _w _- _r =D _s _- _r +y _s _, _s +y _s _, _s ₊₁ +… +y _s _, _s ₊ _w _-OneIs calculated at the same time. In other words,D _w _- _r =y ₀ _, ₀ +y ₀ _, _One +… +y ₀ _{, w} _-One,D _w _- _r ₊₁ =y _One _, _One +y _One _, ₂ +… +y _One _{, w} ,… ,D _w _- _r _{+ (} _m _-One) =y _m _-One _{, m} _-One +y _m _-One _, ₀ +… +y _m _-One _{, w} _-2Is calculated at the same time. In other words,wBlocks-y _s _, _s block,y _s _, _s ₊₁ block, … ,y _s _, _s ₊ _w _-One The blocks are computed at the same time. Also,tWhen = 1, all 0≤s≤mFor -1,D _s ₊₂ _w _- _r =D _s ₊ _w _- _r +y _s _, _s ₊ _w +y _s _, _s ₊ _w ₊₁ +… +y _s _, _s ₊₂ _w _-OneIs calculated at the same time. In other words,D ₂ _w _- _r =D _w _- _r +y ₀ _{, w} +y ₀ _{, w} ₊₁ +… +y ₀ _, ₂ _w _-One =y ₀ _, ₀ +y ₀ _, _One +… +y ₀ _, ₂ _w _-One,D ₂ _w _- _r ₊₁ =D _w _- _r ₊₁ +y _One _{, w} ₊₁ +y _One _{, w} ₊₂ +… +y _One _, ₂ _w =y _One _, _One +y _One _, ₂ +… +y _One _, ₂ _w ,… ,D ₂ _w _- _r _{+ (} _m _-One) =D _w _- _r _{+ (} _m _-One) +y _m _-One _{, w} _-One +y _m _-One _{, w} +… +y _m _-One _, ₂ _w _-2 =y _m _-One _{, m} _-One +y _m _-One _, ₀ +… +y _m _-One _, ₂ _w _-2inwBlocks are computed simultaneously. FinallyL-1 cycle (t=LWhen -1), all 0s≤mFor -1,D _s ₊ _Lw _- _r =D _s _{+ (} _L _-One) _w _- _r +y _s _, _s _{+ (} _L _-One) _w +y _s _, _s _{+ (} _L _-One) _w ₊₁ +… +y _s _, _s _{+ (} _L _-One) _w _{+ (} _w _- _r _)-One =D _s _{+ (} _L _-One) _w _- _r +y _s _, _s _{+ (} _L _-One) _w +y _s _, _s _{+ (} _L _-One) _w ₊₁ +… +y _s _, _s ₊ _m _-OneIs calculated at the same time.
In this case, Equations 17 to 19 described below correspond to the logic operation unit 120 illustrated in FIG. 3.
hereD _s ₊ _Lw _- _r =D _s ₊ _m =D _s . In other words,

D ₀ = D ₍ _L _-1) _w _- _r + y _{0, (} _L _-1) _w + y _{0, (} _L _-1) _w _{+ 1} +. + y _0, _m _-1

=y ₀ _, ₀ +y ₀ _, _One +… +y ₀ _{, m} _-One =c ₀

D ₁ = D _{1+ (} _L ₋₁₎ _w ₋ _r + y _1,1 + ₍ _L ₋₁₎ _w + y _{1,1+ (} _L ₋₁₎ _w ₊₁ +. + y _1,0

=y _One _, _One +y _One _, ₂ +… +y _One _, ₀ =c _One
… …
… …

D _m _-1 = D _m _{-1+ (} _L _-1) _w _- _r + y _m _-1, _m _-1 _{+ (} _L _-1) _w + y _m _-1, _m _-1 _{+ (} _L _-1) _w ₊₁ +… + y _m _-1, _m _-2

=y _m _-One _{, m} _-One +y _m _-One _, ₀ +… +y _m _-One, _m _-2
This is calculated at the same time. In other words, fixedsFor the last outputD _s Is calculated continuously in the following way.

In this case, Equation 20 corresponds to the logical operation unit 120, the logical product operation unit 130, the first exclusive logical operation unit 140, the delay unit 150, and the second exclusive logical operation unit 160 illustrated in FIG. 3. .
In the previous sectionx _s _-One, _s' (=y _s-1, _s )x _s _, _s' (=y _s, _s ) 'S vectora _i ,b _i It was found that they were obtained by cyclically shifting them to the right once. thereforewTo compute two blocks at the same timey _s _, _s Block andy _s _, _s Vectora _i ,b _i To the right once, twice,… ,wThat we get by cyclic shifty _s _, _s ₊₁ block,y _s _, _s ₊₂ block, … ,y _s _, _s ₊ _w _-One The blocks must be calculated at the same time, and the structure of each block is the same. All 0≤s≤mFor -1,y _s _, _s In order to runmAND gates

It can be seen that XOR gates are required.
thereforeD _s ₊ _tw _- _r In order to calculatewBlocks must be computed simultaneouslywmAND gates and at most

Two XOR gates are required. Also,D _s ₊ _tw _- _r IsD _s _{+ (} _t _-One) _w _- _r ,y _s _, _s _{+ (} _t _-One) _w ,y _s _, _s _{+ (} _t _-One) _w ₊₁,… ,y _s _, _s _{+ (} _t _-One) _w _{+ (} _w _-One)To calculate the sum ofwmTwo XOR gates are required. Especially,t=L-1rBlocks are overlappingrAND operation is performed with the control signal '0' to avoid blocksrwAnd AND gates are required. Therefore, the complexity of the proposed word-level multiplier is shown in FIG.wm+rmAND gates and at most

It consists of two XOR gates. Processing delay timeT _AAnd at most

ego,rAfter the value from the two blocks and the control signal are ANDed,D _s _{+ (} _t _-One) _w _- _r +y _s _, _s _{+ (} _t _-One) _w +y _s _, _s _{+ (} _t _-One) _w ₊₁ +… +y _s _, _s _{+ (} _t _-One) _w _{+ (} _w _-One)So thatT _AWow

to be.
Therefore, processing latency should be at most 2T _A +

to be. EspeciallykIf = 2, the word-level multiplier's processing delay is 2T _A +

ego,wm+rmAND gates and at most

Two XOR gates are required.
As a result of the performance analysis of the word-level multiplier according to the present invention as described above, the hardware complexity and the processing delay time are much lower than those of the conventional multiplier.
Based on the content as described above,w= 2 peopleGF(2⁷-Level multiplication using GNB type 4 onC=AB=

Can be obtained as in Equation (21) below.wSince = 2, the two underlined elements are XORed and stored in the register. The first term of the underlined element isa _i To calculate the main diagonal elements with their common termsf ₀ Block, the second term is the vector of the main diagonal element.a _i ,b _i Cyclic shiftf _One Block.

Here, Equation 21 shows a process up to the first exclusive logical operation unit 140 through the logic operation unit 120 and corresponds to one clock. At this time, the front part of the underlined part corresponds to the F0 logical operation part 120a and the rear part of + corresponds to the F1 logical operation part 120b.
3 iswFor = 2GF(2⁷Using GNB type 4 onC=ABIs the corresponding shift register circuit of FIG.f _j Represents a block structure. In a word-level structure, every clock cyclewSince we compute = 2 elementsA,B,R Register isw Cyclic shift by sizeLAfter 4 = 4 cycles, all operations are complete and the result is output.mBecause this is an odd number, processing two elements by one results in the operation of overlapping elements in the last clock cycle, so this overlapping operation should be avoided.
The size of the wordwThe number of overlapping elementsw-rSince it appears differently, the number of overlapping elements should be removed as a control signal in the last clock cycle. As shown in FIG. 3, the proposed multiplier uses a control signal of (1, 1, 1, 0) to avoid overlapping values in the last control signal. Also,mOfwIs a non-integer multiplication result registerR _i Is correctc _i Does not have So for accurate outputR _i You have to change the position ofrThis can be solved by circular shifting. like thiswFor = 2GF(2⁷Looking at the word-level multiplier for GNB type 4 on c), the spatial complexity is increased and the processing latency is also 2T _A+4T _XIs a bit serial multipliermWhile output after a clock cycle,LThe advantage is that it can be output after a clock cycle. Existing multipliers have different hardware structures according to the selected primitive polynomial, but the multipliers proposed in the present invention can obtain the same hardware structure only if they have the same GNB type. Thus the finite body according to the inventionGF(2^m) Is a very suitable multiplier for ECC.
The finite field multiplier according to the present invention as described aboveGF (2 ^m )The circuit was described in VHDL for FPGA implementation and functional verification of the multiplier in phase. After synthesizing the circuit using Xilinx's ISE 6.3i, the function was verified using Mento Graphics's Model Sim.
Those skilled in the art will appreciate that various changes and modifications can be made without departing from the spirit of the present invention.
Therefore, the technical scope of the present invention should not be limited to the contents described in the embodiments, but should be defined by the claims and their equivalents.

전술한 바와 같은 본 발명은 기존의 곱셈기 보다 낮은 하드웨어 복잡도를 가지지만 훨씬 낮은 Critical Path Delay를 가지게 되는 유한체상의 곱셈기를 구현할 수 있게 된다.
또한, 기존의 곱셈기가 원시 기약 다항식에 따라 서로 다른 하드웨어 구조를 가지는 반면 본 발명에 따른 곱셈기는 동일한 GNB 타입만 가지면 동일한 형태의 하드웨어 구조를 얻을 수 있게 되고, 나아가 본 발명에 따른 곱셈기는 타원곡선 암호시스템을 위한 최적의 곱셈기를 구현할 수 있도록 하는 효과가 있다.
또한, 본 발명은 그 구조가 매우 규칙적이기 때문에 VLSI 구현에 매우 적합한 매우 유용한 발명이다.As described above, the present invention can implement a finite field multiplier having a lower hardware complexity than the existing multiplier but having a much lower critical path delay.
In addition, while a conventional multiplier has a different hardware structure according to the primitive weak polynomial, the multiplier according to the present invention can obtain a hardware structure of the same type only by having the same GNB type. This has the effect of enabling an optimal multiplier for the system.
In addition, the present invention is a very useful invention, which is very suitable for VLSI implementation because its structure is very regular.

Claims

In the multiplier on the finite field GF ( 2 ^m ),

A first register unit 10 which receives and stores the vector A and shifts one bit to an upper bit side by a reference clock;

An exclusive OR operation unit 40 for performing an exclusive OR operation on a value stored in the first register unit by a preset algorithm;

A second register unit 20 which receives and stores the vector B and shifts one bit to an upper bit side by a reference clock;

A multiplication value obtained by multiplying the value stored in the second register unit 20 by the reference clock unit and the logical operation value through the exclusive OR operation unit 40, and stored in each of the plurality of detailed registers R0 to R6. each of the values and to XOR computation and stores the third delay register section 30 for that; multipliers on the finite chesang GF (2 ^m) for being configured by comprising a.

The method of claim 1,

The plurality of detailed registers R0 to R6 of the third register unit 30 are respectively

An AND gate (30a) for performing an AND operation on the value stored in the second register unit (20) and the value logically calculated by the exclusive OR operation unit (40);

An XOR gate (30b) for performing an exclusive OR operation on the AND multiplied by the AND gate (30a) and the value shifted from the lower detail register;

A multiplier on the finite field GF (2 ^m) being configured to having a; the delay unit (30c) to delay the logic values calculated from the XOR gate.

In the multiplier on the finite field GF (2 ^m ) ,

An A register unit 100 which receives and stores the vector A and shifts the word size w by the reference clock bit by a word size w ;

A B register 110 which receives and stores the vector B and shifts the word size w by the reference clock bit by a word size w ;

Logic operation unit 120 including F0 logic operation unit 120a and F1 logic operation unit 120b for logically calculating the value stored in the A register unit and the value stored in the B register unit according to a preset algorithm in units of a reference clock. );

A logical product operator 130 for logically multiplying the value output from the F1 logical operator 120b with a control value provided from the outside;

A first exclusive OR operation unit 140 for performing an exclusive OR operation on the values output from the F0 logic operation unit 120a and the AND product operation unit 130;

A second exclusive logical operation unit 160 for performing an exclusive OR operation on the value processed by the first exclusive OR operation unit 140 and the lower register values shifted by the word size w in the lower registers R0 to R6;

And a delay unit 150 for delaying a value output from the second exclusive logic operation unit.

The logical operation unit 120 is

A GNB exclusive OR operation 121 for performing an exclusive OR operation on a value stored in the A register unit 100 by a preset algorithm;

A multiplier on a finite field GF (2 ^m ) comprising end gates 123 for multiplying a value computed through the GNB exclusive OR operation 121 and a value stored in the B register 110. .

The method of claim 3, wherein

The control value is

In order to transfer the output value of the F1 logic operation unit 120b to the first exclusive OR operation unit 140 from the first clock to the last previous clock, the OR operation is performed by the first exclusive OR operation unit 140 by the control value 1. and,

In order to prevent the output value of the F1 logic operation unit 120b from being transmitted to the first exclusive OR operation unit 140 at the last clock, a logical product operation is performed by the control signal 0 to the first exclusive OR operation unit 140. Multiplier on finite field GF (2 ^m ).