Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Two-factor authentication, which corroborates the identities of legitimate users based on two sources of secret information (called factors), generally provides a better security guarantee than its one-factor counterpart and is widely used in Internet services [1, 6, 7, 9, 12, 1517, 23, 24, 26, 3234], such as Google’s 2-step authentication [14]. Although entity authentication in the Internet of Things (IoT) is still largely achieved by one-factor authentication with a pre-shared secret key or password [30] as limited by resource and operational constraints, two-factor authentication is increasingly demanded in the IoT, especially those deployed in the critical infrastructure. This paper proposes a novel authentication factor based on historical data and explores its use in the IoT. It not only provides a more leakage-resilient alternative to passwords/secret keys, but also extends the use of two-factor authentication to the challenging area of the IoT with resource-constrained devices. While it is feasible to use historical data as the primary factor, this paper basically focuses on its use as the secondary factor.

There has been an ever-increasing need to improve the security of the IoT with two-factor authentication. On one hand, many of these systems are deployed for safety-critical applications, thus demanding a higher level of security in general to match up with the high cost of failures. For instance, modern mass transportation systems usually consist of resource-constrained devices — be they in the legacy sub-systems or in their contemporary counterparts — but are relatively intolerant to breaches. The deployment of two-factor authentication is usually inevitable to achieve the needed security assurance. On the other hand, a single level of security is usually insufficient to satisfactorily capture all the requirements of the variety of operations and functions in a typical IoT system. To implement multi-level security, it is necessary to deploy two-factor authentication for more sensitive operations, such as firmware update and configuration change, wherein one-factor authentication is deemed as insufficient.

As an example, EPRI (Electric Power Research Institute) has identified about 100 representative failure scenarios in the smart grid [19, 21, 22], out of which about 30 % are attributed to the use of weak or one-factor authentication in security-sensitive operations. These EPRI reports also recommend multi-factor authentication as a mitigation to these scenarios. Most interesting is the case of meter disconnection or firmware update for smart meters [19], wherein entity authentication based on an established secret key between a meter and the utility company is deemed as insufficient because the secret key can be leaked out in one shot during routine operations through a contractor or disgruntled former employee. In fact, since the same secret key is typically used to establish a secure session for normal message exchange between the meter and the utility (to key the encryption and message authentication algorithms), there are various vectors through which this secret key can be leaked out [11, 13]; assuming perfect security of a secret key is unrealistic. In particular, there is plenty of economic incentive for a firmware modification attack, say, to report a lower consumption. There is consensus that a second authentication factor which is more leakage-resilient should be sought.

The key impediment to the adoption of two-factor authentication in the IoT is resource constraints of typical devices (as driven by the low-cost requirement). Direct application of existing mechanisms originally designed for online services [1, 6, 7, 9, 12, 1517, 23, 24, 26, 3234] to the IoT is practically infeasible. Adding to the complexity is that a resource-constrained IoT device has to act as the verifier of authentication, whereas, existing schemes mostly assume a powerful machine as the verifier, for instance, banks’ backend servers. Despite the existence of a handful of proposals using a physical token [10, 31] for sensor network authentication, it is fair to say two-factor authentication for resource-constrained IoT devices remains largely an open problem. To fill this gap, we propose to leverage on the “big data” notion [35] and technologies to implement a novel, scalable authentication factor, namely, historical data, for IoT devices. A big data application is basically characterized by a large volume of data of a great variety generated at a great velocity [35].

1.1 Overview of Our Technique

The basic idea of the new authentication protocol is to use the whole history of the data exchanged between an IoT device V (the verifier) and a server P (the prover) as the second authentication factor, in addition to the conventional first factor of a shared secret key. In order to pass the verification by the IoT device, the server must has full access to all the historical data sent from the IoT device. The secret key and the historical data are both needed to generate a correct response for a given challenge issued by the IoT device in a typical challenge-response protocol [4].

Roughly speaking (in the big data context), we can view the second factor as a very big piece of secret key which an attacker would need significantly greater effort to fully compromise. The adversary has to eavesdrop the communication between the IoT device and the server over an extended period of time and break the encryption (if one is used), or to steal a considerable amount of data from the server, in order to have a good chance to pass the verification. Besides, such actions could make the attacker more susceptible to detection. This model resembles the underlying assumption of the bounded storage model [2, 13] and the bounded retrieval model [11]: the secret information has a considerably high entropy that leakage is only gradual (i.e. bounded) in practice. In addition, as data are constantly generated by the IoT device and sent to the server, it becomes increasingly difficult for an adversary to gain full knowledge of the growing dataset. This also improves the resilience of the second authentication factor as an adversary having captured a fixed amount of historical data, say, through a lunchtime attack [3], will only find himself holding a diminishing portion of the second authentication factor as time goes by. As a result, the security requirements on the storage for this authentication factor is significantly relaxed, compared to the conventional secret key. In fact, we assume no extra protection is needed for storing the historical data.

We consider the second authentication factor as a big dataset. In general, retrieving selected pieces of data from such a large amount of data (for generating the proof) usually encounters unreasonably long latency. Nevertheless, with the use of big data platforms such as Hadoop over parallel clusters, the searching and retrieving of the data could be achieved with close-to-constant time [5], which makes the proposed authentication protocol practical. Besides, despite its huge volume, corruption of historical data is generally not an issue for conventional big data platforms like Hadoop. HDFS (Hadoop File System) is considered highly fault-tolerant [5, 25]. It should be noted that the proposed authentication based on historical data is scalable in the sense that a huge data set is not a prerequisite for the protocol to operate correctly. While a larger historical dataset implies better security, only a small number of data pieces (say, 5) would be sufficient to bootstrap the protocol, and these data could be readily provided by the set of pre-set configuration data of the IoT device.

The idea of using historical data as an authentication factor is not entirely new. For instance, the “life questions” used to reset a password or when a token is lost in online banking is one embodiment. However, all such schemes require the verifier to keep a synchronized record of all historical data, the volume of which seems overwhelming to a typical IoT device. Relying on a small, selected set of historical data (as in the online banking scenario), though practically achievable, may not provide sufficient security. Intuitively, storing the whole dataset at the IoT device is necessary for verifying the validity of any response generated from a subset of it. However, through the use of the proof of retrievability protocol [27] in a different manner, we achieve a security protocol which can utilize all the historical data for authentication, while maintaining a constant storage at the IoT device (verifier), more precisely, 512 bits of secret keys for 256-bit security.Footnote 1 A proof-of-retrievability protocol is conventionally used for a user to verify whether a cloud storage provider keeps all his data. This paper presents a novel and non-trivial application of it to a very different context to achieve a very different, but commonly sought, security objective, namely, entity authentication.

1.2 Comparison with the Trivial Scheme Using Two Keys

The proposed scheme differs from the trivial extension running two instances of the one-factor authentication protocol using two distinct secret keys in at least two ways.

First, although the historical dataset as an authentication factor can be seen as a large secret key, the way to compromise it is rendered considerably different due to its much larger size than a typical secret key (256 bits). While an adversary, after gaining access to a system, can easily steal/copy a 256-bit key unnoticeably, it takes significantly more time (and operations by the system) to copy just a small portion of a big data set of a few hundred GB. In order to pass the authentication with a reasonably high probability, an attacker has to capture most of the historical dataset. It is thus reasonable to assume that the amount of information of the second authentication factor leaked out in any single (undetected) incident could be insignificant to help the adversary to pass the subsequent authentication. In other words, the historical data as an authentication factor exhibits some form of leakage-resilience property which cannot be found in a typical secret key. For the same reason, confidentiality requirements for the storage of the historical data as an authentication factor are considerably relaxed, compared to that for a secret key (which usually requires storage in a tamper-resistant token). In fact, we assume the data (but not the tags) can be shared in the proposed scheme.

Second, to achieve the same level of leakage-resilience for the authentication factor, a trivial composition requires a linear storage complexity at the IoT device (verifier), whereas, only a constant storage is needed in the proposed scheme.

1.3 Our Contributions

The key contribution of this paper is a novel authentication factor based on historical data, which is leakage-resilient, thereby lowering the confidentiality requirements on its storage and management. It is fair to say this is the first scheme in the literature to demonstrate the potential of the big data for the purpose of entity authentication. A lightweight two-factor authentication scheme suitable for the IoT is achieved as a result, which only requires a constant storage at the IoT device as the verifier. The proposed construction is also highly scalable in the sense that a tradeoff between the computation overhead and attainable security level is inherent in the protocol design — a much needed property for the IoT (with diverse devices). By increasing the number of elements in the authentication challenge, the difficulty for an adversary increases exponentially while the computation overhead increases linearly.

The organization of the paper is as follows. Next section presents related work, and Sect. 3 discusses the problem setting including the security model. Section 4 gives the protocol design details. The security and performance of the protocol is discussed in Sects. 5 and 6 respectively. Section 7 illustrates a mechanism to strength the first authentication factor’s security. Finally, the conclusion is given in Sect. 8.

2 Related Work

Multi-factor authentication has been widely studied [1, 6, 7, 9, 12, 1517, 23, 24, 26, 3234] in the literature. In general, there are three common types of authentication factor, namely, “what you know,” “what you have,” and “what you are” [6]. A fourth type of authentication factor, such as “whom you know” [6] and “where you are” [9, 20], has been proposed. This paper proposes another novel form of authentication factor — “what have been discussed” — which can be seen as a parallel effort.

A two-factor authentication scheme using a physical token for sensor networks is proposed in [10], and subsequently corrected by [31]. Since the same token is used over all the sensor nodes [10, 31], these schemes are not resilient to node compromise. In contrast, the historical datasets for different devices are inherently different in our scheme. Biometrics [1, 16, 17] is a common form of the “what you are” factor, but is ruled out for use in IoT devices, given their resource constraints. The “what you have” factor is most popular and has a variety of forms, from a physical token [7, 10, 23, 24], a dedicated physical channel [20, 29, 33], to a separate logical channel [12]. The need of multiple channels could be over-demanding for IoT devices. Indeed, the desirable property of using a single channel for authentication has been sought [17]. Historical data as an authentication factor can also be considered as a kind of the “what you have” factor. While a physical token stores a secret key for authentication, in our scheme, the secret information needed for authentication is embedded as part of a large body of historical data, which is too big for an adversary to collect beforehand.

This paper proposes using the past history of payload data between the verifier and prover as a new form of authentication factor. In [32], email messages are used as the second authentication factor. While email messages can be seen as a form of historical data, the email message transmission is initiated only after the verification of the first factor succeeds, as an extra confirmation. More precisely, [32] should be seen as using a dedicated channel instead. Besides, only one email message is used in [32], in contrast to random samples from the whole of historical dataset proposed in this paper. The closest existing proposal to this paper is [28] which uses continuous dynamic data generated by a smart phone owner as an authentication factor. The phone has to store the whole set of data for verification, whereas, this paper only requires the tag generation key to be stored at the IoT device for verification. Some schemes [26, 34] are dedicated to combining two authentication factors, addressing a different, orthogonal perspective of the multi-factor authentication problem.

The security mechanism to use historical data as an authentication factor in this paper resembles those in the proof of receipt [8] and proof of retrievability protocols [27]. In fact, regardless of efficiency, it is possible to convert any proof of retrievability protocol into an authentication protocol using historical data. This paper extends the underlying mechanism for use in a different application to achieve a different but more popular security objective, namely, entity authentication. Unlike [27], which has to handle the data corruption issue through erasure code, this paper assumes that the issue has been adequately addressed by most big data platforms which are highly fault-tolerant by design. This paper achieves the leakage resilience property of the second authentication factor through the realization of the bounded-storage model [2, 13] or bounded-retrieval model [11] using historical data in the big data context.

3 Background and Model

3.1 Protocol Syntax

A two-factor authentication scheme involves two parties, a prover P and a verifier V. P is a server and V is an IoT device corroborating P’s identity. Corresponding to the two authentication factors are two secret keys, \(sk_1, sk_2\), but \(sk_2\) is not directly used. Instead, V generates the historical dataset \(\mathcal{D}\) and processes it with \(sk_2\) (by tagging each \(d_i \in \mathcal{D}\)) to form \(\mathcal{D}^*\) that P uses as the second authentication factor. The scheme defines four algorithms \(\textsf {Init}\), \(\textsf {Tag}\), \(\textsf {Auth}_P\), \(\textsf {Auth}_V\), and a two-party interactive protocol:

 

\({\mathbf {\mathsf{{Init}}}}(1^{\lambda }).\) :

This randomized algorithm generates a pair \((sk_1, sk_2)\) where \(sk_1\) is the secret key used as the first authentication factor shared between P and V, and \(sk_2\) is the secret key used by V to tag the historical data \(\mathcal{D}\) to generate the second authentication factor \(\mathcal{D}^*\) for P. While both P and V keep a copy of \(sk_1\), the key \(sk_2\) is known to nobody except V. Concretely, \(sk_1 = mk\) and \(sk_2 = (K, K')\) in this paper.

\({\mathbf {\mathsf{{Tag}}}}(sk_2, d_i).\) :

This data tagging algorithm is used by V with input \(sk_2\) to mark each piece of data \(d_i \in \mathcal{D}\) to generate a tag \(t_i\). \((d_i, t_i)\) is sent to P and forms part of \(\mathcal{D}^*\).

\(({\mathbf {\mathsf{{Auth}}}}_P (sk_1, \mathcal{D}^*)\), \({\mathbf {\mathsf{{Auth}}}}_V (sk_1, sk_2)).\) :

The randomized proving and verifying algorithms define a protocol for proving P’s identity to V. During protocol execution, P takes as input the two authentication factors — \(sk_1\) and \(\mathcal{D}^*\) — and V uses \((sk_1, sk_2)\) to verify. More specifically, V selects some random challenges and a set I of indices corresponding to t pieces of historical data in \(\mathcal{D}\) to request P to generate a response to prove his identity. P uses \(sk_1\) and the specified data tuples \(\{(d_i, t_i) \in \mathcal{D}^*: i \in I \}\) to generate the response. Finally, \(\textsf {Auth}_V\) outputs a bit b where \(b=1\) corresponds to a success of the authentication instance and \(b=0\) corresponds to a failure. We can denote a run of two machines executing the algorithms as: \(\{0, 1\} \leftarrow (\textsf {Auth}_V (sk_1, sk_2) \rightleftharpoons \textsf {Auth}_P (sk_1, \mathcal{D}^*))\).

 

The correctness requirement is that for all \(\lambda \) and some data set \(\mathcal{D}\), if \((sk_1, sk_2) \leftarrow \textsf {Init}(1^{\lambda })\), and \(\mathcal{D}^* = \{(d_i, t_i): d_i \in \mathcal{D}\}\) where \(t_i \leftarrow \textsf {Tag}(sk_2, d_i), \forall d_i \in \mathcal{D}\), then the output b of the protocol \((\textsf {Auth}_V (sk_1, sk_2) \rightleftharpoons \textsf {Auth}_P (sk_1, \mathcal{D}^*)\), with all messages delivered correctly between P and V, will be 1 except for negligible probability.

3.2 Adversarial Model

The goal of the adversary Adv is to carry out an authentication attack to impersonate P to V. A two-factor authentication scheme based on historical data is sound if any cheating prover \(P'\) which convinces the verifier V (running \(\textsf {Auth}_V\)) that he is the prover P actually possesses \(sk_1\) and \(\mathcal{D}^*\). That is, anyone passing the authentication must possess \(sk_1\) and \(\mathcal{D}^*\) (or \(sk_2\) which only V knows and P does not). We assume that Adv does not corrupt V to obtain \((sk_1, sk_2)\); if V is corrupted, there is no need for Adv to carry out an authentication attack to prove his identity to V. Similarly, if P is fully compromised, there is no need for Adv to impersonate P.

We assume that Adv, without all the authentication factors, has a full control of the communication channel through eavesdropping, injecting, modifying and deleting messages exchanged between P and V, including initiating authentication sessions. Adv is also able to obtain some tuples of historical data \((d_i, t_i) \in \mathcal{D}^*\) through eavesdropping, temporarily breaking into P alike the lunchtime attack [3], or other insider threats. However, we assume as in the bounded-storage model [2, 13] that \(\mathcal{D}^*\) is sufficiently large that Adv is not able to compromise all data in \(\mathcal{D}^*\). This assumption is reasonable because, taking the smart meter scenario as an example, a former employee or contractor of the utility company could capture, over a finite period of time, the smart meter data during normal operations — which allows him to obtain a portion but not all of the historical dataset — but extensive data capture would be susceptible to detection. In addition, the data are usually sent over a secure channel in encrypted forms, meaning eavesdropping alone might not capture a significant fraction of the historical dataset. The proposed authentication factor ensures that, unless a significant percentage of the historical data has been compromised, these captured data would not give the adversary any significant advantage to successfully pass subsequent authentication sessions.

We consider the following authentication game denoted by \(\textsf {Auth}_{2FA}^{Adv}\), which takes a tuple \((\lambda , t, q_T, q_E, q_V, q_P)\) as input and executes between Adv and an environment:

  1. 1.

    The environment runs \(\textsf {Init}(1^{\lambda })\) to generate the key pair \((sk_1, sk_2)\) for the two authentication factors. It randomly picks L pieces of data to form the historical dataset \(\mathcal{D} = \{d_1, d_2, \ldots , d_L\}\), and runs \(\textsf {Tag}\) with \(sk_2\) to tag \(d_i\)’s to form \(\mathcal{D}^* = \{ (d_1, t_1), (d_2, t_2), \ldots , (d_L, t_L)\}\). The two authentication factors are \(sk_1\) and \(\mathcal{D}^*\).

  2. 2.

    Adv can choose none or one of the two authentication factors — \(sk_1\), \(\mathcal{D}^*\) — to compromise. The environment gives the selected factor to Adv. For the second authentication factor, the environment also gives \(sk_2\) to Adv.

  3. 3.

    If Adv has not chosen the second authentication factor \(\mathcal{D}^*\) to compromise, it can make at most \(q_T < L\) queries to obtain \(q_T\) pairs of \((d_i, t_i) \in \mathcal{D}^*\) at his choice.

  4. 4.

    Adv can observe the execution of \(q_E\) instances of the authentication protocol, that is, \(\textsf {Auth}_V(sk_1, sk_2) \rightleftharpoons \textsf {Auth}_P(sk_1, \mathcal{D}^*)\) and obtain the transcripts for them.

  5. 5.

    Adv can execute the authentication protocol as the verifier by interacting with \(q_P\) instances of \(\textsf {Auth}_P\), that is, \(Adv \rightleftharpoons \textsf {Auth}_P(sk_1, \mathcal{D}^*)\).

  6. 6.

    Adv can execute the authentication protocol as the prover by interacting with \(q_V\) instances of \(\textsf {Auth}_V\), that is, \(\textsf {Auth}_V(sk_1, sk_2) \rightleftharpoons Adv\).

  7. 7.

    Finally, Adv is challenged to authenticate himself to a new instance of \(\textsf {Auth}_V\), and no more queries can be made. Adv succeeds if this instance \(\textsf {Auth}_V(sk_1, sk_2) \rightleftharpoons Adv\) outputs \(b=1\). Such an event is denoted by \(Succ_{Adv}\).

Definition 1

(Authentication Attack Resistance). A two-factor authentication scheme based on historical data is \((\delta _{12}, \delta _i, \delta _2)\)-sound for parameters \((\lambda , t, q_T, q_E, q_V, q_P)\), where \(q_T, q_E, q_V, q_P\) are polynomial in \(\lambda \), if for any algorithm Adv whose running time is bounded, the following holds for random execution of an authentication game \({{\textsf {\textit{Auth}}}}_{2FA}^{Adv}(\lambda , t, q_T, q_E, q_V, q_P)\):

  1. 1.

    \(Pr[Succ_{Adv}] \le \delta _{12}\) if Adv does not compromise any authentication factor.

  2. 2.

    \(Pr[Succ_{Adv}] \le \delta _{1}\) if Adv only compromises the second authentication factor \(\mathcal{D}^*\) but not the first authentication factor.

  3. 3.

    \(Pr[Succ_{Adv}] \le \delta _{2}\) if Adv only compromises the first authentication factor \(sk_1\) but not the second authentication factor.

We assume that \((d_i, t_i)\)-tuples are leaked out in all queries, not just in \(\textsf {Tag}\) queries, especially when \(sk_1\) has been compromised. Note that the need to authenticate to a new \(\textsf {Auth}_V\) instance in the challenge step prevents Adv from using concurrently running query instances of \(\textsf {Auth}_P(sk_1, \mathcal{D}^*)\) to win the challenge.

4 Two-Factor Authentication Using Historical Data

For the sake of generality, we present the scheme assuming all the arithmetic is done in a certain finite field \(\mathbb {F}\) without discussing its parameters or using special notations. To help understand the design, one can simply understand all the arithmetic in \(\mathbb {Z}_p\) (integers \(\text {mod } p\)) for a large prime p of length \(\lambda \) bits. There is flexibility in choosing the finite field arithmetic used for the protocol: \(\mathbb {Z}_p\) or \(GF(2^{\lambda })\) (the binary extension field of length \(\lambda \) bits). In the implementation in Sect. 6, we used \(GF(2^{256})\). It should be noted that addition in any finite field achieves perfect secrecy as one-time pad does if the key is truly random, even though the operation is not exclusive-OR. In fact, for a binary extension field (which is used in this paper), an addition is simply an exclusive-OR.

The design is a typical 2-step challenge-response protocol to prove the possession of \(sk_1\) and \(\mathcal{D}^*\). The key building blocks include two pseudorandom functions (PRFs) fE (instantiated by SHA256-HMAC in Sect. 6). More precisely, E is a pseudorandom permutation. A cryptographic hash function h is used simply for mapping an arbitrary data piece \(d_i \in \{0,1\}^*\) to an element in the finite field (which is an \(\lambda \)-bit string for \(GF(2^{\lambda })\)).

For the sake of clarity, we assume all data are tagged.Footnote 2 Then L is cardinality of the historical dataset \(\mathcal{D}\). The design uses an adjustable security parameter t, which is size of the set I of indices of the selected samples from \(\mathcal{D}^*\) needed to generate the response.

The two factor authentication protocol using big data runs as follows.

figure a

4.1 Correctness

The correctness of the proposed protocol can be shown easily as follows. For the first authentication factor, both P and V carry out the same computation: \(r' = E_{mk}(r)\). Based on the correctness of the pseudorandom permutation E, if P and V use the same mk and r, the results of \(r'\) computed at P and V respectively should be identical. Given the same value of \(r'\), the correctness of the second authentication factor can be easily shown as follows by substituting \(t_i = K \cdot h(d_i) + f_{K'}(i)\) into Y:

$$ Y = \sum _{i\in I} f_{r'}(i) \cdot t_i = \sum _{i \in I} f_{r'}(i) \cdot (K \cdot h(d_i) + f_{K'}(i)) = K \cdot X + K_I. $$

Note that the last step results from substituting \(X = \sum _{i \in I} f_{r'}(i) \cdot h(d_i)\) and \(K_I=\sum _{i \in I} f_{K'}(i) \cdot f_{r'}(i)\) into the equation.

5 Security Analysis

Intuitively, for the first authentication factor \(sk_1\), without knowledge of \(sk_1\), it would be difficult to get the same value of \(r' = E_{sk_i}(r)\) as computed by V, since the security property of PRF guarantees that, for anyone not knowing the key, the output of E is computationally indistinguishable from a truly random \(\lambda \)-bit string. For the second authentication factor \(\mathcal{D}^*\), it can be shown, as in [27], that either knowledge of \(sk_2 = (K, K')\) or knowledge of all tuples of \((d_i, t_i) \in \mathcal{D}^*\) with \(i \in I\) is necessary to pass the authentication by V with non-negligible probability of success.

Although there are many pairs of (XY) which allow Adv to pass the authentication, the probability of finding these pairs is negligibly small in the security parameter \(\lambda \). Let the number of elements in the underlying finite field \(\mathbb {F}\) be q. Note that \(q \sim O(2^\lambda )\). For \(\mathbb {F} = \mathbb {Z}_p\), \(q = p\) where p is \(\lambda \) bits. An adversary, without knowledge of \(\mathcal{D}^*\), has to find a pair (XY) such that \(Y = K_I + K \cdot X\) in order to pass the verification check by V. This pair is not unique for any given \(K_I\) and K. There are q possible (XY) pairs which satisfy this constraint. However, Adv has to find these q correct pairs out of all \(q^2\) possible combinations of X, Y. In other words, the probability of passing P’s verification (say, by random guess) is about \(\frac{1}{q} \sim O(\frac{1}{2^\lambda })\) which is still negligible in \(\lambda \). The security of the proposed authentication scheme is summarized as follows.

Theorem 1

(First Authentication Factor Security). If E is a secure PRF, then the proposed two-factor authentication protocol remains sound against an adversary Adv which compromises \(\mathcal{D}^*\) and \(sk_2\) only.

Proof

We prove by contradiction. Suppose E is a secure PRF with the advantage (to distinguish its output from a random number) bounded by \(\epsilon _E\). Assume there exists a PPT adversary \(Adv_{2FA}\) which can break the soundness of the two-factor authentication (2FA) after compromising the second factor. That is, \(Adv_{2FA}\) can win the \(\textsf {Auth}_{2FA}^{Adv}\) game with non-negligible probability of success \(p_{2FA}\). We can then construct another adversary \(Adv_{E}\) to break the indistinguishability property of the PRF E as follows:

Pick \(K, K'\) for the second authentication factor. Generate \(\mathcal{D}\) and \(\mathcal{D}^*\) accordingly. Given \(\mathcal{D}^*\) and \((K, K')\) to \(Adv_{2FA}\), since he chooses to compromise the second authentication factor. Since \(Adv_{2FA}\) has compromised \(\mathcal{D}^*\), there is no need to answer any \(\textsf {Tag}\) queries. For the other three types of queries, \(Adv_E\) can make PRF queries and use the results to answer the execution, prover and verifier queries from \(Adv_{2FA}\). In the challenge phase, \(Adv_E\) picks a random \(r \in \{ 0,1 \}^*\) and sends it to the PRF challenger to receive a challenge \(r'\). \(Adv_E\) has to decide whether \(r' = E_k(r)\) for unknown key k or \(r'\) is a random string. To respond, \(Adv_E\) picks a challenge set I for \(Adv_{2FA}\) and sends (rI) as a challenge for \(Adv_{2FA}\), which outputs (XY). \(Adv_E\) uses \(K'\), \(r'\) (the challenge received from the PRF challenger) and \(i\in I\) to compute \(K_I\). \(Adv_E\) checks if \(Y = K_I + K \cdot X\). If yes, it returns 1 to its challenger to indicate that \(r' = E_k(r)\); otherwise, it outputs 1.

The advantage of \(Adv_E\) is then \(p_{2FA}\) (a contradiction). Hence, \(p_{2FA}\) must be negligible. In fact, \(p_{2FA} < \epsilon _E\), much smaller than the probability of a random guess: \(\frac{1}{2^\lambda }\).

Theorem 2

(Second Authentication Factor Security). If h is a collision resistant hash function and f is a secure PRF, then the proposed two-factor authentication protocol remains secure against an adversary Adv which compromises \(sk_1\) only.

Proof

We prove Theorem 2 through a sequence of games, starting with Game 0 being the adversary game defined for the soundness property of the two-factor authentication in Sect. 3.2. In Game 0, we assume the adversary Adv has compromised the first authentication factor \(sk_1\). In the following, we introduce one minor modification between successive games and interleave analyses between games.

Game 1. Game 1 is the same as Game 0 with only one difference: the tagging algorithm uses truly random numbers instead of PRF outputs to tag \(d_i\)’s. The challenger remembers all these random numbers in a table to compute \(K_I\) for verification later on.

Analysis. If there is a difference in the adversary’s probability of success between Game 0 and Game 1, we can use the adversary to break the security of the PRF f. Such a reduction implies that, in subsequent analyses, we can treat \(t_i = K\cdot h(D_i)+f_{K'}(i)\) for any \(d_i \in \mathcal{D}\) as a truly random \(\lambda \)-bit string \(t_i'\), and \(K_I = \sum _{i \in I} f_{K'}(i) \cdot f_{r'}(i)\) also as a truly random \(\lambda \)-bit string. As a result, Y is also a random number.

Game 2. In Game 2, the challenger handles the tagging and authentication execution differently. It keeps a copy of all tuples of \((d_i, t_i) \in \mathcal{D}\). To check the validity of a pair \((X', Y')\), it computes (XY) directly from its copy of \((d_i, t_i)\)-tuples. The challenger accepts \((X', Y')\) only if \((X, Y) = (X', Y')\). That is, it may be possible that \(Y' = K_I + K \cdot X'\) for the given (rI) but \((X', Y') \ne (X, Y)\), and this case is rejected.

Analysis. To satisfy the second authentication factor, \(\textsf {Auth}_S\) checks if \(Y' = K_I+ K \cdot X'\). Note that \(K_I\) cannot be distinguished from a truly random value and is not known to the adversary. They therefore look like independent in each instance of execution. Since we assume \(d_i\)’s are publicly known, X is known to the adversary. The only possibility for the adversary to guess Y and \(K_I\) correctly to satisfy the relation without knowing K is that the challenge set I is covered by the union of \((d_i, t_i)\)-tuples leaked out from the execution, tagging and prover queries, plus the probability of a correct random guess. The total number of distinct tuples leaked out is \(< q_T + (q_P + q_E) \cdot t\). The corresponding probability of success for the adversary is therefore approximately \((\frac{q_T + (q_P + q_E) \cdot t}{L})^t\). The probability for a correct random guess is \(\frac{1}{2^\lambda }\). Alternatively, another possibility is that the challenge set is the same as a previous one, which is negligibly small. The total probability of success for the adversary is therefore \(\frac{1}{2^\lambda } + (\frac{q_T + (q_P + q_E) \cdot t}{L})^t\).

Overall, the security of the two-factor authentication using historical data is summarized by the following theorem, the proof of which follows directly from Theorems 1 and 2. The reason why \(\delta _{12}\) is much smaller compared to \(\delta _2\) is that when \(sk_1\) is in the adversary’s hand, execution and prover queries would leak out the \((d_i, t_i)\)-tuples involved.

Theorem 3

If f and E are secure PRFs, the two-factor authentication using historical data is \((\delta _{12}, \delta _1, \delta _2)\)-sound, with \(\delta _{12} = (\frac{1}{2^{2 \lambda }} + \frac{1}{2^{\lambda }} \cdot (\frac{q_T}{L})^t)\), \(\delta _1 = 1/2^{\lambda }\) and \(\delta _2 = \frac{1}{2^\lambda } + (\frac{q_T + (q_P + q_E) \cdot t}{L})^t\), where \(\lambda \) is the security parameter of the PRFs and \(L = |\mathcal{D}^*|\) and \(q_T\), \(q_E\), \(q_P\) are respectively the number of tagging, execution and prover queries made by the adversary, and t is number of indices in the challenge set I.

5.1 Resilience to Leakage of Historical Data

This section studies the resilience of the proposed two-factor authentication scheme to leakage of the historical dataset \(\mathcal{D}^*\). Suppose the first authentication factor \(sk_1\) has been compromised, and all \(d_i's\) are public and the adversary has obtained a set \(A \subset \mathcal{D}^*\) which contains S distinct tuples of \((d_i, t_i) \in \mathcal{D}^*\), equivalent to a p fraction of the L pieces of \(t_i\)’s. That is, \(S/L = p\). Recall that t is the number of indices chosen in the challenge set I by design. The probability of success for the adversary is then given by:

$$ p_{success} = \frac{\left( {\begin{array}{c}S\\ t\end{array}}\right) }{\left( {\begin{array}{c}L\\ t\end{array}}\right) } = \frac{\left( {\begin{array}{c}\lceil pL\rceil \\ t\end{array}}\right) }{\left( {\begin{array}{c}L\\ t\end{array}}\right) } = \frac{\lceil pL \rceil \cdot (\lceil pL \rceil - 1) \cdot (\lceil pL \rceil - 2) \cdot \ldots \cdot (\lceil pL \rceil - t + 1)}{L \cdot (L - 1) \cdot (L - 2) \cdot \ldots \cdot (L - t + 1)}. $$

For \(pL \gg t\), \(p_{success} \approx p^{t}\). Since \(0< p <1\), \(p_{success}\) drops considerably as t increases. As an example, suppose \(p = 0.8\) (that is, the adversary has compromised 80 % of the tuples) and \(L = 10,000\); for \(t = 1\), \(p_{success} = 0.8\); \(p_{success}\) drops to approximately 0.107 for \(t = 10\). As can be seen, the proposed protocol is scalable because there is a clear tradeoff between the security level and the computational overhead by adjusting the design parameter t.

Figures 1 and 2 plot the actual values (not the approximation) of \(p_{success}\) for the adversary (to break the proposed authentication scheme) against different values of the design parameter \(t = |I|\), to illustrate the resilience of the proposed authentication scheme with respect to the cardinality of the tagged historical dataset \(\mathcal{D}^*\) and the fraction of data compromised by the adversary. As shown in Figs. 1 and 2, the probability of success \(p_{success}\) for an adversary having compromised p fraction of the L tuples of \(\mathcal{D}^*\) drops considerably as t increases. We will see in the next section, the computational overhead for verifying a response is linearly proportionate to the parameter t. As the IoT is constituted with a variety of devices with vastly different resources and capabilities, the proposed two-factor authentication scheme exhibits a practically interesting and desirable property, namely, the security and computation tradeoff is tunable by adjusting the parameter t. Roughly, in practice, \(t \approx 5\) seems sufficient to cover most scenarios, as suggested by Fig. 1. In typical operating scenarios, it should be rare for an attacker to be able to successfully compromise half of all the data tuples in \(\mathcal{D}^*\) (without being detected and dealt with). Note that \(t_i\) is not necessarily shared. For \(p < 0.5\), \(t = 5\) seems sufficient to achieve a reasonably small \(p_{success}\).

Besides, as can be seen from Fig. 2, \(p_{succes}\) seems largely (or solely) determined by the fraction of compromised tuples p, regardless of the size L of the historical dataset \(\mathcal{D}^*\). That is, \(p_{success}\) could be mono-variate in p. This interesting property means that, as data are continuously generated by the IoT device V and exchanged with P, an adversary has to compromise an continuously increasing number of historical data tuples accordingly in order to maintain the same level of advantage in terms of \(p_{success}\) to break the two-factor authentication based on historical data. In other words, the overall security improve substantially as \(\mathcal{D}^*\) gets larger.

Fig. 1.
figure 1

Resilience of the proposed authentication with respect to different sizes L of the historical dataset: \(p_{success}\) for the attacker to break the scheme against different sizes of the challenge set I for different sizes of the historical dataset (\(L = 1000\) and \(L=10000\)).

Fig. 2.
figure 2

Resilience of the proposed authentication with respect to different fraction p of the data being compromised: \(p_{success}\) for the attacker to break the scheme against different sizes of the challenge set I after the attacker has compromised \(p = 0.5\) and \(p = 0.7\) of the historical dataset.

This property is mainly because as the set of historical data enlarges, it becomes less likely that the challenge set I would be a subset of the dataset A held by the adversary, if |A| does not increase at the same time. That is, unless the adversary could keep up with the growth of the historical dataset \(\mathcal{D}^*\), it becomes increasingly difficult for the adversary to break the protocol using the same set A. In order for the adversary to keep up with a growing \(\mathcal{D}^*\) while maintaining a constant p, he needs to capture more data tuples proportionately, which usually would translates to increased duration for eavesdropping or break-in to P for the adversary, thus making his task more difficult. In fact, V could generate dummy data at random interval to grow \(\mathcal{D}^*\) if necessary to increase the difficulty to break the authentication.

5.2 Practical Considerations and Limitation

The security of the proposed authentication factor is wholly based on the bounded retrieval model [11], assuming that the adversary cannot obtain a significant fraction (say, \({>}50\) %) of the historical dataset in the GB range containing months or years of data. If this is violated, the scheme will become insecure. Nevertheless, this assumption can be reasonably fulfilled in practice. For instance, the \((d_i, t_i)\)-tuples are usually transmitted over a secured channel between V and P (which can be established through the master key of the first authentication factor or other means); hence, an adversary eavesdropping and logging the communication between V and P might not be able to obtain these tuples. As discussed in Sect. 1, despite the established secure channel, a second authentication factor is still required for security-critical or sensitive usages. In general, the \((d_i, t_i)\)-tuples would be leaked out in normal communication or authentication execution only when the adversary has compromised the secure channel. Even so, the secret of the second authentication factor would only be leaked out gradually through the \((d_i, t_i)\)-tuples; it takes time for an eavesdropper to compromise the old tuples already stored at P. While an adversary could possibly obtain tuples through a lunchtime attack at P, compromising a significant fraction of the historical dataset without being detected is still challenging as this paper considers tens of GB of data stored in protected servers in a big data cluster. Besides, it is not mandatory to store \(d_i\)’s and \(t_i\)’s together. In particular, \(t_i\)’s have no information value other than authentication, and hence can be stored offline when not used, to add extra protection. Similarly, it is not necessary to store the two authentication factors in the same server. The design goal of the two-factor authentication is to allow V to authenticate P. It is therefore outside the scope of this paper to consider attacks based on a compromised server P. After all, P normally has more resources to defend against the adversary, compared to V.

Finally, to bootstrap the second authentication factor initially, the initial configuration settings of the IoT device V can be used as the first few pieces of historical data \(d_i\). This is possible since the protocol is designed to accept data of different formats and types. It is also practically reasonable to assume that these first few data tuples of \((d_i, t_i)\) can be transported to P in proximity prior deployment via a very secure channel. Roughly, 5 pieces of initial data suffice. If necessary, dummy data can also generated by V and transported to P in the pre-deployment setup stage. Hence, initial bootstrapping of the protocol safely should not be an issue.

6 Implementation and Performance

6.1 Parameter Selection

This section discusses the components of the proposed protocol and the considerations for choosing their parameters. First, the parameters of the underlying finite field \(\mathbb {F}\) should be fixed according to the security parameter \(\lambda \). Then all the components can be selected accordingly. Both integers \(\text {mod } p\) and binary extension fields can be used. In our implementation, we use \(GF(2^{256})\), the 256-bit binary extension field (i.e. \(\lambda = 256\)), defined by a low-weight irreducible polynomial: \(x^{256} + x^{10} + x^5 + x^2 + 1 \in \mathbb {F}_2[x]\). Parameter selection for the components are as follows:

  1. 1.

    A collision-resistant, length-matching hash function h() is used to map \(d_i\) (an arbitrary-length string) to an element of the underlying finite field, which has a wide range of choices. In our implementation, the arithmetic is in \(GF(2^{256})\). Hence, SHA-256 is used to implement h.

  2. 2.

    The main core of the proposed protocol includes two PRFs: Ef (more precisely, E is a pseudorandom permutation). The key size for Ef should be \(\lambda \). E is to map a random challenge r to a key \(r'\) to key the PRF f. Hence, the output of E should be \(\lambda \). SHA-HMAC [18], AES-OMAC, or any block ciphers in general, can be used to realize E. All these are standard primitives. The other PRF f is to map an arbitrary-length string to an element in \(\mathbb {F}\). Ideally, f should be selected such that its output matches the element size of \(\mathbb {F}\). But in practice, truncation of the output of f is acceptable. Like E, there are many possible conjectured PRFs to instantiate f.

For our case, \(\lambda = 256\). Therefore, the key is 256 bits long and the arithmetic is in \(GF(2^{256})\). For the sake of simplicity, we use SHA256-HMAC (\(\{0,1\}^{256} \times \{0,1\}^* \rightarrow \{0,1\}^{256}\)) to instantiate both E and f. Note that the output of SHA256-HMAC is an element in \(GF(2^{256})\) as well as a 256-bit binary string. Besides, the same implementation of SHA256 is used to instantiate h. While this implementation may not be optimized in computation resources, it minimizes the code size. For devices with a hardware implementation of AES, a more efficient implementation is attainable using AES-OMAC.

Table 1. Computational performance of the authentication factor using historical data

6.2 Performance

The two-factor authentication scheme is implemented in Java using JDK 7u07. A finite field defined by the irreducible polynomial \(x^{256} + x^{10} + x^5 + x^2 + 1 \in \mathbb {F}_2[x]\) is used. In the experiment, an Intel core 2 Quad 2.4 GHz processor with 3 GB RAM is used for the prover P, and a BeagleBone-Black board with an ARM Cortex-A8 processor for the lightweight verifier V. Table 1 depicts the computational performance of the novel authentication factor. In Table 1, the bit complexity of one invocation of the cryptographic hash function h is denoted by \(l_h\), and the bit complexity of generating a \(\lambda \)-bit random number by \(O(\lambda )\). For \(\lambda \)-bit security, the finite field \(\mathbb {F}\) used consists of \(\lambda \)-bit elements, implying that the complexity for multiplication and addition in \(\mathbb {F}\) are \(O(\lambda ^2)\) and \(O(\lambda )\) respectively. Finally, \(O(l_D)\) denotes the complexity for searching and retrieving random items over a big data set, dependent on the platform being used.

Fig. 3.
figure 3

Computation time for proof generation by P (in \(\mu \)s ) and proof verification by V (in ms) as the design parameter \( t = |I|\) (no. of indices in the challenge set I) varies.

Figure 3 explicitly shows the experimental results for the computational time needed for proof generation (in \(\mu \)s) and verification (in ms) on the two platforms (P and V) respectively for different challenge difficulty t (also denoted as |I|, i.e. the number of \(d_i\)’s needed to generate the response). The running time of both the proof generation and proof verification has a linear relationship with t.

Regarding the bandwidth used, each tag \(t_i\) is \(\lambda \) bits long. For SHA-256, \(\lambda = 256\) bits or 32 bytes, which seems practically reasonable. Depending on the tagging frequency, the actual communication overhead for the tags could be adjusted. As shown in Sect. 5.1, the proposed scheme is relatively resilient even with a small data set \(\mathcal{D}^*\), implying that a relatively low tagging frequency may suffice in practice. For an authentication session, the challenge size is \(2 \lambda \) plus the data size needed to specify I (which depends on how the elements of I are specified) and the response size is \(3 \lambda \). For SHA-256, the challenge and response are 64 bytes and 96 bytes respectively. If I is specified as a set of equally spaced integers, then the data size needed could be quite small.

7 Prover Delegation

In order to accommodate the usual operations, say, in the power grid, which have risks in leaking out the first authentication factor mk, a secure prover delegation mechanism can be constructed to allow the server P to delegate its prover capability to a user U (which can be an employee or a contractor). The basic idea is that, instead of using the master secret key mk directly in executing the protocol, mk is used to derive a user key \(k_U\), say, with a PRF by computing \(k_U = PRF_{mk}(id_U)\) where \(id_U\) is a bit string representing U’s identity. In the protocol, \(k_U\) is used in place of mk for the first authentication factor. A secure (i.e. private and authenticated) channel between P and U is assumed. The security assurance of the second authentication factor is reduced to that of the first authentication factor. Yet, there is still considerably advantage over the case which concurrently executes the challenge-response protocol twice with distinct secret keys. First, P has more resources as the verifier than V to implement more sophisticated authentication methods, such as biometrics. Second, the leakage resilience property of the second authentication factor is inherent in the modification.

8 Conclusion

This paper addresses the problem of two-factor authentication for resource-constrained IoT devices, which is largely an open problem, especially when a resource-limited device is used as a verifier. A novel authentication factor, using historical data as facilitated by big data platforms, is proposed, to complement the conventional authentication factor, in the form of a pre-shared secret key, to achieve a reasonably lightweight scheme for the IoT. Inherent in the new authentication factor is the much needed leakage resilience property, making it robust to compromise or leakage of secret authentication information. Besides, only a constant storage is required in the verifier, regardless of the size of the historical data. Another interesting property is that, as the volume of the historical data increases, the security of the protocol is improved proportionately. With the new authentication factor, a lightweight two-factor authentication protocol is designed and implemented. The design is scalable in the sense that, a tradeoff between the computational overhead and the security level of the two-factor authentication protocol can be varied by adjusting the challenge set size t, thus allowing the protocol to be tuned for different IoT scenarios with devices of vastly different capabilities.