research-article

Open access

Securing the Ethereum from Smart Ponzi Schemes: Identification Using Static Features

Authors:

Yutong LuAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 5

Article No.: 130, Pages 1 - 28

https://doi.org/10.1145/3571847

Published: 22 July 2023 Publication History

PDF eReader

Abstract

Malware detection approaches have been extensively studied for traditional software systems. However, the development of blockchain technology has promoted the birth of a new type of software system–decentralized applications. Composed of smart contracts, a type of application that implements the Ponzi scheme logic (called smart Ponzi schemes) has caused irreversible loss and hindered the development of blockchain technology. These smart contracts generally had a short life but involved a large amount of money. Whereas identification of these Ponzi schemes before causing financial loss has been significantly important, existing methods suffer from three main deficiencies, i.e., the insufficient dataset, the reliance on the transaction records, and the low accuracy. In this study, we first build a larger dataset. Then, a large number of features from multiple views, including bytecode, semantic, and developers, are extracted. These features are independent of the transaction records. Furthermore, we leveraged machine learning methods to build our identification model, i.e., Multi-view Cascade Ensemble model (MulCas). The experiment results show that MulCas can achieve higher performance and robustness in the scope of our dataset. Most importantly, the proposed method can identify smart Ponzi scheme at the creation time.

1 Introduction

1.1 Blockchain and Decentralized Application

In recent years, blockchain technology has received extensive attention from the industry and academia [28, 37]. Simply, a blockchain is a continuously growing chain of blocks (i.e., a ledger) maintained in a distributed network (i.e., peer-to-peer network), where each peer contains a complete copy of the chain [43]. Each block in the ledger contains a certain number of transactions and a corresponding timestamp, and is linked to its previous block by including a hash of it. This decentralized maintenance and hash connection make the blockchain almost immutable because modifying any one block requires simultaneously generating new hash values for all subsequent blocks and getting the approval of more than 50% of the peers.

Due to its immutability nature, blockchain technology has become the cornerstone of many new forms of applications. Bitcoin, and cryptocurrencies like it, are an important class of applications based on it. With the deepening of the research and discussion on the blockchain technology, various blockchain-empowered applications have been emerging. Due to the decentralized nature of blockchain, these new types of applications are called decentralized applications (DApps) [10, 49]. Composed of smart contracts [53, 59], which are executable programs written into the blocks, DApps have many properties different from traditional software (i.e., centralized or distributed software system [49]), such as open-source licensing, internal cryptocurrency support, decentralized consensus, and no central point of failure [10]. According to the statistical results of dapp.review, about 5,000 DApps are running on the blockchain platforms currently, with applications ranging from gambling and lottery to social and financial fields, involving 12,750 smart contracts and more than 250,000 active users. These facts show that decentralized applications have become a new software system that cannot be ignored.

1.2 Smart Ponzi Schemes and Identification

Running on an immutable blockchain, DApps provide users with a sense of trustworthiness. Yet this feature has also been exploited by criminals to create a new type of scam application. These applications themselves do not create any value, operators mainly rely on constantly attracting new users to participate in them to obtain fees and other income. Early players may be able to reap some of the benefits. However, this mechanism of wealth redistribution will inevitably cause irreversible losses to most users and hinder the development of blockchain ecology. These applications are called Smart Ponzi Scheme [2, 17].

Ponzi schemes are a classic type of scams whose core mechanism is to use the investment of new investors to compensate for the previous. Operators maintain the survival of the scam by continuously attracting new investors (and charging fees) through promising high profits. A Ponzi scheme will eventually collapse because it is hard to keep attracting new investors. By building on blockchain and smart contracts, smart Ponzi schemes exhibit many new characteristics. On the one hand, the proceeds of the fraud are some kind of cryptocurrency, and the anonymous and immutable nature of the blockchain makes many investors’ investment irreparable. On the other hand, a large number of ordinary investors are vulnerable to fraud under the veil of the emerging technology. For example, Fomo3D,¹ a game known as a Ponzi Scheme, soon surpassed Cryptokitties² and became one of the most popular games on the Ethereum platform at the time. It is reported that, by the end of the first round of the game, the final winner was awarded more than $3 million, while most ordinary users lost their investment [45].

The existence of smart Ponzi schemes will result in the loss of cryptocurrency (and money) to a large number of relevant participants, which is not conducive to the development of blockchain technology ecology. Thus, just as it is extremely important to detect malicious apps in the Android markets [65], smart Ponzi scheme detection is also an important measure to maintain the DApp ecosystem. Note that “DApp” is more of a user-oriented concept. A DApp is commonly composed of basic smart contracts to run the application and a web client for the user to interact with. Since smart contracts are the building blocks of DApp, the identification of smart Ponzi scheme is mainly aimed at individual smart contracts that apply the logic of Ponzi schemes. Therefore, when Ponzi logic is detected in a specific smart contract, the result is inferred that only this specific part of the DApp is adopting the Ponzi scheme. The detection method can provide a warning message to participants if and only if their transactions interact with this specific address. In this way, it can protect users from involving in Ponzi schemes while minimizing the impact on other benign functionalities of the DApp.

While many of the participants in the previous example (i.e., Fomo3D) may be aware of the potential risks, there are many other smart Ponzi schemes that lure ordinary investors in the guise of an investment plan with high profits (see [17] for an example). In addition, some smart Ponzi schemes that do not provide source code (i.e., hidden smart Ponzi schemes in [17]), in this case, even professionals can not judge whether it is a smart Ponzi scheme. Considering the rapid development of blockchain technology, the majority of users involved lack professional knowledge, and the relatively weak supervision, it is very urgent to study the identification method of smart Ponzi scheme.

1.3 Current Methods and Limitations

Fortunately, there have been studies on the problem of smart Ponzi schemes. Bartoletti et al. discusses the classification of smart Ponzi scheme and its influence [1, 2]. They collected samples of smart Ponzi schemes in two ways: (1) read the source code and collect some samples; (2) a small number of samples of hidden smart Ponzi schemes were obtained by evaluating the bytecode similarity of contracts. Furthermore, Chen et al. constructed the identification of smart Ponzi schemes as a classification problem, and achieved automatic recognition by adopting a machine learning method based on extracted bytecode and account behavior features [17, 18]. For a decentralized ecology, it makes more sense to establish automatic identification of smart Ponzi schemes. However, the current approaches have four main limitations as follows:

Insufficient samples: To build an automatic identification model, sufficient samples are the key. In the previous study (i.e., [2, 17, 18]), the sample size of smart Ponzi scheme is less than 200, and the sample size of non-Ponzi scheme is also not much. This is a complete mismatch with the current number of smart contracts. (Currently, there are millions of smart contracts on the Ethereum platform, including more than 100,000 open-source contracts.)

Reliance on transaction records: The feature extracted from the transaction records of contracts has made a main contribution to the accuracy in existing methods. However, when investigating the Ponzi schemes on Ethereum, we noticed that most of our collected Ponzi schemes have a short lifetime. The median of their lifetime is only about 2.5 days for those contracts with non-zero transactions. The short lifetime of Ponzi schemes leads to weakness of these transaction-based methods. The time for collecting sufficient transactions is enough for most of the Ponzi contracts to complete their lifecycle (in this paper, we use the terms “Ponzi contract” and “smart Ponzi scheme” interchangeably).

Flawed model evaluation method: In study [17] and [18], the model is trained and tested mainly by randomly dividing sample sets. However, as the technology for developing smart contracts iterates over time [66], the later smart Ponzi schemes are often more complex in their technical approach and logic. The method of randomly dividing training sets may lead to the use of newer contract information in model training and the prediction of earlier contracts, thus providing an inflated model effect. And when models are applied to future judgments, they may perform worse than expected.

Low accuracy: For machine learning based detectors, single feature construction method and simple models used in [17, 18] are insufficient to distinguish smart Ponzi schemes. At the same time, the defective evaluation method exaggerates the real effect of the model. In addition, smart Ponzi schemes are always in the minority in a large number of applications, but the sample imbalance problem was not considered in the previous study. On the other hand, detectors based on symbolic execution [15] rely on expert rules to recognize Ponzi schemes. However, with the explosive growth in the number of smart contracts, it is hard for experts to traverse emerging Ponzi schemes and keep updating detection rules for evolving attack strategies. The scalability of detailed rules is potentially limited due to the proliferation of smart contracts, while general rules may lead to high False Positive Rate and high False Negative Rate in practice [31].

1.4 Our Methods and Contributions

To tackle these challenges, we collected more samples and propose a Multi-view Cascade Ensemble method named MulCas for automatic smart Ponzi scheme identification with the following salient features:

High accuracy: To achieve high accuracy of smart Ponzi scheme detection, we extracted multiple layers of rich features from the bytecode of the contract, including bytecode features and semantic features. Similar to [17, 18], we do not use any source code information from the contract, but only the bytecode deployed on the blockchain. In addition, we extract the developer features by parsing the contract creation records on the blockchain. Based on these features, we obtain a better detection effect by using the proposed multi-view cascade ensemble method.

Identification at creation time: Our model uses only the information that must be provided when the contract is created (i.e., bytecode and a deployment transaction record), rather than relying on the interaction information after the contract is created. Thus, the model can make a decision on whether it is a smart Ponzi scheme at its creation time.

Better robustness: By combining the discriminant results from different perspectives, our model has better robustness. Compared with the Baseline model, on the one hand, the effectiveness of the model declined less as the class imbalance increased; On the other hand, the prediction effect of the model is relatively stable over time.

Figure 1 shows the framework of our study, which consists of four steps. Firstly, after collecting and verifying more samples, we convert the bytecode of each contract into opcodes, which is an important source of subsequent features. In addition, by parsing the blockchain data, we extracted all the contract creation transactions to form the creation graph, which is another source of our features. Secondly, through the above two sources, we extract three types of features, namely bytecode features, semantic features, and developer features. Next, we train identification models (i.e., viewers) based on different features. Finally, we present the Multi-view Cascade Ensemble method. The final result of the model includes two parts. The first part is the result with enough confidence in the two kinds of samples (i.e., smart Ponzi scheme and non-Ponzi scheme contract) by cascading different viewers. The other part is the voting result of all viewers for the unconfident samples.

Fig. 1.

By applying MulCas to the constructed dataset, we find that it exhibits an impressive performance of smart Ponzi scheme detection. In summary, our major contributions include:

•

By reading the source codes (similar to [2, 17]), we collect more samples of smart Ponzi schemes and construct a whole new dataset, which will promote the study of smart Ponzi scheme identification. The dataset and associated analysis code is available at http://xblock.pro/#/dataset/PonziContractDataset.

•

We extract rich features from multiple views. These features do not depend on the performance of the smart contract, enabling the identification at the creation moment. Among these features, the extracted developer features have a significant recognition effect.

•

We propose and develop MulCas, a model for smart Ponzi scheme detection by using supervised learning methods and the extracted features.

•

We conduct extensive experiments to evaluate MulCas on the new dataset. The results show that, as compared with the baseline results in [17], MulCas has greatly improved the recall and F1 values. In addition, MulCas performs better in terms of robustness.

The rest of this paper is organized as follows. After provides some background in Section 2, we introduce the dataset construction method and problem definition in Section 3. Section 4 details the proposed method and Section 5 reports the experimental results. After introducing the related work in Section 7, we conclude the paper in Section 8.

2 Background

Similar to [2, 17, 18], our study is also based on Ethereum. Below is a brief introduction of Ethereum and smart contracts.

Ethereum. Ethereum is the second-largest cryptocurrency platform (in terms of market capitalization) based on blockchain technology. Compared with Bitcoin, Ethereum introduced the Ethereum Virtual Machine (EVM) to support smart contracts [53]. EVM is a stack-based virtual machine that can execute scripts composed of instructions in its own bytecode instruction set [7].

Smart contract. A smart contract is essentially a piece of executable code deployed on Ethereum. Generally, the creation of a smart contract takes several steps: (1) write smart contracts in high-level languages such as Solidity;³ (2) compile source code into bytecode; (3) release the bytecode to Ethereum by sending a transaction. Once a smart contract is deployed to the blockchain, the corresponding bytecode and the creation transaction are permanently stored on the blockchain.

Account. There are two types of accounts on Ethereum: Externally Owned Accounts (EOAs) and smart contract accounts. The EOAs are controlled by private keys and smart contract accounts contain the associated bytecode.

Transaction. Technically, a transaction on Ethereum is a cryptographically-signed instruction. There are two types of transactions on Ethereum [59]: (1) those resulting in message calls; (2) those resulting in the creation of new smart contract accounts. The first can be used for transferring Ether or call specific functions of a smart contract. The latter can be used to deploy a smart contract.

Gas. Gas is the name of execution fee that accounts need to pay for running transactions on Ethereum. For deploying a smart contract, the creator needs to pay the deployment fee related to the length of the contract and the storage space occupied by the contract. For calling a smart contract, the caller needs to pay gas related to the number of opcodes executed and the storage space modified by this function call.

EVM bytecode and opcode. Once a smart contract is created, the EVM bytecode becomes the only code information of the contract stored on blockchain. EVM bytecode consists of three EVM-code fragments: initialization code, contract code, and auxiliary data.

•

Initialization code: the EVM-code fragment for the account initialization procedure. This part of EVM bytecode is executed only once when the contract is being created.

•

Contract code: the main fragment of code which is executed whenever the contract account receives a message call.

•

Auxiliary data: an optional fragment that plays the role of a cryptographic fingerprint of the source code and is never executed.

However, the binary-form bytecode is obscure for developers to conduct analyses. For the sake of interpreting binary values, an instruction set is provided by Ethereum to map bytecode instruction to a mnemonic form called “opcode” [59]. Figure 2 illustrates an example of opcode file converted from the bytecode of a smart contract.

Fig. 2.

3 Dataset Building AND Problem Definition

3.1 Ponzi Contract: A Case

One of the main contributions of our work is to manually check the source codes to determine whether a smart contract deploys Ponzi scheme logic. Recall that the key of Ponzi scheme logic is to use the investment of new investors to compensate for the previous investments. Hence, the source code of Ponzi contracts must reflect the same logic. In practice, Ponzi scheme logic can be adopted through various approaches, such as array-based schemes and tree-based schemes introduced in [1].

The following example illustrates how we recognize Ponzi contracts, in other words, how we determine if a smart contract is a Ponzi contract from source code. The example contract, named Daily12, is a verified Ponzi smart contract with source code available on etherscan.io (one of the most famous Ethereum explorer).

We introduce some interesting characteristics of Daily12. As shown in Figure 3, Daily12 is swift: the lifecycle of the contract is only 10 days. The balance of Daily12 rapidly increased to about 146.59 ETH in 3 days but drops to 0 in the following 7 days. We then download all the transactions of the contract to take an insight into this Ponzi contract. According to our statistics, 171.04 ETH had been involved in Daily12 (now about $219,782.98), which is a considerable amount of money. These characteristics encounter our motivation to identify the Ponzi contract at the creation time. To the best of our knowledge, the key component of the existing method is the transaction record of a contract. However, the time for collecting sufficient records may be long enough for a swift Ponzi contract like Daily12 to complete its lifecycle. Therefore, the detection result based on transactions is not that much meaningful as it seems.

Fig. 3.

The contract has made a seductive claim to attract participants. It promotes itself by claiming that the participants would gain 12% of their investments every day and they could withdraw the profits at any time:

Propaganda

Easy investment contract

- GAIN 12% PER 24 HOURS (every 5900 blocks)

- NO COMMISSION on your investment (every ether stays on contract’s balance)

- NO FEES are collected by the owner, in fact, there is no owner at all (just look at the code)

How to use:

1. Send any amount of ether to make an investment

2a. Claim your profit by sending 0 ether transaction (every day, every week, i don’t care unless you’re spending too much on GAS)

2b. Send more ether to reinvest AND get your profit at the same time

However, the propaganda does not reflect the nature of Daily12. The contract has made its source code available on the etherscan.io to win the trust of the participants. In this sense, we reveal how the Ponzi logic is implemented by Daily12 source code shown in Figure 4.

Fig. 4.

The source code of Daily12 is brief and concise. Like most traditional Ponzi schemes, the contract would record the investments of participants (line 2). The mapping variable in line 3 records the time that the participants last retrieved their profits. Next, the function in line 4–11 reflects the critical logic of the Ponzi contract. Once the contract receives a transaction, it computes the 12% of the sender’s investment scaled by the time interval (line 6), and transfer the profit back to the sender (line 7), where the so-called profit comes from the investment of the participants.

It could be seen that the payback window size is about 8.3 days, since 100%/12% = 8.33. A participant was able to earn the investment back after 8.33 days, as long as the contract had enough balance. If there is even remaining balance, i.e., there are latter investment, an investor can earn from these investment.

Winning or losing in the Ponzi contract depends on whether there are latter investors. The key is to continuously attract new investors. Daily12 adopted the “12%” trick. In the first 8.33 days, Daily12 uses 12% of the participant’s investment to pay back to him/her, causing an illusion that the payback is stable and continuous. Due to this seeming long life cycle, the contract even received about 0.11 ETH on the last day. However, it could be interpreted that the contract had to use the investment of latter participants to compensate for the profit after running out of the investment of the previous.

This mechanism leads to the fact that only the earlier participants could win in the Ponzi contract. Figure 5 shows the ether flow of the smart contract. The ether flow graph is introduced to visualize the transactions involved in a smart contract. Three types of transactions are encoded in the graph: investment, payback, and profit. These types of transactions are denoted by blue circles, green squares and orange triangles, respectively. Only those who earned from the contract chould receive profit transactions (i.e., orange triangles in the graph). The amount of ether involved in these transactions are reflected by the size of the circles. The x-axis represents the time line, while the y-axis represents individual participants. There are also two lines in the figure. The solid line is the regression line of investments and the dotted line is the regression line of first profit transaction for each investor. Several insights come with the two lines: (1) The dotted line roughly splits the payback area (green) and profit area (orange). Investors can only get profit from the orange area below the dotted line. A number of investors have no transactions below the dotted line, meaning that they are victims of the Ponzi contract; (2) The two lines are parallel, indicating the payback window of 8.33 days. And the left margin of orange area have composed another parallel line of payback. Notably, most of the earned participants had made their first investment in the early stage of the contract. The fact is that only those investments before block height 6547454 (the contract lifetime minus the payback window size) have a chance to earn from the contract. Unfortunately, some early investors still lost their money. They reinvested the contract several times after their first investment. Apparently, they did not understand the degenerating nature behind the seemingly high and stable profit.

Fig. 5.

The Daily12 example shows that a Ponzi contract can be swift but may involve a considerable amount of money. Detection methods based on transaction records have weak constraints on these contracts, since the contracts may have completed their life cycle before sufficient transaction records are collected. Therefore, it is significant to develop a detection method independent of transaction records.

3.2 Dataset Building

As is presented in Figure 6, the dataset is built in the following steps:

Fig. 6.

•

Crawl the contracts with accessible source code from etherscan.io. Smart contracts are not required to provide source code on Ethereum. There are millions of smart contracts deployed on Ethereum, but only a small fraction of them have verified source code on etherscan.io. The study [44] has found that only 0.05% of the smart contracts are the target of over 80% of the transactions that are sent to contracts, where 73.1% of them have available source code. Therefore, these open-source contracts have made a major contribution to transactions on Ethereum.

•

Manually check the source code. Similar to the source code inspection in Section 3.1, we manually read the source code of these smart contracts and check if the smart contract reflects the Ponzi scheme logic. Once we find there are codes that use the investment of new investors to compensate for the previous, we label the contract as a Ponzi scheme. Next, we cross-check our result to make it more precise.

•

Extract the contract bytecode and developer information from the blockchain. We use OpenEthereum⁴ to download the bytecode and developer information. Auxiliary bytecode fragment, introduced in Section 2, is discarded since they will never be executed by EVM.

We make several notes for our dataset collecting procedure:

•

While we collect our dataset from open-source smart contracts and learned that the majority of transactions come from or target at these contracts, we develop our Ponzi contract detector from the bytecode level. This design comes from the consideration that there are a considerable number of latent Ponzi schemes running on Ethereum (as can be found in the studies of Bartoletti et al. [2] and Chen et al. [17]), a bytecode-level detector can extend its application scenarios to all smart contracts deployed on Ethereum. Therefore, a warning guide can be provided for users before they interact with bytecode-only smart contracts, or for e-wallets when adversaries want to deploy Ponzi contracts through them. We discuss the potential threats to validity introduced from the biased distribution of our dataset comparing to the entire smart contracts family on Ethereum in Section 6.1.

•

To ensure the validity of our dataset, a cross-check procedure is adopted when we manually classify the source code of smart contracts. There are seven researchers involved in this process. In the first step, we employed five researchers to label these smart contracts. Generally speaking, recognizing Ponzi contracts from source code is relatively easy: we only need to recognize the ether flow relationship and decide whether the relationship falls into Ponzi logic. One smart contract is read by two researchers. A label is considered to be reliable only if both of the two researchers add the same label. If discrepancy occurs between the two labels, this smart contracts will be delivered to a more sophisticated researcher and they will make a discussion. If no agreement is reached, this smart contract will be discarded. In the second step, one researcher will randomly challenge the labeled smart contracts to strengthen the reliability of the dataset.

After collecting and checking the smart contracts with open source code, we obtain 6,498 smart contracts, among which there are 314 smart Ponzi schemes (i.e., Ponzi contracts). These smart contracts range from height 0 to height 7,500,000. The statistical features of the lifetime of these Ponzi contracts are shown in Table 1. It is not surprising that the Ponzi contracts usually have a short life since they do not create any value. As explained in Section 3.1, this phenomenon meets our motivation to identify the Ponzi contract at its creation time. Ponzi contracts usually do not last a long time, so transaction-based methods are not able to alleviate the damage caused by Ponzi contracts on Ethereum.

Table 1.

	mean	median	max
without zero transaction contracts	34.484	2.549	384.795
with zero transaction contracts	21.688	0.027	384.795

Table 1. Statistics of the Collected Ponzi Contracts’ Lifetime

3.3 Problem Definition

Since smart contracts are the building blocks of DApps, the identification of smart Ponzi schemes is essentially the identification of whether any contracts it contains are Ponzi schemes or not. Let $C = \lbrace c_1, c_2 \ldots ,c_n\rbrace$ be a set of smart contracts with labels $L = \lbrace l_1, l_2,\ldots , l_n\rbrace$, where n is the number of contracts. The contracts labeled with $-$1 (i.e., negative) are non-Ponzi contracts, while contracts labeled with 1 (i.e., positive) are Ponzi contracts. Let $D = \lbrace d_1, d_2,\ldots , d_n\rbrace$ be the set of developer information of the corresponding contracts, where the developer information of a certain contract is composed of its creator address and the height of the creation block. Let $B = \lbrace b_1, b_2, \ldots , b_n\rbrace$ be the bytecode of the corresponding contracts. The task of our work is to construct a high-performance classification model to classify given contracts into non-Ponzi contracts or Ponzi contracts, where only bytecode B and developer information D are used for classification.

4 Methodology

Figure 1 illustrates the overall framework of our analysis, which consists of four main stages: preprocessing, feature extraction, viewer training, and Ensemble the Viewers.

4.1 Preprocessing

With Dataset built in Section 3.2, we have the following information: (1) labeled bytecodes of contracts; (2) contract creation records. In this stage, bytecodes are converted to opcodes and the developer features are constructed.

For bytecode, Ethereum yellow paper [59] has provided a complete table to convert binary instruction to mnemonic forms, i.e., opcodes. Therefore, the conversion is done in the following steps: (1) tokenize bytecode into tokens; (2) convert these tokens into opcodes according to the instruction set. A disassembler⁵ can be used to convert bytecode to operation code. The operation code is composed of two parts: opcodes and operands. For example, the mnemonic form “PUSH1” is an opcode and the hexadecimal number “0x80” following this opcode is an operand. In our preprocessing stage, the operands are discarded for the reason that the operands are pure hexadecimal numbers which are highly dependent on individual smart contracts, and the introduction of these numbers should increase the model complexity dramatically. Insight into these operands remains for future research.

For contract creation records, we construct a creator vector as the developer feature. The detailed construction method will be introduced in Section 4.2.

4.2 Feature Extraction

In this stage, we will use the opcodes and contract developer information as inputs to extract features from different views. Note that the opcodes are series of tokenized mnemonic words, so they can be viewed as documents containing tokenized opcodes as words. Thus, feature extraction methods in the scope of natural language processing can be fit into the opcode data.

To investigate which features could be leveraged to build an effective model, statistical analysis are firstly conducted before we select which features to use. Consequently, several feature extraction methods are conducted from different views. First, term frequency count and TF-IDF value are calculated based on the Bag-of-Words model [47]. The two methods view the documents as an unordered set of words. Besides, N-grams and TF-IDF for N-grams are calculated from the view of considering locality of words in contiguous sequences. Moreover, the word2vec [36, 40] algorithm is performed to embed opcodes into vectors from a semantic view. Furthermore, we present a novel feature extracted from developer information called creator vector and prove the effectiveness of this feature.

4.2.1 Term Count and N-gram Count.

Term count feature is counted based on Bag-of-Words model [47], which is the same feature (i.e., word frequency) used in [17, 18]. From the view of this model, a document is considered to be a bag of the words in it, discarding the order of the words but keeping the information on multiplicity.

In contrast, N-gram model is based on a $(N-1)-order$ Markov model, which supposes that the occurrence of a word is completely dependent on the last $N-1$ words. Generally speaking, an N-gram is a contiguous sequence of blocks consisting of n items from a given document. When n is assigned 1, 2, and 3, N-gram is called unigram, bigram, and trigram, respectively.

To learn whether the two features are valuable or not, we counted the top 10 occurrences of words, bigrams, and trigrams for both Ponzi contracts and non-Ponzi contracts. The result is shown in Table 2. It can be seen that the stack operations, such as PUSH and POP, have dominated the top 10 statistics in both of the two kinds of contracts. On the other hand, it seems that the Ponzi contracts favor jump operations more and uses fewer algorithmic opcodes such as ADD and AND. This result comes from the fact that deploying a Ponzi scheme is not difficult and does not necessarily need many algorithmic operations. Other contracts implementing relatively complex applications rely more on the algorithmic operation.

Table 2.

Ponzi Contracts	non-Ponzi Contracts
PUSH1	PUSH1
SWAP1	PUSH2
PUSH2	SWAP1
JUMPDEST	POP
DUP1	JUMPDEST
POP	DUP1
DUP2	DUP2
ADD	ADD
PUSH1 PUSH1	JUMPI
AND	PUSH2 JUMPI

Table 2. Top 10 Occurrence of Words, Bigrams, and Trigrams

Therefore, we take (1) term count; (2) the count of unigram, bigram, and trigram, as the features.

The term count of a word w in document d is calculated by counting the occurrence of it in a particular document. There is the situation that words in long documents would have more occurrence than those in shorter documents if there is no normalization step. However, when we count the length of opcodes for the two kinds of contracts, we find that the length information of a contract is also potentially beneficial to our classification. As shown in Figure 7, the length of opcodes in Ponzi contracts is generally smaller than that in non-Ponzi contracts, due to the simplicity of implementing a Ponzi scheme. In general, the term count feature under this model provides a view of quantity information while keeping the length of the document as potential information.

Fig. 7.

Similarly, the N-gram feature is extracted by counting the occurrence of all three N-grams in a document. However, with n grows larger, the number of N-gram sequence grows exponentially with n. To control the complexity of our model, N-gram sequences with document frequency less than 0.1 are discarded, where the document frequency denotes the reciprocal of IDF before taking the logarithm. Table 3 shows the number of features before and after this selection method.

Table 3.

Category	Before Selection	After Selection
unigram	141	77
unigram+bigram	7211	767
unigram+bigram+trigram	52532	2292

Table 3. Number of Features Before and After Selection

Compared with feature extraction based on the Bag-of-Words model, the N-gram model introduces locality to the feature, since it views the opcode sequences as blocks containing adjacent opcodes. The same consideration on document length in extracting the term count feature also exists in the extraction of the N-gram count feature. In general, the N-gram count feature provides the view of the dataset from information on quantity and locality while potentially keeping the document length as extra information.

4.2.2 TF-IDF and TF-IDF for N-gram.

While the two kinds of smart contracts differ in choosing jump operations and algorithmic operations, they agree on the massive use of stack operations. All smart contracts must use stack operations because EVM is stack-based. There are similar situations in the field of natural language processing: some items, such as “of”, have dramatically high occurrence in documents but they usually carry little information. To handle this problem, the TF-IDF method is introduced.

TF-IDF, short for Term Frequency-Inverse Document Frequency, is a statistical measure used to evaluate the importance of a word to a document in a collection of documents [48]. The calculation of TF-IDF is composed of two parts: (1) calculating the TF value; (2) calculating the IDF value. Similar to term count, Term Frequency calculation aims to infer multiplicity information of a word in a particular document. The difference is that term frequency introduces a normalization step to exclude the influence introduced by document size. The IDF value describes how frequent is a word in all documents. The more frequent a word exists in all documents, the lower the IDF value is. This negative correlation is introduced to reduce the weights of meaningless but frequent words such as “of” in sentences. In our study, the negative correlation may reduce the weights of operations such as “PUSH” and “POP”.

With term frequency and inverse document frequency calculated in the above steps, the TF-IDF value can be calculated by multiply the two items. TF-IDF value evaluates the importance of a certain word given a document and to a collection of documents. The TF-IDF is made taking quantity information into account while ignoring the order information.

The extraction of TF-IDF values for N-gram sequences is similar to that for opcode sequences, except that in this method we regard N-gram sequences as word sequences. According to the idea of N-gram and TF-IDF calculation, this feature provides a combined view of the two methods and evaluates the importance of an item by weighted frequency from a macro view while keeping locality information in the document.

4.2.3 Word2vec Embedding.

Word2vec [27] is a more sophisticated technique for producing word embeddings in the fields of natural language processing. The algorithm uses a neural network model to embed words from a given corpus of text into low-dimensional vectors, where the dimension is set to 100 by default in our study. These vectors are learned to represent the semantic information of corresponding words. Cosine similarity between vectors is used to indicate the level of semantic similarity of the words. Given the word2vec embedding of each word, a document is represented by the weighted sum of all the word vectors in the document with IDF value as the weight of each word. In this sense, the Word2vec embedding is providing a semantic view of the opcodes.

4.2.4 Developer Feature.

A novel developer feature called “creator vector” is introduced in this paper. The main idea of using developer information is based on the the following assumption:

Assumption 4.1.

The information that whether a smart contract is created by a tainted creator can be useful in the Ponzi scheme detection task.

In this work, we try to explain this assumption from two different view: human behavior and code logic.

From the view of human behavior, the idea is inspired by the discovery in phishing scam detection research that it is common for a phishing scam creator to create multiple phishing scams [14]. Therefore, we assume that there is also a higher chance for one who has ever created a Ponzi contract to create more Ponzi contracts. To explore the assumption, we conduct some statistical methods on our dataset and the results are shown in Figures 8 and 9.

Fig. 8.

Fig. 9.

As is illustrated in Figure 8, although the majority of Ponzi contract creators have created exactly one Ponzi contract, there is still a small proportion of Ponzi contract creators who have created more than one Ponzi contracts. Considering the high contract deploying fee on Ethereum, we can essentially suppose that these group of creators have earned from their previous Ponzi contracts. When we look into these creators, we find that in Figure 9, the number of Ponzi contracts created by these creators have been accounting for a considerable amount. This is a reasonable observation because the number of contracts equals the number of creator multiplying the number of contracts they created. As a result, 61% of all Ponzi contracts come from these experienced creators. Therefore, the assumption is reasonable from the statistical results in the scope of our dataset.

From the view of code logic, the developer information has been found to have a positive correlation with code similarity. Huang et al. [63] and Chen et al. [19] have found that smart contract developers tend to reuse their code when deploying new smart contracts with similar functionality. The code similarity between smart contracts from the same developer is significantly higher than the code similarity between independent smart contracts [63]. In this sense, Ponzi contracts from the same creator tend to have higher similarity in code logic, which can be beneficial in the task of Ponzi scheme detection task.

In summary, the developer information could play an assistant role to amplify the code similarity from the same developer and the natural motivation of a malicious developer. It can benefit the model to identify Ponzi contracts created by a tainted developer, but developer information alone is not able to recognize Ponzi logic.

To embed this new-found information into our model, we propose an algorithm to construct a structured feature from given contract information, as is shown in Algorithm 1.

The construction aims to convert the developer information to a $(n \times 1)$ boolean vector, where n refers to the number of the contracts. The vector is constructed in two steps. Firstly, Algorithm 1 collects all creators who have created a Ponzi contract (lines 8–10). In the second step, when new a contract arrives, the algorithm checks if the creator is in the collection and sets the corresponding value of CreatornVector to True or False depending on the result (lines 5–7). In other words, if the corresponding element of a contract in this vector is True, it tells that this contract’s creator has created a Ponzi contract before.

When the creator vector is retrieved by the algorithm, the feature will be concatenated to other opcode features presented above so as to add a view of the dataset from developer information. The effectiveness of the feature will be evaluated by experiments in Section 5. It is worth noting that the creator vector is the collection of Ponzi contracts’ creator information, meaning that the creator vector can be expanded with upcoming labeled Ponzi contracts. In other words, the model can improve itself with new contracts appended to the dataset.

4.3 Viewer Training

In this stage we have an input of 5 types features concatenated with developer feature, each providing a distinctive view of the dataset. This stage mainly aims to train models based on each of the features. Here, the term “viewer” is used to refer to the machine learning model fited with those features from different views.

There have been a number of classification models proved to be useful in many fields, including liner models such as SVM [52] and ridge regression [34], tree-based models such as random forest classifier [38] and XGBoost [13]. The five features retrieved in the previous stage are fitted to the above models and the performances of these models are grouped by their training features. We selected the models with the best performance on the testset for each kind of features. At last, five final models are kept, each of which comprehends the data from distinctive views concerning the features fitted to the model. These models include XGBoost trained with term count feature, N-gram count feature, N-gram TF-IDF feature, and ridge regression trained with TF-IDF feature.

4.4 Ensemble the Viewers

To achieve a comprehensive view of the data, model ensemble techniques are effective to merge the models trained in the previous stage. Given the models with distinctive views of the dataset, we propose a Cascade Multi-view Ensemble algorithm to achieve a comprehensive model. The pseudocode is shown in Algorithm 2.

The ensemble is achieved by two main steps: a cascade ensemble in the training step and a vote-based ensemble in predicting step.

Algorithm 2 takes in the input of a collection of viewers V, two threshold values $\theta$ and $\lambda$, complexity control parameter N, trainset and testset. Among the inputs, the complexity control parameter N is used to restrict the maximum iterations in cascade ensemble step. The $\theta$ denotes the confidence threshold of predictions. In other words, if the predicted probability is larger than $1-\theta$ or smaller than $\theta$, the model is considered to be confident in this prediction. The other parameter $\lambda$ serves for the voting mechanism, where a sample receiving votes more than $\lambda$ is considered to be positive.

At the beginning of Algorithm 2, a cascade ensemble in training step is performed (lines 3–10). Specifically, the trainset and testset is fitted to the first viewer and the viewer then makes predictions on both of the two sets. According to the prediction results, the trainset and testset are split into a confidently-predicted subset, if the probability prediction is larger than $1 - \theta$, and an unconfidently-predicted subset, if the probability prediction is smaller than $\theta$. The threshold is commonly chosen as 0.05, a commonly used significant level in the fields of statistics. This threshold can be set to other values for certain purposes as well. For example, the threshold measuring confident positive predictions could be set to 0.51 so as to trade precision score for recall score, where the precision score and recall score are two commonly used metrics in classification. The trade is reasonable under the consideration that it is more of a problem if we missed a Ponzi contract than misclassified a normal contract. Secondly, the unconfident trainset and unconfident testset are passed to the second viewer and a similar procedure is carried out for $N-1$ times.

In the end, the testset is now split into two subsets: confidently-predicted subset and unconfidently-predicted subset. Predictions of the confidently-predicted subset are set to predictions made by the cascade-ensemble predictors. For the unconfidently-predicted set, a voting mechanism is introduced to make predictions (lines 12–25).

The voting mechanism synthesizes all the viewers as voters. In this step, the viewers are trained with all data in trainset for the reason that the remaining unconfidently-predicted trainset is relatively small and the information may be inadequate to be used for predicting the remaining testset.

From the above steps, we have built a Ponzi scheme detection model based on smart contract bytecode and developer information. Since the detection procedure does not need any human intervention, the model could be integrated into a local platform or local client to monitor the creation of smart Ponzi schemes. The detection pipeline could be divided into two modules: data collection and detection. For the data collection task, the same procedure in Section 3.2 could be leveraged, i.e., use Ethereum client such as OpenEthereum⁶ to collect smart contract bytecode and extract developer information from historical transactions. Once a new smart contract is deployed, the corresponding bytecode and developer information could be passed to the detection module. In the detection module, the features in 4.2 could be automatically extracted for MulCas to classify the target smart contract. Once a smart contract is detected to be a Ponzi contract, the monitor client could label the contract address and update the developer information. In this way, the client could provide a warning message to users if any of their transactions will be involved in a smart Ponzi scheme.

5 Evaluations

5.1 Study Setup

5.1.1 Dataset.

We evaluate the Multi-view Cascade Ensemble model on the dataset collected in Section 3.2. As is described in the section, the dataset consists of 6,498 smart contracts in chronological order, among which there are 314 Ponzi contracts.

5.1.2 Metrics.

The popular Recall, Precision, and F1 metrics for classification are used in our work. However, in contrast to traditional supervised learning problems who calculate these metrics by cross-validation, there are chronological relationships between contracts in our case of study.

In this sense, the cross-validation does not work for our dataset. This is because cross-validation assumes that each data is independent. However, the assumption should be dropped in our case of study. For example, given that a large proportion of Ponzi contracts are created by a small group of creators, the coding style or techniques for deploying a Ponzi scheme may vary over time. Thus using the information in the future to predict contracts in the past seems to be easier than using the information in the past to predict contracts in the future.

Therefore, we will split the dataset by chronological order. By default, the preceding 80% data are divided into train data and the rest are taken as test data. Moreover, to avoid the contingency caused by sampling, the performances with different train-test ratios are evaluated in our experiments. Baseline models and the Multi-view Cascade Ensemble model will be evaluated on these splits.

5.1.3 Parameters.

Three parameters play significant roles in our model, i.e., complexity control parameter N, confident threshold $\theta$, and vote mechanism parameter $\lambda$. In the evaluation step, N is set to 2 so only two viewers are used in the Cascade ensemble step to ensure low complexity. $\lambda$ is set to 0 in the experiment so that once there is a viewer making a positive prediction, the prediction will be accepted. This is for the reason discussed in Section 4.4, which forms our motivation to trade precision for recall. For convenience, such trades will be named as recall-emphasis conditions in this paper, meaning that we apply the condition to gain a higher recall score at acceptable cost of precision score. For $\theta$ values, we set the $\theta$ at 0.05 in the first iteration in cascade ensemble, which is a commonly used significant level in the fields of statistics. On the other hand, the threshold measuring the confident positive predictions is set to 0.51 in the second iteration for the same reason to apply the recall-emphasis condition.

5.1.4 Research Questions.

We designed the experiments to answer the following research questions:

RQ1: How does MulCas perform in terms of sustainability? (Section 5.2.1)

RQ2: Does MulCas outperform baseline approaches? (Section 5.2.2)

RQ3: Is the newly-introduced developer features effective? (Section 5.3.1)

RQ4: How does the model perform against different data imbalance ratios? (Section 5.4.1)

RQ 1 and 2 examines the performance of MulCas while RQ 3 takes insight into features extracted from the dataset. The robustness of the model is investigated in RQ 4.

5.2 Performance of MulCas

5.2.1 RQ1: How Does MulCas Perform in Terms of Sustainability?.

While ML-based methods have been widely applied to malware detection tasks, an emerging problem known as model aging has aroused extensive attention in recent years [8, 32, 46, 64]. The performance of these detectors may significantly degrade over time due to the rapid evolution of malware. Malware may rapidly improve their attack strategies and the detection model built from early data may not be sustainable. Consequently, malware can intentionally or unintentionally avoid being detected by early proposed malware detectors. Researches have found that the performance of early Andriod malware detectors may degrade to 75% after 2 years and below 50% after 3 years [8]. The similar problem also exists in the context of smart Ponzi scheme detection. For example, early Ponzi contracts are implemented based on four Ponzi schemes [1], while new mixed scheme is also available in recent study [15]. Fortunately, machine learning-based classifiers could improve the aged models by retraining practice. In this section, we try to evaluate the sustainability of MulCas by training and predicting smart contracts in incremental time intervals.

The distribution of Ponzi contracts we collected is extremely not uniform. For example, There are 121 Ponzi contracts whose creation block heights are from 5,000,000 to 5,500,000 but there are only 9 Ponzi contracts whose creation block heights are from 3,500,000 to 4,000,000. For the evaluation metrics used in this study, the result would better reflect the performances of detection tools if there are a minimal number of positive samples in each prediction interval. Therefore, we slice the dataset according to the occurrence of Ponzi contracts rather than slicing the dataset into equal-length time intervals. Specifically, we split the dataset into 6 parts (P0 to P5) based on the creation block heights of Ponzi contracts. P0 consists of the first 50 Ponzi contracts and benign contracts whose creation block heights are lower than the 50th Ponzi contract. P1 contains the 51st to 100th Ponzi contracts and benign contracts whose creations block heights are higher than the 51st Ponzi contract and lower than the 100th Ponzi contract. P2, P3, and P4 are obtained by a similar procedure where the Ponzi contracts are the 101st to 150th, 151rd to 200th, and 201rd to 250th in the order of their creation block heights. At last, P5 is composed of the 251st to the last (314th) Ponzi contracts and the benign contracts whose creations block heights are amid these Ponzi contracts.

In this section, we first use P0 and P1 to train the detection model and use the model to predict the smart contract in P2. Then we retrain the model with P0, P1, and P2 (to simulate the retraining procedure in practice) and use the retrained model to predict the smart contracts in P3. Similar retraining and prediction procedures are done with P4 and P5. In this way, we have collected the performances of tested models with respect to incremental time intervals. The result is shown in Figure 10. Due to space limitation, only the performances of four typical tools are shown in the figure, i.e., MulCas, XGBoost trained with Term Count feature (TCXGB) [17], XGBoost trained with Term Count Feature, and Developer Feature (TCXGB-DF), Ridge classifier trained with TF-IDF for Ngram and Developer Feature (TINRIDGE-DF). From the result, we could see that the performance of each detection tool tends to increase as the number of training samples increases before predicting P5. But the recall score of the tested tools shows a significant decrease when predicting smart contracts in P5. Considering the definition of recall, the drop in recall score tells that there are more positive samples that are misclassified by the detection model. We found that the drop in performance is mainly caused by the prevalence of a new type of Ponzi contract derived from a new Ponzi contract demo⁷ after block height 5,000,000. Baseline tools show ill performance in predicting this type of Ponzi contract. But models trained with developer features perform more stable for new-coming Ponzi contracts. As for the 34 new-type Ponzi contracts, 15 of them are created by tainted creators. On the other hand, MulCas shows the best performance and the performance is rather robust in predicting smart contracts in incremental time intervals.

Fig. 10.

For comparison with program analysis methods, we have also tested SadPonzi [15] on out dataset. Comparison result are shown in Table 4. Since SadPonzi is built based on symbolic execution, it can perform Ponzi contract detection without a prior learning procedure. We have tested SadPonzi on all 6,498 smart contracts in the dataset, in which 4,881 of the smart contracts are successfully analyzed. In the failure cases, the tool mostly encountered the timeout problem due to path explosion problem of symbolic execution, or referring to precompiled contract which is not implemented. The performance of SadPonzi on the all success cases is shown in the all column. The performance of MulCas in the same column is N/A because a trainset is needed for learning-based methods to train the detection model. We have compared the performance of SadPonzi and MulCas on P2, P3, P4, and P5. It can be seen that MulCas shows comparable performance on the early set P2 and P3, where SadPonzi tend to have less false positives and MulCas tend to have less false negatives. However, the performance of SadPonzi drops in P3 and P4. The reason is that SadPonzi proposes four expert rules to capture the Ether transfer and redistribution behaviors of Ponzi schemes, but these rules may not be adapted to new emerging scam patterns. As discussed in the previous evaluation, recently we have seen a new type of Ponzi contracts. This new type of Ponzi scheme is deployed based on an ERC-20 token trading contract, where the Ponzi reward is manifest in the increasing value of the Ponzi token. Therefore, the expert rules proposed by SadPonzi failed in detecting this new type of Ponzi contracts.

Table 4.

Method	Metric	all	P2	P3	P4	P5
	Precision	0.51	0.33	0.42	0.18	0.24
SadPonzi	Recall	0.59	1.0	0.71	0.25	0.18
	F1	0.55	0.5	0.53	0.21	0.20
	Precision	N/A	0.88	0.96	0.81	0.95
MulCas	Recall	N/A	0.38	0.32	0.94	0.67
	F1	N/A	0.53	0.48	0.87	0.79

Table 4. Comparison between MulCas and Symbolic Execution based Method SadPonzi [15]

Answer to RQ1: MulCas performs better than baseline models in terms of sustainability.

5.2.2 RQ2: Does MulCas Outperform Baseline Approaches?.

To answer this question, we take the model described in [17] and derived several other baseline models based on the extracted features. Firstly, we use an XGBoost model trained with term count features as is presented in [17]. The following baseline model is constructed by fitting the features in Section 4.2 to linear classification models, i.e., SVM, ridge classifier, and tree-based classification models, i.e., XGBoost. Considering simplicity, we name these models by “Feature-Classifier” in the following section. For example, “NCXGB”, “TCXGB”, “W2VXGB”, “TINXGB”, and “TIXGB” are short for XGBoost models trained with N-gram Count, Term Count, Word2Vec embedding, TF-IDF for N-gram and TF-IDF features. In addition, the symbolic execution method for detecting Ponzi contract, SadPonzi [15], is also considered. Comparisons are made between these models and our Multi-view Cascade Ensemble model.

For the evaluation of machine learning based detectors, we use the dataset spliting method described in the last section. Specifically, the trainset consists of the 1st to 250th Ponzi contracts and benign contracts in the corresponding creation block heights. The testset consists the remaining smart contracts. As a result, the trainset contains 5990 smart contracts and the testset contains 508 smart contracts. For symbolic execution based detector SadPonzi [15], we show its performance on all successfully analyzed smart contracts. The results are shown in Table 5. The following conclusions can be observed.

Table 5.

Model	Feature	Recall	Precision	F1
	Ngram Count	0.453	0.829	0.586
	Term Count	0.406	0.839	0.547
Ridge	Word2Vec	0.000	0.000	0.000
	TF-IDF for Ngram	0.328	0.913	0.483
	TF-IDF	0.250	0.889	0.390
	Ngram Count	0.375	0.923	0.533
	Term Count	0.328	0.913	0.483
SVM	Word2Vec	0.266	0.895	0.410
	TF-IDF for Ngram	0.281	0.900	0.429
	TF-IDF	0.281	0.900	0.429
	Ngram Count	0.219	0.875	0.350
	Term Count [17]	0.219	0.875	0.350
XGBoost	Word2Vec	0.219	0.875	0.350
	TF-IDF for Ngram	0.234	0.882	0.370
	TF-IDF	0.234	0.882	0.370
SadPonzi	N/A	0.52	0.59	0.55
MulCas	hybrid	0.674	0.951	0.789

Table 5. Overall Evaluation Results on the Benchmark

•

MulCas achieves a considerably higher recall score than the baseline models. The recall metric has increased by about 0.3 compared with the average recall score of the baseline models. Moreover, MulCas has achieved the best F1 score of 0.743, meaning that the model performs best considering both recall score and precision score.

•

The precision score of all models is relatively higher than the recall score due to the class imbalance problem. Referring to the definition of precision and recall, the result tells that the detection model tends to classify positive samples into the negative class rather than classify negative samples into the positive class. This phenomenon is reasonable considering the class imbalance problem in the malware detection task. In the real-world scenario, there are much more benign smart contracts than malicious Ponzi contracts. In the scope of our dataset, the class imbalance ratio is about 1:20. Machine learning based models trained on such imbalanced datasets prefer to classify samples in the minority class into samples into the majority class. As a result, the precision scores are significantly higher than recall scores.

Answer to RQ 2: MulCas can achieve a significantly higher recall score while maintaining precision score and thus improve the F1 score of the result. It outperforms all the baseline models.

5.3 Feature Importance

5.3.1 RQ3: Is the Newly-introduced Developer Features Effective?.

The effectiveness of the developer features is evaluated by the baseline models since they are the most basic model and do not introduce other variables in ensemble methods. The results are presented in Table 6.

Table 6.

		TFXGB	TCXGB	W2VSVM	TINRIDGE
Without	Recall	0.234	0.219	0.266	0.328
Developer	Precision	0.882	0.875	0.895	0.913
Features	F1	0.370	0.350	0.410	0.483
With	Recall	0.547	0.563	0.266	0.531
Developer	Precision	0.946	0.947	0.895	0.971
Features	F1	0.693	0.706	0.410	0.596

Table 6. The Effectiveness of Developer Features

As is shown in the table, all XGBoost and Ridge models perform better on F1 score after training with developer feature. As discussed in RQ1, there are 34 Ponzi contracts that belong to the variants of a newly proposed Ponzi token demo⁸ among all 64 Ponzi contracts in the testset. However, from the developer information we found that 14 of these new type Ponzi contracts are from tainted creators. Therefore, the developer information have significantly improved the performance of detection models in these Ponzi contracts created by tainted creators, resulting in the increase in recall score.

Answer to RQ 3: the developer features improve the performance when classifying the new type Ponzi contracts.

5.4 Robustness of MulCas

5.4.1 RQ4: How Does the Model Perform against Different Data Imbalance Ratios?.

Classification on the imbalanced dataset has been an issue of interest in many research fields. In our case of study, the imbalance ratio is about 1:20 (positive sample size vs. negative sample size), indicating that this is a highly imbalanced dataset. To illustrate how MulCas performs on datasets with different imbalance ratios, we conduct the random under-sampling method to our dataset with ratios in {1:1, 1:2, 1:5, 1:10, 1:20}. Comparisons are made between MulCas and the same baseline models in RQ1, i.e., TCXGB, TCXGB-DF, TINRidge-DF. The results are presented in Figure 11. Besides, the recall-emphasis condition is not applied throughout this experiment because this condition mainly works for a highly imbalanced dataset.

Fig. 11.

As is illustrated in the figure, the performances for all of the models are tend to decrease as the dataset becomes more and more imbalanced. A sharp decrease can be seen in both of the metrics for the baseline models. In contrast, though a similar decrease can be found in the performances of MulCas, it is worth noting that the decreasing trend is much smoother. Besides, MulCas has achieved the best recall scores and F1 scores throughout the experiment.

Answer to RQ4: MulCas has shown relatively robust performance while the dataset is getting more and more imbalanced.

6 Threats to Validity

6.1 Dataset Distribution

Our results face common validity threats due to the biased distribution of the dataset. The dataset in Section 3.2 is built from open-source smart contracts under two main concerns: feasibility and reliability. Since Ponzi logic represents a high-level semantic behind the direct control flow or data flow of a contract, manually recognizing Ponzi logic from bytecode is neither feasible nor reliable. Decompilation tools [25] can be leveraged to mitigate the difficulty, but the error introduced by these tools and the error introduced by recognizing Ponzi logic from the decompiled code can still add imprecision to our dataset. Thus, while we exhibit our detection model from the bytecode level to ensure our model to be capable of analyzing all smart contracts deployed on Ethereum, our evaluation phase is purely based on those smart contracts with available source code. The distribution of open-source smart contracts may differ from the distribution of the entire smart contract family on Ethereum. Thus, our results are best interpreted with respect to the benchmarks we analyzed.

From the practical view, however, building an automatical detection tool for Ponzi scheme is necessary for two reasons: (1) Blockchain users may not be sophisticated in programming and smart contracts. Their interaction with some DApps may be done through a Web interface. Thus, users may not be experienced enough to tell whether a smart contract deploys Ponzi logic by reading the contract code. (2) As described in the Introduction section, one of the main motivations of the proposed work is to detect latent Ponzi contracts based only on bytecode. It may be unrealistic to identify these latent Ponzi contracts based on reading the bytecode or opcode. Our methods can provide a warning guide for users before they interact with bytecode-only contracts. Considering that plenty of latent Ponzi contracts are running on Etherum (as can be found in the previous studies by Bartoletti et al. [1], Chen et al. [17] and SadPonzi [15]), our methods can necessarily benefit the Ponzi scheme detection procedure on the whole Ethereum ecosystem. Therefore, the ML methods indeed have their weakness such as bias to the dataset, but using ML methods in Ponzi scheme detection task has its practical value.

6.2 Robustness against Adversarial Attacks

While the experiments have studied the robustness of Ponzi contract detection methods, we essentially assumed adversarial attacks as orthogonal problems. However, the sustainability against adversarial attacks still falls into practical concerns. Since the main efforts of our method rely on treating the bytecode as natural language tokens, adversaries may cheat the model by purposed code injections. Most of our captured features (i.e., word frequency, TF-IDF, N-gram) focus on the quantitative information of opcode tokens. One possible adversarial attack may inject dead code to cheat on the quantitative information. However, junk code injection in the context of Ethereum introduces overhead costs of extra gas. Gas is a specific mechanism introduced by Ethereum to measure the consumption of computational expenses on the Ethereum network. A smart contract costs gas in two phases: (1) the contract deployment; (2) the contract call. Generally speaking, the longer a smart contract is, the more gas it costs [12, 26]. Therefore, this economical concern may hinder an effective junk code injection process.

On the other hand, we note that the Ponzi scheme detection problem on Ethereum is still in the early stage. Because: (1) most recently reported Ponzi contracts⁹ still have not adapted any schemes to evade existing detection methods, such as those proposed in [12]; (2) to the best of our knowledge, there are no platforms that integrate existing Ponzi contract detection tools, while plenty of smart contract auction platforms have adopted the tools developed in the field of smart contract vulnerability detection [6, 54]. Therefore, developing a Ponzi scheme detector with robustness against adversarial attacks is left for our future work.

6.3 Update of the Solidity Compiler Version

An additional limitation of MulCas comes with the rapid evolution of Solidity compilers. As a newly emerging program language, Solidity compiler has gone through several important compiler optimizations to fix vulnerabilities and introduce compile options such as dead code elimination. Due to the rapid change of Solidity compiler, a smart contract may be compiled to two slightly different pieces of bytecode under different compiler version. A more detailed study of the difference introduced by compiler version can be found in the work of Huang et al. [29]. Along with Solidity compiler evolution, the unpredictable contract evolution maybe another challenge for our code-based detector. For example, as the DAO brought reentrancy vulnerability into the public concern, developers has now adapted various approaches to exclude the reentrancy vulnerability in their smart contracts. The continuous and unpredictable contract evolution may be another trigger of retraining the MulCas detector.

6.4 Detection for Inter-contract Ponzi Schemes

The proposed detection method is oriented at individual smart contract. The model ignores the interaction between multiple smart contracts even when they are in the same DApp. Therefore, the detection model may miss Ponzi schemes deployed in multiple smart contracts where they collaborate toward malicious behaviors. Smart contracts deployed on Ethereum does not necessarily provide the information that which DApp it belongs to. There has been researches for modeling inter-contract interactions by performing a backward data-flow analysis to recover the contract invocation relationship [5]. Through contract invocation recovery, multiple smart contracts could be manually incorporated into an integrated smart contract for analysis.

7 Related Work

In recent years, blockchain technology has begun to receive extensive attention from researchers. On the one hand, the identification of smart Ponzi schemes belongs to the field of general analysis for smart contracts; on the other hand, it can also be regarded as the detection of malicious software. Next, we summarize relevant research results from two aspects: smart contract analysis and malware detection.

7.1 Smart Contract Analysis

Smart contracts are a key component of blockchain 2.0 and the building block of DApps. Since smart contracts are essentially a collection of bytecode running on Ethereum, vulnerability concerns also exist in the programming of smart contracts. The vulnerability has been extensively studied in traditional software. Combinatorial testing [24] and a combination of dynamic and static features [35] are used for fault localization. Niu et al. [42] further studied Minimal Failure-Causing Schema with respect to multiple faults. However, when it comes to smart contracts, the decentralized nature has brought new vulnerability concerns. To ensure the security of the contract, various vulnerability identification tools are proposed [4, 33, 39]. Yu et al. [62] proposed a general-purpose approach to search and repair the bugs in smart contracts. Besides, the security properties have been studied as reachability properties based on the abstraction of EVM bytecode [51]. To understand the potential impact of the vulnerability of smart contracts, Sayeed et al. [50] analyzed the 7 most important attack techniques and reveal that even adopting the 10 most widely used tools to detect smart contract vulnerabilities, there still exist known vulnerabilities. This fact makes the development of secure smart contracts from a software engineering perspective an important research direction [20, 58].

7.2 Malware Detection

There have been plenty of studies concerning the malware detection problem in the scenario of traditional software. Part of these studies leverages machine learning technologies to build classification models. Cai et al. [9] used a diverse set of dynamic features for Android app classification. Di Xue et al. [61] deployed convolutional neural networks to analyze static features and variable n-grams for dynamic features. Other works studied the sustainability of these learning-based detectors and proposed classification systems based on a new behavioral profile for Android apps [8]. Besides, Java Bytecode has also been verified to detect Android malware families [11]. On the other hand, malicious scams are also popular in the blockchain ecology due to the lack of regulation and decentralized nature. For example, money laundering [22, 41, 55], Ponzi Scheme [3, 56] and market manipulation [16, 23]. A summary of this problem can be found in [57]. To identify these scams, identification models based on machine learning and data mining are widely used. In view of Ponzi schemes in the bitcoin ecology, Bartoletti et al. built a model based on machine learning by extracting transaction records and constructing features [3]. As for smart Ponzi schemes in Ethereum, Chen et al. [17, 18] first proposed the identification framework based on machine learning, which is also the baseline of this study. Ibba et al. [30] and Fan et al. [21] have explored the Ponzi scheme detection problem by using other machine learning models such as Decision Tree, Support Vector Machine and gradient boosting algorithms. Apart from machine learning based methods, static problem analysis method such as Symbolic execution has also been leveraged to detect Ponzi schemes on Ethereum [15]. In terms of phishing scam identification, graph-based methods are widely used in recent studies [14, 60].

8 Conclusion

We construct a new dataset for smart Ponzi scheme detection, which is larger than the dataset used in [17], and extracted many features from different perspectives. Based on this, we propose MulCas, a Multi-view Cascade Ensemble method for detecting smart Ponzi schemes on Ethereum at the creation moment. Our extensive evaluation results show that MulCas outperforms the state-of-the-art approaches and many other baseline methods in terms of F1 and robustness. With no reliance on the interaction information of contracts, our model can give the identification results at the creation moment. Due to the short lifetime of most Ponzi contracts, identificaiton using static features can significantly alleviate the damage caused by Ponzi contracts on Ethereum.

Footnotes

https://fomo3d.hostedwiki.co/.

http://www.cryptokitties.co/.

https://solidity.readthedocs.io/en/latest/.

⁴

https://github.com/openethereum/openethereum.

⁵

https://github.com/crytic/pyevmasm.

⁶

https://github.com/openethereum/openethereum.

⁷

https://test.jochen-hoenicke.de/crypto/ponzitoken/.

⁸

https://test.jochen-hoenicke.de/crypto/ponzitoken/.

⁹

https://etherscan.io/accounts/label/ponzi.

References

[1]

Massimo Bartoletti, Salvatore Carta, Tiziana Cimoli, and Roberto Saia. 2017. Dissecting ponzi schemes on ethereum: Identification, analysis, and impact. (2017). arxiv:1703.03779

Abstract

1 Introduction

1.1 Blockchain and Decentralized Application

1.2 Smart Ponzi Schemes and Identification

1.3 Current Methods and Limitations

1.4 Our Methods and Contributions

2 Background

3 Dataset Building AND Problem Definition

3.1 Ponzi Contract: A Case

3.2 Dataset Building

3.3 Problem Definition

4 Methodology

4.1 Preprocessing

4.2 Feature Extraction

4.2.1 Term Count and N-gram Count.

4.2.2 TF-IDF and TF-IDF for N-gram.

4.2.3 Word2vec Embedding.

4.2.4 Developer Feature.

4.3 Viewer Training

4.4 Ensemble the Viewers

5 Evaluations

5.1 Study Setup

5.1.1 Dataset.

5.1.2 Metrics.

5.1.3 Parameters.

5.1.4 Research Questions.

5.2 Performance of MulCas

5.2.1 RQ1: How Does MulCas Perform in Terms of Sustainability?.

5.2.2 RQ2: Does MulCas Outperform Baseline Approaches?.

5.3 Feature Importance

5.3.1 RQ3: Is the Newly-introduced Developer Features Effective?.

5.4 Robustness of MulCas

5.4.1 RQ4: How Does the Model Perform against Different Data Imbalance Ratios?.

6 Threats to Validity

6.1 Dataset Distribution

6.2 Robustness against Adversarial Attacks

6.3 Update of the Solidity Compiler Version

6.4 Detection for Inter-contract Ponzi Schemes

7 Related Work

7.1 Smart Contract Analysis

7.2 Malware Detection

8 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

SADPonzi: Detecting and Characterizing Ponzi Schemes in Ethereum Smart Contracts

Detecting Ponzi Schemes on Ethereum: Towards Healthier Blockchain Technology

Ponzi scheme detection via oversampling-based Long Short-Term Memory for smart contracts

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations