In this section, we describe the steps and phases that were required to break CaptchaStar. Note that only two steps out of three are described, as that last one was not needed since in step 2 the CAPTCHA was already broken.
4.1 Step 1: Black-Box Analysis
BASECASS recommends to create a way to automatically interact with the CAPTCHA analyzed. In this case, this was simply done through a program in Python that was able to download a challenge, send its answer to the CaptchaStar server, grab its response, and record a log of it. The answer was initially provided by humans through a replica of the interface of CaptchaStar. Later, it was found that CaptchaStar allows for requesting the validity of different answers for the same challenge, which allowed to use it as an oracle to also find the corresponding solution.
BASECASS recommends us to compare the theoretical size of the base problem being used, and the actual size of the challenges being proposed by the CAPTCHA, as a way to compare its strength to that of the base AI problem. In this case, the size of P can be very roughly estimated through how many different black and white images of \(300 \times 300\) pixels can there be, if the pixel size for the image is indeed 4 pixels, and if we restrict ourselves to no more than \(80\%\) of the pixels in white. This would lead to \(2^{0.8 \times {300 \times 300} \over {4 \times 4}} = 3.5 \times 10^{13}\). This is just an estimation, as many of these possible images would be nonsense and could not be used thus as solutions. To compare the size of H, we downloaded \(2,\!000\) images and checked that they were using \(1,\!631\) different base images. Note that we have counted the number of images, but not the transformations performed on them. When CaptchaStar presented the same image to the user, the transformation on its pixels was different, so there is theoretically no way for an attacker to reuse a previously solved challenge to pass a new one. Even though the challenge space H in CaptchaStar is smaller when compared to P, thanks to the number of possible transformations, it is big enough to prevent a brute-force attack based on repeated challenges.
During our interactions with CaptchaStar, we were able to test that solutions that were not optimal were still accepted by CaptchaStar if they were not too far from the optimal solution. We determined that any solution in a \(12 \times 12\) pixel square around the optimal solution would be accepted too, reducing the answer space needed to explore to \({300 \times 300}\over {12 \times 12} = 6250\), with a brute-force single attack success rate of \(0.016\%\).
We found that the demo implementation of CaptchaStar allowed to test several solutions for a single challenge. This allowed us to estimate the distribution of correct answers and compare it to an uniform distribution. After solving
\(5,\!451\) challenges, we produced a
heat map (here plotted using Gaussian smoothing) of the centers of correct answers that can be seen in Figure
2. As can be seen, CaptchaStar does seldom produce challenges of which answers lie close to the borders of the
\(300 \times 300\) pixel canvas. Interestingly, the distribution of peaks is more accentuated in the case of a pseudo-random uniform distribution than in CaptchaStar. Pearson’s
\(\chi ^2\) is
\(92,\!474.15\), indicating a
p–value \(\lt 0.00001\), which is significant for a distribution with
\(89,\!999\) degrees of freedom (at a significance level of 0.05).
Even when CaptchaStar allows for a margin of error of 12 pixels while answering, and even though the number of images used is limited, the transformations done in them and the semi-random choosing of the center of correct answers make it resilient enough to a brute-force attack. The fact that several answers can be tested using the CaptchaStar demo implementation allows for an Oracle attack, but this can be easily solved by the designers of CaptchaStar.
4.2 Step 2: Statistical/ML Analysis
Once completing the first step of BASECASS, we proceed to the second: the S/ML analysis. To do so, we have to determine which metrics to use and whether any de-noising, pre-processing, or transformation would be beneficial. The images were not altered in a way that was easy to undo, and precisely this alteration embeds the problem on which the CAPTCHA is based. Thus, we have to process the images as they are.
To define which metrics could be of interest, we picked up some well-known metrics that gather some basic information from each image:
•
General purpose metrics:
–
Results of the ENT test of randomness [
16]. A brief summary of this test battery follows:
*
Entropy. This test computes the entropy [
5] of the sequence under examination, as defined in classical information theory. A random sequence is rich in entropy.
*
Chi-squared (
\(\chi ^{2}\))
test. This is a widely used test described by Knuth in the classical book
The Art of Programming [
9]. The chi-squared test computes the frequency of the symbols and compares it with the theoretical frequency in a random sequence. To perform this comparison, the
\(\chi ^{2}\) statistic is computed.
*
Arithmetic mean. As the name suggests, this is just the arithmetic mean of the symbols in the sequence. The expected statistic value for a true random sequence is 0.5 in binary modeand 127.5 in byte mode.
*
Monte Carlo value for \(\pi\). This test interprets the sequence as coordinates in a square and counts the number of points that falls into a circle inscribed within the square. This number is used to estimate the value of \(\pi\). A truly random sequence will approximate \(\pi\), whereas a non-random number is not expected to approximate \(\pi\) well. The test uses the statistic pierror, which contains the error between \(\pi\) and its estimation \(\hat{\pi }\).
*
Serial correlation [
9]. Serial correlation computes the correlation of each symbol in the sequence with its previous symbol. A good random sequence would provide low correlation close to 0, whereas a bad random sequence would achieve a correlation close to 1.
–
Size after compression: This gives an estimate of the amount of information contained. We used the JPEG compression algorithm as implemented by the PILLOW Python library, with qualities of 1 and 95 (lowest and highest recommended).
•
Ad hoc metrics: We did not use any ad hoc metric, as we did not find any ad hoc metric that we thought could be relevant to CaptchaStar.
•
Comparative metrics: The answer space is quite large, so we decided to follow BASECASS recommendation and create comparative metrics within the same challenge for all numerical results of the ENT test and for the size results. We did so normalizing all numeric answer ranges within the same challenge.
We decided to run the tests with these simple metrics and see whether they would allow for proper classification. A summary with a brief description of the metrics used in our BASECASS attack to CaptchaStar is presented in Table
1.
We had to determine the training and tests sets to use for ML. We used the 5,451 challenges downloaded and answered in the previous step. To create a training/test set, we applied these metrics to the images resulting on putting the cursor on different positions. In particular, we created two sets:
(1)
The first one contained the images resulting when we divided the answer space in three parts—that is, when the possible coordinates are \((0,0), (150,0), (300,0), (150,0), (150,150) \ldots (300,300)\). We included another \(3 \times 3\) coordinates derived from positions at \((-10, 0, +10)\) offset of the coordinates from the center of the correct answers. This produced a maximum total of 18 images per challenge. To create the training/test files, we applied the mentioned metrics to these images. We will call this the simple dataset.
(2)
The second one contains images resulting from dividing the answer space into five parts. Similarly, it contains another \(5 \times 5\) coordinates derived from dividing the \([-10 \ldots 10]\) offset range into five parts. In addition, we added coordinates at \([-1,1]\) from the center of correct coordinates. Note that even though these coordinates are marked as correct by CaptchaStar, we marked them as wrong to see if some ML algorithms are able to differentiate which among almost-perfect and perfect solutions. In total, this produced a maximum of 59 images per challenge, to which we applied the mentioned metrics to produce the corresponding file. We will call this the detailed dataset.
We used the ML framework Weka, as it includes several classifiers that can be run using default parameters out of the box. This allows us to create a single dataset (in ARFF format) and use it with all of the classifiers, comparing results. To test the performance of each classifier, we have to settle on a metric to use. When dealing with balanced datasets, a simple metric like the accuracy or the area under the ROC curve can be adequate. Instead, for heavily imbalanced datasets—that is, those in which a class is much less represented than another—other metrics like the \(f_1\) score, the area under the precision/recall curve, or the \(\kappa\) metric are better.
In the case of CaptchaStar, most examples/points are negative (do not pass the CAPTCHA) and only a few are positive. This means that the dataset is very imbalanced. So in this case, to determine which classifier performed best, we decided to use the \(\kappa\) metric, as it is more significant than the accuracy or others when dealing with unbalanced training and tests sets like the ones we have, with only \(1/18\) and \(1/59\) correct answers, respectively.
There are different ways in which to create the train and tests set from data. A typical way, that applies in most cases, is to use cross validation, in which we repeat the testing creating different test sets each time, averaging the results. In this case, we decided to use threefold cross validation for testing.
We tested both datasets with different ML algorithms to determine those that were more successful. Tables
2 and
3 show the top classifiers by their
\(\kappa\) metric for both the
simple and
detailed dataset, respectively. Of a total of 163 classifiers in Weka, only 37 and 34, respectively, were able to load the data and present a solution within the time-out (5 minutes).
Many ML algorithms are able to classify the simple dataset with a high \(\kappa\) value. In particular, the \(meta.RandomCommittee\) (an ensemble of random classifiers), the \(functions.Logistic\) (multinomial logistic regression model with a ridge estimator), and two tree-based classifiers (\(trees.RandomTree\) and \(trees.J48\)) obtained the best results. They all obtain a \(\kappa\) of 0.99 and a perfect accuracy. This implies than an attack using any of them might be feasible.
For the second training/test set, the detailed dataset, we got worst results, as expected. At the top of the scale, there is again a meta classifier (an ensemble). The first pure classifier that reaches a decent solution is J48, with a \(\kappa\) of 0.36. Even though it is not too high, it should be enough as to perform an attack.