1 Introduction

Likelihood-based Imprecise Regression (LIR) is a recently introduced approach to regression for imprecisely observed quantities (see Cattaneo and Wiencierz 2012, 2011). In this approach, it is assumed that the available data are coarse in the sense of Heitjan and Rubin (1991). That is, precise values of the quantities of interest exist, but we cannot observe them directly. Instead, we have only imprecise observations: these are subsets of the sample space, which we know to contain the precise values of the quantities of interest.

At the two extremes of the range of possible imprecise observations are the precise observations and the missing data, respectively. We have a precise observation when the imprecise observation contains a single value, which we then know to be the precise value of the quantity of interest (which in this case is thus indirectly observed). At the other extreme we have the missing data, which occur when the imprecise observation is the whole sample space, since in this case we learn nothing about the precise value of the quantity of interest.

Between these two extremes lies the whole range of possible imprecise observations, which can be any subset of the sample space. In particular, it can be argued that continuous quantities are always imprecisely observed, since no measuring device can be infinitely precise. Therefore, regression for imprecisely observed quantities is certainly an important topic in statistics. In fact, various regression methods have been proposed in several special cases (see for example Beaton et al. 1976; Buckley and James 1979; Dempster and Rubin 1983; Li and Zhang 1998; Pötter 2000; Manski and Tamer 2002; Marino and Palumbo 2002; Gioia and Lauro 2005; Ferson et al. 2007; Chen and Van Keilegom 2009; Utkin and Coolen 2011). In contrast to most of these proposals, LIR approaches the problem of regression with imprecisely observed quantities from a very general perspective.

The imprecise observations induce a likelihood function on the joint probability distributions of the random variables and random sets representing the precise values and imprecise observations, respectively. The result of a LIR analysis consists of all regression functions that cannot be excluded on the basis of likelihood inference. Hence, the result of a LIR analysis is in general set-valued (set-valued results are obtained for instance also by Manski and Tamer 2002; Marino and Palumbo 2002; Gioia and Lauro 2005; Vansteelandt et al. 2006; Ferson et al. 2007). The extent of the set-valued result of a LIR analysis reflects the whole uncertainty in the regression problem with imprecisely observed quantities. That is, it encompasses the statistical uncertainty due to the finite sample as well as the indetermination related to the fact that the quantities are only imprecisely observed (these two kinds of uncertainty in the set-valued results are discerned for example also by Manski and Tamer 2002; Vansteelandt et al. 2006).

In the present paper we consider a robust LIR method, in which the residuals’ quantiles are used to compare the possible regression functions (see Cattaneo and Wiencierz 2012, 2011). This method is closely related to the least median (or more generally, quantile) of squares regression, which is a very robust regression method for precisely observed quantities (see for example Rousseeuw 1984; Hampel 1975; Hampel et al. 1986; Rousseeuw and Leroy 1987; Maronna et al. 2006; Huber and Ronchetti 2009). Besides being a virtue by itself, the robustness of the regression method is practically necessary when dealing with possibly unbounded imprecise observations, because an unbounded imprecise observation means that the precise value can be arbitrarily far away. In practical applications, an unbounded imprecise observation can usually be replaced by a bounded (but very wide) one: the advantage of robust methods is that they do not depend (much) on the choice of the replacing imprecise observation.

In this paper we focus on the case of simple linear regression with interval data. That is, there are two variables of interest, which are real-valued and interval-censored (i.e., the imprecise observations are possibly unbounded intervals). For this situation, we develop the first exact algorithm to determine the set-valued result of the robust LIR method (see Wiencierz and Cattaneo 2012 for some preliminary ideas). The first part of this algorithm is related to the first exact algorithm for least median of squares regression, proposed by Steele and Steiger (1986) (see also Rousseeuw and Leroy 1987, Chapter 5), which was also the basis of many other developments (see for example Souvaine and Steele 1987; Edelsbrunner and Souvaine 1990; Stromberg 1993; Hawkins 1993; Carrizosa and Plastria 1995; Watson 1998; Bernholt 2005; Mount et al. 2007). Here, we develop the algorithm for the robust LIR method in full detail and generality. In particular, we do not assume that the data are “in general position”, since this assumption (which is usual in the context of least median of squares regression) would be too restrictive for interval-censored data.

The paper is organized as follows. In the next section, we briefly present the robust LIR method in the framework of simple linear regression with interval data. Section 3 contains the main results of the paper, expressed as two theorems, whose proofs are in the appendix. These results give us an exact algorithm for the robust LIR method. The computational complexity of the algorithm is then studied in Sect. 3.3. We have implemented the algorithm as part of an R package, which is briefly introduced in Sect. 3.4, and used to analyze data from the European Social Survey (ESS) in Sect. 4. The final section is devoted to conclusions and directions for further research.

2 LIR in the case of simple linear regression with interval data

In the case of simple linear regression, the relation between two real-valued variables, \(X\) and \(Y\), shall be described by means of a linear function. Hence, the set of all possible regression functions is \({\fancyscript{F}} :=\{f_{a,b}:a,b\in {\mathbb {R}}\}\), where the functions \(f_{a,b}:{\mathbb {R}} \rightarrow {\mathbb {R}}\) are defined by \(f_{a,b}(x)=a+b\,x\) for all \(x\in {\mathbb {R}}\). We consider here the case of imprecisely observed quantities, and in particular of interval data. That is, instead of directly observing the realizations of the variables \(X\) and \(Y\), we can only observe the realizations of the extended real-valued variables \({\underline{X}},\, {\overline{X}}, \,{\underline{Y}}\), and \({\overline{Y}}\), which are the endpoints of the interval data \([{\underline{X}},{\overline{X}}]\) and \([{\underline{Y}},{\overline{Y}}]\). Throughout the paper, \([{\underline{w}},{\overline{w}}]\) denotes the closed interval consisting of all real numbers \(w\) such that \({\underline{w}}\le w\le {\overline{w}}\). This notation is used for all \({\underline{w}},{\overline{w}}\in {\overline{{\mathbb {R}}}}\), so that the interval \([{\underline{w}},{\overline{w}}]\) is empty when \({\underline{w}}>{\overline{w}}\), and does not contain its endpoints when these are infinite.

2.1 The probability model

The only assumption about the joint distribution of the six random variables \(X, \,Y, \,{\underline{X}}, \,{\overline{X}}, \,{\underline{Y}}\), and \({\overline{Y}}\) is the following:

$$\begin{aligned} P({\underline{X}}\le X\le {\overline{X}}\,\text { and }\,{\underline{Y}}\le Y\le {\overline{Y}})\ge 1-\varepsilon , \end{aligned}$$
(1)

for some . That is, apart for the choice of \(\varepsilon \), the probability model is fully nonparametric: it is only assumed that the (possibly unbounded) rectangle \([{\underline{X}} ,{\overline{X}}]\times [{\underline{Y}},{\overline{Y}}]\) contains the pair \((X,Y)\) with probability at least \(1-\varepsilon \). In other words, an imprecise observation may not cover the precise data point with probability at most \(\varepsilon \). The usual choice of \(\varepsilon \) is 0 (see for instance Heitjan and Rubin 1991), but sometimes it can be useful to allow the imprecise data to be incorrect with a positive probability, and is then an upper bound on this probability. Apart from this assumption, there is no restriction on the set of possible distributions of the precise and imprecise data. In particular, nothing is assumed about the joint distribution of the quantities of interest, \(X\) and \(Y\).

The relation between \(X\) and \(Y\) shall be described by a linear function \(f\in {\fancyscript{F}}\). For each \(f\in {\fancyscript{F}}\), the quality of the description depends on the marginal distribution of the (absolute) residual

$$\begin{aligned} R_{f}:=\left| Y-f(X)\right| . \end{aligned}$$

The more this distribution is concentrated near \(0\), the better is the description of the relation between \(X\) and \(Y\). In the robust LIR method that we consider in this paper, the concentration near \(0\) of the distribution of the residual \(R_{f}\) is evaluated by its median, or more generally by its \(p\)-quantile, with \(p\in \;]\varepsilon ,{1-\varepsilon }[\). The closer to \(0\) the \(p\)-quantile is, the better \(f\) describes the relation between \(X\) and \(Y\). In particular, the best description of the relation of interest is a linear function for which the \(p\)-quantile of the residual’s distribution is minimal.

Assuming for simplicity that the \(p\)-quantiles of the distribution of \(R_{f}\) are unique for all \(f\in {\fancyscript{F}}\), and that there is a unique \(f_{0} \in {\fancyscript{F}}\) such that the corresponding \(p\)-quantile \(q_{0}\in {\mathbb {R}}_{\ge 0}\) is minimal, we can characterize geometrically the best description \(f_{0}\) as follows. For each \(f\in {\fancyscript{F}}\) and each \(q\in {\mathbb {R}}_{\ge 0}\), let

$$\begin{aligned} {\overline{B}}_{f,q}:=\left\{ (x,y)\in {\mathbb {R}}^{2}:\left| y-f(x)\right| \le q\right\} \end{aligned}$$

be the closed band of (vertical) width \(2\,q\) around the graph of \(f\). Then \({\overline{B}}_{f_{0},q_{0}}\) is the thinnest band of the form \({\overline{B}}_{f,q}\) containing \((X,Y)\) with probability at least \(p\). This is in particular the case when \(Y\) has for each \(x\in {\mathbb {R}}\) a conditional distribution given \(X=x\) that is strictly unimodal and symmetric around \(f_{0}(x)\) (see also Tasche 2003). That is, in the linear model \(Y=a_{0}+b_{0}\,X+E\), the best description in the above sense is \(f_{0}=f_{a_{0},b_{0}}\), when the conditional distribution of the error term \(E\,|\,X=x\) is strictly unimodal and symmetric (around \(0\)) for all \(x\in {\mathbb {R}}\) (e.g., when the error term \(E\) is independent of \(X\) and normally distributed with mean \(0\)).

2.2 The LIR analysis

Let the nonempty (possibly unbounded) rectangles \([{\underline{x}}_{1} ,{\overline{x}}_{1}]\times [{\underline{{y}}}_{1},{\overline{y}}_{1} ],\ldots ,[{\underline{x}}_{n},{\overline{x}}_{n}]\times [{\underline{{y}}}_{n},{\overline{y}}_{n}]\subseteq {\mathbb {R}}^{2}\) be \(n\) independent realizations of the random set \([{\underline{X}},{\overline{X}}]\times [{\underline{Y}},{\overline{Y}}]\). The LIR analysis consists in using likelihood inference to identify a set of plausible regression functions. The imprecise data induce a (nonparametric) likelihood function on the set of all joint probability distributions (of \(X, \,Y,\,{\underline{X}},\,{\overline{X}}, \,{\underline{Y}}\), and \({\overline{Y}}\)) satisfying condition (1). For each \(f\in {\fancyscript{F}}\), let \({\fancyscript{C}}_{f}\) be the likelihood-based confidence region with cutoff point \(\beta \) for the \(p\)-quantile of the distribution of \(R_{f}\), where \(\beta \in [(\max \{p,1-p\}+\varepsilon )^{n},1 [\). That is, \({\fancyscript{C}}_{f}\) consists of all possible values of the \(p\)-quantile of the distribution of \(R_{f}\), for all probability distributions whose likelihood exceeds \(\beta \) times the maximum of the likelihood function.

If the quantities of interest were precisely observed, \({\fancyscript{C}}_{f}\) would be the empirical likelihood confidence interval obtained by thresholding the empirical likelihood ratio at level \(\beta \). The fact that the quantities of interest are only imprecisely observed entails an enlargement of this interval. Therefore, \({\fancyscript{C}}_{f}\) is asymptotically a (conservative) confidence region of level \(F_{\chi ^{2}}(-2\,\log \beta )\) for the \(p\)-quantile of the distribution of the (absolute) residual \(R_{f}\), where \(F_{\chi ^{2}}\) is the cumulative distribution function of the chi-square distribution with \(1\) degree of freedom (see Owen 2001; Cattaneo and Wiencierz 2012 for more details).

In order to obtain an explicit formula for the confidence regions \({\fancyscript{C}}_{f}\), we define

$$\begin{aligned} {\underline{k}}&:= \max \left( \left\{ k\in \left\{ 1,\ldots ,{\underline{i}}-1\right\} :\left( \frac{p-\varepsilon }{k}\right) ^{k}\left( \frac{1-p+\varepsilon }{n-k}\right) ^{n-k}\le \frac{\beta }{n^{n}}\right\} \cup \{0\}\right) ,\\ {\overline{k}}&:= \min \left( \left\{ k\in \left\{ {\overline{i}} ,\ldots ,n-1\right\} :\left( \frac{p+\varepsilon }{k}\right) ^{k}\left( \frac{1-p-\varepsilon }{n-k}\right) ^{n-k}\le \frac{\beta }{n^{n}}\right\} \cup \{n\}\right) , \end{aligned}$$

where \({\underline{i}}:=\left\lceil (p-\varepsilon )\,n\right\rceil \) and \({\overline{i}}:=\left\lfloor (p+\varepsilon )\,n\right\rfloor +1\) (i.e., \({\underline{i}}-1\) is the largest integer smaller than \((p-\varepsilon )\,n\), while \({\overline{i}}\) is the smallest integer larger than \((p+\varepsilon )\,n\)). Clearly, the two integers \({\underline{k}}\) and \({\overline{k}}\) depend on \(\varepsilon ,\,p,\,n\), and \(\beta \), and satisfy

$$\begin{aligned} 0\le {\underline{k}}\le {\underline{i}}-1<(p-\varepsilon )\,n\le p\,n\le (p+\varepsilon )\,n<{\overline{i}}\le {\overline{k}}\le n. \end{aligned}$$

Moreover, when \(\varepsilon ,\,p\), and \(n\) are fixed, \({\underline{k}}\) and \({\overline{k}}\) are an increasing and a decreasing function of \(\beta \), respectively, and in particular, if \(\beta \) is sufficiently large, then \({\underline{k}}={\underline{i}}-1\) and \({\overline{k}}={\overline{i}}\).

Now, for each function \(f\in {\fancyscript{F}}\) and each imprecise observation \([{\underline{x}}_{i},{\overline{x}}_{i}]\times [{\underline{{y}}}_{i} ,{\overline{y}}_{i}]\), we define the lower and upper (absolute) residuals

$$\begin{aligned} {\underline{r}}_{f,i}&:= \min _{(x,y)\in [{\scriptstyle \underline{x}}_{i},{\scriptstyle \overline{x}}_{i}]\times [{\underline{{\scriptstyle y}}}_{i},{\scriptstyle \overline{y}}_{i}]}\left| y-f(x)\right| ,\\ {\overline{r}}_{f,i}&:= \sup _{(x,y)\in [{\scriptstyle \underline{x}}_{i},{\scriptstyle \overline{x}}_{i}]\times [{\underline{{\scriptstyle y}}}_{i},{\scriptstyle \overline{y}}_{i}]}\left| y-f(x)\right| . \end{aligned}$$

Obviously, \({\underline{r}}_{f,i}\le {\overline{r}}_{f,i}\), and \({\underline{r}}_{f,i}\in {\mathbb {R}}_{\ge 0}\), while \({\overline{r}}_{f,i}\in {\overline{{\mathbb {R}}}}_{\ge 0}\). In particular, \({\overline{r}}_{f,i}=+\infty \) if and only if either the linear function \(f\) is not constant and the rectangle \([{\underline{x}}_{i},{\overline{x}}_{i}]\times [{\underline{{y}}}_{i} ,{\overline{y}}_{i}]\) is unbounded, or \(f\) is constant and the interval \([{\underline{{y}}}_{i},{\overline{y}}_{i}]\) is unbounded.

As usual in statistics, \({\underline{r}}_{f,(i)}\) and \({\overline{r}}_{f,(i)}\) denote then the \(i\)th smallest lower and upper residuals, respectively. That is, \({\underline{r}}_{f,(1)}\le \cdots \le {\underline{r}}_{f,(n)}\) are the ordered lower residuals and \({\overline{r}}_{f,(1)}\le \cdots \le {\overline{r}}_{f,(n)}\) are the ordered upper residuals. Then Corollary 2 of Cattaneo and Wiencierz (2012) implies that

$$\begin{aligned} {\fancyscript{C}}_{f}=[{\underline{r}}_{f,({\scriptstyle \underline{k}}+1)},{\overline{r}} _{f,({\scriptstyle \overline{k}})}] \end{aligned}$$

for all \(f\in {\fancyscript{F}}\). That is, the likelihood-based confidence region \({\fancyscript{C}}_{f}\subseteq {\mathbb {R}}_{\ge 0}\) is a nonempty closed interval, which is bounded if and only if either \(f\) is not constant and there are at least \({\overline{k}}\) bounded imprecise observations, or \(f\) is constant and there are at least \({\overline{k}}\) imprecise observations \([{\underline{x}} _{i},{\overline{x}}_{i}]\times [{\underline{{y}}}_{i},{\overline{y}}_{i}]\) such that the interval \([{\underline{{y}}}_{i},{\overline{y}}_{i}]\) is bounded.

It is important to note that in general the interval \({\fancyscript{C}}_{f}\) is proper (i.e., it contains more than one value), even when \(\beta \) is so large that \({\underline{k}}={\underline{i}}-1\) and \({\overline{k}}={\overline{i}}\). In this case, \({\fancyscript{C}}_{f}\) represents the maximum likelihood estimate of the \(p\)-quantile of the distribution of \(R_{f}\), which in general is not a single value because the data are imprecise and quantiles of a distribution are not necessarily unique. For example, if \(n\) is even, \(\varepsilon =0\), and , and thus the maximum likelihood estimate of the \(p\)-quantile (i.e., the median) of the distribution of \(R_{f}\) is .

Hence, for each linear function \(f\in {\fancyscript{F}}\), we have an interval estimate \({\fancyscript{C}}_{f}\) for the \(p\)-quantile of the distribution of the (absolute) residual \(R_{f}\). As in least quantile of squares regression, we would like to select the regression function \(f\in {\fancyscript{F}}\) by minimizing the estimate of the residual’s \(p\)-quantile, but comparing the intervals \({\fancyscript{C}}_{f}\) gives us only a partial order on \({\fancyscript{F}}\). The linear functions \(f\in {\fancyscript{F}}\) that are minimal according to this partial order are said to be undominated. That is, \(f\) is undominated if and only if there is no \(f^{\prime }\in {\fancyscript{F}}\) such that \({\overline{r}}_{f^{\prime },({\scriptstyle \overline{k}})}<{\underline{r}}_{f,({\scriptstyle \underline{k}}+1)}\). In order to simplify the description of the undominated functions, define

$$\begin{aligned} {\overline{q}}_{LRM}:=\inf _{f\in {\fancyscript{F}}}{\overline{r}}_{f,({\scriptstyle \overline{k}})} \end{aligned}$$

(the name \({\overline{q}}_{LRM}\) shall be clarified in Sect. 3.1). The set of all undominated regression functions

$$\begin{aligned} {\fancyscript{U}}:=\{f\in {\fancyscript{F}}:{\underline{r}}_{f,({\scriptstyle \underline{k}}+1)} \le {\overline{q}}_{LRM}\} \end{aligned}$$

is the result of the robust LIR method considered in this paper. It represents the whole uncertainty about the linear function that best describes the relation between \(X\) and \(Y\), including the statistical uncertainty due to the finite sample as well as the indetermination related to the fact that the quantities are only imprecisely observed.

3 An exact algorithm for LIR

We now present an exact algorithm for determining the result of the robust LIR analysis described in Sect. 2. That is, an exact algorithm for calculating the set \({\fancyscript{U}}\) of all undominated regression functions, given \(n\) nonempty (possibly unbounded) rectangles \([{\underline{x}} _{1},{\overline{x}}_{1}]\times [{\underline{{y}}}_{1},{\overline{y}}_{1} ],\ldots ,[{\underline{x}}_{n},{\overline{x}}_{n}]\times [{\underline{{y}}}_{n},{\overline{y}}_{n}]\subseteq {\mathbb {R}}^{2}\) and the two integers \({\underline{k}}\) and \({\overline{k}}\) with \(0\le {\underline{k}}<{\overline{k}}\le n\). The algorithm consists of two parts: in the first one, we determine the bound \({\overline{q}}_{LRM}\), which is then used in the second part to identify the set \({\fancyscript{U}}\). As regards the first part, we will show that in order to determine \({\overline{q}}_{LRM}\), it suffices to minimize \({\overline{r}}_{f,({\scriptstyle \overline{k}})}\) over a finite subset of \({\fancyscript{F}}\). For the second part of the algorithm, we will see that the set of all the intercept-slope pairs corresponding to the undominated regression functions is the union of finitely many polygons. As a by-product, we obtain a representation of the result of the least quantile of squares regression in the case of precise data, without any assumption about the data being “in general position”. We will show that the computational complexity of our exact algorithm is \(O(n^{3}\log n)\). We have implemented this algorithm as part of an R package, which we will briefly introduce at the end of the present section.

3.1 Part 1: Determining the bound \({\overline{q}}_{LRM}\)

Let \({\fancyscript{D}}\) be the set of all \(i\in \{1,\ldots ,n\}\) such that the rectangle \([{\underline{x}}_{i},{\overline{x}}_{i}]\times [{\underline{{y}}}_{i},{\overline{y}}_{i}]\) is bounded. Then define \({\fancyscript{B}}:=\{0\}\) if there are less than \({\overline{k}}\) bounded imprecise observations (i.e., if \(|{\fancyscript{D}}|<{\overline{k}}\), where \(|{\fancyscript{D}}|\) denotes the cardinality of the set \({\fancyscript{D}}\)), and

$$\begin{aligned} {\fancyscript{B}}&:= \left\{ \frac{{\overline{y}}_{i}-{\overline{y}}_{j}}{{\underline{x}}_{i}-{\underline{x}}_{j}}:(i,j)\in {\fancyscript{D}}^{2}\,\text { and }\,{\underline{x}}_{i}>{\underline{x}}_{j}\,\text { and }\,{\overline{y}} _{i}>{\overline{y}}_{j}\right\} \\&\quad \cup \left\{ \frac{{\underline{{y}}}_{i} -{\underline{{y}}}_{j}}{{\underline{x}}_{i}-{\underline{x}}_{j}}:(i,j)\in {\fancyscript{D}}^{2}\,\text { and }\,{\underline{x}}_{i}>{\underline{x}}_{j}\,\text { and }\,{\underline{{y}}}_{i}<{\underline{{y}}}_{j}\right\} \\&\quad \cup \left\{ \frac{{\overline{y}}_{i}-{\overline{y}}_{j}}{{\overline{x}} _{i}-{\overline{x}}_{j}}:(i,j)\in {\fancyscript{D}}^{2}\,\text { and }\,{\overline{x}} _{i}>{\overline{x}}_{j}\,\text { and }\,{\overline{y}}_{i}<{\overline{y}} _{j}\right\} \\&\quad \cup \left\{ \frac{{\underline{{y}}}_{i}-{\underline{{y}}}_{j} }{{\overline{x}}_{i}-{\overline{x}}_{j}}:(i,j)\in {\fancyscript{D}}^{2}\,\text { and }\,{\overline{x}}_{i}>{\overline{x}}_{j}\,\text { and }\,{\underline{{y}}}_{i}>{\underline{{y}}}_{j}\right\} \cup \{0\} \end{aligned}$$

otherwise (i.e., if \(|{\fancyscript{D}}|\ge {\overline{k}}\)). The central ideas of the first part of the algorithm are that in order to obtain \({\overline{q}}_{LRM}\) it suffices to consider the linear functions \(f_{a,b}\) with slope \(b\in {\fancyscript{B}}\), and that for each slope \(b\) the intercept \(a\in {\mathbb {R}}\) minimizing \({\overline{r}}_{f_{a,b},({\scriptstyle \overline{k}})}\) can be easily calculated, since the problem becomes one-dimensional. These ideas are formalized in the following theorem, but first we need some additional definitions. For each \(b\in {\mathbb {R}}\) and each \(i\in \{1,\ldots ,n\}\), define

$$\begin{aligned} {\underline{z}}_{b,i}&= \left\{ \begin{array}{ll} {\underline{{y}}}_{i}-b\,{\underline{x}}_{i} &{} \quad \text {if} \,\,\, b<0,\\ {\underline{{y}}}_{i} &{} \quad \text {if} \,\,\, b=0,\\ {\underline{{y}}}_{i}-b\,{\overline{x}}_{i} &{} \quad \text {if} \,\,\, b>0, \end{array} \right. \\ {\overline{z}}_{b,i}&= \left\{ \begin{array} {ll} {\overline{y}}_{i}-b\,{\overline{x}}_{i} &{} \quad \text {if} \,\,\, b<0,\\ {\overline{y}}_{i} &{} \quad \text {if} \,\,\, b=0,\\ {\overline{y}}_{i}-b\,{\underline{x}}_{i} &{} \quad \text {if} \,\,\, b>0. \end{array} \right. \end{aligned}$$

For each \(b\in {\mathbb {R}}\) and each \(j\in \{1,\ldots ,n\}\), as usual, \({\underline{z}}_{b,(j)}\) and \({\overline{z}}_{b,(j)}\) denote then the \(j\)th smallest value among the \({\underline{z}}_{b,i}\) and among the \({\overline{z}}_{b,i}\), respectively. Furthermore, for each \(b\in {\mathbb {R}}\) and each \(j\in \{1,\ldots ,n-{\overline{k}}+1\}\), let \({\overline{z}}_{b,[j]}\) denote the \({\overline{k}}\)th smallest value among the \({\overline{z}}_{b,i}\) such that \({\underline{z}}_{b,i}\ge {\underline{z}}_{b,(j)}\).

Theorem 1

If there are less than \({\overline{k}}\) imprecise observations \([{\underline{x}}_{i},{\overline{x}}_{i}]\times [{\underline{{y}}}_{i} ,{\overline{y}}_{i}]\) such that the interval \([{\underline{{y}}}_{i},{\overline{y}}_{i}]\) is bounded, then

$$\begin{aligned}&{\overline{q}}_{LRM} =+\infty ,\\&\{f\in {\fancyscript{F}}:{\overline{r}}_{f,({\scriptstyle \overline{k}})}={\overline{q}}_{LRM}\} ={\fancyscript{F}}. \end{aligned}$$

Otherwise (i.e., when there are at least \({\overline{k}}\) imprecise observations \([{\underline{x}}_{i},{\overline{x}}_{i}]\times [{\underline{{y}}}_{i} ,{\overline{y}}_{i}]\) such that the interval \([{\underline{{y}}}_{i},{\overline{y}}_{i}]\) is bounded),

$$\begin{aligned}&{\overline{q}}_{LRM} =\tfrac{1}{2}\min _{(b,j)\in {\fancyscript{B}}\times \{1,\ldots ,n-{\scriptstyle \overline{k}}+1\}}({\overline{z}}_{b,[j]} -{\underline{z}}_{b,(j)}),\\&\{f\!\in \!{\fancyscript{F}}\!:\!{\overline{r}}_{f,({\scriptstyle \overline{k}})}\!=\!{\overline{q}}_{LRM}\}\\&\quad \supseteq \left\{ f_{a^{\prime },b^{\prime }}:(b^{\prime },j^{\prime } )\in \mathop {{{\mathrm{arg min}}}}\limits _{(b,j)\in {\fancyscript{B}}\times \{1,\ldots ,n-{\scriptstyle \overline{k}}+1\}} ({\overline{z}}_{b,[j]}\!-\!{\underline{z}}_{b,(j)})\,\text { and }\right. \\&\qquad \quad \;\left. a^{\prime }=\tfrac{1}{2}\,({\underline{z}}_{b^{\prime },(j^{\prime })} +{\overline{z}}_{b^{\prime },[j^{\prime }]})\right\} , \end{aligned}$$

where the set on the left-hand side is infinite when the inclusion is strict. However, the inclusion is certainly an equality when the following condition is satisfied: if there is a pair \((i,j)\in {\fancyscript{D}}^{2}\) such that \({\underline{x}}_{i}={\overline{x}}_{j}\) and \(\max \{{\overline{y}}_{i},{\overline{y}}_{j}\}-\min \{{\underline{{y}}}_{i},{\underline{{y}}}_{j}\}=2\,{\overline{q}}_{LRM}\), then \(i\ne j\) and the two intervals \([{\underline{{y}}}_{i},{\overline{y}}_{i}]\) and \([{\underline{{y}}}_{j},{\overline{y}}_{j}]\) are nested (i.e., either \([{\underline{{y}}}_{i},{\overline{y}}_{i}]\subseteq [{\underline{{y}}}_{j},{\overline{y}}_{j}]\), or \([{\underline{{y}}}_{j},{\overline{y}}_{j} ]\subseteq [{\underline{{y}}}_{i},{\overline{y}}_{i}]\)).

Some further explanations are needed to fully understand the results in Theorem 1. As seen in Sect. 2.2, for each linear function \(f\in {\fancyscript{F}}\), we have a likelihood-based confidence region \([{\underline{r}}_{f,({\scriptstyle \underline{k}}+1)},{\overline{r}}_{f,({\scriptstyle \overline{k}})}]\) for the \(p\)-quantile of the residual’s distribution. Hence, the functions \(f\in {\fancyscript{F}}\) minimizing \({\overline{r}}_{f,({\scriptstyle \overline{k}})}\) can be interpreted as the results of a minimax approach to our regression problem: they are called Likelihood-based Region Minimax (LRM) regression functions (see Cattaneo 2007). For these functions, the upper endpoint of the interval estimate of the \(p\)-quantile of the residual’s distribution is \({\overline{q}}_{LRM}\), which explains its name.

Theorem 1 implies in particular that an LRM regression function always exists, though it is not necessarily unique. When it is unique, it is denoted by \(f_{LRM}\). In this case, \({\overline{B}}_{f_{LRM},{\scriptstyle \overline{q}}_{LRM}}\) is the thinnest band of the form \({\overline{B}}_{f,q}\) containing at least \({\overline{k}}\) imprecise observations, for all \(f\in {\fancyscript{F}}\) and all \(q\in {\mathbb {R}}_{\ge 0}\). More generally, if there are at least \({\overline{k}}\) imprecise observations \([{\underline{x}}_{i},{\overline{x}}_{i}]\times [{\underline{{y}}}_{i},{\overline{y}}_{i}]\) such that the interval \([{\underline{{y}}}_{i},{\overline{y}}_{i}]\) is bounded, then \(2\,{\overline{q}} _{LRM}\) is the (vertical) width of the thinnest bands of the form \({\overline{B}}_{f,q}\) containing at least \({\overline{k}}\) imprecise observations (there can be more than one such bands, but only finitely many when the condition at the end of Theorem 1 is satisfied).

If all interval data are degenerate: \({\underline{x}}_{i}={\overline{x}}_{i}\) and \({\underline{{y}}}_{i}={\overline{y}}_{i}\) for all \(i\in \{1,\ldots ,n\}\) (i.e., the imprecise data are in fact precise), then the LRM regression functions correspond to the least quantile of squares (or absolute residuals) regression functions \(f\in {\fancyscript{F}}\) minimizing the (square of the) \({\overline{k}}\)th smallest absolute residual \({\underline{r}}_{f,({\scriptstyle \overline{k}})}={\overline{r}}_{f,({\scriptstyle \overline{k}})}\) (see Rousseeuw and Leroy 1987). That is, the LRM regression functions can be interpreted as the results of a generalization of the least quantile of squares regression to the case of imprecise data. The first part of our algorithm corresponds to a generalization (to the case of general quantiles and imprecise data) of the first exact algorithm for least median of squares regression, proposed by Steele and Steiger (1986) (see also Rousseeuw and Leroy 1987, Chapter 5).

The key result behind Theorem 1 is that (when the condition at the end of the theorem is satisfied) if \({\overline{B}}_{f^{\prime },q^{\prime }}\) is one of the thinnest bands of the form \({\overline{B}}_{f,q}\) containing at least \({\overline{k}}\) imprecise observations, then the union of these imprecise observations touches one of the two borders of \({\overline{B}}_{f^{\prime },q^{\prime }}\) in at least two different points. This is a simple consequence of general results by Cheney (1982, Chapters 1 and 2), as suggested by Stromberg (1993). From this property it follows that one of the two borders of \({\overline{B}}_{f^{\prime },q^{\prime }}\) (which obviously have the same slope as \(f^{\prime }\)) is the line determined by two points on the borders of the imprecise observations contained in \({\overline{B}}_{f^{\prime },q^{\prime }}\). Hence, either the slope of \(f^{\prime }\) is \(0\), or it is determined by two vertices of a pair of bounded imprecise observations contained in \({\overline{B}}_{f^{\prime },q^{\prime }}\). The set \({\fancyscript{B}}\) consists of all the possible slopes that can be obtained in this way: they are at most \(4\,{\genfrac(){0.0pt}1{n}{2}}+1\). For each possible slope \(b\in {\fancyscript{B}}\), finding the thinnest bands of the form \({\overline{B}}_{f_{a,b},q}\) containing at least \({\overline{k}}\) imprecise observations (for all \(a\in {\mathbb {R}}\) and all \(q\in {\mathbb {R}}_{\ge 0}\)) corresponds to finding the shortest intervals (of the form \([a-q,\,a+q]\)) containing at least \({\overline{k}}\) of the \(n\) intervals \([{\underline{z}}_{b,1},{\overline{z}}_{b,1}],\ldots ,[{\underline{z}}_{b,n},{\overline{z}}_{b,n}]\). This is a finite problem: it suffices to consider the intervals \([{\underline{z}}_{b,(j)},{\overline{z}}_{b,[j]}]\) with \(j\in \{1,\ldots ,n-{\overline{k}}+1\}\).

Therefore, Theorem 1 gives us an algorithm for determining the bound \({\overline{q}}_{LRM}\), by reducing the minimization of \({\overline{r}}_{f,({\scriptstyle \overline{k}})}\) on the infinite set \({\fancyscript{F}}\) to a minimization problem on the finite set \({\fancyscript{B}}\times \{1,\ldots ,n-{\overline{k}}+1\}\). This constitutes the first part of our algorithm for the robust LIR analysis.

Besides that, Theorem 1 gives us also an algorithm for finding all LRM regression functions, when the condition at the end of the theorem is satisfied. An explicit formula for the set of all LRM regression functions in the general case (i.e., also when this condition is not satisfied) can be easily obtained, but requires several case distinctions and goes beyond the scope of the present paper. However, a brief comment on the condition at the end of Theorem 1 can be helpful in understanding the theorem. This condition is sufficient (but not necessary) for excluding the cases in which the set of all LRM regression functions is an infinite, proper subset of \({\fancyscript{F}}\). That is, for excluding the situations in which there is a thinnest band of the form \({\overline{B}}_{f,q}\) containing at least \({\overline{k}}\) imprecise observations, but the union of these imprecise observations touches each border of the band in only one point. In fact, in such situations these contact points have the same \(x\)-coordinate, and can thus be written as \((x,y+{\overline{q}}_{LRM})\) and \((x,y-{\overline{q}}_{LRM})\), for some \(x,y\in {\mathbb {R}}\). In this case, there are infinitely many linear functions \(f\in {\fancyscript{F}}\) going through the point \((x,y)\) and such that the band \({\overline{B}}_{f,{\scriptstyle \overline{q}}_{LRM}}\) contains at least \({\overline{k}}\) imprecise observations. All these functions \(f\) are LRM regression functions.

3.2 Part 2: Identifying the set \({\fancyscript{U}}\)

After having determined the bound \({\overline{q}}_{LRM}\), in the second part of the algorithm we identify the set \({\fancyscript{U}}\) of all undominated regression functions (i.e., the result of the robust LIR analysis described in Sect. 2).

Theorem 2

$$\begin{aligned} {\fancyscript{U}}=\left\{ f_{a,b}:b\in {\mathbb {R}}\,\text { and }\,a\in \bigcup _{j=1}^{n-{\scriptstyle \underline{k}}}[{\underline{z}}_{b,({\scriptstyle \underline{k}}+j)}-{\overline{q}}_{LRM},\,{\overline{z}}_{b,(j)}+{\overline{q}}_{LRM}]\right\} \text {.} \end{aligned}$$

A linear function \(f\in {\fancyscript{F}}\) is undominated if and only if \({\underline{r}}_{f,({\scriptstyle \underline{k}}+1)}\le {\overline{q}}_{LRM}\). That is, if and only if the band \({\overline{B}}_{f,{\scriptstyle \overline{q}}_{LRM}}\) intersects at least \({\underline{k}}+1\) imprecise observations. For each possible slope \(b\in {\mathbb {R}}\), finding all the bands of the form \({\overline{B}} _{f_{a,b},{\scriptstyle \overline{q}}_{LRM}}\) intersecting at least \({\underline{k}}+1\) imprecise observations (for all \(a\in {\mathbb {R}}\)) corresponds to finding all the intervals of the form \([a-{\overline{q}}_{LRM},\,a+{\overline{q}}_{LRM}]\) intersecting at least \({\underline{k}}+1\) of the \(n\) intervals \([{\underline{z}}_{b,1},{\overline{z}}_{b,1}],\ldots ,[{\underline{z}}_{b,n},{\overline{z}}_{b,n}]\). For each \(b\in {\mathbb {R}}\) and each nonempty set \({\fancyscript{I}}\subseteq \{1,\ldots ,n\}\), the interval \([a-{\overline{q}} _{LRM},\,a+{\overline{q}}_{LRM}]\) (with \(a\in {\mathbb {R}}\)) intersects all the intervals \([{\underline{z}}_{b,i},{\overline{z}}_{b,i}]\) with \(i\in {\fancyscript{I}}\) if and only if \(a\in [\max _{i\in {\fancyscript{I}}}{\underline{z}}_{b,i}-{\overline{q}}_{LRM},\,\min _{i\in {\fancyscript{I}}}{\overline{z}}_{b,i}+{\overline{q}}_{LRM}]\). Therefore,

$$\begin{aligned} {\fancyscript{U}}=\left\{ f_{a,b}:b\in {\mathbb {R}}\,\text { and }\,a\in \bigcup _{{\fancyscript{I}}\subseteq \{1,\ldots ,n\}\,:\,|{\fancyscript{I}}|={\scriptstyle \underline{k}}+1}\left[ \max _{i\in {\fancyscript{I}}}{\underline{z}}_{b,i}-{\overline{q}}_{LRM},\,\min _{i\in {\fancyscript{I}}}{\overline{z}}_{b,i}+{\overline{q}}_{LRM}\right] \right\} . \end{aligned}$$

Theorem 2 gives a simpler expression for \({\fancyscript{U}}\), in which the number of intervals in the union is reduced from \({\genfrac(){0.0pt}1{n}{{\scriptstyle \underline{k}}+1}}\) to \(n-{\underline{k}}\).

Hence, Theorem 2 gives us an algorithm for identifying, for each possible slope \(b\in {\mathbb {R}}\), the set of all intercepts \(a\in {\mathbb {R}}\) such that the linear function \(f_{a,b}\) is undominated. This suffices for most practical purposes, but Theorem 2 also enables us to precisely describe as union of finitely many (possibly unbounded) polygons the set

$$\begin{aligned} {\fancyscript{U}}^{\prime }:=\left\{ (a,b)\in {\mathbb {R}}^{2}:f_{a,b}\in {\fancyscript{U}}\right\} \end{aligned}$$

of all the intercept-slope pairs corresponding to the undominated regression functions. More precisely, \({\fancyscript{U}}^{\prime }\) is a subset of the plane \({\mathbb {R}}^{2}\) bounded by finitely many line segments and half-lines. However, \({\fancyscript{U}}^{\prime }\) is not necessarily convex nor connected, and if there are imprecise observations \([{\underline{x}}_{i},{\overline{x}} _{i}]\times [{\underline{{y}}}_{i},{\overline{y}}_{i}]\) such that the interval \([{\underline{x}}_{i},{\overline{x}}_{i}]\) is unbounded and \([{\underline{{y}}}_{i},{\overline{y}}_{i}]\ne {\mathbb {R}}\), then \({\fancyscript{U}}^{\prime }\) is not even necessarily closed.

Consider first the case with no imprecise observations \([{\underline{x}} _{i},{\overline{x}}_{i}]\times [{\underline{{y}}}_{i},{\overline{y}}_{i}]\) such that the interval \([{\underline{x}}_{i},{\overline{x}}_{i}]\) is unbounded and \([{\underline{{y}}}_{i},{\overline{y}}_{i}]\ne {\mathbb {R}}\). In this case, for each \(i\in \{1,\ldots ,n\}\), the function \(b\mapsto {\underline{z}}_{b,i}\) on \({\mathbb {R}}\) is either continuous and piecewise linear, or constant equal \(-\infty \), while the function \(b\mapsto {\overline{z}}_{b,i}\) on \({\mathbb {R}}\) is either continuous and piecewise linear, or constant equal \(+\infty \). Therefore, for each \(j\in \{1,\ldots ,n-{\underline{k}}\}\), the function \(b\mapsto {\underline{z}}_{b,({\scriptstyle \underline{k}}+j)}-{\overline{q}}_{LRM}\) on \({\mathbb {R}}\) is either continuous and piecewise linear, or constant equal \(-\infty \), while the function \(b\mapsto {\overline{z}}_{b,(j)}+{\overline{q}} _{LRM}\) on \({\mathbb {R}}\) is either continuous and piecewise linear, or constant equal \(+\infty \). Thus, Theorem 2 implies that \({\fancyscript{U}}^{\prime }\) is a closed subset of the plane \({\mathbb {R}}^{2}\) bounded by finitely many line segments and half-lines. That is, \({\fancyscript{U}}^{\prime }\) is the union of finitely many (possibly unbounded) polygons (see for example Alexandrov 2005, Subsection 1.1.1).

If \([{\underline{x}}_{i},{\overline{x}}_{i}]\times [{\underline{{y}}}_{i},{\overline{y}}_{i}]\) is an imprecise observation such that the interval \([{\underline{x}}_{i},{\overline{x}}_{i}]\) is unbounded and \([{\underline{{y}}}_{i},{\overline{y}}_{i}]\ne {\mathbb {R}}\), then at least one of the two functions \(b\mapsto {\underline{z}}_{b,i}\) and \(b\mapsto {\overline{z}}_{b,i}\) on \({\mathbb {R}}\) has a discontinuity at \(b=0\). Therefore, in this case, the functions \(b\mapsto {\underline{z}}_{b,({\scriptstyle \underline{k}}+j)}-{\overline{q}}_{LRM}\) and \(b\mapsto {\overline{z}}_{b,(j)}+{\overline{q}}_{LRM}\) on \({\mathbb {R}}\) (with \(j\in \{1,\ldots ,n-{\underline{k}}\}\)) can be discontinuous at \(b=0\). As a consequence, Theorem 2 implies that \({\fancyscript{U}}^{\prime }\) is a subset of the plane \({\mathbb {R}}^{2}\) bounded by finitely many line segments and half-lines, but \({\fancyscript{U}}^{\prime }\) is not necessarily closed. However, the two parts \({\fancyscript{U}}^{\prime }\cap ({\mathbb {R}}\times \{0\})\) and \({\fancyscript{U}} ^{\prime }\cap ({\mathbb {R}}\times {\mathbb {R}}_{\ne 0})\) are relatively closed in \({\mathbb {R}}\times \{0\}\) and \({\mathbb {R}}\times {\mathbb {R}}_{\ne 0}\), respectively.

If the imprecise data are in fact precise (i.e., all interval data are degenerate: \({\underline{x}}_{i}={\overline{x}}_{i}\) and \({\underline{{y}}}_{i}={\overline{y}}_{i}\) for all \(i\in \{1,\ldots ,n\}\)) and \({\underline{k}}={\overline{k}}-1\), then \({\fancyscript{U}}\) is the set of all least quantile of squares regression functions \(f\in {\fancyscript{F}}\) minimizing the \({\overline{k}}\)th smallest absolute residual \({\underline{r}}_{f,({\scriptstyle \overline{k}})}={\overline{r}}_{f,({\scriptstyle \overline{k}})}\). That is, Theorem 2 gives us in particular an algorithm for calculating the result of the least quantile of squares regression, without any assumption about the data being “in general position”.

3.3 Computational complexity

The algorithm consisting of the two parts presented in Sects. 3.1 and 3.2 is the first exact algorithm to determine the result of the robust LIR analysis in the case of simple linear regression with interval data. It has worst-case time complexity \(O(n^{3}\log n)\), exactly as the first exact algorithm for least median of squares regression (see Steele and Steiger 1986).

In the first part of the algorithm, described in Sect. 3.1, for each possible slope \(b\in {\fancyscript{B}}\), we must determine the pair \(({\underline{z}}_{b,(j)},{\overline{z}}_{b,[j]})\) (with \(j\in \{1,\ldots ,n-{\overline{k}}+1\}\)) such that the difference \({\overline{z}}_{b,[j]}-{\underline{z}}_{b,(j)}\) is minimal. We can do this as follows: after having calculated the values \({\underline{z}}_{b,1} ,\ldots ,{\underline{z}}_{b,n}\) and \({\overline{z}}_{b,1},\ldots ,{\overline{z}} _{b,n}\), we sort the two lists, obtaining \({\underline{z}}_{b,i_{1}} ,\ldots ,{\underline{z}}_{b,i_{n}}\) (with \({\underline{z}}_{b,i_{j}}={\underline{z}}_{b,(j)}\)) and \({\overline{z}}_{b,(1)},\ldots ,{\overline{z}}_{b,(n)}\). Then, for each \(j\) from \(1\) to \(n-{\overline{k}}+1\), we retrieve the pair consisting of the \(j\)th entry (i.e., \({\underline{z}}_{b,i_{j}}\)) in the first list and of the \({\overline{k}}\)th entry in the second one, and after that we remove the value \({\overline{z}}_{b,i_{j}}\) from the second list. In this way, the pairs of values that we have retrieved include all the pairs \(({\underline{z}} _{b,(j)},{\overline{z}}_{b,[j]})\) with \(j\in \{1,\ldots ,n-{\overline{k}}+1\}\) (and possibly some irrelevant additional pairs with larger differences, if some of the \({\underline{z}}_{b,i}\) are equal), and we did not have to calculate a new list of \({\overline{z}}_{b,i}\) for each \(j\) in order to determine \({\overline{z}}_{b,[j]}\).

Hence, for each possible slope, we have to calculate and sort two lists of length \(n\), which can be done in time \(O(n\,\log n)\), and then for each \(j\in \{1,\ldots ,n-{\overline{k}}+1\}\), we have to search and remove a value from the second list, which can be done in time \(O(\log n)\) using balanced trees (see for example Knuth 1998, Subsection 6.2.3). Therefore, since there are at most \(4\,{\genfrac(){0.0pt}1{n}{2}}+1\) possible slopes, the worst-case time complexity of the first part of the algorithm is \(O(n^{3}\log n)\).

In the second part of the algorithm, described in Sect. 3.2, for a given slope \(b\in {\mathbb {R}}\), we must determine the pairs \(({\underline{z}}_{b,({\scriptstyle \underline{k}}+j)},{\overline{z}}_{b,(j)})\) for all \(j\in \{1,\ldots ,n-{\underline{k}}\}\). This can be done in time \(O(n\,\log n)\), since it suffices to calculate and sort the two lists \({\underline{z}}_{b,1} ,\ldots ,{\underline{z}}_{b,n}\) and \({\overline{z}}_{b,1},\ldots ,{\overline{z}} _{b,n}\), and then, for each \(j\) from \(1\) to \(n-{\underline{k}}\), retrieve the pair consisting of the \(({\underline{k}}+j)\)th entry in the first list and of the \(j\)th entry in the second one.

For example, if we want to graphically represent the set \({\fancyscript{U}}^{\prime }\) of all the intercept-slope pairs \((a,b)\in {\mathbb {R}}^{2}\) corresponding to the undominated regression functions \(f_{a,b}\), then it suffices to consider a finite number of possible values for the slope \(b\), resulting in a worst-case time complexity of \(O(n\,\log n)\) for the second part of the algorithm. However, if the goal is to precisely describe the set \({\fancyscript{U}}^{\prime }\) as union of finitely many (possibly unbounded) polygons, then the (worst-case) number of values \(b\in {\mathbb {R}}\) that must be considered depends on \(n\). In this case, it suffices to consider all values \(b\in {\mathbb {R}}\) such that some of the \(2\,n\) graphs of the functions \(b\mapsto {\underline{z}}_{b,i}-{\overline{q}}_{LRM}\) and \(b\mapsto {\overline{z}}_{b,i}+{\overline{q}}_{LRM}\) cross each other, and five additional values for the slope \(b\). More precisely, these additional values are \(0\), a positive and a negative value sufficiently near \(0\) (in order to clarify what happens in the limits \(b\downarrow 0\) and \(b\uparrow 0\)), and finally a positive and a negative value sufficiently far from \(0\) (in order to clarify what happens in the limits \(b\uparrow +\infty \) and \(b\downarrow -\infty \)). Therefore, the worst-case number of values \(b\in {\mathbb {R}}\) that must be considered is \(2\,{\genfrac(){0.0pt}1{2\,n}{2}}+5\), and so the worst-case time complexity of the second part of the algorithm is \(O(n^{3}\log n)\), when the goal is to precisely describe the set \({\fancyscript{U}}^{\prime }\) as union of finitely many (possibly unbounded) polygons.

Altogether, the worst-case time complexity of the whole algorithm for the robust LIR analysis is thus \(O(n^{3}\log n)\).

3.4 R package

We have implemented the presented algorithm in the statistical software environment (R Development Core Team 2012). It is part of the package linLIR (Wiencierz 2013), designed for the implementation of LIR methods for the case of linear regression with interval data. The available version of the linLIR package includes a function to create a particular data object for interval-valued observations (idf.create), the function s.linlir to perform the robust LIR analysis for two variables out of the data object, as well as associated methods for the generic functions print, summary, and plot. Both parts of the algorithm are incorporated in the s.linlir function. The corresponding plot method provides tools to visualize the LIR results including, e.g., the set \({\fancyscript{U}}^{\prime }\).

The linLIR package provides a ready-to-use first implementation of the robust LIR method for linear regression with interval data, although the current version of the s.linlir function is not optimized for computational speed yet. In the following section, we illustrate the implementation by means of an application example.

4 Analysis of ESS data using the linLIR package

In recent years, there has been a lively interest in analyzing subjective well-being in various disciplines of the social and behavioral sciences. In this context, one important question is how an increase in income translates to subjective well-being (see, e.g., Deaton 2012; Clark et al. 2008; Diener and Biswas-Diener 2002). Empirical studies in this field often use global measures of subjective well-being, which are obtained from a single survey question about the overall satisfaction with life. These global measures are indicators of the state of an individual’s well-being, and therefore, it is sensible to use them to analyze subjective well-being (Deaton 2008), although, of course, they do not capture the entire complexity of the concept of well-being (Huppert et al. 2009). As single-item measures are usually measured on a discrete scale, they can be considered as coarse observations of the latent, continuous variable of interest degree of subjective well-being. The coarseness of the discrete values can be represented by intervals, thus, the LIR approach is suitable to analyze this kind of data. Moreover, when investigating the relation between income and subjective well-being, sometimes also the income data are only available as classes, which represent in fact intervals that form a partition of the associated observation space \({\mathbb {R}}_{\ge 0}\). Finally, as the relation between income and subjective well-being is usually assumed to be log-linear (see, e.g., Deaton 2012; Diener and Biswas-Diener 2002), we can conduct a linear LIR analysis with the logarithm of income as independent variable \(X\) and subjective well-being as dependent variable \(Y\) to analyze the relation of interest, accounting for the imprecision of the data.

In this section, we analyze data from the fifth round of the ESS (Norwegian Social Science Data Services 2010) to illustrate the implementation of the linear LIR analysis. The ESS is a biennial multi-country survey established to monitor changing attitudes and behavior of people in Europe. The data collected for the ESS are available free of charge on the ESS website www.europeansocialsurvey.org.

Previous empirical studies indicated that the relation between income and subjective well-being on the individual level is not the same in rich countries as in poor countries, and furthermore, that there may be different relations for men than for women (see, e.g., Clark et al. 2005; Diener and Biswas-Diener 2002). For these reasons, we choose Finland and Bulgaria as representatives for the groups of rich and poor European countries, respectively, and we will analyze only the corresponding subsets of the ESS data set. Furthermore, for each country we will perform separate LIR analyses for the subpopulations of women and men. From the variables included in the ESS data set we retrieve the following ones: household income (net per month, in categories corresponding to the decile classes of the income distribution in each country) and overall satisfaction with life (on a discrete scale from 0—extremely dissatisfied to 10—extremely satisfied). In a data preprocessing step, the income classes are replaced by the corresponding intervals, then the interval endpoints are divided by the household size and, finally, the logarithmic transformation is made. The data on subjective well-being are changed from discrete values \(0,1, \ldots , 9,10\) to intervals \([0,0.5], [0.5,1.5], \ldots , [8.5,9.5], [9.5,10]\). Hence, the independent and dependent precise quantities whose relation is investigated by the linear LIR analysis are the logarithm of monthly net household income per capita in euros and the subjective well-being on a latent, continuous scale from \(0\) to \(10\).

The resulting data frames contain each four columns: two for each of the analyzed variables, one column for the lower interval endpoint and one for the upper endpoint, which is the required data format for the linLIR package. Applying the function idf.create to these data frames, we create so-called interval data frame (idf) objects, which consist of a list of data frames, each containing the corresponding two columns of interval endpoints of one variable. For these idf-objects, the linLIR package provides a summary method as well as a plot method with two options. Figures 1 and 2 show the data plots of the four data sets we will analyze. As the data sets consist of roughly 1,000 observations each, we used the two-dimensional histogram plot by choosing the option typ=“hist” in the plot function. As expected, we notice that the marginal distribution of subjective well-being is concentrated at a higher level in Finland compared to Bulgaria, but there appear to be no big differences between men and women within the countries. Moreover, we can see that there are many observations that are unbounded with respect to \(X\). This is partly caused by the high number of observations in the lowest and highest income classes. In addition to this, there is a significant percentage of completely missing income values (Finland 5–10  %, Bulgaria 15–20  %), which are represented in the data set as intervals \([{\underline{x}}_{i},{\overline{x}}_{i}]=[-\infty , +\infty ]\). Given the high degree of data imprecision, we can expect to obtain rather uninformative results from the LIR analyses, reflecting the high uncertainty induced by the interval data. One could argue that using \(-\infty \) as lower endpoint of the range of the logarithmic income (instead of using, e.g., zero) entails too much unnecessary data uncertainty. However, the results of the LIR analyses are affected only a little by this, because the LIR method is very robust.

Fig. 1
figure 1

Histogram plots of the Finnish data sets: women above (\(n=967\)), men below (\(n=911\)). The darker a rectangle the more observations overlap this rectangle

Fig. 2
figure 2

Histogram plots of the Bulgarian data sets: women above (\(n=1,\!370\)), men below (\(n=1,\!064\)). The darker a rectangle the more observations overlap this rectangle

Before conducting the linear LIR analyses, we have to set up the probability model by selecting the only model parameter \(\varepsilon \), and furthermore, we need to choose the quantile to be considered and the cutoff point \(\beta \). For simplicity, we here assume that the imprecise data are correct in the sense that the observed rectangles contain the correct precise values with probability one, i.e., we assume \(\varepsilon =0\). If we had concerns about the data quality or if we wanted to account for possibly wrong coarsening, a positive \(\varepsilon \) could be considered in the probability model. This would lead to more imprecise results of the LIR analyses, reflecting the fact that there is additional uncertainty. As the residual’s quantile to be minimized we consider the median (i.e., \(p=0.5\)) because it can be shown that the robust LIR method yields the most robust results in this case. Finally, we choose \(\beta =0.8\) as cutoff point for the likelihood-based confidence regions \({\fancyscript{C}}_{f}\) with \(f \in {\fancyscript{F}}\). This choice of \(\beta \) corresponds to an asymptotic confidence level of (approximately) \(50\) % for each \({\fancyscript{C}}_{f}\) (see Sect. 2.2).

The model parameter \(\varepsilon \), the LIR settings \(p\) and \(\beta \), as well as the idf-object to be analyzed are handed over to the s.linlir function, which determines the set \({\fancyscript{U}}^{\prime }\) by the introduced algorithm. In the current version of the linLIR package, the first part of the s.linlir function determines \({\overline{q}}_{LRM}\). After this, the range \([{\underline{b}},{\overline{b}}]\) of slope values for which there may be undominated functions is identified. This is done by exploiting the representation of the set \({\fancyscript{U}}^{\prime }\) as the union of finitely many polygons. As described in Sect. 3.3, the possible vertices of the polygons are situated at those values \(b \in {\mathbb {R}}\) at which the graphs of some of the functions \(b\mapsto {\underline{z}}_{b,i}-{\overline{q}}_{LRM}\) and \(b\mapsto {\overline{z}}_{b,i}+{\overline{q}}_{LRM}\) cross each other. The set of all these intersection points can be formulated similarly to the set \({\fancyscript{B}}\) in terms of the endpoints of the interval data, and thus, it can easily be determined. Considering these values together with the slopes \(b=0\), the smallest slope minus 100, and the maximum plus 100, ordering them by their size and starting from the smallest value, one can find \({\underline{b}}\) as the first of these slopes for which the corresponding set \(\bigcup _{j=1}^{n-{\scriptstyle \underline{k}}}[{\underline{z}}_{b,({\scriptstyle \underline{k}}+j)}-{\overline{q}}_{LRM},\,{\overline{z}}_{b,(j)}+{\overline{q}}_{LRM}]\) is not empty. Analogously, starting from the highest values and descending, the upper endpoint \({\overline{b}}\) is identified. If \({\underline{b}}\) corresponds to the smallest or \({\overline{b}}\) to the highest of the considered slopes, respectively, the set \({\fancyscript{U}}^{\prime }\) is unbounded. In this case, in the final part of the s.linlir function, the set of undominated functions is approximated only over a coarse grid of slope values ranging at most from \(-10^9\) to \(10^9\) (if unbounded on both sides). Otherwise, \({\fancyscript{U}}^{\prime }\) is approximated by determining the corresponding intercept values over a fine grid across the identified range of slope values. As already mentioned at the end of Sect. 3, the current version of the function s.linlir is not optimized for speed. The computations for the present analysis took about 2–10 min on a standard desktop computer, most of the time is needed for the first part of the algorithm, where \({\overline{q}}_{LRM}\) is determined.

The s.linlir function returns an object of the class “s.linlir”, a list object whose elements include the ranges of slope and intercept values in \({\fancyscript{U}}^{\prime }\), a data frame containing the intercept-slope combinations that represent the approximation of the set \({\fancyscript{U}}^{\prime }\), the bound \({\overline{q}}_{LRM}\), the analyzed data set, the used LIR settings, \({\underline{k}}\) and \({\overline{k}}\), etc. The linLIR package provides a print method and a summary method for these s.linlir-objects. To visualize the results, there is furthermore an associated plot method with three options, which are to plot only the LRM regression functions (typ=“lrm”), to plot a random selection of functions out of the set \({\fancyscript{U}}\) (typ=“func”), or to plot the entire set \({\fancyscript{U}}^{\prime }\) (typ=“para”). For Figs. 3 and 4 we used the latter plot type with the default option para.typ=“polygon” to display the results of the conducted linear LIR analyses, the black points indicate the LRM regression functions.

Fig. 3
figure 3

Sets \({\fancyscript{U}}^{\prime }\) for Finland: women on the left (\(n=967\)), men on the right (\(n=911\))

Fig. 4
figure 4

Sets \({\fancyscript{U}}^{\prime }\) for Bulgaria: women on the left (\(n=1,\!370\)), men on the right (\(n=1,\!064\))

The sets \({\fancyscript{U}}^{\prime }\) resulting from the LIR analyses of the data sets of women and men in Finland are displayed in Fig. 3. Both sets of parameter values are bounded and have a similar shape, admitting both lines with positive and negative slopes ranging approximately from \(-9.5\) to \(12\). For the sample of Bulgarian women, the shape of the obtained set \({\fancyscript{U}}^{\prime }\) is much different, as shown in the left part of Fig. 4. In this particular data set, there are \(687\) observations \([{\underline{x}}_i, {\overline{x}}_i] \times [{\underline{{y}}}_i, {\overline{y}}_i]\) such that \({\underline{x}}_i = -\infty \) and \([{\underline{{y}}}_i, {\overline{y}}_i] \ne {\mathbb {R}}\). A line with an arbitrarily high slope will always go through these observations at the lower end of the income range as long as the intercept is not too low, and conversely, a line with a negative slope will always intersect these observations if the intercept is not too high. As here \({\underline{k}}+1 = 673\), all lines intersecting these \(687\) observations are undominated. Therefore, the obtained set of undominated functions is unbounded, reflecting the high degree of imprecision inherent in this data set. Furthermore, we here observe the particular data situation discussed at the end of Sect. 3.2, where the set \({\fancyscript{U}}^{\prime }\) is not closed. (The borders at \(b=0\) are not included.) In the LIR results for the sample of men in Bulgaria, \({\fancyscript{U}}^{\prime }\) is not unbounded, but large, which is to some extent due to the almost 20  % of missing income values. In the right part of Fig. 4, we displayed only the middle section of \({\fancyscript{U}}^{\prime }\). Interestingly, in this LIR analysis we find three LRM regression lines. These lines can be characterized geometrically by the fact that the closed bands of width \(2\,{\overline{q}}_{LRM} = 4\) around them completely include at least \({\overline{k}}=543\) observations. In the present data set, there are only \(500\) observations bounded with respect to \(X\), therefore, only the band around a horizontal line can contain at least \(543\) observations. Hence, each of the three functions has slope \(0\).

The results of the LIR analyses do not give a clear answer to the question how an increase in income translates to subjective well-being. However, the obtained results are more or less in line with current research in this field, as there is no clear evidence about the direct relationship between these two variables. Some empirical studies in rich countries found only very weak positive effects of income on subjective well-being, while others even suggested a negative effect at the upper end of the income distribution (Diener and Biswas-Diener 2002). These two possibilities are also admitted by the LIR results for the Finnish data sets, admitting increasing and decreasing functions. In poorer countries, several studies found a strong positive effect, reflecting the fact that in these countries an increase in income is more often used to fulfill basic material needs, clearly improving the individual living standard (Diener and Biswas-Diener 2002). The LIR result for the sample of Bulgarian men admits more extreme slope and intercept values, while the data of the sample of Bulgarian women are too imprecise to obtain informative results.

5 Conclusion

In the present work, we considered the LIR approach to regression for imprecisely observed quantities (see Cattaneo and Wiencierz 2012, 2011). The result of a LIR analysis is in general set-valued: it consists of all regression functions that cannot be excluded on the basis of likelihood inference. These regression functions are said to be undominated. In this paper, we studied in particular the robust LIR method based on the residuals’ quantiles, in the special case of simple linear regression with interval data. For this situation, we proved that the set of all the intercept-slope pairs corresponding to the undominated regression functions is the union of finitely many polygons, and we gave an exact algorithm for determining this set (i.e., for determining the set-valued result of the robust LIR method). In particular, when the data are precise, the algorithm can calculate the (possibly infinite) set of all least median of squares regression functions, without any assumption about the data being “in general position”.

We have implemented this exact algorithm as part of the R package linLIR (Wiencierz 2013). In the present paper, we analyzed data of the fifth round of the ESS (Norwegian Social Science Data Services 2010) to illustrate the implementation of the robust LIR method in the linLIR package. The obtained results are in line with current research in the field. In addition to that, we showed that the algorithm has worst-case time complexity \(O(n^{3}\log n)\). In fact, the first part of the algorithm is related to the first exact algorithm for least median of squares regression, which has the same (asymptotic) worst-case time complexity (see Steele and Steiger 1986; Rousseeuw and Leroy 1987). This algorithm for least median of squares regression was then improved (see for example Souvaine and Steele 1987; Edelsbrunner and Souvaine 1990; Carrizosa and Plastria 1995; Mount et al. 2007) and extended to multiple linear regression (see for instance Stromberg 1993; Hawkins 1993; Watson 1998; Bernholt 2005). In future work, we intend to do the same with the algorithm for the robust LIR method (which can also be generalized to imprecise data other than intervals). In particular, the first part of our algorithm can be easily extended to the problem of multiple linear regression by adapting the ideas of Stromberg (1993) to the case of interval data.