Mining Interpretable Regional Co-location Patterns Based on Urban Functional Region Division

Yuqing Liu¹,
Lizhen Wang ORCID: orcid.org/0000-0003-2214-2299^1,2,
Peizhong Yang¹ &
…
Lihua Zhou¹

868 Accesses
Explore all metrics

Abstract

Discovering the correlation relationships of spatial facilities in urban functional regions is of great significance in analyzing urban planning and promoting urban development. However, existing regional co-location pattern mining methods do not fully consider the attributes of urban functional regions. Consequently, they fail to effectively capture the relationships between spatial facilities and regions, resulting in patterns lacking practical application value and interpretability. To address this issue, this paper proposes a novel regional co-location pattern (RCP) mining method based on urban functional region division. First, an ontology-based method for urban functional region division is proposed. On the basis of dividing urban functional regions, the expected mean distance is generated to adaptively calculate the neighbor relationships between instances in each region, and a parallel algorithm is designed to quickly mine RCPs in urban functional regions. Then, the correlation coefficient ($CC_{RP}$) measures the association strength between RCP and function type to provide interpretability for RCPs. Finally, extensive experiments on real-world datasets are conducted to verify the effectiveness of urban functional region partition and to demonstrate the superiority of the proposed RCP mining method.

Mining Regional High Utility Co-location Pattern

Ontology-Based Interactive Post-mining of Interesting Co-location Patterns

Discovering Interesting Co-location Patterns Interactively Using Ontologies

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Spatial facilities within urban ensure the ease of human production, circulation, and consumption, concurrently offering diverse services at a regional scale grounded in socio-economic activities [1,2,3]. As urban development continues, there is a continual increase in both the quantity and diversity of these spatial facilities. There is numerous valuable knowledge hidden in urban spatial data, and capturing this concealed knowledge for many applications is important [4]. As a vital branch of spatial data mining, spatial regional co-location pattern (RCP) refers to a subset of spatial features (spatial facilities), where instances of different features within the pattern frequently occur adjacent to each other in a local region. Mining RCPs aims to uncover patterns existing in a particular region, revealing information that may not be apparent at a global scale. For example, regions proximate to bars may display elevated crime rates compared to the entire city. Mining RCPs can unveil the correlation between spatial facilities in local regions [5, 6], facilitating the exploration of fine-grained urban spatial layouts and the optimization of urban development.

Although mining RCPs proves beneficial in elucidating the regularities within the internal spatial structure of urban. However, existing methods for RCP mining often fall short of adequately incorporating the influence of urban regional attributes. Typically, these methods rely on clustering or gridding to directly divide regions, ignoring the attributes of regions. Different regional division methods may lead to differences in the discovered RCPs, and the practical significance of the RCPs may vary with the region. The results of region division without considering regional attributes are depicted in Fig. 1a and b. Due to the clustering method only considering the spatial locations of the spatial facilities, it leads to the unreasonable partitioning results, as seen in Fig. 1a with the red encircled parts. These parts show distinct spatial characteristics from their surroundings but are grouped into a single region due to their density and spatial proximity. Such outcomes overlook interesting patterns within these regions, such as {Community, Pharmacy, Restaurant, Shop}. Similarly, grid partitioning methods encounter similar issues, as shown in Fig. 1b with the red encircled part. Moreover, grid partitioning methods typically divide the urban map into regular grids, but the boundaries of these grids may not be consistent with the actual geographical divisions (such as roads, administrative divisions, etc.). Furthermore, the lack of consideration for regional background information limits the interpretation and application of the mined RCPs. Functional attributes are the predominant regional characteristics in urban. These attributes delineate specific regions within urban areas, categorized by diverse human activities and uses [7]. Considering the functional attributes of regions can enhance the rationality of regional partitioning. At the same time, it avoids overlooking interesting patterns and increases the practical application value. As illustrated in Fig. 1c, the green parts represent mix-functional regions of residence and commerce, where the pattern C={Community, Pharmacy, Restaurant, Shop} prevalently occurs. Such regions cater to residents’ daily needs while providing convenient commercial services. The interaction and socialization between residents and commercial activities lead to the occurrence of pattern C. In terms of applications, it can help users in choosing a place of residence. Due to the different distribution of spatial facilities in different residential regions, the choice of the corresponding residents will also be different. The green regions in Fig. 1c are more oriented towards the needs of daily life, while the yellow region is more oriented towards educational and professional development needs. Therefore, to further improve the application value of RCPs within urban, it is necessary to deeply study the correlation between the attributes of urban functional regions and spatial facilities.

Existing methods of urban functional region division usually use blocks divided by user-specified grids [8, 9] or road networks as spatial units [10,11,12,13]. However, (1) the uniform grid size ignores the heterogeneity of spatial data distribution, and the division results produce large errors with the actual functional regions. (2) Using the spatial units divided by the road networks, it is easy to ignore the existence of multiple functional regions in the same block. To address these problems in functional region division, we propose an ontology-based functional region division method, which integrates domain knowledge (ontology) of urban POI data into the functional region division. Based on urban ontology, the semantic similarity of different spatial facilities in urban is measured, and a bottom-up adaptive segmentation of spatial units at the granularity of instances is proposed.

In the RCP mining phase, due to the density of instances that differ in different functional regions, such as the urban centers and suburbs, the measures of the neighbor relationships between instances within a functional region or between different functional regions become a top priority. To solve this issue, based on the expected mean distance in each functional region, an intra-regional neighbor relationship between instances in a region is defined, which can automatically adapt the data distribution and density changes in different functional regions. Meanwhile, to ensure the completeness of the neighbor relationship between instances, we propose an inter-regional neighbor relationship between instances in different functional regions. It helps to prevent the loss of neighbor relationships caused by functional region division.

Finally, we define the correlation coefficient between the discovered RCPs and their prevalent regions to further analyze the degree of support of the functional regions for the corresponding RCPs. Furthermore, to enhance mining efficiency, this paper introduces a parallel mining framework based on participating instances. To the best of our knowledge, this is the first to combine the functional attributes in urban to mine RCPs. Meanwhile, the first to consider intra-regional and inter-regional neighbor relationships between instances in the RCP mining.

The principal contributions of this paper are as follows.

First, we propose a method for identifying urban functional regions based on the ontology, and define the intra-regional neighbor relationship between instances within each region and the inter-regional neighbor relationship between instances of different regions, which helps to prevent the loss of neighboring relationships caused by the functional region division.
Second, a general mining framework for RCP mining in urban functional regions is presented. Within this framework, a parallel algorithm (PRCPM-FR) based on participating instances is designed. Additionally, the correlation coefficient between a regional type and a specific RCP is defined to measure the correlation of an RCP with different types of urban functional regions.
Finally, a case study demonstrates the effect of ontology-based urban functional region division, while conducting comprehensive experiments on real-world datasets to demonstrate the performance of the proposed PRCPM-FR algorithm.

The remainder of this paper is organized as follows. In Sect. 2, we review the related work of mining RCPs and the urban functional region division. Section 3 details the proposed concepts and the designed ontology-based urban functional region division method. Section 4 presents and analyzes the PRCPM-FR algorithm. Section 5 gives the experimental evaluation and Sect. 6 concludes this paper.

2 Related Work

Spatial co-location pattern (SCP) mining was first proposed by [14], and they suggested the participation index to measure the prevalence of SCPs. Based on the antimonotone property of the participation index, an Apriori-like mining approach was proposed, namely the join-based algorithm [15]. Subsequently, some approaches to improve the efficiency of SCP mining were proposed. For example, the partial-join algorithm [16] reduces the join operations by clique instance division, but this approach requires additional space to store the cut-severed neighbor relationships. There was also the join-less algorithm [17], which exploits the star instances to materialize neighbor relationships between instances and avoid the join operations, to further improve the mining efficiency. Because the SCP mining and the cliques search are inextricably linked, a series of clique-based SCP mining methods were proposed. [18] optimized the BK algorithm and introduced a hash table structure for storing clique instances to improve the SCP mining efficiency. [19] presents a highly effective solution for solving the problem of mining maximal co-location patterns using the Fraction-Score method. To make the neighbor relationships of spatial facilities more realistic, [20] proposes utilizing road networks of urban as constraints to define the neighbor relationship between instances using the shortest path algorithm. [21] introduces fuzzy theory to define the neighbor relationship between instances, thereby avoiding the issues associated with the traditional binary measures in neighbor measures. All the above methods used a globally uniform distance threshold to identify neighbor relationships between instances. However, due to the heterogeneity of the spatial data distribution, a fixed global neighbor distance threshold is difficult to effectively capture the neighbor relationships between instances. Therefore, a series of methods were proposed to avoid setting a uniform distance threshold. For example, Barua et al. [22] proposed a statistical validation-based method that employs multi-distance thresholds to mine SCPs throughout the study space. Yoo [23] proposed a nearest neighbor-based approach and a spatial autocorrelation method based on the theoretical K function. However, these statistical methods have high time complexity, and the neighbor relationships are only considered with a global perspective which ignores the heterogeneity in data. Qian et al. [24] proposed a k-nearest neighbor-based SCP mining method, where k neighbors replace the distance threshold, but determining the appropriate k value is also challenging. The adaptive generation methods based on the Delaunay Triangulation [25] use the geometric theory to determine neighbor relationships between instances. However, the triangular net generated by Delaunay Triangulation is locally optimal, which may lost some closer neighbor relationships to maintain the stability of the triangular net.

These traditional SCP mining approaches are designed to discover patterns that are prevalent in the global study area. Due to spatial heterogeneity, some patterns are prevalent only in local regions but not in the global study area. The traditional SCP mining methods fail to find regional co-location patterns (RCPs) that are only prevalent in local regions. Therefore, RCP mining has attracted a lot of attention, which aims to find the prevalent RCPs in local regions. The existing RCP mining methods are mainly divided into the following two categories: clustering-based methods and division-based methods.

Clustering-based approaches. The clustering-based approaches identify regions by clustering spatial instances or co-location pattern instances and calculate the participation index of patterns within regions to evaluate their prevalence. Mohan proposed a neighborhood graph-based method to discover regions for RCP mining [26]. Deng et al. [27] proposed a multi-level method to use global non-prevalent patterns as candidate RCPs, and then identify regions of RCPs by using the Delaunay Triangular adaptive clustering method. Liu et al. [28] proposed a multi-level co-location pattern mining approach based on natural neighbors to create a new local adaptive neighbor relationship, but this method is sensitive to the number of features. In [29], the road constraint was added to the k-nearest neighbors, and a heuristic two-stage approach was proposed to identify candidate regions of each RCP and determine whether it is a prevalent RCP by evaluating the statistical significance of patterns through a Monte Carlo simulation method. In [30], Jiang proposed to use a modified peak density clustering algorithm to determine the cluster center instances, and then use fuzzy set theory to generate multiple extremely fuzzy clusters as regions. Since the clustering-based approach only considers the concentration of instances, it may ignore some real regions that are not clustered, resulting in incomplete identification of prevalent regions.

Division-based approaches. The division-based approaches divide the whole study area into multiple regions, and then mine the RCPs within these regions separately. The main challenge for the division-based approaches is to seek out the spatial division schemes for regions, and existing studies proposed different regional division approaches. [31] used a supervised clustering algorithm to partition the geographic space into grid regions of arbitrary shape. Qian et al. [24] proposed a division method based on k-nearest neighbors. The original regions are initialized with $k=1$, and then it uses an iterative method to merge regions with similar characteristics to obtain the target region. But this method is not only computationally complex but also may ignore potential RCPs. Wang et al. [32] proposed a statistical framework (Frequentist and Bayesian) for mining RCPs and developed a heuristic method based on the probabilistic expansion to find arbitrarily shaped regions. To explore the regional co-location patterns with core features as well as their prevalent regions, [33] presents a regional partition method based on the influence of core features. Li et al. [34]suggests using the relative distribution coefficient to identify RCPs, and a quarter-iterative and regional combination strategy generates the regions where RCPs are prevalent. However, regional division may result in the loss of neighbor relationships among several regional instances.

The identification of urban functional regions is of great significance for urban planning and site selection. POI (point-of-interest) data reflects the spatial distribution of urban facilities and is widely used for identifying urban functional regions. Existing research on identifying urban functional regions based on POI data mainly includes: Hu et al. [9] proposed the POI frequency density index and the ratio analysis of functional types to analyze the functions of the grid units. [10] proposed a method to identify urban functional regions by coupling Open Street Map (OSM) and POI data. [11] proposed a new framework for identifying urban functional regions by constructing a Place2vec model. [12] used POI, OSM and other datasets to identify mix-functional regions and predict urban layout by the CA-Markov model. Huang et al. [13] established a comprehensive measure of functional intensity by using a combination of the POI data density index and average nearest neighbor index (ANNI). However, these methods of urban functional region division usually use OSM division or user-specified grids as spatial units. OSM division methods may overlook the inclusion of multiple functional regions within a block, which may lead to under-segmentation and cannot determine the functional type. Grid-based division methods employ single-sized grids as spatial units, which is inappropriate face to the spatial heterogeneity of data.

Therefore, aiming at the urban POI data, this paper proposes an RCP mining method in functional regions by combining the inherent background of functional regions with RCP mining. In this paper, we mine RCPs in functional regions and gain insights into the interrelationships among various spatial facilities within functional regions. We specifically propose an effective ontology-based functional region division method. For the neighbor relationships between instances in RCP mining, it is proposed to adaptively generate intra- and inter-regional neighbors with regional characteristics, which not only avoids the difficulty of setting appropriate distance thresholds, but also solves the problem of neighbor relationship loss caused by regional division.

3 Definitions for Mining RCPs

3.1 Notations

Let $F=\{f_1,f_2,...,f_n\}$ be a set of spatial features which represent different types of spatial facilities. The set of urban functional labels is denoted as $L=\{l_1, l_2,..., l_m\}$. A functional label can encompass multiple features, but each feature can only belong to one functional label. This characteristic ensures that each spatial feature is uniquely associated with a functional label, representing the primary function or purpose of the spatial feature. The right-hand table of Fig. 2 shows each spatial feature corresponding to a functional label, e.g., shop and restaurant are categorized as commercial functional label. Given an urban dataset O, for each spatial instance $o\in O$, o has a feature type, and instances with the same feature are assigned a unique id for distinction, so each instance is represented by a vector $\{f,id,location\{x,y\}\}$ ($f\in F$). For a functional region $FR_i$ with a functional type $L_i$ ($Li\subseteq L$), represent it as $FR_i$ = ($O_i, L_i$), where $O_i$ ($O_i\subseteq O$) is a set of regional instances. Given two instances o and $o'$, if the Euclidean distance between them is no larger than a given distance threshold d, a neighbor relationship $R(o, o')$ exists between the two instances. A spatial co-location pattern $C=\{f_1, f_2,..., f_k\} (C\in F)$, where the features are different and their instances frequently occur adjacent to each other in space. The size of C is the number of features in C. An instance set $I=\{o_1, o_2,..., o_k\}$ is called a row instance of C, if it satisfies such two constraints: (1) the set of features of all instances in I is C and the size of I is the same as C, (2) all instances of I have a neighbor relationship with each other. All instances in I are called participating instances.

In this paper, the calculation formula for the participation index of traditional global co-location patterns [15, 35] is modified to mining RCPs as follows:

Given an RCP C in region $\alpha$, the regional participation ratio (RPr) of feature f in C and the regional participation index (RPI) of C in the region $\alpha$ are defined as

$$\begin{aligned} RPr(\alpha ,C,f) =\frac{|Obj(\alpha ,C,f)|}{{RI(\alpha },f)}. \end{aligned}$$

(1)

$$\begin{aligned} RPI(\alpha ,C) =min^k_{i=1} (RPr(\alpha ,C,f_i )). \end{aligned}$$

(2)

Where the $|Obj(\alpha ,C,f)|$ denotes the number of participating instances of feature f in C within region $\alpha$, and the $RI(\alpha ,f)$ represents the number of instances of feature f in region $\alpha$. The RPI of C in this region is the minimum of the RPr of all features in C. Given a predefined participation index threshold $\theta$, if $RPI(\alpha ,C)\ge \theta$, we say the co-location pattern C is prevalent in the region α. Figure 2 shows the instance distribution of 6 features (i.e., Community, Shop, Restaurant, Company, Pharmacy, School) in the two regions. Each feature corresponds to a functional label. In Fig. 2, two instances connected by a solid line have a neighbor relationship. For example, $C_1=\{Community, Restaurant, Pharmacy\}$ is a size-3 RCP, and $RPI(Residence\ region\ 1,C_1 )=0.67$; $C_2=\{Community, School, Company\}$ and $RPI(Residence\ region\ 2,C_2 )=0.33$.

3.2 Identification of Functional Regions Based on the Ontology

3.2.1 The Concepts of the Ontology

The ontology is typically an explicit specification of a shared conceptual model, which aims at articulating knowledge of specific domains, encompassing concepts, entities, and their interrelations [36]. In this paper, the ontology is represented by a quaternion $ OR, OI, OH\}$, where OC is a set of ontology concepts, OR is a set of relationships defined over ontology concepts, OI is an instance set of ontology concepts, and OH is a directed acyclic graph (DAG) defined by the inclusion relation (is-a relation) between the ontology concepts. If the ontology concept $OC_2$ is contained in $OC_1$, then $OC_2$ is-a $OC_1$, and in the OH it is represented as a directed edge pointing $OC_1$ to $OC_2$. We utilize two types of ontology concepts: leaf concept and generalized concept [37]. Leaf concepts are leaf nodes in OH that do not contain other ontology concepts, while generalized concepts are described as ontology concepts that include other ontology concepts. The functional types of instances are categorized according to their respective features, while a functional label generally includes multiple spatial features, so the spatial features serve as leaf nodes in the ontology. Both intermediate categories and functional labels are generalized concepts. For example, Fig. 3 illustrates a construction of the POI data ontology, and the arrow represents the relationship of is-a, the two types of ontology concepts are leaf concepts: $\{A1, A2,..., Q2\}$, and generalized concepts: $\{POI, OC_1,..., OC_6, 1-1,..., 6-2\}$.

Utilizing the guidelines outlined in the Urban Land Classification and Planning Construction Land Standard (GB50137-2011),^{Footnote 1} this standard declares that urban functional types should be divided and categorized according to the actual use of spatial facilities, and provides a brief explanation of the meaning of each functional type. Based on this standard, we categorize POI data into six major functional types, as illustrated in Fig. 3, and provide a detailed explanation of each functional type in Table 1. These six functional types ensure comprehensive coverage of all spatial features in urban POIs. The specific mapping steps are as follows: First, collect urban POI data and map the spatial features representing the basic abstract concepts to the corresponding leaf concepts in the ontology. Second, analyzing similarities and common purposes between spatial features, and incorporate spatial features with similar purposes into the intermediate category. For example, the spatial features of museums and schools are grouped into the intermediate category of Techno-Culture, because of their educational and cultural role. Finally, based on the meanings of the functional types, assign the intermediate categories to their respective functional types. As illustrated in Fig. 3, according to the meaning of Green Space and Squares, by collecting their pertinent spatial features and analyzing their commonalities, they can be classified into the intermediate category of Parks and Scenic Spots. Parks typically encompass green areas within urban settings, offering citizens leisure and recreational spaces, exemplified by Park Square, Botanical Garden, and Zoo. Scenic Spots include natural landmarks of historical and tourist significance, such as Beaches, Temples, etc. The ontology displays the semantic links between diverse spatial features and functional types, allowing them to be classified and organized systematically. Furthermore, we use ontology to quantify the similarity of spatial units and propose an adaptive quadrant division based on ontology.

Table 1 The meaning of six major functional types

Full size table

3.2.2 Adaptively Quadrant Division of Spatial Units Based on Ontology

The accurate identification of urban functional regions relies on determining the spatial unit of functional regions, which poses a challenge in practice. To address this issue, this study treats heterogeneous cells as spatial units and proposes an adaptive quadrant division method based on the ontology. Based on the ontology concept of geographic domain, we incorporate geographic semantic information in the dividing process.

In real life, if there are similarities between different things, they must have some of the same characteristics or properties. Similarly in the spatial domain, the similarity of the four pre-quartered cells needs to be considered during the quadrant division to determine whether further division is required. Ontology can effectively express the relationship between different concepts. Therefore, by combining ontology and Jaccard distance [38], this paper proposes a metric (called ontological similarity) to measure the similarities of instance types for pre-quartered cells. Given the four pre-quartered cells c1, c2, c3, c4, their ontology itemsets are $I_{c1}$, $I_{c2}$, $I_{c3}$, and $I_{c4}$, respectively. The itemsets of these cells can be chosen at various levels within the ontology hierarchy based on user domain knowledge. As depicted in Fig. 3, ontology itemsets of feature or function labels can be selected. Herein, the focus is placed on the ontology itemsets of function labels. The ontological similarity of pre-quartered cells is defined as

$$\begin{aligned} OS = \frac{|I_{ c1}\cap I_{c2}\cap I_{c3}\cap I_{c4}|}{|I_{{c1}}\cup I_{c2}\cup I_{c3}\cup I_{c4}|}. \end{aligned}$$

(3)

A larger ontological similarity indicates that the functional label types of the instances within pre-quartered cells are more consistent. On the contrary, it indicates that the functional label types of the instances in pre-quartered cells vary differently, which requires further division. As shown in Fig. 4 (the dashed lines in the figure indicate pre-division, and the solid lines indicate the division), the ontology itemsets (functional labels itemsets) of the first pre-quartered cells are $I_{c1}=\{l_1, l_2, l_4\}$, $I_{c2}=\{l_1, l_3\}$, $I_{c3}=\{l_1, l_2, l_3, l_4\}$, $I_{c4}=\{l_1, l_2, l_3\}$, so the ontological similarity of this cell is 0.25.

Information entropy [39] is a fundamental concept of information theory. It describes the uncertainty of the occurrence of each possible event of the information source. When analyzing urban functional structures, the use of information entropy allows for the measurement of complexity and balance. It can quantitatively reflect the degree of mixture of functional attributes in a cell. In this paper, the information entropy gap of a pre-quartered cell is used to measure whether the pre-quartered cell needs to be divided. The calculation of the information entropy gap for each pre-partitioned cell c depends on the information within its four sub-cells $\{c1, c2, c3, c4\}$. In the case that the ontology similarity satisfies the threshold, if the information entropy gap of the current cell c reaches the threshold then there is no need to consider the division, otherwise it is divided into $\{c1, c2, c3, c4\}$, and then the information entropy gap of each cell is to determine whether it needs to be further divided. Information Entropy Gap of a pre-quartered cell c is formulated as follows:

$$\begin{aligned} H_{cj} = -{\sum \limits _{i=1}^{n}{P(OC_i)ln(P(OC_i))}}, j \in \{1,2,3,4\}. \end{aligned}$$

(4)

$$\begin{aligned} HD_{c}=max\{H_{c1},H_{c2},H_{c3},H_{c4}\}-min\{H_{c1},H_{c2},H_{c3},H_{c4}\}. \end{aligned}$$

(5)

Where $H_{c_j}$ is the information entropy of the cell $c_j$, n is the total number of the ontology types of functional labels in the cell $c_j$, $P(OC_i)$ is the probability of the ontology $OC_i$ (functional label, as shown in Fig. 3) in the cell $c_j$, and $P(OC_i)=\frac{OC_i}{\sum \nolimits _j^n{OC_j}}$, $\sum \nolimits _{i=1}^n{P(OC_i)}=1$; and $HD_c$ is the information entropy gap of the pre-quartered redcell c, which contains c1, c2, c3, c4. A larger information entropy gap indicates a larger mixing degree of instances within cell c, i.e., the more inconsistent the functional types of instances in c; while the smaller information entropy gap indicates that the functional mixing of instances of the pre-quartered cell c is consistent and requires no division. By introducing the ontology of urban, and measuring the semantic similarity of different spatial facilities in urban, we have achieved a bottom-up adaptive segmentation of spatial units at the granularity of instances. In this study, the ontological similarity threshold was set to 1, due to the relatively limited number of ontology categories for functional labels. Additionally, the information entropy gap threshold was set to 0.1 (Experimental results indicate that setting the information entropy gap threshold to 0.1 can effectively discern variations within the cell resulting from the adaptive partitioning of the entire area).

The ontological similarity measures the mixing degree of instance functional label types within a cell, while the information entropy gap measures the difference in the number of instance distributions of different functional types within the cell as a further evaluation of the mixing degree. The two metrics are complementary to determine whether further cell partitioning is required.

3.2.3 Identify Functional Types for Cells

Let $Cells=\{c_1, c_2,..., c_s\}$ be the set of all divided cells (spatial units) in the whole study area. To identify the main functional types of each cell, we need to consider not only the distribution of instances with functional labels in the entire area, but also to measure the influence between the different functional label instances within the cell. We utilize the expected mean distance [40], which represents the average distance between data points under the assumption of a random distribution. Instances tend to cluster when their distance is smaller than this expected mean distance. This clustering indicates stronger mutual influence and correlation, aligned with the first law of geography, that is, closer objects are more correlated. Hence, we use the expected mean distance to detect potential mutual influence between instances of distinct functional labels within a cell.

Definition 1

Given a cell $c\ (c\in Cells)$, the influence radius dc(c) of the cell c is defined as:

$$\begin{aligned} dc(c)=\frac{0.5}{\sqrt{\frac{n(c)}{A(c)}}}. \end{aligned}$$

(6)

The formula is derived from the expected mean distance [40]. Where n(c) is the number of instances in the cell c, A(c) is the acreage of the cell c. The dc(c) reflects the data distribution density of the cell.

Definition 2

Given a cell $c\ (c\in Cells)$ and a functional label l in this cell, if the Euclidean distance between an instance o with the functional label l and an instance $o'$ with other reference functional labels ${l'}\ (l\ne l')$ is not greater than dc(c), i.e., $d(o,o')\le dc(c)$, we say that they satisfy the functional neighbor relationship. Thus, the functional neighborhood size of the functional label l in the cell c is defined as

$$\begin{aligned} FNS(c,l)=\sum \limits _{o\in Obj(l)}{|\{o'|d(o,o')\le dc(c)\}|}. \end{aligned}$$

(7)

The larger the functional neighborhood size of a certain type of functional label, the stronger the impact of the instances with this functional label, and the greater the contribution of that functional type in the cell. Thus, instances with the functional type l are dominant (highly influential) in the cell, i.e., the functional property of the region is dominated by this functional type. For the example with four functional labels shown in Fig. 5, the influence radius for each cell is calculated. In cell B, the number of instances of functional labels $l_1$ and $l_3$ are both 10, but their influence on other instances is different, the centers of the black circles are the instances with functional label $l_1$, while the centers of the green circles are the instances with functional label $l_3$, the radius of these circles in cell B is dc(B), so the $FNS(B,l_1)=19$, $FNS(B,l_3)=3$. That is, the instances with the functional label $l_1$ have a greater influence on the instances of other functional labels in cell B than the instances with the functional label $l_3$.

Definition 3

Given a cell $c\ (c\in Cells)$ and a functional label l in this cell, the functional degree of the functional label l in the cell c is formulated as:

$$\begin{aligned} FD(c,l)=0.5\times \frac{FNS(c,l)}{N(l)}\sqrt{\frac{N(l)}{A(c)}}+0.5\times \frac{m(l,c)}{A(c)}. \end{aligned}$$

(8)

Where N(l) denotes the total number of instances with functional label l in the whole study area, and A(c) is the acreage of the cell c, m(l, c) denotes the number of instances with functional label l in the cell c. $\frac{FNS(c,l)}{N(l)}\sqrt{\frac{(N(l)}{A(c)}}$ is the ratio of the functional neighborhood size of instances with functional label l in the cell c to all instances with functional label l in the entire area, and influenced by the acreage of the cell c and all instances with functional label l. It provides insights into the spatial distribution and density of instances with the same functional label, highlighting their degree of aggregation within a specific cell; $\frac{m(l,c)}{A(c)}$ is the cell density of instances with functional label l in cell c and has often been applied to urban function assessment [13]. In summary, the FD index is influenced by two factors: Firstly, the density of functional label instances within the cell. Secondly, the presence of functional neighbors indicates the extent of its distribution influence. The sum of the two coefficients is set to 1. Through repeated experiments, it has been found that setting both coefficients to 0.5 with the highest accuracy.

Definition 4

Given a cell $c\ (c\in Cells)$ and a functional label l in this cell, the functional ratio of the functional label l in the cell c is defined as

$$\begin{aligned} FR(c,l)=\frac{FD(c,l)}{\sum \nolimits _{i=1}^n{FD(c,l_i)}}. \end{aligned}$$

(9)

Where $\sum \nolimits _{i=1}^n{FD(c,l_i)}$ is the sum of the functional degree of all functional types in the cell c.

In each cell, there is a set of instances with different functional labels, and the functional ratio is used to determine the functional property of each cell. If there is a certain type of functional label significantly higher than other functional labels in a cell, we consider this cell as a single-functional cell. Otherwise, the cell is regarded as a mix-functional cell. Specifically, taking 50% as the threshold, if the functional ratio of a certain type of functional label is greater than 50%, this cell is considered as a single-functional cell, and the functional type is the functional label with the largest proportion. If the functional ratio of all functional labels is less than 50%, the cell is considered as a mix-functional cell, and the first two functions with the highest functional ratio are taken as its mixed function. For example, in Fig. 5, there are 15 instances with functional label $l_1$, 11 instances with functional label $l_2$, 14 instances with functional label $l_3$, 17 instances with functional label $l_4$, and we assume the acreage of these four cells is 4. In cell B, $FNS(B,l_1)=19$, $FNS(B,l_2)=10$, $FNS(B,l_3)=3$, $FNS(B,l_4)=6$. The functional degree of each functional label is calculated by Formula (8), i.e., $FD(B,l_1 )=0.5\times \frac{19}{15}\times \sqrt{\frac{15}{2^2}}+0.5\times \frac{10}{4}=2.4764$, $FD(B,l_2)=1.379$, $FD(B,l_3)=1.45$, $FD(B,l_4)=0.988$. The final functional ratio for each functional label is obtained, i.e., $FR(B,l_1)=39.34\%$, $FR(B,l_2)=21.9\%$, $FR(B,l_3 )=23.04\%$, $FR(B,l_4 )=15.7\%$. All functional ratios of functional labels are less than 50% in cell B, so cell B is a mix-functional cell and its functional type is $l_1l_3$.

3.2.4 Merging Functional Cells

A reasonable functional region should demonstrate a relatively consistent functional type. Once the functional type of each cell is determined, it is crucial to investigate whether adjacent cells share the same functional type. Subsequently, we merge the adjacent cells that match the same functional types to obtain the final functional regions.

The flowchart of identifying functional regions based on the ontology is shown in Fig. 6. In the first step, an ontology of the POI data is constructed to classify all spatial features into functional labels. The second step is to adaptively divide the whole study area into heterogeneous cells based on the ontology. Third, the functional ratio of each functional label in each cell is calculated according to Definition 4, and then determined whether the cell is a single-functional cell or a mix-functional cell. Finally, the cells with the same functional types and adjacent to each other are merged to form the final functional regions.

3.3 Measure of RCPs

In this section, we present our measure method for prevalent RCP mining in the divided urban functional regions. One of the important concepts in RCP mining is the spatial neighbor relationship between instances, and the traditional methods require users to specify a global distance threshold. However, setting an appropriate distance threshold is difficult. If the distance threshold is too large, too many worthless RCPs will be mined. If the distance threshold is too small, some neighbor relationships will be lost, resulting in the loss of useful RCPs. Due to the heterogeneity and autocorrelation of spatial data distribution, different functional regions exhibit varying data distributions. Consequently, the neighbor distance thresholds should be adjusted accordingly across different regions. Thus, we propose the intra-regional neighbor distance and the inter-regional neighbor distance to adaptively calculate the neighbor distance thresholds between instances in regions and between different regions.

Definition 5

Given a functional region $\alpha$, the intra-regional neighbor distance of the functional region $\alpha$ is defined as:

$$\begin{aligned} Dintra(\alpha )=\frac{0.5}{\sqrt{\frac{n(\alpha )}{A(\alpha )}}}. \end{aligned}$$

(10)

Where $n(\alpha )$ is the number of instances in the functional region $\alpha$, and $A(\alpha )$ is the acreage of this functional region. This formula provides a way to automatically calculate the neighbor distance threshold in the region by considering the instance density, so that the threshold can be adjusted according to the internal characteristics of the region, which is more in line with the actual situation.

The $Dintra(\alpha )$ denotes the average distance of the hypothetical randomly distributed data. If the Euclidean distance between instances with different features located in the functional region $\alpha$ is less than or equal to $Dintra(\alpha )$, we consider that the two instances have an intra-regional spatial neighbor relationship, i.e., given two instances o, $o'$ located in region $\alpha$, if $d(o,o')\le Dintra(a)$, o and $o'$ are intra-regional neighbors to each other.

Definition 6

Given two adjacent functional regions $\alpha$ and $\beta$, the inter-regional neighbor distance between two regions is defined as:

$$\begin{aligned} Dinter(\alpha ,\beta )=min\{{Dintra(\alpha ), Dintra(\beta )}\}. \end{aligned}$$

(11)

To ensure the completeness of neighbor relationships, it is necessary to consider whether neighbor relationships exist between instances of two adjacent regions. For the instances in two different regions, since the distance threshold within each region is different and to ensure the reflexive and symmetric of the neighbor relationship, the inter-regional neighbor relationships need to satisfy the intra-regional neighbor relationships of different regions at the same time, i.e., if $d(o,o')\le Dinter(\alpha ,\beta )$, where instance o is located in region $\alpha$ and instance $o'$ is located in region $\beta$, we say that o has an inter-regional neighbor relationship with $o'$. As shown in Fig. 7, there are two different functional regions Region a, Region b. The intra-regional neighbor distance of Region a and Region b is $d_1$, $d_2$, respectively. The inter-regional neighbor distance between Region a and Region b is $d_1$. The intra-regional neighbors of instance A.6 are {B.2, B.3, C.2}, and its inter-regional neighbors are {C.3, C.4}.

Definition 7

Given a functional region $\alpha$, all instances in $\alpha$ and the instances having inter-regional neighbor relationships with the instance in $\alpha$ are called regional instances of region a.

For example, in Fig. 7 the regional instance set of Region a is {A.1, A.2, A.3, A.4, A.5, A.6, B.1, B.2, B.3, C.1, C.2, C.3, C.4}.

Different RCPs often occur in different functional regions. Therefore, we define the correlation coefficient between region and pattern ($CC_{RP}$) to measure the strong correlation of RCP in different types of functional regions as follows.

$$\begin{aligned} CC_{RP}(C,L_i)=\frac{PI(C,L_i)}{SumPI(C)}. \end{aligned}$$

(12)

$$\begin{aligned} PI(C,L_i)=\sum \limits _{r\in Region(L_i)}{RPI(r,C)}. \end{aligned}$$

(13)

$$\begin{aligned} SumPI(C)=\sum \limits _{i=1}^n{PI(C,L_i)}. \end{aligned}$$

(14)

The $Region(L_i)$ represents the set of all regions with the functional type $L_i$. $PI(C,L_i)$ is the sum of the regional participation index of an RCP C in all prevalent regions with functional type $L_i$. SumPI(C) refers to the sum of the regional participation index of RCP C in all types of functional regions, and $0\le CC_{RP}(C,L_i)\le 1$.

Definition 8

Given a strong correlation threshold $\mu$, if $CC_{RP}(C,L_i)\ge \mu$, we say C is a strong regional correlation RCP (SRCP) with functional regions of type $L_i$.

The value of $CC_{RP} (C,L_i)$ can reflect the correlation between the RCP C and the regions of functional type $L_i$. When the $CC_{RP} (C,L_i)$ is larger, it indicates that the functional type $L_i$ has a high correlation with the corresponding RCP C. In other words, this type of functional region supports the RCP C. In this way, we can use the function region types to analyze the reasons why RCPs are prevalent and the regularity of the distribution of spatial instances within the functional region.

4 Algorithm of PRCPM-FR

4.1 Mining Algorithm

To mine all RCPs in urban efficiently, we adapt the CPM-Col algorithm [35] to fit the measure of the regional participation index and perform the mining task in each functional region. The CPM-Col algorithm is one of the most efficient algorithms in mining SCPs. Similar to the Apriori-based method, the CPM-Col algorithm searches for SCPs size-by-size and calculates the participation index of candidates by the participating instance search. CPM-Col algorithm saves time and space for SCP mining to a great extent, since it avoids the generation and storage of pattern instances. To further enhance the efficiency of the PRCPM-FR algorithm, we implemented a multi-threaded parallel mining algorithm. First, both the neighbor relationship calculation and the CPM-Col algorithm can be performed independently in each functional region, which allows for the parallel acceleration of RCP mining at the granularity of functional regions. Parallelization requires consideration of load balancing among threads. Since the main memory consumption and running time of the CPM-Col algorithm are generally positively correlated with the number of instances, the thread allocation problem is converted as follows:

For the x functional regions of the whole study area $FRs =\{FR_1, FR_2,..., FR_x\}$, they are partitioned into t independent subsets, i.e., $\{FRs_1, FRs_2,..., FRs_t\}$ (Suppose there are t threads). In order to achieve load balancing of threads, the objective function J needs to be minimized, where the objective function J is defined as follows:

$$\begin{aligned} J=\sum \limits _{i=1}^t{\sum \limits _{j=i+1}^t}{|\Gamma (|FRs_i|)-\Gamma (|FRs_j|)}|. \end{aligned}$$

(15)

where $\Gamma$ is a mapping function from the task volume (the number of instances) to the CPM-Col running time. The optimization problem of objective function J has been proven to be NP-hard. The main objective of this paper is to balance the thread workload as far as possible to improve the efficiency of the algorithm. Therefore, this paper uses the task volume as the only factor in memory and running time consumption, and the task volume is allocated to the thread with the lightest amount of current tasks to achieve a greedy strategy. The specific parallel algorithm is as follows:

The pseud-code of RCP parallel mining in functional regions (PRCPM-FR) is presented in Algorithm 1. First, an ontology of the urban data is constructed to classify all spatial features into functional labels, and the whole study area is adaptively divided into cells based on the ontology. To obtain the functional type of each cell, it calculates the functional ratio of each functional label in each cell by Formula (9) (lines 1–2). Line 3 merges adjacent cells that have the same functional type to compose functional regions with arbitrary shapes. Second, the neighbor relationships are calculated for all instances within and between regions according to Formulas (10) and (11), and the regional instances are obtained by Definition 7 (line 4). In line 5, the mining task is partitioned by using a greedy strategy, as detailed in Algorithm 2. Then, the CPM-Col algorithm is executed in parallel for each task (lines 6–9). Finally, calculating $CC_{RP}(C,L_i)$ according to Definition 8 to identify SRCPs (lines 12–16).

Algorithm 2 illustrates the details of partitioning thread tasks. In line 1, initialize an array B. Subsequently, the functional regions of the whole study area are grouped for task allocation. Each functional region is assigned to the thread with the lightest current workload, following a greedy strategy. Finally, the thread-grouped tasks are returned.

4.2 Algorithm Analysis

Time complexity analysis. In the algorithm PRCPM-FD, the time complexity is analyzed concerning various parameters. These include the total number of instances n, the number of cells m, the average number of instances in each cell n/m, and the number of regions $r\ (r\ll m)$. Firstly, in steps 1–3 of algorithm 1, identifying functional regions incurs a time complexity of $O(m\times (n/m)^2)$. Secondly, when computing neighbor relationships in step 4, considering neighbor relationships for each region is $O((n/r)^2)$, though practical techniques like plane scanning reduce this significantly. Thirdly, assuming the time complexity of CPM-Col for each region is O(T), influenced by the number of features |F|, the time complexity in steps 6–10 is $O(T\times r)$. Finally, for identifying SRCPs in steps 12–16. Assuming the number of final discovered RCPs is denoted as |RC|, due to the prevalence threshold setting, $|RC|\ll 2^{|F|}$. |L| represents the number of functional region types, and the time complexity is $O(|RC|\times |L|)$. Consequently, the overall time complexity of algorithm PRCPM-FD is $O(n^2/m+n^2/r+T\times r+|RC|\times |L|)$.

Lemma 1

The PRCPM-FR algorithm is complete.

Proof

The completeness of the PRCPM-FR can be guaranteed by the following three parts. (1) In Algorithm 1, steps 1–3 of identifying functional regions do not ignore any space where containing instances. (2) Step 4 guarantees a complete neighbor relationship, i.e., the intra-regional neighbors and inter-regional neighbors guarantee the completeness of neighbor relationships in each region. (3) In steps 5–9, the CPM-Col algorithm can discover all prevalent RCPs in each functional region.

Lemma 2

The PRCPM-FR algorithm is correct.

Proof

In steps 5–9 of Algorithm 1, it is correct to mine prevalent RCPs by CPM-Col in each functional region, i.e., the prevalent RCPs are correctly found in each functional region and the correct regional participation index of each prevalent pattern is saved. Subsequently, in steps 12-16, the PRCPM-FR algorithm selects only those RCPs with the $CC_{RP}$ exceeding the user-specified threshold in certain types of functional regions, thus identifying them as SRCPs. In summary, the PRCPM-FR algorithm is correct.

5 Experimental Evaluation

In this section, extensive experiments are conducted on real-world datasets to verify the accuracy of identifying urban functional regions and the effectiveness and efficiency of mining prevalent RCPs. In the experimental part, substitute the function type with its first word. The algorithms in the experiments are implemented by C++. The hardware environment is Intel Core i7 3.70GHz CPU, 16GB RAM. The running environment is Microsoft Windows 10, Clion2021.

The real-world dataset^{Footnote 2} 1 is the POIs of Wuhua District in Kunming city, which includes 17 features and 24,145 instances, and the number of feature instances ranges from 19 to 1,3055. Another dataset 2 is from Baoan District, Shenzhen, consisting of 20 features and 161,571 instances.

5.1 Effectiveness of the Urban Functional Region Division

5.1.1 Accuracy Evaluation of the Urban Functional Region Division

To assess the accuracy of the urban functional regions identified by our proposed method, we conducted a comprehensive comparison between these regions and the real attributes extracted from the Gaode Map. Because our real datasets are sourced from Gaode Maps, and due to the large number of regions in the entire city, directly determining their functional attributes would require a significant amount of manual effort. Therefore, we combined sampled regions from Gaode Maps with urban planning maps as the ground truth to validate the effectiveness of our method. A total of 100 functional regions were randomly selected, and the accuracy of the identification results of urban functional regions was evaluated by conformity scoring [10, 41]. The full score is 3, that is, complete compliance, and 0 is complete noncompliance. If the single-functional region is identified as a mix-functional region of one type, it is 2, and the rest is 1. If the mix-functional region is identified as a single-functional region that contains one function type of this mix-functional region or a mix-functional region, that contains one function type of this mix-functional region, it is 2, and the rest is 1. The accuracy of the identified functional regions is calculated as $accuracy=\sum \nolimits _{i=1}^n{\frac{x_i}{X_i}}$,where n is the sampling number, $x_i$ is the conformity score for each urban functional region, and $X_i$ is the full score for each urban functional region.

Table 2 Accuracy comparison of functional region division methods

Full size table

To validate the accuracy of the proposed functional division method, the experiments were conducted with Method 1 [9] and Method 2 [10] as comparison approaches. Method 1 requires the user to specify the basic cell size. The selection of the basic cell size should be based on the actual city. So we re-experiment the basic cell according to the actual urban, and select the most suitable cell size from 100 m to 500 m with 50 m as the increase. After repeated experiments, we selected a 250 m grid as the basic cell in Wuhua District, Kunming, and a 100 m basic cell in Bao’an District, Shenzhen. On the other hand, Method 2 divides the basic spatial units based on road networks. The accuracy assessment of the three methods, as shown in Table 2, shows that the region division method proposed in this paper achieves the highest accuracy on both real-world datasets. Method 1 using a uniform cell size may ignore the spatial heterogeneity distribution of the data. If the cell size is too large, the single-functional regions may be identified as mixed ones, resulting in an insufficient number of identified regions and coarse precision in delineating boundaries. Conversely, if the cell size is too small, it may mistakenly classify mix-functional regions as single-functional regions, resulting in an excessive number of regions and making it challenging for users to determine an appropriate uniform cell size. Method 2 based on road networks primarily assumes that regions within a block are of the same functional type. However, in reality, a block may contain multiple types of functional regions, leading to reduced accuracy. In contrast, the ontology-based adaptive cell partitioning proposed in this paper takes into account both the semantic correlation and the spatial heterogeneity distribution of the urban data, ensuring appropriate unit sizes and considering the influence of different functional instances within different cells, thereby enhancing the accuracy of functional region identification.

Table 3 Results of functional region division (/mix-functional type)

Full size table

After evaluating the accuracy of the functional region division method, we use these results as a basis for an in-depth analysis of the layout of the functional regions of these urban to provide a deeper understanding of urban planning and development. Table 3 shows the results of their divided functional regions and shows their maps in Fig. 8. It can be seen from Table 3 that there are 21 types of functional regions (6 single-functional types and 15 mix-functional types). In the Wuhua District of Kunming, the single functional type with the largest region is the industrial region, followed by the commercial region. It is worth noting that the industrial and commercial regions are often adjacent, and this spatial layout is designed to promote the co-development of industry and commerce. The existence of industrial regions provides raw materials and products for commercial regions, while commercial regions provide sales and distribution channels for industrial parks, forming an organic connection of industrial chains. In addition, the largest mix-functional type is the industry/transport regions designed to facilitate commodity mobility. This proximity layout helps optimize resource allocation and increase economic efficiency. In Baoan District, Shenzhen, the largest single functional type is the commercial region, and the distribution of land for business facilities is relatively concentrated, which reflects the vitality of Shenzhen’s commercial development as an important economic center. The centralized layout of business districts may provide enterprises with a better business environment and resource support, attract more merchants and customers, and promote the prosperity of business activities. On the other hand, the largest mix-functional type is the administration/commercial region. Such functional regions combine management institutions and commercial facilities, contributing to the tightened connection between administrative and commercial activities. This close integration can improve administrative efficiency, promote collaboration between government and business, and benefit the development and management of the urban. On the whole, the layout of Baoan District reflects the vitality of urban economic development and the pursuit of efficient management.

5.1.2 Scalability Analysis of the Urban Functional Region Division

In the stage of identifying functional regions, we conducted scalability experiments on the variation of the number of instances. The experimental data was sourced from a random sampling of data from Bao’an District, Shenzhen, with the number of instances ranging from 20,000 to 100,000. As depicted in Table 4, although all algorithms exhibit an increase in time consumption with the number of instances, Method 1 proves to be the most time-efficient. This may be attributed to its straightforward cell-based approach, although it demonstrates the lowest accuracy as shown in Table 2. Method 2 employs road network segmentation, which offers higher accuracy compared to Method 1, albeit at the expense of increased time consumption due to road network analysis. In contrast, our method achieves the best accuracy while maintaining good scalability.

Table 4 Impact of number of instances on the running time(s)

Full size table

5.2 Analysis of Mined RCPs

5.2.1 Analysis of SRCPs

Table 5 SRCPs correlated with it functional region type

Full size table

The presence of RCPs is closely related to their regional environment. In this paper, SRCPs refer to RCPs strongly correlated with a particular functional region type. Experiments were conducted on the two real-world datasets with a strong correlation threshold set to 0.5 and the partially mined SRCPs are listed in Table 5. It can be observed that these results fit well with the actual scenario, providing evidence for reasonable interpretations of the mined patterns. For example, {Hotel, Chinese Restaurant, Supermarket} indicates the tightened connection between business residential districts and catering service facilities in the residence regions. This association reflects the demand for catering services in residential regions.{School, Community, Hotel, Chinese Restaurant, Foreign Restaurant} further enrich the specific pattern in the residence regions. Schools and Community are usually important parts of the residence regions, interwoven with the business district and catering service facilities to form a comprehensive residential region. This pattern provides urban planners with layout recommendations for educational and living facilities in residence regions. Simultaneously, the functional type of regions provides an interpretability for the occurrence of the identified RCPs. The RCPs discovered in this study exhibit a strong correlation with the background of the corresponding functional regions. By analyzing the correlation between patterns and functional regions, we can gain insight into the causes of pattern formation. This helps to reveal the correlation mechanisms and spatial interactions between different RCPs and functional regions.

5.2.2 Differential Layout of the Same Type of Functional Regions

Under the premise of considering different functional attributes of urban, we analyzed the spatial facilities of the same type of functional regions in Wuhua District, Kunming and Baoan District, Shenzhen. Taking the type of residential region as an example, the experimental results reveal that even for functional regions of the same type, their internal spatial facilities present a rich variety of characteristics. First, for residential regions that favor convenient services, the surrounding facilities mainly cover chinese restaurant, shopping malls. This layout focuses on providing the necessities of life, offering residents convenient shopping and eating options. The residential region 2 in Fig. 9a is biased towards convenience services. Secondly, the layout of the residential regions in favor of education and employment development presents a more diversified character. Surrounding facilities include educational and cultural services, companies, and living services. Just like the residential region 1 in Fig. 9a. The residential region has schools and scientific research units, as well as a concentration of companies, providing residents with a wealth of educational and employment resources. Moreover, the assessment of comprehensive liveability can also take into account the richness of spatial facilities within the residential region. Taking a residential region 1 in Baoan District of Shenzhen as an example, compared with the other three residential regions, its surrounding facilities show more diversified characteristics, which not only satisfy the shopping needs of residents, but also provide comprehensive medical and educational services. This diversified layout of spatial facilities helps to enhance the comprehensive liveability of the neighborhood and provides residents with a more comprehensive and convenient living experience. Overall, considering the analysis of RCPs within the functional regions can help users make more reasonable regional selection decisions, trip planning, and so on. By understanding the layout characteristics of different functional regions, users can more accurately identify locations that meet their individual needs, thereby improving their quality of life. At the same time, a comprehensive understanding of the characteristics of urban development facilitates effective urban planning and sustainable development.

5.2.3 Comparison with other RCP Mining Methods

To verify the effectiveness of the proposed method for mining RCPs in different functional regions, we conduct experiments on the two real urban POI datasets. We select these real urban datasets and verify whether different methods of mining RCPs can accurately discover the known RCPs. KNNG [24] and FDPC-PRCPM [30] are used as the comparison methods. The two methods were both aimed at mining RCPs, but they did not take into account the functional attributes of the regions. Prior to comparing the results, a post-processing step was performed on the outcomes obtained from these two methods. The comparison methods employed different approaches for region division, namely partitioning based on similar RCPs and clustering. Subsequently, functional attributes were assigned to these regions using the method proposed in this paper, allowing for a comparative analysis. In terms of experimental evaluation, we introduce the F1-score as the main evaluation index [42]. The F1-score is a harmonic average of precision and recall, integrating the accuracy and recall of the results to fairly compare the performance between different methods. The specific calculation formula is as $F1\_score=\frac{2\times (precision\times recall)}{(precision+recall)}$.

To comprehensively validate the discovered RCPs, we conducted experiments with a prevalence threshold set to 0.2. The results are depicted in Fig. 10, and it is noted that the unrecognized region types from both comparison methods are not displayed. It can be observed that these two comparison methods are unable to effectively identify single-functional regions. This limitation arises from their lack of consideration for the intrinsic attributes of the regions, focusing solely on the spatial characteristics of instances, leading to coarse regional divisions. Therefore, we compared the identified mix-functional types of regions. Notably, our proposed method consistently achieves the highest F1 score among all types of functional regions. In contrast, the KNNG method employs constraints to extend regions with the same type of co-location candidates to ultimately obtain the target region. Therefore, the regions formed by this method may be different from the functional regions, and the method does not consider the neighboring relationships between instances of different regions. Consequently, the KNNG method cannot fully mine the RCPs associated with the functional type. Similarly, FDPC-PRCPM method utilizes a clustering approach to form regions, focusing mainly on the distribution and density of the data, ignoring the semantic information of instances within the region. Therefore, the preset RCPs cannot be mined correctly and completely by FDPC-PRCPM. In summary, the outstanding performance of our method in mining RCPs is attributed to its ability to efficiently identify the functional regions and capture their underlying semantic information.

Table 6 Accuracy comparison of functional region division methods

Full size table

Compared with the comparison method, this method can discover more fresh RCPs, and some of them are shown in Table 6. By mining these fresh patterns of RCPs, we can better understand the correlations between different functional regions, providing valuable insights and guidance for urban planning. For example, an important finding in the mix-functional region of Residence/Commercial is that there is a strong correlation between community, shopping malls, hospital, elderly care facility. This means that when planning and laying out commercial regions, close connection with community and healthcare facilities should be fully considered to meet the growing needs of residents and businesses. To achieve people’s convenience, the design of urban regions should focus on the interaction between commercial and residential regions, and ensure the reasonable distribution of medical care resources to meet residents’ needs for health services.

5.3 The Scalability Analysis of PRCPM-FR Algorithm

Next, the scalability analysis of the proposed algorithm is explored. Although the mining idea of PRCPM-FR is adapted from CPM-Col [35], PRCPM-FR is used to discover RCPs within functional regions and CPM-Col is to mine global co-location patterns. Therefore, instead of comparing the mining efficiency with CPM-Col, we compare it with other RCP mining methods [30, 43]. The effects of the number of instances, the number of features, and the number of threads on the running time are considered respectively. The experimental datasets are obtained by randomly sampling POI data from the dataset of Baoan District, Shenzhen. When testing the number of instances, we selected datasets comprising 20 features, with the number of instances ranging from 20,000 to 100,000. Regarding the analysis of feature impact, datasets with 20,000 instances were chosen, with the number of features varying from 5 to 20.

Table 7 The running time (s) of different methods for mining RCPs on real datasets

Full size table

In this experiment, the prevalence threshold was set to 0.4. Figure 11a illustrates the impact of the number of instances on the overall runtime of different algorithms. With an increase in the number of instances, more neighbor relationships are generated, leading to an increase in algorithmic time overhead. As observed in Fig. 11a, our proposed method demonstrates superior time efficiency compared to the other two methods. Meanwhile, we also tested the algorithm efficiency of different methods for mining RCPs on two real datasets, as shown in Table 7. From Table 7, it can be observed that our method still exhibits optimal time efficiency even under single-threaded conditions. This is attributed to the fact that the other two methods, with the increase of neighbor relationships, incur more time-consuming for table instances generation, and require the exhaustive verification of all instances for each feature in candidate patterns. In contrast, our approach to mining RCPs based on participating instances eliminates the need to generate and maintain table instances, significantly reducing the search space. Overall, the efficiency of our algorithm is validated on both sampled data and real datasets.

Figure 11b shows the effect of the number of features on the algorithm. An increase in the number of features leads to a larger size of the mined RCPs, which also increases the time consumption in each region, leading to an increase in the overall algorithm time. The PRCPM-FR also shows good scalability compared to the other two methods.

In this experiment, the prevalence threshold was set to 0.4, and the number of threads was set to 2, 4, 8, and 16 respectively. Figure 12a shows the effect of the number of instances on the overall running time of the algorithm. As the number of instances increases, more neighbor relationships are created, leading to an increased algorithmic time overhead. Figure 12b shows the effect of the number of features on the algorithm. An increase in the number of features leads to a larger size of the mined RCPs, which also increases the time consumption of CPM-Col in each region, leading to an increase in the overall algorithm time. Both experiments show that the parallel algorithm has an advantage over the non-parallel algorithm in terms of time efficiency. However, despite the exponential increase in the number of threads, the running time of the algorithm decreases linearly. There may be several reasons for this: (1) the method proposed in this paper for dividing the thread tasks is based on a greedy strategy and is not an optimal solution; (2) the longest thread execution time determines the overall running time of the algorithm; (3) the same number of instances within a region does not necessarily result in the same number of neighbor relationships.

6 Conclusions and Future Work

In this paper, we proposed a method for mining RCPs based on urban functional region division. Our method takes into account the functional attributes of regions throughout the RCP mining process, contributing to the application value and interpretability of mining results. Furthermore, we proposed an adaptive method for intra- and inter-regional neighbor relationship identification. This method not only overcomes the challenge of setting appropriate distance thresholds but also resolves the issue of losing neighbor relationships due to regional division. Extensive experiments on real-world datasets demonstrated the effectiveness of our approach in accurately identifying urban functional regions. Comprehensive comparisons with existing RCP mining algorithms validated the efficiency of the PRCPM-FR algorithm in mining RCPs. Through analyses of the relationship between mined RCPs and functional region types, we discover the association relationships of spatial facilities within functional regions and further support the practical application value and interpretability of RCPs within functional regions. Overall, this study enhances our understanding of the relevance of RCPs and urban functional regions, and provides potential applications to urban planning and management.

This study still has some limitations and requires further refinement. Specifically, in the identification of functional regions, traditional methods tend to only select the first two functional types in mixed-functional regions. When the proportions of the second and third functional type are similar, the third functional type may be overlooked. In future work, it is crucial to further consider the appropriate number of functional types in mixed-functional regions. Additionally, our paper lacks depth in considering parallelization. There is a need for an in-depth study of load balancing of algorithms. In the future, we plan to address this limitation by integrating the effects of the number of instance neighbors, RCP size, and number of instances on load balancing. This will involve further evaluating the actual task allocation expectation of each thread to reach the optimal algorithm. Finally, in this study, we mined RCPs that cover all functional types, which may overlook users’ specific interests. Future research directions could consider mining user-preferred RCPs to meet their personalized needs. We anticipate that these improvements will make the study more comprehensive and in-depth, providing more practical and feasible guidance for mining RCPs.

Availability of Data and Materials

The data that support the findings of this study are available at the following websites: https://opendata.pku.edu.cn/dataset.xhtml?persistentId=doi:10.18170/DVN/WSXCNM.

Notes

https://www.planning.org.cn/law/uploads/2013/1383993139.pdf.
The real datasets is derived from https://opendata.pku.edu.cn/dataset.xhtml?persistentId=doi:10.18170/DVN/WSXCNM.

References

Xu R, Yue W, Wei F et al (2022) Density pattern of functional facilities and its responses to urban development, especially in polycentric cities. Sustain Cities Soc 76:103526. https://doi.org/10.1016/j.scs.2021.103526
Article Google Scholar
Sun T, Xu J, Hu C (2022) An efficient algorithm of star subgraph queries on urban traffic knowledge graph. Data Sci Eng 7(4):383–401. https://doi.org/10.1007/s41019-022-00198-0
Article Google Scholar
Dai S, Yu Y, Fan H et al (2022) Spatio-temporal representation learning with social tie for personalized poi recommendation. Data Sci Eng 7:44–56. https://doi.org/10.1007/s41019-022-00180-w
Article Google Scholar
Wang L, Fang Y, Zhou L (2022) Preference-based spatial co-location pattern mining. Springer, Singapore. https://doi.org/10.1007/978-981-16-7566-9
Book Google Scholar
Chen Y, Chen X, Liu Z et al (2020) Understanding the spatial organization of urban functions based on co-location patterns mining: a comparative analysis for 25 chinese cities. Cities 97:102563. https://doi.org/10.1016/j.cities.2019.102563
Article Google Scholar
Fan W, Hu C (2017) Big graph analyses: from queries to dependencies and association rules. Data Sci Eng 2:36–55. https://doi.org/10.1007/s41019-016-0025-x
Article Google Scholar
Jiang Y, Han Y, Liu M et al (2022) Street vitality and built environment features: a data-informed approach from fourteen Chinese cities. Sustain Cities Soc 79:103724. https://doi.org/10.1016/j.scs.2022.103724
Article Google Scholar
Luo S, Liu Y, Du M et al (2021) The influence of spatial grid division on the layout analysis of urban functional areas. ISPRS Int J Geo-Inf 10(3):189. https://doi.org/10.3390/ijgi10030189
Article Google Scholar
Hu Y, Han Y (2019) Identification of urban functional areas based on poi data: a case study of the guangzhou economic and technological development zone. Sustainability 11(5):1385. https://doi.org/10.3390/su11051385
Article Google Scholar
Wang Z, Ma D, Sun D et al (2021) Identification and analysis of urban functional area in Hangzhou based on OSM and POI data. PLoS one. https://doi.org/10.1371/journal.pone.0251988
Article Google Scholar
Zhai W, Bai X, Shi Y et al (2019) Beyond Word2vec: an approach for urban functional region extraction and identification by combining Place2vec and POIs. Comput Environ Urban Syst 74:1–12. https://doi.org/10.1016/j.compenvurbsys.2018.11.008
Article Google Scholar
Zheng M, Wang H, Shang Y et al (2023) Identification and prediction of mixed-use functional areas supported by POI data in Jinan City of China. Sci Rep 13:2913. https://doi.org/10.1038/s41598-023-30140-x
Article Google Scholar
Huang C, Xiao C, Rong L (2022) Integrating point-of-interest density and spatial heterogeneity to identify urban functional areas. Remote Sens 14(17):4201. https://doi.org/10.3390/rs14174201
Article Google Scholar
Shekhar S, Huang Y (2001) Discovering spatial co-location patterns: a summary of results. In: International symposium on spatial and temporal databases. Springer, pp 236–256
Huang Y, Shekhar S, Xiong H (2004) Discovering colocation patterns from spatial data sets: a general approach. IEEE Trans Knowl Data Eng 16(12):1472–1485. https://doi.org/10.1109/TKDE.2004.90
Article Google Scholar
Yoo JS, Shekhar S, Smith J et al (2004). A partial join approach for mining co-location patterns. In: Proceedings of the 12th annual ACM international workshop on geographic information systems, pp 241–249. https://doi.org/10.1145/1032222.1032258
Yoo JS, Shekhar S (2006) A joinless approach for mining spatial colocation patterns. IEEE Trans Knowl Data Eng 18(10):1323–1337. https://doi.org/10.1109/TKDE.2006.150
Article Google Scholar
Tran V, Wang L, Chen H et al (2021) MCHT: a maximal clique and hash table-based maximal prevalent co-location pattern mining algorithm. https://doi.org/10.1016/j.eswa.2021.114830
Chan HK-H, Long C, Yan D et al (2024) Fraction-score: a generalized support measure for weighted and maximal co-location pattern mining. IEEE Trans Knowl Data Eng 36(4):1582–1596. https://doi.org/10.1109/TKDE.2023.3304365
Article Google Scholar
Govan R, Selmaoui-Folcher N, Giannakos A et al (2023) Co-location pattern mining under the spatial structure constraint. International conference on database and expert systems applications, pp 186–193
Hu Z, Wang L, Tran V et al (2022) Efficiently mining spatial co-location patterns utilizing fuzzy grid cliques. Inf Sci 592:361–388. https://doi.org/10.1016/j.ins.2022.01.059
Article Google Scholar
Barua S, Sander J (2014) Mining statistically sound co-location patterns at multiple distances. In: Proceedings of the 26th international conference on scientific and statistical database management, pp 1–12. https://doi.org/10.1145/2618243.2618261 .
Yoo JS, Bow M (2012) Mining spatial colocation patterns: a different framework. Data Min Knowl Disc 24:159–194. https://doi.org/10.1007/s10618-011-0223-0
Article MathSciNet Google Scholar
Qian F, Chiew K, He Q et al (2014) Mining regional co-location patterns with kNNG. J Intell Inf Syst 42:485–505. https://doi.org/10.1007/s10844-013-0280-5
Article Google Scholar
Tran V, Wang L (2020) Delaunay triangulation-based spatial colocation pattern mining without distance thresholds. Stat Anal Data Min ASA Data Sci J 13(3):282–304. https://doi.org/10.1002/sam.11457
Article MathSciNet Google Scholar
Mohan P, Shekhar S, Shine JA et al (2011) A neighborhood graph based approach to regional co-location pattern discovery: a summary of results. In: Proceedings of the 19th ACM SIGSPATIAL international conference on advances in geographic information systems—GIS ’11, pp 122–132. https://doi.org/10.1145/2093973.2093991 .
Deng M, Cai J, Liu Q et al (2017) Multi-level method for discovery of regional co-location patterns. Int J Geogr Inf Sci 31(9):1846–1870. https://doi.org/10.1080/13658816.2017.1334890
Article Google Scholar
Liu Q, Liu W, Deng M et al (2021) An adaptive detection of multilevel co-location patterns based on natural neighborhoods. Int J Geogr Inf Sci 35(3):556–581. https://doi.org/10.1080/13658816.2020.1775235
Article Google Scholar
Liu W, Liu Q, Deng M et al (2022) Discovery of statistically significant regional co-location patterns on urban road networks. Int J Geogr Inf Sci 36(4):749–772. https://doi.org/10.1080/13658816.2021.1981335
Article Google Scholar
Jiang X, Wang L, Tran V (2023) A parallel algorithm for regional co-location mining based on fuzzy density peak clustering. SCI SIN Inf 53(7):1281. https://doi.org/10.1360/SSI-2022-0004
Article Google Scholar
Ding W, Eick CF, Yuan X et al (2011) A framework for regional association rule mining and scoping in spatial datasets. GeoInformatica 15:1–28. https://doi.org/10.1007/s10707-010-0111-6
Article Google Scholar
Wang S, Huang Y, Wang XS (2013) Regional co-locations of arbitrary shapes. In: International symposium on spatial and temporal databases. Springer, pp 19–37
Wang D, Wang L, Jiang X et al (2024) RCPM_CFI: a regional core pattern mining method based on core feature influence. Inf Sci 658:119895. https://doi.org/10.1016/j.ins.2023.119895
Article Google Scholar
Li J, Wang L, Yang P et al (2024) A novel algorithm for efficiently mining spatial multi-level co-location patterns. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2024.3381178
Article Google Scholar
Yang P, Wang L, Wang X et al (2022) A spatial co-location pattern mining approach based on column calculation. SCI SIN Inform 52(6):1053. https://doi.org/10.1360/SSI-2020-0384
Article Google Scholar
Gruber TR (1993) A translation approach to portable ontology specifications. Knowl Acquis 5(2):199–220. https://doi.org/10.1006/knac.1993.1008
Article Google Scholar
Bao X, Gu T, Chang L et al (2022) Knowledge-based interactive postmining of user-preferred co-location patterns using ontologies. IEEE Trans Cybern 52(9):9467–9480. https://doi.org/10.1109/TCYB.2021.3054923
Article Google Scholar
Tanimoto TT (1958) An elementary mathematical theory of classification and prediction
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Article MathSciNet Google Scholar
Tobler WR (1970) A computer movie simulating urban growth in the Detroit region. Econ Geogr 46:234–240
Article Google Scholar
Kang Y, Wang Y, Xia Z et al (2018) Identification and classification of Wuhan urban districts based on poi. J Geomat 43(1):81–85. https://doi.org/10.14188/j.2095-6045.2016352
Article Google Scholar
Cai J, Kwan M-P (2022) Discovering co-location patterns in multivariate spatial flow data. Int J Geogr Inf Sci 36(4):720–748. https://doi.org/10.1080/13658816.2021.1980217
Article Google Scholar
Li Y, Shashi S (2018) Local co-location pattern detection: a summary of results. In: 10th international conference on geographic information science (GIScience 2018), vol 114, pp 10–11015. https://doi.org/10.4230/LIPIcs.GISCIENCE.2018.10

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (62276227, 62306266), the Program of Yunnan Key Laboratory of Intelligent Systems and Computing (202205AG070003), the Project of Innovative Research Team of Yunnan Province (2018HC019), and the Yunnan Fundamental Research Projects (202201AS070015, 202401AT070450).

Author information

Authors and Affiliations

School of Information Science and Engineering, Yunnan University, Kunming, 650091, China
Yuqing Liu, Lizhen Wang, Peizhong Yang & Lihua Zhou
Dianchi College of Yunnan University, Kunming, 650228, China
Lizhen Wang

Authors

Yuqing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Lizhen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Peizhong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Lihua Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The 1st author proposed the method and algorithm, completed the experiments and wrote the paper. The 2nd author optimized the method and revised the paper. The 3rd optimized the method and revised the paper. The 4th author revised the paper.

Corresponding author

Correspondence to Lizhen Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing Financial interests or personal relationships that could have appeared to infuence the work reported in this paper.

Consent for Publication

All authors have read and agreed to the published version of the manuscript.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, Y., Wang, L., Yang, P. et al. Mining Interpretable Regional Co-location Patterns Based on Urban Functional Region Division. Data Sci. Eng. 9, 464–485 (2024). https://doi.org/10.1007/s41019-024-00256-9

Download citation

Received: 09 January 2024
Revised: 31 May 2024
Accepted: 01 July 2024
Published: 29 August 2024
Issue Date: December 2024
DOI: https://doi.org/10.1007/s41019-024-00256-9

Keywords

Mathematics Subject Classification

68Uxx

Mining Interpretable Regional Co-location Patterns Based on Urban Functional Region Division

Abstract

Similar content being viewed by others

Mining Regional High Utility Co-location Pattern

Ontology-Based Interactive Post-mining of Interesting Co-location Patterns

Discovering Interesting Co-location Patterns Interactively Using Ontologies

Explore related subjects

1 Introduction

2 Related Work

3 Definitions for Mining RCPs

3.1 Notations

3.2 Identification of Functional Regions Based on the Ontology

3.2.1 The Concepts of the Ontology

3.2.2 Adaptively Quadrant Division of Spatial Units Based on Ontology

3.2.3 Identify Functional Types for Cells

Definition 1

Definition 2

Definition 3

Definition 4

3.2.4 Merging Functional Cells

3.3 Measure of RCPs

Definition 5

Definition 6

Definition 7

Definition 8

4 Algorithm of PRCPM-FR

4.1 Mining Algorithm

4.2 Algorithm Analysis

Lemma 1

Proof

Lemma 2

Proof

5 Experimental Evaluation

5.1 Effectiveness of the Urban Functional Region Division

5.1.1 Accuracy Evaluation of the Urban Functional Region Division

5.1.2 Scalability Analysis of the Urban Functional Region Division

5.2 Analysis of Mined RCPs

5.2.1 Analysis of SRCPs

5.2.2 Differential Layout of the Same Type of Functional Regions

5.2.3 Comparison with other RCP Mining Methods

5.3 The Scalability Analysis of PRCPM-FR Algorithm

6 Conclusions and Future Work

Availability of Data and Materials

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent for Publication

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation