[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
Analysis of the Impact of Reformulation of the Recipe Composition on the Quality of Instant Noodles
Previous Article in Journal
Electromagnetic Field Distribution and Data Characteristics of SUTEM of Multilayer Aquifers
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Motion Target Localization Method for Step Vibration Signals Based on Deep Learning

1
School of Microelectronics and Control Engineering, Changzhou University, Yan zheng West 2468#, Changzhou 213164, China
2
School of Computer Science and Artificial Intelligence, Changzhou University, Yan zheng West 2468#, Changzhou 213164, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(20), 9361; https://doi.org/10.3390/app14209361
Submission received: 25 August 2024 / Revised: 5 October 2024 / Accepted: 10 October 2024 / Published: 14 October 2024
Figure 1
<p>Location scenario. This figure shows the experiment scene of vibration signal acquisition of a moving target, including the sensor configuration, single moving target, and one camera.</p> ">
Figure 2
<p>Overview of our method. This paper proposes a deep learning-based method for locating moving targets using vibration signals. The approach begins by collecting vibration signal data sequences and recording the actual trajectories of pedestrians on the ground using cameras. Subsequently, supervised learning is employed, leveraging deep learning techniques to extract signal features for training the positioning model. Finally, the two-dimensional velocity vectors of pedestrians are output to estimate their motion trajectories.</p> ">
Figure 3
<p>Architectures of our method.</p> ">
Figure 4
<p>The real experimental conditions.</p> ">
Figure 5
<p>Comparison of the predicted trajectory and the real trajectory.</p> ">
Figure 6
<p>Velocity vector comparison. (<b>a</b>) Comparison of real velocity and prediction in the X-axis direction. (<b>b</b>) Comparison of real velocity and prediction in the Y-axis direction.</p> ">
Figure 7
<p>Part of the linear path test results. (<b>a</b>) The test set results of experimenter A. (<b>b</b>) The test set results of experimenter B. (<b>c</b>) The test set results of experimenter C.</p> ">
Figure 8
<p>Part of the test result graph around the figure-eight path. (<b>a</b>) The test set results of experimenter A. (<b>b</b>) The test set results of experimenter B. (<b>c</b>) The test set results of experimenter C.</p> ">
Figure 9
<p>Part of the loop path test results. (<b>a</b>) The test set results of experimenter A. (<b>b</b>) The test set results of experimenter B. (<b>c</b>) The test set results of experimenter C.</p> ">
Figure 10
<p>Part of the L-path test results. (<b>a</b>) The test set results of experimenter A. (<b>b</b>) The test set results of experimenter B. (<b>c</b>) The test set results of experimenter C.</p> ">
Figure 11
<p>Test set experimental results.</p> ">
Figure 12
<p>Comparison of positioning results of different methods.</p> ">
Versions Notes

Abstract

:
To address the limitations of traditional footstep vibration signal localization algorithms, such as limited accuracy, single feature extraction, and cumbersome parameter adjustment, a motion target localization method for step vibration signals based on deep learning is proposed. Velocity vectors are used to describe human motion and adapt it to the nonlinear motion and complex interactions of moving targets. In the feature extraction stage, a one-dimensional residual convolutional neural network is constructed to extract the time–frequency domain features of the signals, and a channel attention mechanism is introduced to enhance the model’s focus on different vibration sensor signal features. Furthermore, a bidirectional long short-term memory network is built to learn the temporal relationships between the extracted signal features of the convolution operation. Finally, regression operations are performed through fully connected layers to estimate the position and velocity vectors of the moving target. The dataset consists of footstep vibration signal data from six experimental subjects walking on four different paths and the actual motion trajectories of the moving targets obtained using a visual tracking system. Experimental results show that compared to WT-TDOA and SAE-BPNN, the positioning accuracy of our method has been improved by 37.9% and 24.8%, respectively, with a system average positioning error reduced to 0.376 m.

1. Introduction

In recent years, positioning technology has become a focus in both industry and academia, and it has become a core component of numerous personnel positioning services. These personnel positioning data have a wide range of applications in various scenarios, including disaster relief, situational awareness, security defense, and intrusion detection. However, at the same time, there is increasing concern about the potential invasion of personal privacy when obtaining location information. It is worth noting that many research works related to surveillance and situational awareness often do not adequately consider privacy issues, including typical positioning technologies, such as GPS [1], computer vision [2], Wi-Fi [3], Bluetooth [4], and inertial sensors [5], which may, to some extent, leak privacy information of located individuals. Compared to other positioning algorithms, vibration signal-based positioning algorithms do not require individuals to wear transmitters or receivers, thus protecting individual privacy, and they perform well in complex electromagnetic environments or low visibility conditions [6].
Vibration signal positioning technology has been widely applied in various fields, such as structural health monitoring [7], indoor positioning [8], fault diagnosis [9], earthquake monitoring [10], traffic management, and mine safety [11]. A footstep vibration signal originates from structural vibrations of the floor induced by footsteps during walking, which are detected and converted into voltage signals by vibration sensors. Since walking is a continuous and dynamic process, footstep vibration signals exhibit multidimensional temporal characteristics. In particular, during the process from the heel contacting the ground to the toe leaving, the generated vibration signals form wide pulse widths and multiple peak values, which distinguishes them from other types of vibration signals. Additionally, footstep vibration signals mainly concentrate in the frequency band below 100 Hz, and there is a fixed time interval between adjacent footsteps [12]. However, because of the non-uniformity of ground structures and the influence of construction factors, ground materials exhibit anisotropy. In indoor environments, the propagation of ground waves is affected by various obstacles, resulting in multipath effects, which leads to a low signal-to-noise ratio in the collected vibration signals and large measurement errors [13].
Differences in propagation time occur when footstep vibration signals reach different sensors, providing the theoretical basis for vibration signal personnel positioning algorithms based on Time Difference of Arrival (TDOA) [7,14]. Such algorithms typically use signal processing methods, such as wavelet transformation, to mitigate the dispersion effects in signal propagation and then manually extract the propagation time delay characteristics of multi-channel vibration signals to achieve higher accuracy. However, these methods have some limitations. Firstly, they are extremely vulnerable to interference from environmental background noise, resulting in a decrement in the stability and accuracy of localization. Although signal processing techniques, such as wavelet transformation, can be employed, they inevitably escalate algorithmic complexity and necessitate meticulous parameter adjustments. Furthermore, because of the uniformity of the feature extraction approach, it becomes challenging to adequately leverage the multidimensional temporal characteristics within vibration signals, thereby imposing limitations on further enhancing localization precision.
Neural networks excel at handling complex nonlinear relationships. Convolutional neural networks are widely used in extracting the frequency domain features of vibration signals, avoiding the need for complex signal processing and addressing the limitations of manual feature extraction [15,16]. In the field of fault diagnosis, to cope with network degradation, researchers have proposed a model that combines residual connections [17] and one-dimensional convolutional neural networks (1DCNNs) [18]. The aim is to improve diagnostic accuracy by preserving key parameters and enhancing feature representation capability. However, the time scale may be neglected, leading to information loss. To address this issue, researchers have introduced long short-term memory (LSTM) [19] networks based on deep residual networks (ResNets). LSTM can capture the temporal information in the features extracted by a ResNet, effectively enhancing diagnostic performance. In indoor positioning research, researchers have also attempted to analyze vibration signals using neural networks. For example, using ResNets to extract vibration signal features and determining the position of individuals through classification methods has resulted in higher classification accuracy [20]. Other studies have utilized Sparse Autoencoders (SAEs) to extract signal features and combined them with backpropagation neural networks (BPNNs) to build indoor positioning models, achieving precise output of two-dimensional coordinates [21].
However, these methods uniformly treat localization as a classification task, dividing the localization scene into several regions and assigning individuals to the center of a particular region based on classification outcomes. Nevertheless, given that walking is inherently a continuous dynamic process, the discrete nature of classification methods struggles to capture the dynamic trends and subtle variations in human walking poses while overlooking the temporal correlations between successive footsteps. Consequently, in practical scenarios demanding higher localization accuracy, relying solely on the region assignment of individual footprints fails to provide continuous and precise estimates of the moving target’s position. Currently, a considerable number of researchers have transformed the problem of continuous target pose estimation into a data sequence learning task. For example, researchers [22] utilized 2D pose sequences as input to their models, encoding the extracted time-series features through LSTM to output predicted 3D poses. RoNIN [23] proposed constructed multiple neural network models, including ResNets, using IMU data to regress walking speed and direction changes, achieving high-precision trajectory prediction.
Based on the above research, this paper proposes a motion target localization method for step vibration signals based on deep learning. The main work includes the following:
  • Introducing the Channel Attention Mechanism (CAM) [24], one-dimensional residual neural network (1D-ResNet), and bidirectional long short-term memory network (Bi-LSTM) in building a combined network for the task of feature extraction of multidimensional footstep vibration signals. This aims to fully explore the potential pedestrian motion information in the footstep signals. Additionally, using velocity vectors to describe human motion better adapts to nonlinear motion and complex interaction situations.
  • Collecting footstep vibration signal data from multiple subjects walking on different paths and combining it with a visual tracking system to obtain real trajectory information of individuals. The actual walking speed of pedestrians is used as the signal label to construct a footstep vibration signal dataset, which is made publicly available for use by other researchers.
  • Through experimental verification, our method in this paper has been shown to have higher positioning accuracy compared to traditional methods, such as WT-TDOA and SAE+BPNN, achieving continuous and accurate estimates of motion target positions.
The remainder of the paper is structured as follows: In Section 2, we describe the location scenario and sensor arrangement. In Section 3, we describe our detailed method of localizing motion targets using footstep-induced floor vibration, and in Section 4, we describe the experiments we conducted, including test set experiment, ablation experiment, and comparative experiment. Finally, we present future work and conclusions in Section 5.

2. Location Scenario Description

In a 2D space of size m × n , the floor is made of tile material. M seismic geophone nodes S = { s 1 , , s i , , s M } are uniformly deployed in the corners to establish a vibration sensing network, with the coordinates of node i being L i = [ x i , y i ] T . This configuration ensures that the sensing network’s detection range covers the observation area and captures effective signals, as shown in Figure 1.
To simplify the problem and validate the effectiveness of the localization algorithm, this paper focuses on positioning a single moving target (as shown in Figure 1). When the moving target generates vibrations, the working principle of the geophones involves the relative movement of the internal coil and permanent magnet due to the ground vibrations, which cuts through the magnetic field lines and induces an electromotive force in the wire. This induced voltage is proportional to the vibration acceleration. Thus, the vibration detected by the i-th node is converted into a voltage signal, which is then amplified by an operational amplifier. The amplified signal is subsequently digitized by an analog-to-digital converter to yield the vibration signal, denoted as e i ( t ) . The entire vibration sensor network can collect M channels of vibration signals e ( t ) = { e 1 ( t ) , , e i ( t ) , , e M ( t ) } .
To train the localization model, supervised learning is employed. The actual trajectory of the moving target is recorded by a camera and used as supervised information to ensure the accuracy of the supervision data, as shown in Figure 2.

3. Description of Our Method

This section provides a detailed explanation of the design of the method proposed in this paper. Firstly, the vibration signal is preprocessed to eliminate noise and normalize it. Then, different modules in the feature extraction stage are described in detail. Finally, the pedestrian speed vector is regressed to obtain position information. The entire process is illustrated in Figure 3.

3.1. Vibration Signal Preprocessing Module

Floor vibration signals, induced by footsteps, typically encompass frequency components below 100 Hz. A geophone is inherently a second-order high-pass filter with a high-frequency cutoff below 10 Hz (implying that as the frequency decreases from 10 Hz to 1 Hz, the geophone’s response may attenuate to one percent of its original value). Artifacts within the signal primarily originate from environmental high-frequency noise exceeding 100 Hz. To mitigate its impact, a second-order Butterworth low-pass filter with a cutoff frequency of 100 Hz [25] is first used to filter the signal. Then, normalization is performed to unify the dimensions of the signal. The normalized vibration signal is denoted as a ( t ) :
a ( t ) = { a 1 ( t ) , , a i ( t ) , , a M ( t ) }
where a i ( t ) ( 1 a i ( t ) 1 ) ( 1 i M ) represents the result of e i ( t ) after low-pass filtering and normalization.
Considering that the velocity of individuals is not constant during walking, a sliding window approach [26] is adopted to partition the normalized vibration signal a ( t ) in order to more accurately capture velocity changes within each time window. The specific operations are as follows: starting from the beginning of the data stream at t = 0 seconds, a window length of L = 1 second (based on the consideration of a normal human gait cycle) is set, and the number of samples contained in each window, N s = f s × L , is determined based on the sampling rate f s . Then, the window is slid sequentially according to the set moving step size d to segment the entire signal. If the total length of the signal is N, a signal matrix of k windows each of length N s is obtained. When the window moves to the jth sample point, the r t h signal matrix is represented as U j r R M × N s :
k = N N s d + 1
U j r = u 1 , j N s u 1 , j N s + 1 u 1 , j u 2 , j N s u 2 , j N s + 1 u 2 , j u i , j N s u i , j N s + 1 u i , j u M , j N s u M s , j N s + 1 u M , j
where i represents a particular channel signal.

3.2. Feature Extraction

Pedestrian gait exhibits dual characteristics of continuity and periodicity, with the contact between feet and ground during walking generating vibration signals that embed dynamic information about the walker’s speed. The smooth continuity and periodic repetition of gait impart a regular pattern to the walking state. Furthermore, variations in walking speed elicit adjustments in the frequency and intensity of foot-ground contact, which are manifested through the temporal features of these vibration signals. For instance, during accelerated walking, the frequency of foot contacts increases, accompanied by a concurrent amplification in the amplitude and frequency of the vibration signals. These integrated features constitute the complex motion patterns of pedestrian gait, holding significant research value for pedestrian position tracking and speed analysis. Building upon this analysis, the present study leverages the nonlinear learning capabilities of neural networks to conduct an in-depth investigation into the influence of factors such as step frequency, step length, gait cycle, symmetry, and foot landing patterns on the variation patterns of vibration signals.
To establish a mapping relationship between vibration signal features and motion features, this paper employs a combination of global channel attention, local feature extraction, and temporal sequence feature extraction modules. Specifically, the global channel attention module uses CAM to learn the correlations between multiple footstep vibration signals by adjusting the weights of different channels in the signal matrix, focusing on channels with higher signal strengths and signal-to-noise ratios to extract global features and achieve effective information compensation. For local feature extraction, a 1D-ResNet structure is used to capture local time–frequency domain features of the signals, mining rich fine-grained motion information. The temporal feature extraction module employs Bi-LSTM to further extract and optimize the time dependencies of the features obtained from convolutional operations, providing a comprehensive understanding of the walking dynamics.

3.2.1. Global Channel Attention Module

During the propagation of vibration waves, they are often affected by attenuation and multipath effects, leading to signal amplitude reduction, energy loss, and temporal domain distortion. Consequently, in deep learning tasks for vibration signal feature representation, learning features from individual channels alone often fails to capture the multidimensional characteristics of footstep vibration signals comprehensively. Since the signals from different channels originate from the same footstep-induced structural vibrations, they exhibit spatial correlation. Therefore, this paper introduces the CAM mechanism to construct a global channel attention module, which enhances attention to channels that are closer and less affected by interference by extracting inter-channel correlation information and adjusting weights, thereby achieving effective complementary information among channels.
Given the spatial correlation between signal channels, this paper introduces the CAM mechanism to enhance the model’s ability to identify key channels by dynamically adjusting weights, thereby effectively complementing information between channels and obtaining global signal features. Generally, vibration sensors closer to the moving target collect more accurate and reliable signals because they contain significant peaks and valleys with less energy attenuation. To this end, the CAM module first performs global max pooling and average pooling operations on the segmented signal matrix U j r to extract maximum value statistical features f M a x r R M × 1 and average value statistical features f Avg r R M × 1 for each sensor data channel. The max pooling focuses on the vibration peaks associated with events such as foot contact with the ground and lift-off, capturing the extrema characteristics of gait. Meanwhile, the average pooling emphasizes the overall trend and steady state of the signal, providing information on the stability and rhythm of the gait. These statistical features are then input into a shared network composed of multiple perceptrons, or multilayer perceptrons (MLPs), where the outputs of the shared network are element-wise summed to merge the advantages of the two pooling strategies. The weight vector w r R M × 1 is generated by processing through the sigmoid function to represent the importance of different channels. Equation (4) shows this process:
w r = σ ( M L P ( f M a x r ) + M L P ( f A v g r ) )
where σ represents the sigmoid function. Finally, the weighted feature matrix U ¯ r R M × N s is obtained by multiplying the weight vector w r with the signal matrix U j r :
U ¯ r = w × U j r

3.2.2. Local Feature Extraction Module

A complete footstep typically includes the processes of heel strike, forefoot contact, and toe-off, which manifest as distinct signal waveforms in the time domain. These waveforms contain local features, such as shape, duration, and amplitude variations, which are crucial for distinguishing different walking patterns and gait information and for precise motion localization. Therefore, following global feature extraction, this section utilizes a 1D-ResNet to thoroughly extract the local time–frequency domain characteristics of the signal and addresses the performance degradation issue associated with increasing network depth.
A 1D-ResNet consists of an initialization module and four structurally similar residual modules. The feature matrix U ¯ r first passes through the initialization module with a convolutional kernel size of 1 × 7 and a stride of 2, which preliminarily identifies key shapes in the signal, such as rising and falling edges, sharp changes, flat sections, etc., to obtain an initial feature representation D 0 r . Subsequently, the signal enters the four stacked residual modules, each outputting channel quantities of 64, 128, 256, and 512. Each residual module contains two residual units, each consisting of two convolutional layers and a residual connection. To fully capture local time–frequency characteristics, small 1 × 3 convolutional kernels are used in each convolutional layer [25]. This study draws on the idea of residual structure, different from traditional networks that directly learn input–output mappings. In this study, the l-th residual unit is retained to learn shallow features D l r capturing local motion trends, which are used as input to the ( l + 1 ) -th residual unit to further learn subtle changes in pedestrians’ movements, resulting in local fine-grained residual features Δ D l . When the sizes of D l r and Δ D l r are mismatched, D l r is dimensionally aligned through a 1 × 1 convolution operation in the branch, followed by addition with Δ D l r ; otherwise, they are directly added. Finally, to enhance the model’s nonlinear expressive power, the sum is passed through the ReLU function to obtain the output D l + 1 r of the ( l + 1 ) -th residual unit:
D l + 1 r = ReLU ( D l r + F ( D l r , W 1 , W 2 ) )
Here, F denotes the residual function, and W 1 and W 2 are the weight coefficient matrices of two convolutional layers. After computations through four residual modules, the output is the spatiotemporal feature D 8 r .

3.2.3. Time Sequence Feature Extraction Module

During pedestrian movement, footsteps are typically smooth and continuous, with subtle changes in walking states between adjacent moments and continuous speed variations. Additionally, the gait pattern exhibits periodicity, with footstep vibration signals showing periodic rises and falls. However, a single convolutional network cannot capture the complete feature representation of footstep vibration signals. Considering that the convolution operation in the local feature extraction module compresses the feature matrix U ¯ r along the time axis, while there still exists temporal correlation within D 8 r , this paper employs Bi-LSTM for cascading to further extract temporal features of the signal while reducing computational complexity.
To fully grasp the periodic nature of the signal, the local spatiotemporal feature D 8 r is used as input to Bi-LSTM, segmented into T time steps. By utilizing the gating units of LSTM, the forget gate filters out noise, retaining the motion state from the previous time step and passing it to the next time step. The bidirectional structure of Bi-LSTM separately computes the hidden layer information from the future h t j r and from the past h t j r , then concatenates them to generate the output D 9 r . The specific computation process is as follows:
h t j r = L S T M ( h t j 1 r , D 8 r ( t j ) , c t j 1 r ) , t j [ 1 , T ]
h t j r = L S T M ( h t j 1 r , D 8 r ( t j ) , c t j 1 r ) , t j [ T , 1 ]
D 9 r = σ ( W h [ h t j r , h t j r ] + b h )
In the above text, D 8 r ( t j ) represents the input vector at time t j , h t j 1 r and c t j 1 r denote the hidden state and cell state, respectively, at time t j 1 in the forward LSTM layer, and h t j + 1 r and c t j + 1 r correspond to the states at time t j + 1 in the backward LSTM layer. W h and b h represent the weights and biases, respectively, between the input layer and the hidden layer.

3.2.4. Velocity Vector Regression Module

The velocity vector regression module consists of two fully connected layers used to integrate abstract features extracted from the vibration signal matrix and output continuous velocity information. To mitigate the risk of overfitting, a dropout function is introduced in the fully connected layers, enhancing the model’s generalization capability by randomly dropping neurons. Here, the dropout parameter [25] is set to 0.5.
In the velocity vector regression process, the feature representation D 9 is first flattened and used as input to the fully connected layers. The entire feature extraction module acts as function f θ , optimizing network parameters through end-to-end training process, mapping the segmented matrix U j r of the vibration signal to the velocity vector ( v ˜ x r , v ˜ y r ) nonlinearly, as shown in Equation (10):
U j r f θ ( v ˜ x r , v ˜ y r )

4. Experimental Results and Analysis

4.1. Experimental Dataset and Description

In this experiment, six sensor nodes were deployed on a 3 × 5 m2 tiled floor for measuring vertical vibrations, as shown in Figure 4. These nodes consisted of LGT-20D10 seismic detectors and AD620 voltage amplifiers connected to an MCC118 data acquisition board on a Raspberry Pi 4B. The experimental setup, detailed in Table 1, involved individuals wearing athletic shoes repeating four types of path experiments in normal walking postures: straight line, figure eight, circular, and L-shaped paths. The sampling rate was set at 3000 Hz, and after amplification and sampling, the vibration signals were transmitted in real-time via LAN from the Raspberry Pi to a host PC for processing. A total of 8000 sets of raw data were collected to build the Footstep Vibration Signal (FVS) dataset. (The dataset used in this study is publicly available on 9 Augest 2024 at https://github.com/RCHEN1220/FVS-DATA).
To label velocity information in the signals, a visual tracking system was designed. Initially, the OpenPose system [27] was used to construct real-time human skeleton diagrams and calculate the individuals’ foot positions accurately. Subsequently, the pixel coordinates of foot locations between frames were connected to obtain the image trajectory of the moving target. The parameter information of the camera was obtained by camera calibration, as shown in Table 2. Through coordinate transformation techniques [28], the image trajectory was converted into the actual ground trajectory, and velocities for each frame were calculated based on timestamps.

4.2. Experimental Details

The method in this paper was implemented using the PyTorch framework. The experimental computing platform comprised an Intel E5-2630 2.20 GHz quad-core CPU, 47 GB RAM, and an NVIDIA Corporation GM200 graphics card with 12GB VRAM. The network model was trained, validated, and tested using the FVS dataset, with a ratio of 8:1:1 for the training, validation, and testing sets, respectively. Samples from each path type were evenly distributed. During training, the initial learning rate was set at 0.001, the batch size was 128, the mean square error (MSE) was chosen as the loss function, and the Adam optimizer with parameters β 1 = 0.9 , β 2 = 0.999 , and ε = 10 8 was used. Typically, the model converged after around 100 epochs, with the entire training process taking approximately 6 h.

4.3. Positioning Performance Evaluation Metrics

When evaluating the positioning performance, we assumed that the personnel speed predicted by the network model was v ˜ ( v ˜ x , v ˜ y ) , and the personnel estimated position p ˜ ( x ˜ , y ˜ ) was obtained through calculation. The root mean square error (RMSE) and relative localization accuracy (RTE) were chosen as evaluation metrics. RMSE was used to quantify the overall deviation between the estimated trajectory and the true trajectory with the following calculation formula:
R M S E = 1 k [ t = 1 k ( x t x ˜ t ) 2 + ( y t y ˜ t ) 2 ]
Here, p ˜ t ( x ˜ t , y ˜ t ) represents the coordinates of the position point at time t in the estimated trajectory, while p t ( x t , y t ) represents the coordinates of the position point at time t in the true trajectory, and k is the total number of points in the trajectory. On the other hand, RTE is used to reflect the overall positioning accuracy of the estimated trajectory with the following calculation formula:
R T E = t = 1 k ( x t x ˜ t ) 2 + ( y t y ˜ t ) 2 k I
Here, I represents the length of the true trajectory.

4.4. Analysis of Test Results

To illustrate the positioning effect, the experimental results of a group of data from the test set were randomly selected for detailed analysis. The recorded trajectory length of this data is 26.206 m, taking 34.068 s, and the complete trajectory comparison is shown in Figure 4. The starting point coordinates of the trajectory in Figure 5 are uniform (1.508 m, 1.927 m), with the estimated trajectory endpoint coordinates and the true trajectory endpoint coordinates being (1.865 m, −2.494 m) and (1.637 m, −2.326 m), respectively, with an error between them of only 0.28 m. Furthermore, the total estimated trajectory length of this experiment is 24.052 m, compared to the actual value of 26.209 m, with a difference of only 2.15 m.
In further analysis, special attention was given to the prediction of the speed vector by the network model and compared with the actual speed, as shown in Figure 6. Figure 6a and Figure 6b show the comparisons between v ˜ x and v x and between v ˜ y and v y , respectively. It is clearly observable from the figures that the predicted results maintain consistency with the dynamic trend of the actual speed.
To more accurately evaluate the model’s average performance, the 6 best and 6 worst experiments in the test set were excluded during the experiment, and 16 experiments were randomly selected from the remaining test set for metric evaluation. These experiments covered various types of paths, including straight paths (Figure 7), figure-eight paths (Figure 8), circular paths (Figure 9), and L-shaped paths (Figure 10). The total length of the 16 experiment trajectories reached 285 m, with average RMSE and RTE values of 0.374 and 1.7%, respectively, as shown in Figure 11. Among them, the design of straight paths was simple, and the walking pattern was easy to learn, thus yielding the best experimental results, with an average RMSE value of only 0.175 m. Although partial results from the circular and L-shaped paths, such as those depicted in Figure 9c and Figure 10c, failed to accurately track the trajectory, the average RMSE (root mean square error) across multiple experiments is only 0.357 m, which still outperforms the comparative algorithms and demonstrates high positioning accuracy. In comparison, the experimental results for the figure-eight path demonstrate less-than-ideal performance, as illustrated in Figure 8. Despite achieving an average RMSE of 0.529 m, which still constitutes a reasonable positioning outcome, the reason for this is attributed to the design intent of the figure-eight path, which is to test the positioning capabilities of the model under extreme conditions. The continuous and sharply curved turns executed by the pedestrian result in rapid velocity changes, prone to inducing significant trajectory deviations.
From the above analysis, it can be concluded that this positioning model exhibits good performance across various types of paths, with trajectory length estimation and speed vector prediction closely matching the actual values, demonstrating high accuracy and reliability.

4.5. Ablation Study

To validate the effectiveness of the components in the network model, this study conducted ablation experiments by respectively removing the CAM, 1D-ResNet, and Bi-LSTM modules to train different models. In order to comprehensively assess the performance of the models, this study also supplemented the evaluation metric of multiplications and additions to measure the computational complexity and efficiency of the models. As shown in Table 3, compared to the baseline ResNet18 model, although ResNet50 slightly improved system performance by increasing the number of convolutional layers, the computational complexity significantly increased, and the localization error was still higher than in other models. This indicates that relying solely on convolutional networks makes it difficult to extract rich features from vibration signals. The proposed model in this paper achieved superior positioning performance on the FVS dataset compared to Cam + ResNet18 and ResNet18 + Bi-LSTM. The positioning error RMSE was only 0.441 m, and the RTE value was 1.3%. When CAM was removed, with other modules unchanged, the positioning error of ResNet18 + BiLSTM increased by 0.098 compared to the proposed model, and the RTE increased by 0.9%. This was mainly because the spatial correlation between different sensor data was not combined at the same time, resulting in a decrease in positioning performance. When the Bi-LSTM module was removed, the positioning error and RTE of CAM + ResNet18 increased by 0.169 and 1.2%, respectively. The reason was that the temporal correlation of data in the time dimension was not considered when extracting features from the signal. Compared to the basic ResNet18 network, the positioning error of ResNet18 + BiLSTM was reduced by 0.222 m, and the RTE was reduced by 1.078%. The positioning error and RTE of CAM + ResNet18 were reduced by 0.111 m and 0.778%, respectively. This result demonstrates that by introducing CAM and Bi-LSTM, the model can simultaneously consider the spatial correlation between different sensor data and the temporal correlation of single sensor data in the time dimension, thereby capturing more crucial information and achieving more precise ground personnel trajectory prediction.

4.6. Comparative Experiment

Currently, there are relatively few explorations in the field of utilizing neural networks to automatically extract footstep vibration features for localization. To comprehensively evaluate the potential advantages of deep learning techniques in this domain, the present study selected the classical WT-TDOA algorithm as the benchmark comparison algorithm and simultaneously introduced the SAE-BPNN localization method for a comparative analysis of overall performance. Through this comprehensive comparison, the study aimed to further illustrate the significant performance advantages of the proposed localization method and provide valuable references for research in this field. Additionally, to assess the method’s robustness, the maximum, minimum, and median errors between the actual position and estimated position along the path were calculated.
In this experiment, 200 sets of path data were randomly selected from the test set, and the localization errors of different algorithms were analyzed in detail. As shown in Figure 12, from the perspective of positioning accuracy, the WT-TDOA algorithm has the largest error, reaching 0.62 m. This is mainly attributed to the limitations of traditional algorithms in feature extraction, which makes it difficult to fully capture the key information in vibration signals. In contrast, although the SAE-BPNN has improved accuracy and stability, reaching 0.494 m, because of the difficulty of static models in extracting the temporal features of vibration signals, the overall positioning performance is lower than that of our method. Compared to the two algorithms, the positioning error of ours is only 0.376 m. Our method not only shows improvements in the average positioning accuracy but also better reflects the pedestrian movement trend by outputting continuous trajectories through the regression method, showing this method’s unique advantages in practical applications.
Furthermore, the positioning of moving targets in practical applications is often restricted by the requirement of real-time performance. Therefore, in this experiment, the average calculation time for each path predicted by the three methods was further statistically analyzed, as shown in Table 4. Among them, the WT-TDOA algorithm took the longest time, mainly due to the fact that its data preprocessing stage involved a large number of wavelet decomposition calculations, significantly increasing the complexity and time cost of the calculation. In contrast, the average calculation time of our method was the shortest at 16.58 s, which was mainly attributed to its efficient feature extraction and temporal learning capabilities.
In conclusion, compared to traditional algorithms, WT-TDOA and SAE-BPNN, our method not only demonstrated certain advantages in localization accuracy and stability but also required less computation time, making it more suitable for meeting real-time requirements in practical applications.

5. Conclusions

Addressing the issues of limited accuracy, uniform feature extraction methods, cumbersome parameter adjustments, and the inability to output continuous dynamic trajectories in existing indoor positioning methods based on vibration signals, this paper proposes a deep learning-based approach for locating moving targets using vibration signals. This method involves the construction of a neural network model aimed at establishing a mapping relationship between footstep vibration signals and pedestrian speed. To validate the effectiveness and accuracy of the proposed method, ablation experiments and comparative experiments were designed. The experimental results demonstrate that the network model, which incorporates the multivariate time-series characteristics of footstep vibration signals, achieves high precision. On the FVS dataset, the proposed method significantly improves positioning accuracy, reducing the average system positioning error to 0.376 m. Compared to WT-TDOA and SAE-BPNN, the positioning accuracy is enhanced by 37.9% and 24.8%, respectively.
However, we acknowledge the inherent limitations of the proposed method when dealing with complex scenarios, particularly in multi-person walking situations where the signals captured by the vibration sensor network are confounded by footstep vibration information from different individuals. The current study is not yet capable of accurately identifying the specific number of people within the active area, nor can it effectively distinguish these mixed vibration signals, thereby constraining the widespread application of vibration signal localization technology.
In light of this, our team’s future research will concentrate on leveraging the unique differences in footstep characteristics among individuals and integrating deep learning techniques to extract crucial features of footstep variability, with the objective of achieving precise estimation of the number of vibration sources and effective separation of the source signals.

Author Contributions

Conceptualization, R.C.; methodology, R.C. and Y.Z.; software, R.C.; validation, R.C., Y.Z. and Q.C.; formal analysis, Y.Z.; investigation, R.C., Y.Z. and C.Z.; resources, Y.Z.; data curation, R.C.; writing—original draft preparation, R.C.; writing—review and editing, R.C.; visualization, C.Z.; supervision, Y.Z.; project administration, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by [Jiangsu Graduate Research Fund Innovation Project] grant number [KYCX23_3059] and The APC was funded by [Changzhou University].

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to considerations of privacy protection and ethical principles.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhu, H.Z.; Zheng, R.W.; Zhang, K. Bds/gps the method of bds/gps real time kinematic non-difference error correction. Sci. Surv. Mapp. 2018, 43, 1–6. [Google Scholar]
  2. Naggar, Y.N.; Kassem, A.H.; Bayoumi, M.S. A low cost indoor positioning system using computer vision. Int. J. Image Graph. Signal Process. (IJIGSP) 2019, 11, 8–25. [Google Scholar] [CrossRef]
  3. Bi, J.; Zhao, M.; Yao, G.; Cao, H.; Feng, Y.; Jiang, H.; Chai, D. Psosvrpos: Wifi indoor positioning using svr optimized by pso. Expert Syst. Appl. 2023, 222, 119778. [Google Scholar] [CrossRef]
  4. Szyc, K.; Nikodem, M.; Zdunek, M. Bluetooth low energy indoor localization for large industrial areas and limited infrastructure. Ad Hoc Netw. 2023, 139, 103024. [Google Scholar] [CrossRef]
  5. Feng, D.; Peng, J.; Zhuang, Y.; Guo, C.; Zhang, T.; Chu, Y.; Zhou, X.; Xia, X.-G. An adaptive imu/uwb fusion method for nlos indoor positioning and navigation. IEEE Internet Things J. 2023, 10, 11414–11428. [Google Scholar] [CrossRef]
  6. Zhu, C.; Wang, Q.; Xie, Y.; Xu, S. Multiview latent space learning with progressively fine-tuned deep features for unsupervised domain adaptation. Inf. Sci. 2024, 2024, 120223. [Google Scholar] [CrossRef]
  7. Chen, W.Q. Research on Intelligent Sensing Technology Based on Vibration Signal. Master’s Thesis, Shenzhen University, Shenzhen, China, 2019. [Google Scholar]
  8. Mirshekari, M.; Pan, S.; Fagert, J.; Schooler, E.M.; Zhang, P.; Noh, H.Y. Occupant localization using footstep-induced structural vibration. Mech. Syst. Signal Process. 2018, 112, 77–97. [Google Scholar] [CrossRef]
  9. Xu, C.F.; Wang, S.; Jing, Y.P.; Hu, H.Z.; Wu, J. Damage localization of steel plate based on vibration signal denoising. Sci. Technol. Eng. 2021, 21, 9440–9446. [Google Scholar]
  10. Xin, W.Y.; Li, J.; Wang, X.L.; Li, Y.J. Method for positioning of underground shallow hypocenter based on deep learning. Comput. Eng. 2020, 46, 292–297. [Google Scholar]
  11. Lin, Z.H.; Huang, L.Q.; Li, H.; Wang, B.B.; Wang, Z.X.; Chen, X.Y. Study of underground locating and tracking algorithm in mines based on ultra-band asymmetric bilateral two-way ranging. Gold 2023, 44, 1001–1277. [Google Scholar]
  12. Sabatier, J.M.; Ekimov, A.E. A review of human signatures in urban environments using seismic and acoustic methods. In Proceedings of the 2008 IEEE Conference on Technologies for Homeland Security, Waltham, MA, USA, 12–13 May 2008; pp. 215–220. [Google Scholar]
  13. Mirshekari, M.; Pan, S.; Zhang, P.; Noh, H.Y. Characterizing wave propagation to improve indoor step-level person localization using floor vibration. In Sensors and Smart Structures Technologies for Civil, Mechanical, and Aerospace Systems 2016; SPIE: Bellingham, WA, USA, 2016; Volume 9803, pp. 30–40. [Google Scholar]
  14. Bahroun, R.; Michel, O.; Frassati, F.; Carmona, M.; Lacoume, J.-L. New algorithm for footstep localization using seismic sensors in an indoor environment. J. Sound Vib. 2014, 333, 1046–1066. [Google Scholar] [CrossRef]
  15. Ince, T. Real-time broken rotor bar fault detection and classification by shallow 1d convolutional neural networks. Electr. Eng. 2019, 101, 599–608. [Google Scholar] [CrossRef]
  16. Liang, M.; Cao, P.; Tang, J. Rolling bearing fault diagnosis based on feature fusion with parallel convolutional neural network. Int. J. Adv. Manuf. Technol. 2021, 112, 819–831. [Google Scholar] [CrossRef]
  17. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  18. Li, X.; Li, J.; Zhao, C.; Qu, Y.; He, D. Gear pitting fault diagnosis with mixed operating conditions based on adaptive 1d separable convolution with residual connection. Mech. Syst. Signal Process. 2020, 142, 106740. [Google Scholar] [CrossRef]
  19. Hou, Z.; Wang, H.W.; Zhou, L. Fault diagnosis of rotating machinery based on improved deep residual network. Syst. Eng. Electron. 2022, 44, 2051–2059. [Google Scholar]
  20. Yu, Y.; Waltereit, M.; Matkovic, V.; Hou, W.; Weis, T. Deep learning-based vibration signal personnel positioning system. IEEE Access 2020, 8, 226108–226118. [Google Scholar] [CrossRef]
  21. Tan, Q.Z. Research on Moving Target Location Algorithm Based on Sae-Bpnn. Master’s Thesis, Changzhou University, Changzhou, China, 2022. [Google Scholar]
  22. Pavllo, D.; Feichtenhofer, C.; Grangier, D. 3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7745–7754. [Google Scholar]
  23. Herath, S.; Yan, H.; Furukawa, Y. Ronin: Robust neural inertial navigation in the wild: Benchmark, evaluations, & new methods. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 3146–3152. [Google Scholar]
  24. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  25. Selesnick, I.W.; Burrus, C.S. Generalized digital butterworth filter design. IEEE Trans. Signal Process. 1998, 46, 1688–1694. [Google Scholar] [CrossRef]
  26. He, J.; Guo, Z.L.; Liu, L.Y.; Su, Y.H. Wearable human activity recognition technology based on sliding window and convolutional neural network. J. Electron. Inf. Technol. 2022, 44, 168–177. [Google Scholar]
  27. Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
  28. Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
Figure 1. Location scenario. This figure shows the experiment scene of vibration signal acquisition of a moving target, including the sensor configuration, single moving target, and one camera.
Figure 1. Location scenario. This figure shows the experiment scene of vibration signal acquisition of a moving target, including the sensor configuration, single moving target, and one camera.
Applsci 14 09361 g001
Figure 2. Overview of our method. This paper proposes a deep learning-based method for locating moving targets using vibration signals. The approach begins by collecting vibration signal data sequences and recording the actual trajectories of pedestrians on the ground using cameras. Subsequently, supervised learning is employed, leveraging deep learning techniques to extract signal features for training the positioning model. Finally, the two-dimensional velocity vectors of pedestrians are output to estimate their motion trajectories.
Figure 2. Overview of our method. This paper proposes a deep learning-based method for locating moving targets using vibration signals. The approach begins by collecting vibration signal data sequences and recording the actual trajectories of pedestrians on the ground using cameras. Subsequently, supervised learning is employed, leveraging deep learning techniques to extract signal features for training the positioning model. Finally, the two-dimensional velocity vectors of pedestrians are output to estimate their motion trajectories.
Applsci 14 09361 g002
Figure 3. Architectures of our method.
Figure 3. Architectures of our method.
Applsci 14 09361 g003
Figure 4. The real experimental conditions.
Figure 4. The real experimental conditions.
Applsci 14 09361 g004
Figure 5. Comparison of the predicted trajectory and the real trajectory.
Figure 5. Comparison of the predicted trajectory and the real trajectory.
Applsci 14 09361 g005
Figure 6. Velocity vector comparison. (a) Comparison of real velocity and prediction in the X-axis direction. (b) Comparison of real velocity and prediction in the Y-axis direction.
Figure 6. Velocity vector comparison. (a) Comparison of real velocity and prediction in the X-axis direction. (b) Comparison of real velocity and prediction in the Y-axis direction.
Applsci 14 09361 g006
Figure 7. Part of the linear path test results. (a) The test set results of experimenter A. (b) The test set results of experimenter B. (c) The test set results of experimenter C.
Figure 7. Part of the linear path test results. (a) The test set results of experimenter A. (b) The test set results of experimenter B. (c) The test set results of experimenter C.
Applsci 14 09361 g007
Figure 8. Part of the test result graph around the figure-eight path. (a) The test set results of experimenter A. (b) The test set results of experimenter B. (c) The test set results of experimenter C.
Figure 8. Part of the test result graph around the figure-eight path. (a) The test set results of experimenter A. (b) The test set results of experimenter B. (c) The test set results of experimenter C.
Applsci 14 09361 g008
Figure 9. Part of the loop path test results. (a) The test set results of experimenter A. (b) The test set results of experimenter B. (c) The test set results of experimenter C.
Figure 9. Part of the loop path test results. (a) The test set results of experimenter A. (b) The test set results of experimenter B. (c) The test set results of experimenter C.
Applsci 14 09361 g009
Figure 10. Part of the L-path test results. (a) The test set results of experimenter A. (b) The test set results of experimenter B. (c) The test set results of experimenter C.
Figure 10. Part of the L-path test results. (a) The test set results of experimenter A. (b) The test set results of experimenter B. (c) The test set results of experimenter C.
Applsci 14 09361 g010
Figure 11. Test set experimental results.
Figure 11. Test set experimental results.
Applsci 14 09361 g011
Figure 12. Comparison of positioning results of different methods.
Figure 12. Comparison of positioning results of different methods.
Applsci 14 09361 g012
Table 1. Experimenter information sheet.
Table 1. Experimenter information sheet.
ExperimenterHeight (cm)Weight (kg)
A17182
B17884
C15849
D18380
E17575
F16655
Table 2. Camera parameter.
Table 2. Camera parameter.
ParameterValue
   Camera internal parameter matrix 654.45291494 0 644.9591719 0 712.02933536 405.31468142 0 0 1
Camera distortion coefficient matrix 0.2486 0.7481 0.0127 0.0044 0.7095
Rotation matrix 0.40250731 1.36022442 2.09998569
Translation matrix 1.611763 3.62288654 9.40672029
Table 3. Experimenter information sheet.
Table 3. Experimenter information sheet.
ModelATE (m)RTEMultiply–Addition Degree
ResNet180.7613.3%565.407 M
ResNet500.6232.4%1127.54 M
Cam + ResNet180.6102.5%565.408 M
ResNet18 + Bi-LSTM0.5392.2%615.044 M
Our method0.4411.3%615.044 M
Table 4. Experimenter information sheet.
Table 4. Experimenter information sheet.
MethodMean Computation Time (s)
WT-TDOA27.31
SAE-BPNN20.56
Our method16.58
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, R.; Zhu, Y.; Chen, Q.; Zhu, C. Motion Target Localization Method for Step Vibration Signals Based on Deep Learning. Appl. Sci. 2024, 14, 9361. https://doi.org/10.3390/app14209361

AMA Style

Chen R, Zhu Y, Chen Q, Zhu C. Motion Target Localization Method for Step Vibration Signals Based on Deep Learning. Applied Sciences. 2024; 14(20):9361. https://doi.org/10.3390/app14209361

Chicago/Turabian Style

Chen, Rui, Yanping Zhu, Qi Chen, and Chenyang Zhu. 2024. "Motion Target Localization Method for Step Vibration Signals Based on Deep Learning" Applied Sciences 14, no. 20: 9361. https://doi.org/10.3390/app14209361

APA Style

Chen, R., Zhu, Y., Chen, Q., & Zhu, C. (2024). Motion Target Localization Method for Step Vibration Signals Based on Deep Learning. Applied Sciences, 14(20), 9361. https://doi.org/10.3390/app14209361

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop