1 Introduction

Orthopaedic diseases have a great impact on the musculoskeletal system, such as bones, cartilage, ligaments, tendons, and connective tissues, leading to various problems, including chronic pain, reduced mobility, and chronic disability [1]. According to the Global Burden of Disease Study, 1.71 billion people globally have musculoskeletal disorders such as osteoarthritis, rheumatoid arthritis, and low back pain, highlighting the need for better diagnosis and treatment [2]. Various major types of diseases exist under this line of thought, including fracture, bone tomour, arthritis (e.g., osteoarthritis and rheumatoid arthritis), and degenerative diseases such as osteoporosis and musculoskeletal injuries [3]. Recent advancements in artificial intelligence (AI) have had a tremendous impact on the enhancement of orthopaedic diseases diagnosis and paved the way for the implementation of intelligent systems in diagnostics [4]. Machine learning (ML), deep learning (DL) and multicriteria decision-making (MCDM) techniques have been applied in different branches of orthopaedics to assist specialists in handling complex radiological images (including magnetic resonance imaging (MRI), X-ray, and computed tomography (CT) [5]. To illustrate, ML/DL computational models have been effectively deployed to enable the detection of pathologies such as complex fractures, mild forms of osteoarthritis, and even early-stage bone tumours in a quick and efficient manner [6, 7]. Furthermore, clinicians use MCDM and fuzzy logic methods in orthopaedics to support sensible decision-making and assess several medical standards, promoting accuracy in measurements as well as tailored treatment recommendations [8,9,10]. These results indicate that the evolution and ongoing advancement of modern mechanisms will be vital for the operative management of contemporary orthopaedic disorders.

While computational artificial intelligence (AI) systems have transformed orthopaedic disease detection, they also pose considerable challenges regarding their trustworthiness and interpretability. The dark side is the complexity of the detection models, specifically the ML/DL algorithms in intelligent detection systems. This commonly results in a loss of transparency regarding the decision-making process, which may lead to reluctance among clinicians to fully trust such systems [11]. This is exactly where trustworthy AI comes into play. Trustworthy AI systems are designed with fundamental principles such as fairness, transparency, privacy, and accountability, thereby enabling their safe inclusion in clinical practice [12, 13]. Several disciplines have explored the area of trustworthy AI, for example, disaster management, where AI plays a pivotal role in early warning and decision-making regarding real-time interventions, and healthcare where AI is utilized for diagnosis and treatment planning [14, 15], 16. In orthopaedic disease detection, trustworthy AI must not only provide highly accurate diagnoses but also offer explainable and interpretable outcomes so that clinicians can understand and validate the AI decision-making process [17, 18]. The AI trustworthiness guidelines of the European Union emphasize legality, ethics, and sustainability to promote robust and fair AI solutions for orthopaedic disease detection and stakeholder trust on the basis of seven key components [19]. These components include ensuring human agency and oversight by upholding human rights and involving humans in critical decisions [20]. Additionally, technical robustness and safety are emphasized through security measures, backup systems, and ensuring the accuracy, reproducibility, and reliability of the AI-based computational detection system. Additionally, privacy and data governance prioritize maintaining data quality, protecting privacy, and ensuring data availability. Transparency involves clearly explaining decision-making processes to stakeholders. Diversity, nondiscrimination, and fairness aim to eliminate bias, ensure accessibility, adopt a universal design, and advocate for all users, including those with disabilities [21]. Environmental and societal well-being consider environmental, sustainability, and social impacts and promote democracy. Finally, accountability minimizes and reports negative impacts, manages trade-offs and provides redress when necessary [22].

Several attempts have been made to review AI-based orthopaedic disorder diagnoses in the literature. The survey [23] provided a structured review of existing studies from 2017 to 2021 in the PubMed, MEDLINE, and Embase databases that applied the DL model to various orthopaedic conditions. This study highlighted significant advancements and performance metrics in fracture detection, osteoarthritis diagnosis, and soft tissue disease classification. While it reviewed various studies, it failed to organize them into a detailed taxonomy. This would help in understanding the different types of computational AI systems more comprehensively. Moreover, a review [24] provided a comprehensive overview of the articles that use AI systems in musculoskeletal imaging, covering trauma, bone age estimation, osteoarthritis, tumours, and orthopaedic implants. This highlights the importance of AI for assisting radiologists in optimizing workflows, improving diagnostic accuracy, and handling increased workloads. However, this survey lacked a systematic method for literature evaluation, which may have biased the study selection and findings. There is little discussion regarding AI research dataset availability and quality, which is essential for developing robust and generalizable AI computational models. A recent review [25] suggested two ways to improve AI fracture diagnosis via orthopaedic X-rays. First, the training dataset quantity and quality should be increased, and more advanced deep learning algorithms should be used. To obtain a more complete diagnosis, the second technique integrates AI algorithms with CT and MRI. Table 1 shows the differences and coverage between the other reviews and the present study. It illustrates the coverage of multiple aspects and directions integrated with the context of computational AI systems that are used in orthopaedics, such as developed taxonomy, discussion analysis, and trustworthiness.

Table 1 Comparisons between previous relevant reviews and the current systematic review

Despite prior efforts, the current evaluation lacks a defined classification, making intelligent system adoption comparisons difficult. Many reviews fail to analyze the literature thoroughly, suggesting bias and erroneous results. These studies focus on improvements and performance indicators, but they often disregard the reasons for incorporating AI systems into orthopaedics, such as diagnostic efficiency, and neglect dataset availability and quality for robust AI models. Additionally, the absence of detailed recommendations for future research and practical implementation leaves a gap in guiding the development and deployment of AI technologies in clinical applications. The literature review framework for the trustworthiness measurement of the contribution of intelligent systems in orthopaedic disease detection is presented in Fig. 1. Furthermore, the primary contributions of the presented systematic review are as follows:

  • Provide a structured and detailed taxonomy of AI computational systems that are used in orthopaedic disease detection.

  • Elucidating the crucial AI models and datasets used in orthopaedic disease detection to address key challenges in AI research effectively in this field.

  • Identifying and organizing the key findings into motivations, challenges, and recommendations makes it easier for readers to understand the current obstacles that need to be addressed for future advancements.

  • Provide a broad picture of the trustworthiness measurement requirements for orthopaedics studies based on trustworthy AI components and helps scholars identify current gaps and potential solutions.

Fig. 1
figure 1

Literature review framework

The paper is structured as follows: Sect. 2 outlines the methodology employed for the systematic literature review, detailing the search strategy, eligibility criteria, and study selection process. Section 3 provides a comprehensive review of previous studies, with a focus on studies that used intelligent systems in orthopaedic disease detection. Section 4 discusses the key findings from the systematic review, highlighting the motivations for AI adoption, the challenges encountered, and future research opportunities in this field. Section 5 shows a classification of the AI methods and the dataset types that are identified in the literature, organizing them by functionality in orthopaedic diagnostics. Section 6 concludes the paper by summarizing the insights gained from the study, including recommendations for future research and the critical need for trustworthy AI systems in clinical practice.

2 Methodology

This systematic literature review was conducted in accordance with the "Preferred Reporting Items for Systematic Reviews and Meta-Analyses" (PRISMA) criteria and aligned with previous research on quality databases [27,28,29,30]. By broadening its scope beyond the trustworthiness of AI systems, this investigation aims to systematically assess contributions to orthopaedic disease detection by AI, to what extent these contributions span the entire spectrum of AI from development through deployment in addition to assessment. Although trustworthiness is an important dimension, this review provides a broader analysis of the implications for orthopaedic diagnostics dealing with the overarching research question described in the introduction section in greater detail. Literature searches of the IEEE Xplore, PubMed, Web of Science, Science Direct, and Scopus databases were chosen for their relevance and coverage in the healthcare, AI, and orthopaedics domains. PubMed was added because it covers biomedical literature, which includes good-quality research regarding musculoskeletal diseases. Scopus and Web of Science were selected on the basis of their extensive multidisciplinary coverage, which offers a wide collection of peer-reviewed studies across clinically related domains of AI. IEEE Xplore was selected because of its large store of state-of-the-art research in AI methodologies, which are foundational to this review. Finally, Science Direct was included in response to the high representation of technical and applied research on health status and health quality data in AI. Collectively, these two databases provide a comprehensive platform to capture the breadth of studies at the intersection of AI and orthopaedic disease detection. In parallel, where relevant, the review covers trustworthiness, which is consistent with the high-level aim of assessing the feasibility of intelligent systems and their responsible assimilation into clinical practice.

2.1 Search Strategy

For this investigation, five databases were thoroughly searched for English-language scholarly literature. The search ranged from January 2019 to 2024, when the detection of musculoskeletal diseases via AI methods increased due to breakthroughs in methods, technologies, and knowledge. The "OR" operator was used to connect "Musculoskeletal disease," "Orthopaedic," and "Bone disease," and the "AND" operator linked these phrases to "Artificial intelligence," as shown at the top of Fig. 2. This search approach was used to find relevant scholarly literature.

Fig. 2
figure 2

An outline of the approach used to identify, select, and include relevant contributions

2.2 Eligibility Criteria

The systematic literature review employed a predetermined set of inclusion and exclusion criteria to guide the selection of pertinent contributions. The criteria were implemented to guarantee that the chosen literature was in accordance with the particular study aims and that a rigorous degree of methodology was used. The inclusion or selection of papers was determined on the basis of the following criteria:

  • The manuscripts must be composed in the English language and disseminated either through a scholarly journal or included in the official records of a prestigious conference.

  • These studies should focus primarily on AI methodologies and techniques for the detection of orthopaedic diseases in humans.

The exclusion criteria for the present study were as follows:

  • Any study focusing on orthopaedic disease detection outside the realm of AI applications was excluded.

  • Study types that are not pertinent to the subject matter, such as animal studies, letters to the editors, and case reports.

  • Studies published in languages other than English.

These criteria ensure that the focus remains tightly bound to the intersection of AI and orthopaedic disease detection, enhancing the relevance and quality of the findings.

2.3 Study Selection

Two reviewers separately gathered data from the chosen studies to eliminate bias and ensure accuracy. Standardized data extraction forms were employed by the reviewers to ensure uniformity throughout the gathered data. A third reviewer was consulted to settle any disputes that might have arisen between the reviewers. In an effort to improve the data's dependability, the reviewers additionally made an email correspondence with the study investigators to clarify any unclear or missing material from the reports. Furthermore, the first step of related studies that were gathered from the five mentioned databases did not utilize any automation techniques, meaning that all of the data were manually retrieved and confirmed by the reviewers. This manual method further ensures the integrity of the data utilized in the research by enabling careful evaluation of the context and subtleties within each report. The methodology involved several stages, as summarized in Fig. 2, starting with the removal of duplicate studies via Mendeley software. Titles and abstracts were reviewed to eliminate unrelated works, with discrepancies resolved by the corresponding author. A comprehensive examination of the full texts of the articles was performed according to predefined inclusion criteria. Initially, 1,657 entries were identified from various databases. After 344 duplicates were removed, 1,313 papers remained. Title and abstract evaluations excluded 628 articles, leaving 685 for detailed examination. Finally, 600 studies were excluded, resulting in 85 relevant studies included for in-depth analysis. In our present systematic review, essential information was extracted from the hard analysis of the resulting studies on several important variables to conduct subgroup analyses, including the type of AI technology used, the specific orthopaedic disease type targeted, the performance metrics applied to assess the AI models, dataset availability, and the primary findings reported.

3 Orthopaedic Disease Detection-Based AI: Taxonomy

The findings from the selected articles are detailed in this section, with each article analyzed and categorized on the basis of its objectives and contributions. The 85 articles that met the predefined criteria were organized into six primary categories, as shown in Fig. 3. This structured approach ensures systematic analysis on the basis of objective evidence from relevant studies. To provide a more detailed examination, subcategories were established within the major categories, further structuring and presenting the findings. These divisions allow for a wide-ranging overview of AI techniques in orthopaedic disease detection, highlighting advancements and challenges in the field. The categories, encompassing 85 articles, are as follows:

  • Arthritis: 4 out of 85 contributions (4.71%).

  • Tumours: 15 out of the 85 contributions (17.65%).

  • Deformities: 10 out of 85 contributions (11.76%)

  • Fractures: 45 out of 85 contributions (52.94%).

  • Osteoporosis: 2 out of 85 contributions (2.35%).

  • General bone abnormalities: 9 out of 85 contributions (10.59%).

Fig. 3
figure 3

Taxonomy of the use of AI computational systems in orthopaedics

3.1 Arthritis

The arthritis category included four of the 85 selected articles that focused on the role of AI in diagnosing conditions such as hip–knee osteoarthritis and rheumatoid arthritis, aiming to enhance patient care methodologies.

Reliance on gait data is affected by environmental conditions, marker placement consistency, and patient-specific gait patterns. Additionally, the relatively small size of the dataset introduces sampling bias, potentially limiting the model's ability to generalize to larger populations. To address these challenges, the study [31] developed a robust vision-based dataset using passive marker-based techniques to ensure consistent data collection. It optimized feature extraction via the Fractional Order Darwinian Particle Swarm Optimization (FODPSO) algorithm, enabling the precise identification of key regions of interest (ROIs) and the classification of abnormal gaits in knee osteoarthritis (KOA) patients via the k-nearest neighbor (KNN) algorithm, achieving high accuracy across different severity levels. Moreover, a previous study [32] demonstrated the use of DL models, specifically EfficientNetb7, along with XAI techniques such as Grad-CAM, to diagnose KOA via 8,260 knee X-ray images, which offers a valuable understanding of the model's decision-making process [32]. A significant disparity was observed between the number of normal (Grade 0) and severe osteoarthritis cases (Grade 4), creating challenges in training models to effectively recognize the underrepresented categories. To address the class imbalance issue, data augmentation techniques, such as histogram equalization and contrast enhancement, were applied, increasing the representation of severe cases from 51 to 357 samples. As a result, the model achieved a remarkable accuracy of 99.13% and excelled in distinguishing normal and severe cases. Furthermore, the study [8] introduced a framework using MCDM methods, including the Analytical Hierarchy Process (AHP) and fuzzy multichoice goal programming (FMCGP), to enhance shared decision-making for patients with KOA.

Furthermore, the study [8] introduced a framework using MCDM methods, including the AHP and fuzzy multichoice goal programming (FMCGP), to enhance shared decision-making for patients with KOA. While the dataset captures diverse patient goals, the lack of external validation datasets limits the generalizability of the model. Despite these limitations, the system has the potential for enhancing healthcare quality by addressing common challenges in decision-making and achieving 94.74% adherence to international patient decision-aid standards. For rheumatoid arthritis, which can lead to severe joint damage if untreated, one study [33] introduced a technique called automated rheumatoid arthritis classification via an arithmetic optimization algorithm with deep learning (ARAC-AOADL) using biomechanical images. The dataset in this study comprised 310 samples divided into three classes: Hernia (60), Spondylolisthesis (150), and Normal (100). The class imbalance and reliance on biomechanical features may not capture all underlying conditions influencing rheumatoid arthritis. Despite the importance of considering adversarial attacks during model development, the mentioned studies could be evaluated for not considering adversarial attacks during model development, which is crucial for ensuring model robustness. Adversarial attacks, which involve introducing imperceptible perturbations to input data to mislead models, pose significant risks to the reliability and robustness of AI systems, especially in critical applications such as medical imaging.

With respect to AI studies for arthritis detection, ensuring the trustworthiness of AI systems is paramount for their effective integration into clinical practice. Trustworthy AI involves several components, as outlined by the European Union's guidelines: technical robustness and safety; diversity; transparency; privacy and data governance; human agency and oversight; societal and environmental well-being; nondiscrimination and fairness; and accountability. These components help ensure that AI models operate in a reliable, transparent, and ethical manner. To evaluate the level of trustworthiness in the studies collected for this review, the authors categorized them on the basis of these trustworthiness components and rated their adherence to these criteria as very low (V-L), low (L), high (H), medium (M), or very high (V-H), highlighting both the strengths and areas that require significant improvement. The analysis of the collected studies, as shown in Table 2, reveals that technical robustness and safety are the most addressed components, with several studies achieving H- or V-H ratings. However, other critical areas, such as human agency, privacy, and transparency, consistently received L-to-V-L ratings [31, 32]. This suggests that while AI models for arthritis detection demonstrate considerable focus on technical performance, they frequently overlook key trustworthiness principles necessary for ethical and reliable clinical application. Moreover, the consistently low ratings of privacy, fairness, and accountability highlight a significant gap in ensuring that these AI systems adhere to broader ethical standards [32, 33]. This indicates a pressing need for future research to not only improve technical outcomes but also address the essential trustworthiness criteria that underpin the safe and responsible integration of AI in healthcare settings.

Table 2 The current frequency of trustworthy AI requirements in the arthritis detection literature

3.2 Tumours

In this research, tumours are categorized from 15 of the 85 total collected contributions into primary, metastatic, and unspecified-origin tumours, and potential computational intelligent techniques for determining the prognosis of this bone disease are scrutinized.

  • Primary tumours

Primary tumours are the initial cancer origin in the body [34], and the disease is further divided into three types, each linked to the impacted area: Ewing sarcoma, chondrosarcoma, and osteosarcoma. Ewing sarcoma is a rare type of malignancy that predominantly impacts pediatric and adolescent populations [35]. One study [36] leveraged the transfer learning approach, using a dataset of 182 radiographs, to develop a DL framework capable of distinguishing between osteomyelitis and Ewing sarcoma, achieving accuracies of 94.4% and 90.6% on validation and test data, respectively. Despite its high accuracy, the study does not address the model’s interpretability or how it handles various imaging conditions, both of which are critical for clinical adoption. However, chondrosarcoma is a type of cancer that typically starts with cells that produce cartilage, the tough, flexible tissue that cushions bones. Owing to the difficulty in accurately diagnosing atypical cartilaginous tumours and appendicular chondrosarcomas via traditional diagnostic methods, the authors of that study [37] developed the LogitBoost ML technique to address this challenge. The study involved 120 patients with confirmed lesions, and the classifier achieved an accuracy of 81% in the training group and 75% in the external test group, with a good intraclass correlation coefficient (ICC). Osteosarcoma, also known as osteogenic sarcoma, is the most common type of primary bone cancer. The study [38] proposed the use of the Honey Badger Optimization with Deep Learning Automated Osteosarcoma Classification (HBODL-AOC) model to identify the existence of osteosarcoma via fuzzy logic and medical images. However, the study could improve by providing a clearer strategy for model validation beyond accuracy, including metrics such as sensitivity and specificity.

  • Metastatic Tumours

Numerous investigations have explored the utilization of AI in detecting metastatic tumours that have spread to other parts of the body from their original part of the body [39]. Within the field of ML, the present study [40] aimed to assess the effectiveness of support vector machine (SVM) and random forest (RF) classifiers in identifying patients with incidental osteoblastic metastases of the spine by screening 200 dual-energy X-ray absorptiometry (DEXA) studies. This study's dependence on a relatively small dataset may limit its robustness and scalability. Furthermore, multi-input Convolutional Neural Network (CNN) models and Adaptive Moment Estimation (Adam) optimizer with two evaluation strategies were used to diagnose bone metastasis and metabolic bone diseases [41]. One study [42] proposed a CNN-based pipeline that incorporated an ML/DL component to predict the development of bone metastasis. The pipeline was used to construct several ML detection models on the basis of gene expression data. The best results were achieved with InceptionResNet-v2 for normal-abnormal differentiation and Inception-v3 for metabolic-metastatic differentiation. Furthermore, it does not consider the adversarial robustness of the model, which could be critical in real-world applications. In the same DL field, the study [43] developed an automatic image interpretation system using the ResNet-50 model to assist physicians in diagnosing cancer bone metastasis through bone scintigraphy, but the authors did not elaborate on the interpretability of the model, which is critical for clinical validation.

  • Unspecified Tumour Origin

Different studies do not specify whether the tumours under consideration are primary or metastatic and encompass a wider scope of topics about the detection of tumours via AI. One study [44] concentrated on the morphological characteristics of cancerous versus healthy bones, employing edge detection algorithms and histogram of oriented gradients (HOG) features, which resulted in an F1 score of 0.92 with the SVM model. Despite its effectiveness, the study did not examine the performance of the model under conditions of noise or other imaging artifacts, which are frequently encountered in practical scenarios. Another study [45] implemented bilateral filtering for noise reduction, adaptive histogram equalization for segmentation, and SVM for classification. In the same area, [46] employed a U-Net architecture with ResNet34 as a training baseline for the neural network, utilizing 12 data augmentation techniques. The study achieved an accuracy of 99.72% and an intersection over union (IoU) of 87.43%. Despite these impressive results, a more in-depth analysis of how different data augmentation techniques impact the model's robustness and performance could further enhance the study. In this study, [47], three DL models based on contrast-enhanced MR images were developed to improve the diagnostic efficacy for musculoskeletal tumors. These models significantly enhanced the diagnostic sensitivities of oncologists and orthopedists without impairing their specificities. However, the study focused primarily on sensitivity and specificity metrics, neglecting other important evaluation metrics such as precision and recall. Studies [48,49,50] have explored advanced DL techniques for the detection and classification of bone tumours, each highlighting unique approaches and results. In the study [48], a new approach for detecting bone tumour necrosis rates was presented by combining generative adversarial networks (GANs) and CNNs to simulate biopsy-based necrosis rate results. [49] developed a DL computational model for bone tumour assessment on radiographs by segmentation, bounding box placement, and classification via a mask region-based CNN (Mask-RCNN). Although the model's performance was comparable to that of musculoskeletal fellowship-trained radiologists, the study did not elaborate on the interpretability of the model’s predictions or its robustness in the presence of noisy or incomplete data. Furthermore, the study [50] aimed to assist physicians in detecting and classifying knee bone tumors via the Seg-Unet model with global and patch-based approaches. The model achieved an accuracy of 99.05% for classification and a mean IoU of 84.84% for segmentation. The patch-based approach improved malignant tumor detection. Moreover, femoral bone tumors abnormally grow in the femur, which is the largest bone in the human body and extends from the hip to the knee. However, one study showed that various CNN algorithms were used to detect and classify lesions in the proximal femur and achieved better performance than did practising orthopedic surgeons with varying experience levels [51]. The reviewed studies reveal that many datasets used in tumor detection AI lack sufficient representation of diverse patient populations. For instance, datasets such as the CNUH dataset [50] and TCIA dataset [38, 47] often provide limited information on ethnicity or geographic variability. Although some datasets include gender distributions and an extensive age range, they do not assess how these factors influence model performance, leaving gaps in understanding demographic-specific biases [44, 48]. Furthermore, models trained on homogeneous datasets risk suboptimal performance in underrepresented populations, as evidenced in the study, which struggled to detect osteosarcoma features in non-Caucasian populations [40]. Future research should prioritize multicenter collaborations to develop datasets that reflect global diversity.

With respect to the trustworthiness measurement for the studies related to AI-based tumours detection, the results in Table 3 reveal a clear focus on technical robustness and safety, with several studies achieving high ratings [37, 44]. However, critical trustworthiness components such as human agency, privacy, and transparency are consistently rated L or V-L [36, 38, 40]. This imbalance raises concerns about the ethical and practical integration of these AI models into clinical settings, where human oversight and data protection are essential. Additionally, the lack of attention given to diversity and fairness risks introduces biases in AI studies tumour detection [37, 43, 44]. While technical performance is prioritized, addressing these gaps is crucial for ensuring both the safety and ethical use of AI in orthopaedics healthcare centers.

Table 3 The current frequency of trustworthy AI requirements in tumours detection literature

3.3 Deformities

Among the 85 related studies, 10 were specifically dedicated to deformities and were divided into four significant subcategories: Kashin-Beck disease (KBD), with one contribution; developmental dysplasia of the hip (DDH), with three contributions; knee malalignment syndrome, with one contribution; and spine deformities, with five contributions.

  • Kashin-Beck Disease (KBD)


KBD is a chronic, endemic osteoarthropathy marked by deformity of the joints, especially in the hands and knees [52]. The early detection of KBD is particularly challenging due to the subtle radiographic changes in the metaphyseal zones and carpals during the initial stages. For this purpose, the sole study [53] on KBD focused on developing an algorithm to automatically screen for KBD using hand X-ray images. This method employs a CNN to extract both global and local features from images, which are then fed into a neural network for classification. The study achieved an accuracy of 98.5% and a sensitivity rate of 97.6%, outperforming methods that use only global features.

  • Developmental dysplasia of the hip (DDH)


DDH is a condition observed in infants and young children where the ball and socket joints of the hip do not form properly. The manual process of diagnosing DDH is time intensive, requiring clinicians to spend 150–200 s per case for tasks such as classification, angle measurements, and landmark detection. This lengthy process limits efficiency, especially in high-volume clinical settings, and increases the potential for fatigue-related errors [54]. Thus, the authors of that study [55] developed a pyramid nonlocal UNet (PN-UNet) to accurately detect and identify Misshapen anatomical landmarks, which are crucial for diagnosing DDH. It significantly reduced the diagnosis time to only 1.21 s per case while maintaining comparable diagnostic accuracy (86–95%) and reliability, demonstrating its potential to streamline workflows and improve productivity in clinical practice. Furthermore, the authors of [56] modified the U-net and ResNet architectures for radiographic measurements of the hip in adults, which could improve the efficiency and accuracy of diagnosing hip conditions. However, the mentioned studies could be criticized for not addressing the robustness of the model under both white-box and black-box adversarial examples, which are critical for ensuring reliable real-world applications.

  • Knee Malalignment Syndrome


Knee malalignment syndrome is a physiological condition in which the kneecap (patella) does not move normally in the cavity of the thigh bone [57]. This typically occurs due to abnormalities in the structure or mechanics of the lower body, leading to an imbalance or misalignment at the knee joint. One study [58] described the YOLO and ResNet landmark regression algorithm (YARLA) for the fully automated assessment of knee alignment from a full-leg X-ray dataset.

  • Spine Deformities


Spine deformities cover a vital area in orthopaedics and involve abnormalities that affect the structure and function of the spine [59]. One of the main challenges of scoliosis diagnosis is that orthopedic surgeons' manual measurement of the Cobb angle often leads to inconsistent and subjective diagnoses [60]. An automated system was adopted in the study [61] based on four Faster R-CNNs and ResNets with the Stochastic Gradient Descent (SGD) optimizer using X-ray images to automate Cobb angle measurement and classify scoliosis. This significantly reduced the reliance on manual input, ensuring consistency and objectivity in diagnosis. Moreover, identifying early-stage scoliosis with mild Cobb angle deformities poses challenges because subtle radiographic changes are difficult for both clinicians and models to detect accurately [62, 63]. This study [64] employed U-Net, which is specifically designed for biomedical image segmentation, to automatically segment vertebrae, locate endplates, and calculate Cobb angles with a prediction accuracy of 94.42%. Furthermore, employing DeepLabV3 and EfficientNet-B4 with optimization for segmenting and classifying lordosis on cervical X-rays achieved better accuracy than did two other surgeons [65].

The trustworthiness analysis of the studies on deformity detection, as shown in Table 4, highlights technical robustness and safety as the most emphasized component, with several studies achieving M- or V-H ratings [53, 55, 61]. However, other essential areas, including human agency, privacy, and transparency, consistently receive L or V-L ratings [53, 54, 58]. This pattern indicates that while the technical aspects of AI models for deformity detection are prioritized, the ethical dimensions are frequently overlooked. Furthermore, the consistently low ratings of diversity and fairness [60, 61] and accountability [63, 65] suggest that many studies lack adequate consideration of ethical inclusivity and the responsibility for potential negative outcomes.

Table 4 The current frequency of trustworthy AI requirements in the deformity detection literature

3.4 Fractures

According to the taxonomy analysis of our systematic review, 45 out of 85 significant contributions were made within the fracture category. These contributions reflect the substantial focus on fractures in orthopaedic research, highlighting the importance of developing accurate and reliable AI models for diagnosing and managing various types of fractures. The high number of studies indicates a strong interest in leveraging AI to improve fracture detection, classification, and treatment, which is crucial for streamlining clinical workflows. For general fracture detection, traditional classifiers, such as KNN and SVM, often underperform because of incomplete feature extraction and their inability to handle high-dimensional image data effectively. Additionally, relying solely on traditional features, such as texture and shape, limits the depth of information used for classification, reducing overall accuracy. Therefore [66], utilized AlexNet for deep feature extraction and integrated learning to train classifiers, assigning different weight values on the basis of each classifier's contribution, which increased the accuracy of the clinical diagnosis. Furthermore, researchers have focused on adopting different natural language processing (NLP) methods to classify radiology reports in orthopaedic trauma, comparing different ML approaches, including a DL-based BERT model [67].

  • Femoral Fractures


Many studies have shown that AI methods can help with femoral fracture diagnosis. Notably, advanced CNN architectures such as VGG19, InceptionV3, and ResNet50 have been employed to improve the radiographic diagnosis of atypical femoral fractures, achieving an average accuracy rate of 81.93% [68]. Accurately diagnosing fractures can be challenging due to variations in imaging conditions. To address this, a novel encoder-decoder neural network is proposed, which incorporates radiology reports as additional information during training [69]. Conventional feature extraction procedures, such as Histogram of Oriented Gradients (HOG) and Local Binary Patterns (LBP), delivered limited performance due to their inability to fully capture the complex features required for accurate fracture detection [70]. The features extracted from the CNN layers were further refined using Bidirectional LSTM (BiLSTM) and Long Short-Term Memory (LSTM) architectures, which enhanced the classification performance for femoral fractures [71]. However, these studies often lack the use of selection and benchmarking techniques to choose the optimal detection model. Without these techniques, ascertaining the most effective model for clinical application is challenging, as there is no standardized method for comparing model performance across different datasets and conditions.

  • Ankle and Foot Fractures


The detection of ankle fractures that are occult presents several significant challenges. These fractures are subtle and not easily visible in standard radiographs, often leading to a greater risk of misdiagnosis. Additionally, less obvious fractures are underrepresented than easily identifiable fractures are, creating a class imbalance that can bias AI models during training. Therefore, multiple Deep Convolutional Neural Networks (DCNNs) have been employed to adapt features learned from general image recognition tasks (e.g., ImageNet) to the specific task of fracture detection [72, 73]. Furthermore, data augmentation techniques, including flipping and rotation (± 10°), were applied to increase the diversity of training samples, improving model robustness and generalizability. Moreover, single-view radiographs (e.g., anteroposterior) often fail to capture all relevant features of the ankle structure, making it difficult to detect fractures comprehensively. The study [74] addressed this issue by employing three-view radiographs with Inception-V3 and ResNet-50 models demonstrated superior performance compared with single-view radiographs, significantly enhancing the diagnostic process. A foot fracture diagnosis assistance system developed with the Gradient-weighted Class Activation Mapping (Grad-CAM) XAI technique and ensemble learning techniques was tested across various proficiency levels and showed marked improvements in diagnostic accuracy [75]. The identification of the optimal model is a critical step in ensuring the success and applicability of intelligent systems for detecting ankle and foot fractures. While the reviewed studies demonstrate significant advancements in developing AI models, they often fail to prioritize or identify the most effective detection model from the set of developed approaches. This oversight limits the potential to achieve the highest possible performance in terms of accuracy, sensitivity, and specificity.

  • Calcaneus and Vertebral Fractures


High-energy events often cause calcaneal fractures, which can severely impact mobility. Studies exploring the use of augmented and nonaugmented images with DL techniques have shown significant potential in classifying and detecting these fractures in CT images [76, 77]. Furthermore, an automated system utilizing a U-Net model was employed for anatomical angle measurement, fracture identification, and segmentation in radiographs, with high accuracy [78]. Vertebral fractures involving the collapse or breakage of spinal bones are particularly challenging. The primary challenge in vertebral fracture detection is class imbalance in the dataset, with a significantly greater number of normal vertebrae than fractured vertebrae [79]. This imbalance introduces bias in model training, making it difficult for AI algorithms to detect fractures accurately, particularly subtle or severe cases [80]. To address this, the study [80] balanced the dataset by undersampling normal vertebrae and employed YOLOv3 for precise vertebra localization and classification, combined with ensemble learning to improve accuracy and interpretability. Moreover, this study [81] utilized ensemble learning for the VGG16, VGG19, DenseNet201, and ResNet50 architectures with MRI data for vertebral fractures and achieved superior diagnostic efficiency compared with that of spine surgeons. However, there is a notable absence of consideration for adversarial examples, which can significantly impact the robustness and reliability of DL models.

Furthermore, the CNN tends to generate false positives, often misinterpreting degenerative changes, nutrient foramina, and ligament ossifications as fractures. Grad-CAM heatmaps, combined with a two-stage detection process, were used to visually highlight fracture regions, assisting radiologists in verifying AI predictions and reducing diagnostic errors [82]. A novel DL model combining YOLOv4 and ResUNet has been introduced to detect vertebral fractures from X-ray images [83]. It achieved high precision rates of 99% for healthy vertebrae, 74% for compression fractures, and 94% for burst fractures. However, these studies focused primarily on X-ray images, which are less detailed than CT or MR images. This might limit the model's applicability to more complex or subtle fracture cases.

  • Pelvic, Wrist, and Hand Fractures


According to the literature, many studies have explored different pelvic regions, including the sacral, hip, and acetabulum areas, each showing that AI methods can optimize current detection methodologies [84,85,86]. Pelvic and hip fractures involve multiple anatomical sites with complex morphologies, making it difficult to define fracture boundaries and accurately classify fracture types [87]. To address this issue, the study developed PelviXNet, a DL algorithm trained with point-based annotations instead of traditional bounding boxes [88]. This method efficiently captures regional information, enhancing the model's ability to detect fractures across diverse categories. One critical aspect of acetabular fracture classification is detecting iliopectineal line fractures, which can appear as minor disruptions that are difficult to identify via conventional radiographs, especially when noise, poor image quality, or complex anatomical structures obscure fracture visibility. To address this, Gaussian filtering was used to denoise radiographs while preserving essential edge details, and the Derivative of Gaussian (DoG) method was applied to highlight the iliopectineal line while minimizing false edges caused by noise or overlapping anatomical structures [89].

Distal radius (wrist fracture) detection typically requires a large dataset (thousands of images) to train DL models effectively [90]. Furthermore, studies [91, 92] have reported that the accuracy of fracture detection via CNNs is comparable to published values despite the low number of training datasets. However, this study [93] aimed to develop an intelligent system capable of accurately diagnosing distal radius fractures via a small biplane plain X-ray dataset. The AI system was trained via VGG16, which was originally trained on the ImageNet dataset, to compensate for the small dataset size, and its performance was evaluated via several metrics.

A major challenge in AI-based fracture detection is the black-box nature of models, which limits clinician trust due to the lack of transparency in decision-making [94]. To address this, Grad-CAM was applied to visualize key areas influencing predictions, making the EfficientNet-B4 model more interpretable and fostering trust among clinicians [95]. The study [96] evaluated the effectiveness of YOLO single-stage DL models (YOLOv5, YOLOv6, YOLOv7, and YOLOv8) for detecting wrist fractures in pediatric X-rays and reported that they were superior to the Faster R-CNN model. YOLOv8m showed the highest sensitivity (0.92) and mean average precision (mAP) of 0.95. This research underscores the potential of YOLO models to increase the accuracy and efficiency of pediatric wrist fracture diagnosis. However, the study evaluated the ability of a CNN to diagnose distal radius fractures via frontal and lateral wrist radiographs. The study included 503 patients with wrist fractures diagnosed via plain radiographs and 289 patients without fractures. However, the dataset used may not fully represent the diversity and variability observed in clinical practice. The relatively small and specific sample size could limit the generalizability of the CNN model to broader populations or different clinical environments. Furthermore, the manual cropping of radiographs used in that study might have introduced biases or inconsistencies. Scaphoid fracture detection poses significant challenges due to the unique structure and imaging characteristics of this small bone. A major difficulty lies in its small ROI, as the scaphoid occupies a tiny area in hand images. This results in a pronounced class imbalance during model training, with a significantly greater number of negative samples (nonfracture areas) than positive samples (fracture areas), making accurate detection more difficult [97,98,99]. To address this, the study [97] introduced a specialized network architecture called CSR-Net, which leverages cross-scale residual connections. These connections enable the model to effectively integrate features from different layers and scales, allowing it to focus on the small ROI of the scaphoid and improving its ability to detect fractures despite their small size. Additionally, the structural similarity between the scaphoid and surrounding carpal bones further complicates fracture identification, as these bones share similar radiographic features, making it challenging for models to distinguish subtle fractures from adjacent structures. Employing data augmentation techniques, such as rotation and contrast enhancement, alongside CNN models, provides a more balanced training set and improved generalizability [98, 99]. However, no study has yet developed a model capable of efficiently detecting these types of fractures while accounting for diverse patient demographic information.

  • Supracondylar and tibial plateau fractures


Supracondylar fractures, which commonly occur in children, present unique challenges due to their anatomical complexity, requiring a high level of expertise for accurate diagnosis. The variability in ossification centers and the presence of incomplete fractures in pediatric patients further complicate detection. Therefore, a dual-input CNN-based algorithm utilizing two identical ResNet-50 models for anteroposterior and lateral elbow radiographs demonstrated substantial efficacy in the automated detection of pediatric supracondylar fractures via conventional radiography [100]. Tibial plateau fractures, which occur at the top of the tibia, are critical because of their impact on knee stability and mobility. AI has made advances in this area with the development of a RetinaNet model trained on a substantial dataset of 542 X-rays [101]. This model performed robustly, indicating the ability of DL to handle the complexities of diagnosing fractures in load-bearing joints.

  • Rib Fractures


Rib fractures, especially incomplete and subtle fractures, are difficult to identify because of their small size and overlapping structures in chest CT images [102]. One study [103] employed a 3D object detection model and analyzed high-energy trauma patients’ CT scans via a CNN, which proved more effective than radiologist evaluations. The model retained 3D spatial continuity between CT slices, improving the detection of subtle and incomplete fractures. However, the study lacked a comparison of the transfer learning models with a standardized benchmark, making it difficult to objectively assess their relative performance. Moreover, the CCE-Net model, which incorporates contralateral, contextual, and edge-enhanced modules, uses a large dataset of 1639 digital radiography images for training, achieving high accuracy and demonstrating the potential of DL models to handle large and complex datasets [104].

  • Knee fractures and meniscus tears


Knee fractures, particularly those around complex knee joints, require precise classification to guide treatment strategies. The high granularity of the classification system requires the model to distinguish subtle differences between fracture types, increasing the risk of misclassification. Additionally, the lack of transparency in AI decision-making makes it challenging for clinicians to trust the model’s predictions, particularly for complex or ambiguous fracture types. To address this, a modified 26-layer ResNet-based CNN architecture was employed, enabling the model to extract detailed features and improve accuracy in distinguishing subtle fracture types [105]. Furthermore, Grad-CAM XAI techniques were utilized to generate heatmaps highlighting fracture regions. This approach enhances model interpretability and fosters clinician trust by providing visual explanations for AI decisions. According to the literature, MRI plays a crucial role in diagnosing meniscus tears, which are common knee injuries, by providing detailed visualization of cartilage structure, edema, and subchondral bone damage [106]. However, its effectiveness is often hindered by image degradation and blurred boundaries, posing significant challenges to accurate lesion detection and classification. The Dragonfly Optimization and Regional Similarity Transformation Algorithm (DO-RSTA) was employed to enhance this issue by correcting noise and uneven illumination along with a modified version of the AlexNet model, facilitating the classification of different lesion levels [107]. Compared with medial tears, DL models have demonstrated reduced sensitivity in detecting lateral meniscus tears, highlighting the need for further optimization in identifying these more subtle or less apparent disruptions [108]. The study utilized a multistep CNN architecture comprising coronal and sagittal convolutional blocks, batch normalization layers, and inception modules. These layers effectively extract subtle features, enhancing classification accuracy when trained on over 18,500 MRI scans from multiple institutions [109].

In measuring the effects of trustworthiness components on fracture detection studies, the component’s technical robustness and safety are notably well covered, with several studies achieving M–H ratings [66, 68, 79], as shown in Table 5. However, other critical aspects, such as human oversight, data governance, and clarity in decision-making, are consistently underrepresented, with most studies receiving V-L scores [66, 67, 71]. This finding points to a persistent gap in ensuring that AI models for fracture detection uphold essential ethical standards. The low focus on fairness and responsibility [70, 71] further highlights the need for a more balanced approach that prioritizes both the technical and ethical dimensions required for trustworthy AI in the medical field.

Table 5 The current frequency of trustworthy AI requirements in the fracture detection literature

3.5 Osteoporosis

Osteoporosis is often characterized by a gradual loss of bone mineral density, leading to subtle fractures or deformities that are difficult to detect on conventional radiographs. Two of the 85 total collected contributions discussed the effectiveness of AI models in osteoporosis detection. Osteoporosis detection via dual-energy X-ray absorptiometry (DXA) faces several significant challenges. Inconsistent regions of ROI selection, often due to operator-dependent errors in identifying areas such as the lumbar spine or femur, affect the precision of bone mineral density (BMD) calculations and lead to variability in measurements. To address these issues, [110] proposed a detection system based on a multilayer perceptron neural network using digital images that accurately measured BMD via an automated model for ROI selection (e.g., the lumbar spine and femur), which minimizes operator dependency and ensures consistent and precise ROI localization. Moreover, variability in human measurements of the second metacarpal cortical percentage (2MCP), often due to manual annotation errors, reduces the reliability of osteoporosis diagnosis. Therefore [111], implemented automated laterality correction and vertical alignment normalization to standardize radiographs, which were integrated into a fully convolutional network (FCN-8) for segmenting the second metacarpal, enabling precise ROI extraction while minimizing human intervention. Moreover, the studies lack a clear outline of the methodological phases, making it difficult to recognize the step-by-step process and reproduce the experiments. A transparent and detailed methodology is crucial for other researchers to validate and build upon the findings.

The evaluation of trustworthiness in AI models for osteoporosis detection studies reveals significant issues, particularly with respect to human agency, privacy, and accountability, with both studies reporting V-L in these areas [110, 111]. This underscores a lack of clinician oversight and poor handling of sensitive patient data, which are critical for building trust in healthcare computer-aided detection systems.

3.6 General Bone Abnormalities

Many studies in the literature have highlighted issues related to the datasets used in abnormality detection. High image variability is a common challenge, as datasets often include diverse lower extremity radiographs (e.g., foot, ankle, knee, and hip) captured under varying imaging conditions, such as resolution, contrast, and noise, making uniform feature extraction difficult [112]. The study utilized DenseNet-161, which effectively captured complex features from lower extremity radiographs and visualized them via the Grad-CAM technique [113]. The model was pretrained on ImageNet and MURA datasets to leverage prior knowledge, enhancing performance on small datasets and reducing the need for extensive labeled training data [114]. Additionally, class imbalance poses a significant problem, with a disproportionate number of normal images compared with abnormal images, which biases model predictions and limits the detection of rare abnormalities. Addressing these issues is essential for improving model performance and reliability. Moreover, the authors of [115] proposed a new model based on DenseNet-169, DenseNet-201, and InceptionResNetV2 to enhance the detection of upper extremity abnormalities using limited data. However, the absence of selection and benchmarking techniques to choose the most suitable detection model implies a lack of application of those models in healthcare centers. Furthermore, owing to the limited number of abnormal cases in most of the bone diseases studied, many models face the risk of overfitting, particularly with DL architectures [116]. To address this issue, studies have combined multiple models, such as VGG-19 and ResNet, using an ensemble technique that assigns weights to each model's predictions [117, 118] This approach has improved overall accuracy and mitigated biases from individual models. Additionally, cross-validation was employed during training to evaluate the model’s robustness and stability, ensuring that the ensemble system generalized well across different data subsets [119, 120].

The trustworthiness evaluation of detecting general bone abnormalities studies shows notable deficiencies, particularly in terms of human agency, privacy, and accountability, where most studies have received V-L ratings [112, 113, 115, 117], as shown in Table 6. These issues highlight the need for improved clinician involvement and better data governance practices to ensure secure and ethical AI implementation. While some studies have performed moderately in terms of transparency and fairness, receiving M and H ratings [112, 114], the overall trustworthiness of these models is compromised by a lack of societal and environmental considerations.

Table 6 The current frequency of trustworthy AI requirements in the bone abnormality detection literature

4 Discussion

This section aims to present a detailed overview of the reasons, issues, and prospects of the use of AI in the detection of orthopaedic disease studies. It provides an essential contribution to the assessment of the already available literature by determining the major reasons in favor of using AI in orthopaedics, the technical or ethical barriers that impede such usage, and the directions for further research aimed at producing more effective, safe and trustworthy AI as an augmentative tool in orthopaedics.

4.1 Motivations

Many reviewed studies have shown that computational AI models in orthopaedic diagnostic systems have been developed to accurately identify complicated disorders such as acetabular fractures, scaphoid fractures, and bone cancers in a manner similar to that used by emergency room clinicians, improving doctors' diagnoses and patient outcomes [97]. Additionally, AI simplifies orthopaedic diagnostic workflows in healthcare centers, which can reduce diagnostic errors among doctors and optimize clinical outcomes [47]. Compared with traditional methods, automated AI systems detect vertebral and knee bone tumours more accurately and quickly [75]. AI models also improve decision-making, prevent misdiagnoses in critical conditions such as malignant tumours and fractures, and automate measurements, enhancing consistency and reducing subjectivity. Furthermore, many AI research initiatives are motivated by the significant impact that early diagnosis and preventive healthcare can have on treatment outcomes and quality of life [60, 107]. Intelligent computational models are being developed to reliably detect early signs of conditions such as wrist fractures and osteoarthritis [61]. Additionally, improving diagnostic procedures for atypical cartilaginous tumours and pelvic radiography and predicting high-risk bone metastasis demonstrated the ability of AI to provide targeted diagnostic support, which is essential for timely interventions [37]. In addition to improving diagnostic accuracy and clinical outcomes, the integration of trustworthy AI in orthopaedic diagnostics is vital for building clinician and patient confidence [8]. AI systems designed with transparency, explainability, and fairness are more likely to expand acceptance in healthcare centers [32]. By ensuring that AI algorithms are interpretable and provide a clear rationale behind their decisions, healthcare professionals can trust the system's output, which is crucial when dealing with complex and high-stakes diagnoses, such as bone tumours or fractures [50, 95]. Figure 4 illustrates the three key reasons in the literature for integrating AI into orthopaedics.

Fig. 4
figure 4

Motivations of AI computational systems in orthopaedics

4.2 Challenges

According to the literature, acquiring large, well-labeled datasets for effective AI training in orthopaedics is time-consuming and labor-intensive, leading to data quality and availability challenges [37]. Small dataset sizes, high-resolution requirements, and the unique appearance of medical images complicate AI model training. Studies highlight these challenges, particularly in knee bone tumour detection, owing to the uncommon appearance and variety of poses in X-ray images [36, 50]. Without enough data, distinguishing between similar conditions such as Ewing sarcoma and acute osteomyelitis becomes difficult, leading to potential misdiagnoses and reduced accuracy. Many studies have reported that diverse datasets representing various patient demographics and medical conditions are crucial for the good generalization of computational models across populations [38, 51]. Strategic collaboration, imaging modalities such as X-rays, CT scans, and MRIs, and high-quality data annotation are essential for increasing dataset size and diversity for complete bone disease diagnosis [70]. Many studies have highlighted the need for precise and reliable integrated diagnostic tools to improve patient healthcare and clinical adoption, especially for bone tumour diagnosis and osteosarcoma classification [46, 86]. By showing AI's ability to match or exceed human diagnostic accuracy in orthopaedic illnesses, especially femoral intertrochanteric fractures, AI technologies gain credibility and practical use [107]. Current orthopaedic diagnosis methods are difficult since they rely on interpretation competence, which can cause errors due to skill level, weariness, or minor symptoms. This issue is crucial for diagnosing proximal femur bone cancers and scaphoid fractures where misdiagnosis can cause prolonged pain and disability [113]. Even with two- and three-dimensional CT scans, the pelvic anatomy and unusual fracture forms, such as acetabulum fractures, make diagnosis problematic [66]. More efficient orthopaedic AI diagnosis systems are needed due to these limitations. Figure 5 presents the challenges in integrating AI in the context of orthopaedic disease detection according to three key directions.

Fig. 5
figure 5

Challenges of AI computational systems in orthopaedics

4.3 Future Research Avenues

According to the literature, expanding and diversifying orthopaedic AI datasets improves performance and generalizability, which can improve the clinical feasibility and statistical reliability of these models, increasing their adaptability across healthcare institutions [107]. Multimodal imaging with detailed labeling and varied angles can increase diagnostic accuracy, especially for challenging classifications such as knee cartilage lesions [44, 75]. These recommendations emphasize the importance of diverse and comprehensive datasets for developing reliable AI diagnostic tools for orthopaedic diseases. Additionally, improving diagnostic tools involves optimizing features and expanding landmark identification for better accuracy and adaptability [66]. Key strategies include hybrid computational models that merge traditional and deep features, mobile apps for remote diagnostics, and diverse datasets with various fractures and imaging modalities [53]. Exploring different AI architectures and using metaheuristic algorithms such as ant colony and gray wolf methods can also enhance fracture detection [71]. These methods collectively emphasize the need for robust AI optimization to advance orthopaedic diagnostics. Another area where orthopaedic diagnostic AI needs clinical confirmation. External validation and clinical data integration are needed for meniscus tears, ankle fractures, and vertebral fractures [72, 109]. Accurate testing against established benchmarks is essential for ensuring reliability and effectiveness, enhancing diagnostic accuracy, and supporting clinical decision-making [81]. These steps are critical for successful intelligent system deployment in healthcare. Additionally, ethical deployment, regulatory approval, and the clinical advantages of trustworthy AI over traditional approaches should be addressed to enable the models to be applied in healthcare centers [88]. Validation and adherence to ethical guidelines are necessary to build trust and protect patient privacy [84]. Future work should explore the implications of computational detection models based on AI in healthcare, including potential biases and the need for transparency in AI algorithms, to promote ethical practices in the medical field. The four key directions of the literature review recommendations for integrating AI techniques with bone disease diagnosis are shown in Fig. 6.

Fig. 6
figure 6

Recommendations of AI computational systems in orthopaedi

5 AI Techniques and Datasets for Orthopaedic Disease Detection

This section aims to provide a comprehensive overview of the AI techniques and datasets used in the orthopaedic disease detection literature, highlighting the advancements and challenges in this field. SubSect. 5.1 focuses on categorizing the key AI methods and demonstrating their applications and effectiveness in orthopaedic diagnosis. SubSect. 5.2 discusses the availability and utilization of medical datasets, including X-ray, MRI, and CT scans, and explores how these datasets support diagnostics in orthopaedic research.

5.1 AI Directions

This subsection focuses on detailing the five main AI techniques (DL, ML, XAI, fuzzy logic, and MCDM) and their significance in orthopaedic disease detection. This study aims to categorize and analyze how these techniques contribute to improving diagnostic accuracy and decision-making in orthopaedic healthcare. In evaluating the effectiveness of AI techniques for orthopaedic disease detection, several key performance metrics are commonly used to assess model efficiency. Metrics such as accuracy, precision, recall, F1 score, IoU, confidence interval (CI), and AUC (Area Under the Curve) are frequently applied across studies. By employing these methods, researchers can address various challenges, such as image interpretation, complex disease classification, and personalized care.

5.1.1 Deep Learning (DL)

DL has emerged as a cornerstone in medical imaging, particularly for musculoskeletal disease detection [70]. Its ability to automatically extract complex features and learn hierarchical representations from imaging data surpasses traditional ML methods, making it indispensable for tasks such as fracture classification, osteoarthritis grading, and tumour detection [109, 114]. This section aims to explore the transformative role of DL techniques in the detection of orthopaedic diseases that improve patient outcomes [72, 90]. Tables 7, 8 and 9 illustrate the various applications of DL models in orthopaedic research, highlighting their effectiveness in identifying pathological changes and facilitating early diagnosis.

Table 7 CNN techniques used for orthopaedic disease detection
Table 8 Pretrained CNN techniques used for orthopaedic disease detection
Table 9 Optimization algorithms with DL techniques used for orthopaedic disease detection

CNN-based models have shown high accuracy in diagnosing various conditions, such as vertebral fractures, with an accuracy of 86% [79], and intertrochanteric fractures, with an accuracy of 88% [70], aiding clinicians in detecting subtle patterns in medical images. Moreover, advanced models such as the Mask-RCNN for ankle fractures achieve high performance with 89% accuracy, emphasizing the potential of CNNs in automating fracture detection and assisting clinicians in making faster, more reliable decisions, as shown in Table 7 [72].

However, the reviewed models face several significant challenges that hinder their broader clinical adoption. They rely heavily on large, annotated datasets such as the MURA dataset, which are often scarce in orthopaedic diseases because of the expertise-driven nature of annotation. This limitation restricts their training on diverse and comprehensive datasets such as the Chinese triple-A grade hospital dataset [70], resulting in potential biases and overfitting. Moreover, variations in demographics, scanner types, and imaging protocols across institutions can lead to suboptimal performance when models are applied outside their training environment, further limiting their safe adoption in healthcare settings.

Pretrained CNN models have shown significant importance in orthopaedic disease detection, as highlighted in Table 8. By leveraging architectures like ResNet, EfficientNet, and DenseNet, pretrained models achieve higher accuracy in detecting conditions such as fractures, osteoarthritis, and bone metastases, with accuracies greater than 98%. These models, pretrained on large datasets such as ImageNet, transfer learned features to medical imaging tasks, enabling effective classification and segmentation even with limited annotated datasets such as the Pusan National University Hospital dataset [75].

For example, EfficientNet demonstrated exceptional performance in fracture detection, with accuracies exceeding 90%, whereas DenseNet showed its ability to perform tumour classification by reducing the number of false negatives. This transfer learning approach not only mitigates the need for extensive training data but also enhances generalizability, making pretrained CNNs indispensable for advancing orthopaedic AI systems. The reviewed studies employ various performance metrics, including accuracy, precision, recall, F1 score, and AUC, to evaluate the effectiveness of DL models for orthopaedic disease detection. However, a critical gap lies in the inconsistency of reporting metrics such as the False Positive Rate (FPR), which is crucial for understanding a model's propensity to generate false alarms. Additionally, some studies fail to report confidence intervals (CI) or standard deviations, both of which are essential for assessing the statistical reliability and robustness of the results.

Optimization algorithms, such as SGD and Adam, are essential for fine-tuning DL models, enabling them to learn more effectively from complex medical datasets, including radiographic images and patient data [84]. By refining the learning process, these algorithms help DL models converge on optimal solutions, improving diagnostic accuracy and reducing the likelihood of overfitting [51]. This is especially beneficial in detecting conditions such as fractures, musculoskeletal disorders, scoliosis, and bone metastasis, where precision is critical for treatment planning [78]. According to the literature, many studies have shown that the SGD algorithm enhances the diagnosis of various bone diseases [53, 88], as outlined in Table 9. Moreover, the Adam optimizer has proven effective in other studies for improving fracture detection, scoliosis, and bone metastasis [113]. Additionally, seven studies have combined both the SGD and Adam methods to achieve more efficient diagnostics [50, 51].

One challenge is that optimization algorithms require substantial computational resources, making real-time diagnostics difficult in resource-constrained environments, such as smaller healthcare facilities. Moreover, while methods such as SGD and Adam have shown promise, they are not immune to challenges such as local minima or convergence issues, particularly when dealing with highly complex or noisy data. As a result, further research is needed to explore more efficient optimization techniques and establish standardized practices that can enhance the robustness and scalability of DL-based diagnostic systems in orthopaedics.

While various DL architectures, such as ResNet-50, PelviXNet, DenseNet, and YOLOv8, have been employed to detect conditions such as fractures, scoliosis, and musculoskeletal disorders, there is no standardized approach for selecting the best model. This gap in the literature presents a significant challenge, as choosing the best model is crucial for ensuring the reliability of diagnostic systems. Without a clear methodology for model benchmarking, it becomes difficult to compare the performance of different models or determine which one is best suited for clinical applications in orthopaedics.

A review of related studies reveals that DL models for orthopaedic disease detection are often deployed online or integrated into cloud-based systems. While this integration enhances accessibility and scalability, it also potentially exposes these models to adversarial attacks, which can severely compromise the accuracy and reliability of diagnostic outcomes [121]. Despite this significant vulnerability, there is a notable absence of studies addressing this issue in the literature. No research has specifically explored the integration of AI-based orthopaedic detection models with considerations for adversarial attack mitigation, leaving a critical gap in ensuring the robustness and security of these systems in real-world applications.

5.1.2 Machine Learning (ML)

ML technologies like Linear Regression (LR) [89], SVM [45], and RF [40] have significantly improved the detection of various bone diseases, with a focus on their use in conditions such as bone cancer, fractures, and osteoporosis, as mentioned in Table 10. Notably, five studies have demonstrated the effectiveness of SVM alone or in combination with RF, particularly for complex cases such as osseous metastases, where some studies reported near-perfect sensitivity and specificity [40, 44]. Additionally, LogitBoost was specifically applied in one study for detecting chondrosarcomas, highlighting its potential in specialized diagnostic settings [37].

Table 10 ML techniques used for orthopaedic disease detection

Moreover, Table 11 provides a detailed overview of the integration of ML/DL techniques for bone disease detection, highlighting various AI models applied to different types of orthopaedic diseases such as musculoskeletal disorders, fractures, and tumours. The models include advanced architectures such as ResNet, DenseNet, and ensemble learning approaches, with accuracy and other performance metrics such as precision, recall, and AUC reported across studies [36, 42, 81]. For example, CNN ensemble learning models for Ewing sarcoma achieve an accuracy of 94.4%, whereas ensemble learning models for musculoskeletal disorders report accuracies of approximately 83%. A key insight from this table is the broad adoption of ensemble learning combined with DL techniques, which consistently demonstrates strong performance in diverse applications, such as vertebral fracture and bone metastasis detection. However, there is an absence of clear methodological development phases for the models in some studies, making it difficult to replicate results or understand the processes involved. Additionally, practical results are not provided in several studies, limiting their applicability in orthopaedic health centers. Additionally, the lack of standardized benchmarks makes comparing AI models difficult.

Table 11 ML integrated with DL techniques used for orthopaedic disease detection

5.1.3 Explainable AI (XAI)

Despite the transformative potential of XAI in enhancing the transparency and trustworthiness of AI models for musculoskeletal disease detection, its application remains limited, with only six studies addressing this critical aspect. A notable limitation is the exclusive reliance on Grad-CAM, leaving other advanced techniques, such as SHAP, LIME, or counterfactual explanations, unexplored [82, 95, 113]. These alternative methods could provide deeper insights into model decision-making and foster greater trust in AI systems. While integrating Grad-CAM with pretrained models offers a promising research direction, particularly in bone metastasis detection [32, 75, 105], reliance on a single XAI approach limits opportunities to improve interpretability and clinical adoption. Expanding the application of diverse XAI techniques is essential for developing robust and trusted AI systems.

5.1.4 Fuzzy Logic

Despite their limited application, fuzzy logic studies have shown significant potential for enhancing orthopaedic disease detection by addressing the inherent uncertainty in medical data [8, 38]. In particular, its integration with other AI methods could enable more precise diagnostic outcomes in complex conditions such as bone tumours and fractures. The limited use of fuzzy logic in orthopaedics suggests the need for more research exploring its potential in other bone conditions, where its ability to handle uncertain information could significantly increase diagnostic accuracy.

5.1.5 MultiCriteria Decision-Making (MCDM)

By systematically evaluating multiple criteria, MCDM provides a structured framework that enhances the objectivity and comprehensiveness of decision-making, ultimately leading to more personalized and effective patient care. However, only one study has shown how the integration of benchmarking methods with orthopaedic disease detection can significantly improve shared decision-making [8].

5.2 Dataset Availability

The purpose of this section is to underscore the importance of medical datasets in the development of AI and its application in the diagnosis of orthopaedic diseases. Data such as CT scans, MRIs, and X-rays serve to enhance and improve AI detection models in terms of fracture, tumours, and osteoporosis detection. According to the literature, several key datasets have been widely used as sources of raw data for AI models, enabling them to learn and distinguish fine features in medical images, thereby enhancing diagnostic accuracy and treatment options. One notable dataset is the Osteoarthritis Initiative (OAI), which contains longitudinal imaging data from over 4,796 subjects [31]. Its standardized imaging protocols and rich temporal data support disease progression modeling and risk prediction [58]. However, a significant limitation of the OAI is its lack of diversity in patient ethnicity and geographical representation, which restricts its applicability to global populations [122]. Among the studies reviewed in this paper, the MURA dataset developed by the Stanford ML Group has emerged as one of the most frequently used datasets for developing and validating AI models in orthopaedic applications [66]. It comprises over 40,000 radiographs across seven musculoskeletal regions and supports general-purpose model training [113]. Despite its extensive coverage, MURA is limited by its binary labels (normal or abnormal), which restrict the nuanced understanding of disease severity [118]. Additionally, it lacks contextual patient data, such as demographics or clinical history, which could improve model personalization [119]. Another frequently used resource is the Knee Osteoarthritis Severity Grading Dataset, which contains 8260 X-ray images graded via the Kellgren-Lawrence scale [32]. This dataset’s large sample size and well-annotated severity labels make it ideal for training models focused on severity classification [123]. However, it faces significant class imbalance, with an underrepresentation of Grade 4 cases, which can bias model predictions toward less severe conditions. To address these limitations, augmentation techniques have been applied to improve data representation for underrepresented categories.

According to the literature, the issue of bias in medical datasets is a critical factor affecting the performance and generalizability of AI models [80]. Biases can emerge due to unequal representations of patient demographics, imaging protocols, and disease severity levels [78, 80]. For example, datasets skewed toward specific populations, such as predominantly Caucasian or male participants, can lead to AI models that perform suboptimally in underrepresented groups, such as individuals from diverse ethnic backgrounds or female patients [40]. This bias in representation not only limits the clinical applicability of the models but also poses ethical challenges in ensuring equitable healthcare access. Additionally, imaging biases, including variations in scanner types, imaging settings, and data preprocessing techniques such as manual cropping, further compound the problem [44, 48]. For example, models trained predominantly on high-quality imaging data may struggle to perform effectively on noisy or low-resolution images commonly encountered in resource-limited settings [93]. The lack of external validation datasets to evaluate the performance of AI models across diverse clinical environments exacerbates these limitations. Addressing these biases is essential for developing robust and trustworthy AI systems. Data augmentation and undersampling of the normal class can be utilized alongside AI models to provide a more balanced training set and improved generalizability [98]. Furthermore, transfer learning techniques and synthetic data generation can mitigate class imbalance challenges, but they are not substitutes for inherently diverse and well-curated datasets [89]. Additionally, three-view radiographs can be used to overcome issues related to the absence of relevant features required for the detection process [74]. Future studies should prioritize multicenter collaborations to collect balanced datasets reflecting global diversity in demographics, imaging conditions, and disease presentations. Furthermore, integrating fairness metrics into model evaluation pipelines can help quantify and address biases, ensuring that AI models provide reliable outcomes for all patient groups. Consequently, increasing the availability of these datasets, both for public and private projects, is necessary to further increase the availability of orthopaedics AI applications.

5.2.1 Orthopaedic via X-Rays

X-rays are essential tools in orthopaedic diagnostics because they allow for the evaluation of the bone structure, density, alignment, and any trauma or morphological disease, including fractures and osteoporosis [44]. They are still among the simplest and most inexpensive means of assessment, assisting medical practitioners in the better management of disorders of the musculoskeletal system [45, 68]. Complete information concerning the detection of diseases on several X-ray datasets constituting the major sections of this study can be found in Table 12. The datasets differ in scale as well as in the presence or absence of the source [85]. Public datasets, one containing 40,561 samples [66] and another with 14,863 images [112, 113], help with the development and testing of AI models in orthopaedics, as they are the most desirable. These databases, which are available publicly and contain high volumes, have contributed to an increase in research activity and the development of tools for diagnosis that are more generalized by AI. In contrast, a few private datasets contain as few as 100 [71] or as many as 10,000 samples [54]. These types of datasets are not easy to access and reproduce, especially because they are usually within specific research teams or research institutes. Despite these constraints, the inclusion of legally collected datasets ensures that ethical standards are maintained, which is crucial for the clinical application and regulatory approval of AI systems.

Table 12 Orthopaedic disease X-ray datasets

5.2.2 Orthopaedic via MRI

In contrast, MRI is recognized for its superior soft tissue contrast resolution, making it an ideal choice when disorders involve the surrounding marrow, cartilage, ligaments, or muscles and can visualize early changes that may not yet be visible via X-rays [47, 51]. A range of MRI datasets utilized in the detection of orthopaedic diseases is shown in Table 13, which specifies their sizes and sources. Most patients’ public datasets, e.g., 2253 patients [41] or 1144 samples [38], establish a great basis for research into AI models that work regardless of the clinical setting. Although private sets are good for research, their availability is usually limited, which thereby inhibits their widespread adoption, coordination, and consistent validation in various studies as well [107, 120] Remarkably, in MRI, the size of the datasets in general is smaller than that of the X-ray datasets, which is attributed to the fact that MRI is more expensive and used for more specific applications. Some entries in the table also lack clarity about legal collection or dataset size, which can be a limitation for research transparency and ethical standards.

Table 13 Orthopaedic disease MRI datasets

5.2.3 Orthopaedic via CT Scanning

Medical contributions have frequently highlighted the benefits of CT scans by providing detailed cross-sectional images of bone structures and complex anatomical regions, making them critical tools not only for the initial diagnosis of bone conditions but also for guiding surgical planning and assessing postoperative outcomes [97, 102]. Compared with X-rays, which provide two-dimensional images, CT scans provide three-dimensional images, making it easier to evaluate bone fractures, malformations, and malformations during presurgery [37]. Furthermore, CT scans provide a much better look at bones that are hard to evaluate on MRI scans, although MRI is better at soft tissue imaging [82]. Hence, CT scans become essential, particularly in the assessment and planning of surgeries where it is critical to understand the bony anatomy fully and how it may change after the procedure [99]. The datasets in Table 14 range from smaller private collections (e.g., 14 or 65 patients) [76, 103] to larger public collections (e.g., 2340 samples) [91], reflecting both focus research opportunities and broad-scale model training. Private datasets are beneficial for developing treatments for uncommon bone diseases, but their scale is often insufficient for performing AI generalization. Public datasets, on the other hand, contain a much wider array of information and thus reinforce the validation of AI models in different clinical settings through the provision of rich data.

Table 14 Orthopaedic disease CT datasets

5.2.4 Orthopaedic via Uncommon Types of Medical Images

Biomedical images, DEXA, and digital radiography play significant roles in enhancing the diagnosis of bone disorders. These imaging techniques add information that goes beyond the more common approaches, including X-rays, MRI, or CT imaging. For example, DEXA scans are particularly useful for accurate measurement of the amount of mineralized bone, which helps in the diagnosis of osteoporosis [40]. Biomedical imaging, especially with the use of nuclear medicine techniques, helps in understanding bone dynamics and pathology [33], whereas high-quality digital radiography images improve the quality and resolution of the images [104]. However, only two studies have explored the use of DEXA images for detecting bone disease, with 200 samples from patients with metastases of the spine and 615 samples from patients with osteoporosis [110]. Moreover, single studies of biomedical images and digital radiography have been conducted and performed for the detection of bone abnormalities. Increasing the development of such datasets for the creation of AI models could lead to substantial advances in diagnosis, especially for bone densitometry and metabolic bone disorders, which are often limited by imaging techniques.

6 Conclusions

The present systematic review aims to minimize potential biases and highlight how computational AI systems enhance diagnostic accuracy, decision-making, and patient care. Additionally, it evaluates the reliability of intelligent systems on the basis of ethical, legal, and technical norms through a critical analysis of prior research. This review thoroughly examines the use of AI models in detecting orthopaedic diseases via a structured and systematic approach. It highlights gaps in current research, focusing on advancements, challenges, and future directions. Despite the efforts of previous studies, many reviews lack systematic evaluation and structured taxonomy, leading to potential biases and unreliable conclusions. They often fail to address the motivations behind integrating intelligent systems in orthopaedics, such as the need for improved diagnostic efficiency. Moreover, the absence of detailed recommendations for future research and practical implementation leaves an issue in guiding the development and deployment of computational AI technologies in orthopaedic healthcare centers. Adhering to trustworthy AI principles is crucial for creating reliable, safe, and ethical healthcare solutions. Our study evaluated the contribution of the 85 reviewed studies in using intelligent detection models for orthopaedic diseases against trustworthy components, which emphasizes the need for ethical considerations in those models. While some studies met high standards, many fell into the V-L category for several criteria, indicating a need for higher-quality research. The primary goal was often to improve model efficiency rather than address trustworthiness issues. Researchers and practitioners should prioritize developing intelligent detection systems that emphasize explainability, fairness, privacy preservation, causality, and robustness, which are aligned with trustworthy AI principles in orthopaedic healthcare. The presented review also underscores the need for more research on reproducibility, model interpretability, and human oversight in computational AI detection techniques. Future AI-based orthopaedic disease detection applications may become more advanced and accessible, but AI is not a straightforward solution. The success of intelligent detection systems will depend on advances in computational models by exploring different architectural models and evaluating their performance, the availability of large and diverse datasets by utilizing multicenter studies and strategic collaborations, and the ability of detection tools to handle complex orthopaedic problems.