Abstract
Lack of reliable measures of cutaneous chronic graft-versus-host disease (cGVHD) remains a significant challenge. Non-expert assistance in marking photographs of active disease could aid the development of automated segmentation algorithms, but validated metrics to evaluate training effects are lacking. We studied absolute and relative error of marked body surface area (BSA), redness, and the Dice index as potential metrics of non-expert improvement. Three non-experts underwent an extensive training program led by a board-certified dermatologist to mark cGVHD in photographs. At the end of the 4-month training, the dermatologist confirmed that each trainee had learned to accurately mark cGVHD. The trainees’ inter- and intra-rater intraclass correlation coefficient estimates were “substantial” to “almost perfect” for both BSA and total redness. For fifteen 3D photos of patients with cGVHD, the trainees’ median absolute (relative) BSA error compared to expert marking dropped from 20 cm2 (29%) pre-training to 14 cm2 (24%) post-training. Total redness error decreased from 122 a*·cm2 (26%) to 95 a*·cm2 (21%). By contrast, median Dice index did not reflect improvement (0.76 to 0.75). Both absolute and relative BSA and redness errors similarly and stably reflected improvements from this training program, which the Dice index failed to capture.
Supplementary Information
The online version contains supplementary material available at 10.1007/s10278-022-00730-8.
Keywords: Erythema, Skin imaging, Body surface area, Dice index, Error, Image annotation
Introduction
Chronic graft-versus-host disease (cGVHD) is a common complication of hematopoietic cell transplantation (HCT) and is a major cause of morbidity and mortality [1]. Cutaneous involvement occurs in half of incident cGVHD cases, and erythema is the most common finding [2]. The current standard measure of cGVHD erythema is clinician estimation of body surface area (BSA) and conversion to a 0–3 NIH skin score. Higher skin scores have been associated with poor survival [3], whereas reversal of erythema has been associated with improved survival [4, 5]. However, the lack of reliable measures of the extent of cutaneous cGVHD remains a significant challenge. Mitchell et al. demonstrated a wide range of reliability among clinicians and experts in assessing BSA of erythema involvement [6]. In the final two of four trials, clinician-expert pairs only achieved moderate reliability [7]. Objective, reproducible measures of erythema are needed.
A promising avenue for objective measures of erythema for inflammatory conditions such as GVHD are artificial intelligence (AI) algorithms to segment the affected area of skin [8, 9]. In contrast to classification algorithms applied to the diagnosis of individual neoplasms [10, 11], segmentation algorithms assign each pixel in an image to the disease or non-disease class. 3D photography with standard distance, color, and lighting calibration allows for more accurate measurements of BSA and redness over routine, 2D photography [12]. However, the development of AI algorithms to automatically analyze such data is limited by the large amount of time needed to delineate erythema in 3D photos. To acquire sufficient numbers of accurate markings (ground truth) for the development of reliable algorithms, training of non-experts is required.
There is currently no standard metric to evaluate the effect of training on accuracy and reliability in marking dermatologic images. On teams like our own, the quality of trainee’s annotation skill is typically estimated according to the amount of training and perhaps the subjective evaluation of trainee’s skill by an expert. We trained three non-experts in marking cGVHD in photographs, and a board-certified dermatologist confirmed that each trainee had learned to accurately mark active disease. We aimed to assess the ability of different metrics to measure the training effect and to determine the inter- and intra-rater reliability of non-experts after training. A prior case report suggested that markings of erythema in 3D photographs may have higher agreement on redness intensity than affected BSA [13]. Therefore, we compared the ability of BSA, redness, and the traditional computer vision metric of Dice index to capture trainee improvement after a training program.
Materials and Methods
Patients
To train non-experts, we acquired fifteen 3D photos from eight patients with cutaneous cGVHD (Table S1). All patients provided informed consent for an IRB-approved imaging study. The photo set was captured using a Canfield Vectra H1 3D camera and included six body sites: chest, back, neck, axilla, and upper and lower extremities.
Training of Non-Experts
Three non-physician trainees, medical student KP, post-doc XL, and PhD student TR, underwent a thorough training program led by ERT, a board-certified dermatologist with more than 5 years of cGVHD focus (the expert).
Trainees marked the same set of fifteen photos on three occasions: (1) before live training, (2) after 4 months of live training by the expert, and (3) after an additional 3-month washout period with no training. The expert similarly marked the same fifteen photos two different times more than 3 months apart. This resulted in a total of 11 markings (9 trainee and 2 expert annotations) for each photo. All markings were completed on individual laptop or desktop computer monitors, which were not calibrated to each other. Trainees and the expert confirmed that they could not remember the individual photos or the anticipated correct markings from one session to another. This account indicated that any learning effects were general cGVHD knowledge rather than specific knowledge of individual photographs.
Before the first annotation assignment, each trainee completed extensive background reading, which included 7 articles and a textbook of cGVHD [3, 5, 14–19]. Before the next annotation assignment, trainees underwent the following training program. First, trainees reviewed a 70-min photo annotation consensus call between nine expert cGVHD clinicians, which included six dermatologists and three HCT physicians. Second, trainees studied the resulting active cGVHD photo-marking expert guideline. Third, trainees underwent live training, which consisted of four 60-min sessions. In these interactive sessions, the expert demonstrated active cGVHD on volunteer patients (not photographed) that should be marked and skin changes unrelated to cGVHD that should not be marked. Finally, trainees also completed twelve in-person photography teaching sessions. For the first photography teaching session, trainees marked photos as a group and then reviewed results with the expert. For the remaining eleven photography teaching sessions, trainees independently marked photos and then reviewed their results with the expert. All trainees finished the training. At the end of the training, the expert confirmed both in the clinic exam room, as well as through direct observation of computer-based marking that each trainee had learned to accurately recognize and mark cGVHD.
Five Potential Metrics to Evaluate Training of Non-Experts
Markings were evaluated in terms of both physical and color spaces. Physical space was measured in terms of BSA (cm2). Color was measured in the Commission Internationale de l’Éclairage L*a*b* (CIELAB) color space, where positive a* values represent shades of red, and negative a* values represent shades of green [20]. The Vectra software provided these values for each marked area. Total redness (a*∙cm2) was calculated as the product of these values [13]. As each of the fifteen photos had eleven markings, each parameter was obtained for 165 unique markings in total.
We defined five potential metrics to capture the effect of live training as follows. First, we calculated both absolute and relative error for each trainees’ measure of BSA compared to the BSA measured from the expert annotations (ground truth). Then, we repeated these two calculations for the measure of total redness. Finally, we calculated the Dice index from each trainee’s annotation to the ground truth, which is a generally accepted measure of accuracy of image segmentation accuracy in computer vision [21]:
Absolute error
Relative error
Dice index = , where |X| is the number of pixels assigned to the expert’s marking and |Y| is the number of pixels assigned to the trainee’s marking, and is the number of pixels that both selected.
When possible, we used the average of the expert (ERT) marking as ground truth. There is no straightforward way to obtain an average expert annotation for Dice calculations. Therefore, when calculating Dice indices between the expert and trainee, we used ERT’s second marking session.
To measure reliability, we calculated the concurrent inter and intra-rater intraclass correlation coefficients (ICCs) of trainees post-live training. We used Eliasziw’s simultaneous random effects, absolute agreement, and single-measure model [22]. Lower bound (single-sided) 95% confidence intervals were calculated using the corresponding relInterIntra function of the irr package in R. To alleviate heteroscedasticity that was noted in Bland–Altman analysis of some of the metrics, log transforms were used prior to ICC calculations [23].
Results
To reflect the training of non-experts in marking active cutaneous cGVHD in photos, we report five potential metrics: absolute and relative BSA error, absolute and relative redness error, and Dice index. After the training, the expert (ERT) was confident that each of the trainees had improved and was accurately marking cGVHD in photos. Compared to markings before the training, trainees’ markings after the training were more similar to the markings of the expert (example markings in one of the fifteen 3D photos in Fig. 1). To summarize the overall training effect, medians from each of the 45 error values (one error for each of the three raters for each of the fifteen images) are reported in Tables 1 and 2.
Table 1.
Pre-training | Post-training day 1 | Post-training day 114 | |||
---|---|---|---|---|---|
Absolute error | BSA [cm2] (IQR) | 20 (9–55) | 14 (4–49) | 12 (5–41) | |
Total redness [a*·cm2] (IQR) | 122 (48–505) | 95 (30–200) | 103 (52–169) | ||
Relative error | BSA (IQR) | 29% (12–61) | 24% (10–47) | 25% (9–53) | |
Total redness (IQR) | 26% (11–55) | 21% (8–41) | 20% (9–41) | ||
Dice | (IQR) | 0.76 (0.52–0.84) | 0.75 (0.68–0.84) | 0.80 (0.65–0.84) |
Table 2.
Trainees only | Trainees + expert | ||||||||
---|---|---|---|---|---|---|---|---|---|
Inter-rater | Intra-rater | Inter-rater | Intra-rater | ||||||
Lower bound (95% CI) | ICC | Lower bound (95% CI) | ICC | Lower bound (95% CI) | ICC | Lower bound (95% CI) | ICC | ||
BSA | 0.73 | 0.86 | 0.82 | 0.93 | 0.72 | 0.84 | 0.83 | 0.93 | |
Total redness | 0.68 | 0.81 | 0.76 | 0.90 | 0.66 | 0.79 | 0.77 | 0.91 |
Following live training, median absolute BSA error improved from 20 to 14 cm2. Median total redness absolute error also improved with live training, from 122 to 95 a*∙cm2. After training, BSA relative error (24%) was higher than total redness relative error (21%). In fact, BSA relative error after training was close to the pre-training total redness relative error (26%). Both relative errors improved comparably (~ 5% decrease) with the training program. Compared to immediate post-training errors, no significant differences in absolute or relative error were observed after the additional 3-month washout period (final two columns of Table 1).
We did not find a significant difference between BSA and total redness inter-rater ICCs (BSA: 0.86, one-sided 95% confidence lower bound 0.73; total redness: 0.81, lower bound 0.68) or between BSA and total redness intra-rater ICCs (BSA: 0.93, lower bound 0.82; total redness: 0.90, lower bound 0.76).
Discussion
Out of five commonly used metrics, absolute and relative BSA and redness (a*) errors best reflected the training of non-experts in marking active cutaneous cGVHD in 3D photographs. Extensive training significantly improved the expert dermatologist’s confidence in trainees’ markings. The possibility to acquire reliable non-expert markings improves the potential of developing reliable artificial intelligence algorithms.
Both in physical space of BSA (cm2) and color space of total redness (a*·cm2), we found that absolute and relative error of trainees to an expert captured expected training effects. In contrast to surface area that has intuitive and familiar physical dimensions (e.g., 1 cm2 lesion size), color is more typically described in subjective shades (e.g., pink, salmon-colored, violaceous). The use of relative error enables both surface area and color spaces to be understood on a universal normalized scale. Thus, relative error allows comparison across different measurements, such as BSA and total redness. Unlike absolute and relative error, the Dice index did not capture improvement and is likely not a valuable metric to measure training effect when marking inflammatory conditions.
Progress in algorithm development to detect abnormal skin has been overwhelmingly driven by neoplastic conditions. In 2017, the International Skin Imaging Collaboration held its second challenge to encourage the development of algorithms for melanoma detection. For the segmentation portion of this challenge, participating algorithms predicted pixels affected by melanoma in dermoscopic images. Compared to expert-marked ground truth, the top-ranked algorithm achieved an average Dice of 0.85 [24, 25]. By contrast, our post-training day 1 median Dice for marking the inflammatory condition of cGVHD was 0.75 (IQR: 0.68–0.84). Despite much focus on neoplastic conditions, non-specialists more commonly see inflammatory conditions [26]. Therefore, it is important to consider the limitations of the Dice metric in moving the neoplasm-dominated world of dermatologic machine vision to impact inflammatory skin conditions.
In clinical assessment of the inflammatory condition of cGVHD, Mitchell et al. reported a wide range of inter-rater agreement in the assessed BSA of erythema (median inter-rater ICC range: 0.07–0.88) [6]. They interpreted these values by Landis and Koch criteria which assign: 0.21–0.40 fair agreement, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81 to 1.00 as almost perfect agreement [7]. In the final two of four trials, cGVHD-experienced clinicians achieved a “moderate” inter-rater reliability (median ICCs: 0.47 and 0.57). In comparison, our point estimates for inter-rater ICCs were “almost perfect” for both BSA (0.86) and total redness (0.81). Our study likely achieved higher ICCs thanks to months of training of our trainees compared to the 2.5-h training session in Mitchell’s study. Often in engineering and quantitative sciences, however, it is not the point estimate of reliability that is important, but rather having confidence that some minimum threshold has been reached. Accordingly, we note that the lower bounds of the 95% confidence interval of inter-rater ICCs were “substantial” for both BSA (0.73) and total redness (0.68).
As expected, we found a trend for higher intra-rater than inter-rater reliability in our ICCs for both BSA and total redness. If the ground truth for algorithm development relies on the markings of a single rater, good intra-rater reliability is adequate, and the lower inter-rater ICC need not be considered. However, in a situation where one expert might train multiple individuals to divide an image set to complete the annotation task, the inter-rater reliability becomes critical.
A significant caveat of our findings is that although trainee reliability between BSA and total redness is similar, this does not imply similar clinical utility. The current study does not evaluate clinical utility at all. Our prior publication suggests that, compared to the gold standard of BSA, total redness is much more sensitive to a patient’s clinical change over time [13]. We also emphasize that we found lower trainee relative errors for total redness than for BSA. In fact, even without instruction, the distribution of the trainees’ total redness relative error was similar to the post-training distribution of the BSA relative error (median ~ 25%).
Our study is limited by the sample size of 15 three-dimensional images from 8 patients and no calibration of computer monitors. Without calibration, the color displayed on each team members’ monitor could vary slightly. In addition, due to trainee’s interest and background reading, their pre-training marking is likely much better than that of the general population. Limitations aside, our trainees were highly reliable after training (as measured by both inter- and intra-rater ICC). We found that absolute and relative error, with respect to BSA and redness, are better than Dice at capturing trainee improvement in marking cGVHD erythema. These metrics should be considered in the development of dermatologic machine vision to impact inflammatory skin conditions.
The metrics of non-expert training tested here will likely apply to a broad range of inflammatory dermatologic disease where assessment of affected BSA is important, such as psoriasis and atopic dermatitis. However, each condition should have a specific training program adopting the most appropriate methods. For example, compared to cGVHD, a photo-marking training program for a common condition like psoriasis would likely involve less computer- and literature-based training and more in-person clinical teaching.
Conclusion
We explored potential metrics to evaluate non-expert improvement in marking skin affected by cutaneous cGVHD after 4 months of extensive dermatologist-led training. Absolute and relative BSA errors, along with absolute and relative redness errors, similarly and stably reflected improvements, which the Dice index failed to capture.
Supplementary Information
Below is the link to the electronic supplementary material.
Author Contribution
Eric R. Tkaczyk conceptualized and designed the study. Data collection and analysis were performed by Kelsey Parks, Xiaoqi Liu, Tahsin Reasat, Zain Khera, and Laura X. Baker. Statistical analyses were done by Heidi Chen. Manuscript was written and revised by Kelsey Parks, Inga Saknite, and Eric R. Tkaczyk. All authors read and approved the final manuscript.
Funding
This work was supported by Career Development Award Number IK2 CX001785 from the United States Department of Veterans Affairs Clinical Sciences R&D (CSRD) Service to ERT, the National Institutes of Health Grants K12 CA090625 and R21 AR074589, and the European Regional Development Fund (1.1.1.2/VIAA/4/20/665) to IS.
Declarations
Ethics Approval
This study was performed in line with the principles of the Declaration of Helsinki. Approval was granted by the Vanderbilt Institutional Review Board (Date: 4/22/2022 / #170456).
Consent to Participate
Informed consent was obtained from all individual participants included in the study.
Conflict of Interest
The authors declare no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Lee SJ, Klein JP, Barrett AJ, Ringden O, Antin JH, Cahn JY, Carabasi MH, Gale RP, Giralt S, Hale GA, Ilhan O, McCarthy PL, Socie G, Verdonck LF, Weisdorf DJ, Horowitz MM. Severity of chronic graft-versus-host disease: Association with treatment-related mortality and relapse. Blood. 2002;100(2):406–414. doi: 10.1182/blood.V100.2.406. [DOI] [PubMed] [Google Scholar]
- 2.Gandelman JS, Zic J, Dewan AK, Lee SJ, Flowers M, Cutler C, Pidala J, Chen H, Jagasia MH, Tkaczyk ER. The Anatomic Distribution of Skin Involvement in Patients with Incident Chronic Graft-versus-Host Disease. Biol Blood Marrow Transplant. 2019;25(2):279–286. doi: 10.1016/j.bbmt.2018.09.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Jacobsohn DA, Kurland BF, Pidala J, Inamoto Y, Chai X, Palmer JM, Arai S, Arora M, Jagasia M, Cutler C, Weisdorf D, Martin PJ, Pavletic SZ, Vogelsang G, Lee SJ, Flowers MED. Correlation between NIH composite skin score, patient-reported skin score, and outcome: Results from the chronic GVHD Consortium. Blood. 2012 Sep 27 [cited 2020 Jul 6];120(13):2545–52. [DOI] [PMC free article] [PubMed]
- 4.Baker LX, Byrne M, Martin PJ, Lee S, Chen H, Jagasia M, Tkaczyk E. Association of skin response in erythema and sclerosis with survival in chronic graft-versus-host disease [abstract] J Invest Dermatol. 2020;140(7):S57. doi: 10.1016/j.jid.2020.03.442. [DOI] [Google Scholar]
- 5.Curtis LM, Grkovic L, Mitchell SA, Steinberg SM, Cowen EW, Datiles MB, Mays J, Bassim C, Joe G, Comis LE, Berger A, Avila D, Taylor T, Pulanic D, Cole K, Baruffaldi J, Fowler DH, Gress RE, Pavletic SZ. NIH response criteria measures are associated with important parameters of disease severity in patients with chronic GVHD. Bone Marrow Transplant. 2014;49(12):1513–20. [DOI] [PMC free article] [PubMed]
- 6.Mitchell SA, Jacobsohn D, Thormann Powers KE, Carpenter PA, Flowers MED, Cowen EW, Schubert M, Turner ML, Lee SJ, Martin P, Bishop MR, Baird K, Bolaños-Meade J, Boyd K, Fall-Dickson JM, Gerber LH, Guadagnini JP, Imanguli M, Krumlauf MC, Lawley L, Li L, Reeve BB, Clayton JA, Vogelsang GB, Pavletic SZ. A multicenter pilot evaluation of the national institutes of health chronic graft-versus-host disease (cGVHD) therapeutic response measures: Feasibility, interrater reliability, and minimum detectable change. Biol Blood Marrow Transplant. 2011;17(11):1619–1629. doi: 10.1016/j.bbmt.2011.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics. 1977;33(1):159. doi: 10.2307/2529310. [DOI] [PubMed] [Google Scholar]
- 8.Wang J, Chen F, Dellalana LE, Jagasia MH, Dawant BM, Tkaczyk ER. Segmentation of skin lesions in chronic graft versus host disease photographs with fully convolutional networks. In: 10575 Medical Imaging 2018: Computer-Aided Diagnosis. 2018. p. 105750N-1–105750N – 7.
- 9.Liu X, Parks K, Saknite I, Reasat T, Cronin AD, Wheless LE, Dawant BM, Tkaczyk ER. Baseline Photos and Confident Annotation Improve Automated Detection of Cutaneous Graft-Versus-Host Disease. Clin Hematol Int. 2021;3(3):108–115. doi: 10.2991/chi.k.210704.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Esteva Andre, Kuprel Brett, Novoa Roberto A, Ko Justin, Swetter Susan M, Blau Helen M, Thrun Sebastian. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–118. doi: 10.1038/nature21056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Tschandl P, Codella N, Akay BN, Argenziano G, Braun RP, Cabo H, Gutman D, Halpern A, Helba B, Hofmann-Wellenhof R, Lallas A, Lapins J, Longo C, Malvehy J, Marchetti MA, Marghoob A, Menzies S, Oakley A, Paoli J, Puig S, Rinner C, Rosendahl C, Scope A, Sinz C, Soyer HP, Thomas L, Zalaudek I, Kittler H. Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study. Lancet Oncol. 2019;20(7):938–947. doi: 10.1016/S1470-2045(19)30333-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.McNeil A, Parks K, Liu X, Saknite I, Chen F, Reasat T, Cronin A, Wheless L, Dawant BM, Tkaczyk ER. Artificial intelligence recognition of cutaneous chronic graft-versus-host disease by a deep learning neural network. Br J Haematol. 2022;74(March):1–4. doi: 10.1111/bjh.18141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Tkaczyk ER, Chen F, Wang J, Gandelman JS, Saknite I, Dellalana LE, Jagasia MH, Dawant BM. Overcoming human disagreement assessing erythematous lesion severity on 3D photos of chronic graft-versus-host disease. Bone Marrow Transplantation. 2018;53:1356–1358. [DOI] [PMC free article] [PubMed]
- 14.Cotliar JA. Clinical Presentation of Acute Cutaneous Graft-Versus-Host Disease. In: Atlas of Graft-versus-Host Disease. Springer; 2017. p. 21–8.
- 15.Carpenter PA. How I conduct a comprehensive chronic graft-versus-host disease assessment. Blood. 2011;118(10):2679–2687. doi: 10.1182/blood-2011-04-314815. [DOI] [PubMed] [Google Scholar]
- 16.Lee SJ, Cook EF, Soiffer R, Antin JH. Development and validation of a scale to measure symptoms of chronic graft-versus-host disease. Biol Blood Marrow Transplant. 2002;8(8):444–452. doi: 10.1053/bbmt.2002.v8.pm12234170. [DOI] [PubMed] [Google Scholar]
- 17.Lee SJ, Wolff D, Kitko C, Koreth J, Inamoto Y, Jagasia M, Pidala J, Olivieri A, Martin PJ, Przepiorka D, Pusic I, Dignan F, Mitchell SA, Lawitschka A, Jacobsohn D, Hall AM, Flowers MED, Schultz KR, Vogelsang G, Pavletic S. Measuring Therapeutic Response in Chronic Graft-versus-Host Disease. National Institutes of Health Consensus Development Project on Criteria for Clinical Trials in Chronic Graft-versus-Host Disease: IV. The 2014 Response Criteria Working Group Report. Biol Blood Marrow Transplant. 2015;21(6):984–99. [DOI] [PMC free article] [PubMed]
- 18.Jagasia MH, Greinix HT, Arora M, Williams KM, Wolff D, Cowen EW, Palmer J, Weisdorf D, Treister NS, Cheng GS, Kerr H, Stratton P, Duarte RF, McDonald GB, Inamoto Y, Vigorito A, Arai S, Datiles MB, Jacobsohn D, Heller T, Kitko CL, Mitchell SA, Martin PJ, Shulman H, Wu RS, Cutler CS, Vogelsang GB, Lee SJ, Pavletic SZ, Flowers MED. National Institutes of Health Consensus Development Project on Criteria for Clinical Trials in Chronic Graft-versus-Host Disease: I. The 2014 Diagnosis and Staging Working Group Report. Biol Blood Marrow Transplant. 2015;21(3):389–401.e1. Available from: 10.1016/j.bbmt.2014.12.001 [DOI] [PMC free article] [PubMed]
- 19.Hymes SR, Alousi AM, Cowen EW. Graft-versus-host disease: Part I. Pathogenesis and clinical manifestations of graft-versus-host disease. J Am Acad Dermatol. 2012;66(4):515.e1–515.e18. [DOI] [PMC free article] [PubMed]
- 20.Matias AR, Ferreira M, Costa P, Neto P. Skin colour, skin redness and melanin biometric measurements: comparison study between Antera ® 3D, Mexameter ® and Colorimeter ®. Ski Res Technol [Internet]. 2015 Aug 1 [cited 2020 Sep 2];21(3):346–62. Available from: 10.1111/srt.12199 [DOI] [PubMed]
- 21.Zijdenbos AP, Dawant BM, Margolin RA, Palmer AC. Morphometric Analysis of White Matter Lesions in MR Images: Method and Validation. IEEE Trans Med Imaging. 1994;13(4):716–724. doi: 10.1109/42.363096. [DOI] [PubMed] [Google Scholar]
- 22.Eliasziw M, Young SL, Woodbury MG, Fryday-Field K. Statistical methodology for the concurrent assessment of interrater and intrarater reliability: using goniometric measurements as an example. Phys Ther [Internet]. 1994 Aug;74(8):777–88. Available from: http://view.ncbi.nlm.nih.gov/pubmed/8047565 [DOI] [PubMed]
- 23.Bland JM, Altman DG. Transformations, means, and confidence intervals. BMJ Br Med J. 1996;312(7038):1079. doi: 10.1136/bmj.312.7038.1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Codella NCF, Gutman D, Celebi ME, Helba B, Marchetti MA, Dusza SW, Kalloo A, Liopyris K, Mishra N, Kittler H, Halpern A. Skin lesion analysis toward melanoma detection: A challenge at the 2017 International symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). Proc - Int Symp Biomed Imaging. 2018;2018-April(Isbi):168–72.
- 25.Yuan Y, Chao M, Lo Y-C. Automatic skin lesion segmentation using deep fully convolutional networks with jaccard distance. IEEE Trans Med Imaging. 2017;36(9):1876–1886. doi: 10.1109/TMI.2017.2695227. [DOI] [PubMed] [Google Scholar]
- 26.Wilmer EN, Gustafson CJ, Davis SA, Feldman SR, Huang WW. Most common dermatologic conditions encountered by dermatologists and nondermatologists. Cutis. 2014;94(6). [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.