Abstract
The HathiTrust Digital Library (HTDL) is one of the largest digital libraries in the world, containing seventeen million volumes from the collections of major academic and research libraries. In this paper, we discuss the HTDL’s potential for musicology research by providing a bibliometric analysis of the collection as a whole, and of the music materials in particular. A series of case studies illustrates the kinds of musicological research that may be conducted using the HTDL. We highlight several opportunities for improvement and discuss promising future directions for new knowledge creation through the processing and analysis of large amounts of retrospective data. The HTDL presents significant new opportunities to the study of music that will continue to expand as data, metadata and collection enhancements are introduced.
Similar content being viewed by others
Notes
For example, the Bach digital database portal, designed to provide Bach researchers with “solid information on the works of Johann Sebastian Bach and other composers from the Bach family and their whereabouts,” http://www.bach-digital.de/content/infos.xml.
For example, Alexander Street’s Classical Scores Library, http://alexanderstreet.com/products/classical-scores-library-package.
For example, a virtual library of nineteenth-century California sheet music (from between the years 1852 and 1900), http://people.ischool.berkeley.edu/~mkduggan/neh.html.
See “What are the different copyright statuses of items in HathiTrust, and what do they mean?” https://www.hathitrust.org/help_copyright#RightsCodes.
We first converted the records we obtained from the HathiTrust into MARCXML (see: http://www.loc.gov/standards/marcxml/) using a Perl module (see: http://search.cpan.org/~gmcharlt/MARC-File-MiJ/lib/MARC/File/MiJ.pm). Then we used a Library of Congress XSLT stylesheet to convert the records to the Metadata Object Description Schema (MODS) format. The stylesheet was enhanced locally by consolidating information encoded in multiple MARC data and control fields, to reduce data loss and retain more detail about the conceptual characterizations of the items. Finally, a locally developed XSLT style sheet was used to transform the records into Structured Query Language (SQL) insert statements for populating the customized MODS database tables.
Of the 14.6 million item records that we examined, 6,672,311 (46%) had Library of Congress Classification numbers and 2,461,361 (16.8%) had Dewey Decimal Classification numbers. 2,165,140 had both, leaving only 296,221 (2% of the total items) that had a Dewey Decimal number but not a Library of Congress classification number. Of the volumes with a recorded classification authority (approximately 50% of the total records), only a very small number had any authority other than Library of Congress or Dewey Decimal. Classification authorities were not ascertainable for the remaining volumes, primarily because the HTDL does not retain local call numbers that can help in determining classification information. Although not retained, the local call number for a volume can, however, be retrieved if needed—via the holding record for the volume, using the contributing library’s local system number, which is stored by the HTDL.
An important source of inconsistency is the significant variation that is found in the format of the date field across records, originating from variations in the local standards used by contributing institutions for their own bibliographic records. For example, some bibliographic records use wildcard characters in the date field, which are not consistent with each other as, sometimes, different wildcard characters have been used. Ranges of years often appear, and their formats, too, are frequently different. (For example, “1904–1924,” “between 1920 and 1950,” etc.). Other sources of variation include the fact that some records use the character ‘u’, for ‘unknown’, in place of a digit—as in ‘18uu’ to denote an year which is not precisely known but is from the nineteenth century—other records may use a dash (for example, ‘198–’).
See https://www.hathitrust.org/visualizations_dates for up-to-date chronological information.
The footprints of languages in the overall HTDL collection are shown in Fig. 3.
Consistent with the relative decline of German as a language of international scholarship starting in the 1920s, the proportion of German-language material in the music-class in the HTDL collection declines from about 61% for texts that are considered to be in public domain for researchers in the USA (which are mostly pre-1923 publications) to about 30% for texts that are in-copyright (which are mostly post-1923).
Each distinct Library of Congress Subject Heading in the HathiTrust metadata explored in this paper is counted separately, so that “Piano Music,” “Vocal Music,” etc. are categories distinct from the category “Music.”
Tune books constituted a genre of early American music publication that had a pedagogical aim, and they usually contained an instructional preface. That so few tune books are classified as such by subject in the HTDL is probably due to their having often been cataloged by libraries as ‘hymns.’
‘Music’ and ‘Musical Scores’ are overlapping categories for a small number items in the HTDL, which is an artifact of how these metadata are generated from the MARC bibliographic records.
Important early histories and reference works related to music and contained in the HTDL collection include Jean-Jacques Rousseau’s Dictionnaire de Musique (1768), Sir John Hawkins’s General History of the Science and Practice of Music (1776), and Charles Burney’s A General History of Music: From the Earliest Ages to the Present Period (1789).
This was the focus of a collaborative project carried out under the auspices of the HTRC in 2015; details can be found at: https://www.hathitrust.org/htrc_acs_awards_spring2015.
For simple examples of how comparison and contrast between two corpora created from the HTDL collection can be performed by using the algorithmic tools provided by the HathiTrust Research Center, see: ‘Workset Builder and Portal of the HathiTrust Research Center’. HathiTrust Research Center UnCamp. Ann Arbor, Michigan. 30–31 March 2015, http://bit.ly/1NF7QLi.
Sag [28] notes: ‘The HathiTrust aims to develop and facilitate the development of data mining and analysis of its digital collection. This activity would have qualified as “non-consumptive research” under the now defunct Amended Settlement Agreement [ASA]. “Non-consumptive research” as defined in the ASA is a form of non-expressive use...’
References
Beers, S., Parker, B.: HathiTrust and the challenge of digital audio. IASA J. 36, 38–46 (2011)
Biggers, K., Audenaert, N., Houston, N.M.: VisualPage: Workset creation through image analysis of document pages. Tech. rep., University of Illinois at Urbana-Champaign. http://hdl.handle.net/2142/79020 (2015)
Brahms, J.: Sonata, Op. 120, No. 1, F Moll = Fa Mineur = F Minor: Clarinetta & Piano: Viola & Piano. N. Simrock. https://hdl.handle.net/2027/mdp.39015070677896 (1895)
Brylawski, S., Lerman, M., Pike, R., Smith, K.: ARSC guide to audio preservation. Tech. Rep. 164, Council on Library and Information Resources. https://www.clir.org/pubs/reports/pub164/ (2015)
Burton-West, T.: Challenges for HathiTrust full-text search. https://www.hathitrust.org/blogs/large-scale-search/challenges (2018)
Centivany, A.: The Dark History of HathiTrust. In: Proceedings of the 50th Hawaii International Conference on System Sciences, Scholarship@Western, pp. 1–10. https://ir.lib.uwo.ca/fimspub/120 (2017)
Christenson, H.: HathiTrust: a research library at web scale. Library Resour. Tech. Serv. 55(2), 93–102 (2011). https://doi.org/10.5860/lrts.55n2.93
Dougan, K.: Music to our eyes: google books, google scholar, and the open content alliance. portal: Libraries Acad. 10(1), 75–93 (2010). https://doi.org/10.1353/pla.0.0088
Downie, J.S., Furlough, M., McDonald, R.H., Namachchivaya, B., Plale, B.A., Unsworth, J.: The HathiTrust Research Center: exploring the full-text frontier. EDUCAUSE Rev. 51(3). http://er.educause.edu/articles/2016/5/the-hathitrust-research-center-exploring-the-full-text-frontier (2016)
Duffy, E.P.: Searching HathiTrust: old concepts in a new context. Partnersh. Can. J. Library Inf. Pract. Res. 8(1). https://doi.org/10.21083/partnership.v8i1.2503 (2013)
Dvořák, A.: Quartett, Op. 34. N. Simrock?. https://hdl.handle.net/2027/uc1.31175005524569 (1890)
Fujinaga, I., Hankinson, A., Cumming, J.E.: Introduction to SIMSSA (single interface for music score searching and analysis). In: Proceedings of the 1st International Workshop on Digital Libraries for Musicology, pp. 1–3. ACM. https://doi.org/10.1145/2660168.2660184 (2014)
Hagedorn, K., Kargela, M., Noh, Y., Newman, D.: A new way to find: testing the use of clustering topics in digital libraries. D-Lib Mag. 17(9/10). http://www.dlib.org/dlib/september11/hagedorn/09hagedorn.html (2011)
IASA Technical Committee: Guidelines on the Production and Preservation of Digital Audio Objects. International Association of Sound and Audiovisual Archives. https://www.iasa-web.org/audio-preservation-tc04 (2009)
Jett, J.: Modeling Worksets in the HathiTrust Research Center. Tech. Rep. CIRSS Technical Report WCSA0715, University of Illinois at Urbana-Champaign. http://hdl.handle.net/2142/78149 (2015)
Malipiero, G.F.: Rispetti e Strambotti per Quartetto d’archi. J. & W. Chester. https://hdl.handle.net/2027/mdp.39015007983367 (1921)
Moretti, F.: Distant Reading. Verso, London (2013)
Motuz, C.: CIRMMT Workshop, September 7th, 2013, Part I: Introduction. https://ddmal.github.io/simssa-website/blog/cirmmt-workshop-september-7th-2013-part-i-introduction/ (2013)
Newcomer, N.L., Belford, R., Kulczak, D., Szeto, K., Matthews, J., Shaw, M.: Music discovery requirements: a guide to optimizing interfaces. Notes 69(3), 494–524 (2013)
November, N.: Editing Beethoven’s middle-period quartets: performers, scholars and sources in dialogue. Ad Parnassum J. Eighteenth- Nineteenth-century Instrum. Music 12(24), 31–53 (2004)
Peng, Z., Chen, M., Plale, B., Kowalczyk, S.: Author gender metadata augmentation of HathiTrust digital library. Proc. Am. Soc. Inf. Sci. Technol. 51(1), 1–4 (2014). https://doi.org/10.1002/meet.2014.14505101098
Rathey, M.: Bach’s Major Vocal Works: Music, Drama, Liturgy. Yale University Press, New Haven (2016)
Ratliff, B.: Every Song Ever: Twenty Ways to Listen in an Age of Musical Plenty. Farrar, Straus and Giroux, New York (2016)
Riley, J., Fujinaga, I.: Recommended best practices for digital image capture of musical scores. OCLC Syst. Serv. 19(2), 62–69 (2003)
Romani, F.: I Due Figaro, Ossia Il Soggetto Di Una Commedia: Da Rappresentarsi Nel Ducale Teatro Di Parma La Primavera Del 1840. Filippo Carmignani (1840)
Root, G.F.: Our National War Songs; A Complete Collection of Grand Old War Songs, Battle Songs, National Hymns, Memorial Hymns, Decoration Day Songs, Quartettes, Etc., with Accompaniment for Piano or Organ. S. Brainard’s Sons (1892)
Rumsey, A.S.: When We Are No More: How Digital Memory Is Shaping Our Future. Bloomsbury, London, UK (2016)
Sag, M.: Orphan works as grist for the data mill. Berkeley Technol. Law J. 27(3), 1503–1550 (2012)
Schumann, R.: Drei Quartette Für 2 Violinen, Viola, Violoncell: Op. 41. C.F. Peters. https://hdl.handle.net/2027/uc1.31175007325064 (1900)
Sheer, M.: Dynamics in Beethoven’s late instrumental works: a new profile. J. Musicol. 16(3), 358–378 (1998)
Solomon, M.: Reason and imagination: Beethoven’s aesthetic evolution. Historical Musicology: Sources, pp. 188–203. Interpretations, University of Rochester Press, Methods (2008)
Tillett, B.B.: Authority control: state of the art and new perspectives. Cataloging Classif. Q. 38(3/4), 23–41 (2004)
Tillett, B.B.: What is FRBR? A conceptual model for the bibliographic universe. Aust. Library J. 54(1), 24–30 (2005). https://doi.org/10.1080/00049670.2005.10721710
Underwood, T., Sellers, J.: How quickly do literary standards change? figshare pp 1–37. https://doi.org/10.6084/m9.figshare.1418394.v1 (2015)
Wilkin, J.: HathiTrust and Print Storage: Building around a digital core. https://www.hathitrust.org/documents/HathiTrust-CIC-201105.ppt (2011)
York, J.: HathiTrust Solr Benchmarking. Tech. rep., HathiTrust Digital Library. https://www.hathitrust.org/technical_reports/Large-Scale-Search.pdf (2008)
York, J.: Building A Future By Preserving Our Past: The Preservation Infrastructure of HathiTrust Digital Library. IFLA, pp 1–11. https://www.ifla.org/past-wlic/2010/157-york-en.pdf (2010)
York, J., Hagedorn, K.: Quality in HathiTrust. https://www.hathitrust.org/quality-in-hathitrust (2015)
Zeng, J., Ruan, G., Crowell, A., Prakash, A., Plale, B.: Cloud Computing Data Capsules for Non-consumptive Use of Texts. In: Proceedings of the 5th ACM Workshop on Scientific Cloud Computing, pp 9–16. ACM. https://doi.org/10.1145/2608029.2608031 (2014)
Acknowledgements
We gratefully acknowledge the contributions of Kirstin (Dougan) Johnson and Colleen Fallaw, who were co-authors of a preliminary version of this paper, which appeared in the Proceedings of the 1st International Workshop on Digital Libraries for Musicology. We wish to thank Janina Sarol for her help in generating the data for this paper. Xiao Hu provided valuable suggestions and critique. Mike Furlough of the HathiTrust, as well as Tim Cole of the HathiTrust Research Center, helpfully answered our queries. We are grateful to the anonymous reviewers who provided detailed and generous commentary. This work was made possible with generous support from the HathiTrust Foundation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Downie, J.S., Bhattacharyya, S., Giannetti, F. et al. The HathiTrust Digital Library’s potential for musicology research. Int J Digit Libr 21, 343–358 (2020). https://doi.org/10.1007/s00799-020-00283-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-020-00283-7