Usage Analysis for the Identification of Research Trends in Digital Libraries

More Web Proxy on the site http://driver.im/

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
May 2003

Volume 9 Number 5

ISSN 1082-9873

Usage Analysis for the Identification of Research Trends in Digital Libraries

Johan Bollen
Computer Science Department, Old Dominion University
Norfolk, Virginia
<jbollen@cs.odu.edu>

Rick Luce
Research Library, Los Alamos National Laboratory
Los Alamos, NM 87511
<rick.luce@lanl.gov>

Soma Sekhara Vemulapalli
Computer Science Department, Old Dominion University
Norfolk, Virginia
<svemulap@cs.odu.edu>

Weining Xu
Computer Science Department, Old Dominion University
Norfolk, Virginia
<wxu@cs.odu.edu>

Abstract

The analysis of user logs from large-scale digital libraries offers new opportunities to assess research trends in an institution's user communities. We describe the application of a methodology to derive weighted journal relationship networks from reader logs at the Los Alamos National Laboratory's Research Library during 1998 and 2001. A journal impact metric is defined that derives journal impact from the structural features of the generated journal relationship networks, much in the same manner Google's PageRank evaluates the impact of web pages for a given subject on the basis of its context of hyperlinks to other pages. A comparison of this reader impact metric to the ISI Impact Factor values for the same journals in 1998 and 2001 allows us to detect and interpret community-specific research trends where the LANL community deviates from more general trends as indicated by changes in the Institute for Scientific Indexing (ISI) Impact Factors during those same years. Such analysis yields information to aid digital library managers to improve the evaluation of not only which parts of the collection are most highly valued by their local community, but it also detects research trends in user communities as they evolve over time.

Introduction

Effective libraries anticipate and serve multiple constituencies of users. Library collection managers, charged with making significant financial acquisition decisions in anticipation of user needs, historically used a combination of professional judgment and random user feedback coupled with generalized resource tools to make such decisions. This approach suffers both from a lack of precision and lack of a fast user feedback loop. One of the promises of DL development has been the prospect of deriving usage indicators to better illuminate the question of effectiveness. During the first generation growth stages of digital library (DL) development, usage indicators have focused on the relative, comparative quantity of electronic downloads vs. paper usage. Such comparisons probably reflect keen interest in the experimental stage of transitioning from paper to electronic, but these comparisons miss the mark in terms of a dynamic analysis of what the activity logs suggest for fine-tuning collection development. It could be expected that digital libraries would develop techniques to forecast and respond dynamically to changing user preferences over time; however, to evaluate how well a DL responds to the needs and preferences of its user community requires that such preferences can be accurately and, ultimately, dynamically determined. Although the evaluation of DL collections can be approached from a number of perspectives, an assessment of the nature of DL user communities should be an essential part of any DL evaluation strategy.

Our approach to determine the characteristics of user communities operates by proxy of full-text document download patterns. It is assumed that DL users express their preferences through their usage of DL resources, most specifically by downloading and reading articles [Kaplan 2000, Darmoni 2002]. When the same user retrieves two documents during the same session, this serves as an indication that both documents may be related to the same information need. This assumption is often referred to as the "retrieval coherence assumption". A machine learning mechanism then scans DL download logs for pairs of documents downloaded in close temporal proximity by the same user and updates document relationship weights accordingly. After sufficient numbers of document downloads have been read from a DL download log, the resulting document and journal networks have been shaped by statistical patterns in document downloads as they occurred over a given time period for a community of readers or DL users. This methodology is explained in full detail in the June 2002 D-Lib Magazine article "Evaluation of Digital Library Impact and User Communities by Analysis of Usage Patterns" [Bollen 2002].

Networks that have thus been generated represent the relationships between documents and journals as implicitly indicated by large groups of DL readers. These networks can be analyzed to assess the characteristics of a given DL community similar to the analysis of citation networks [Everett 1991, Braam 1991, Gibson 1998]. Thus far we have focused specifically on networks of journal relationships derived from article download patterns in a DL.

A number of graph-theoretical measures can be applied to these networks to determine the nature of the DL community that shaped the network:

Impact ranking: determine a document's or journal's impact or prestige by its context of relationships to other documents and journals, similar to how social network analysis identifies nodes of high prestige.
Structural analysis: examine the structural properties of the generated networks to determine the characteristics of DL user communities and how they relate to certain network features.
Temporal analysis: examine the differences between networks generated from download patterns registered at different times to reveal research trends.

We extended the methodology outlined in [Bollen 2002] to detect research trends by generating journal relationship networks from article downloads during 1998 and again in 2001 at the Los Alamos National Laboratory (LANL) Research Library (RL). The structure of these two networks can be compared to yield measures of how the needs of the underlying user community have changed over the course of 3 years, and may point to specific research trends. We calculated measures of journal impact from both networks using a graph-theoretical measure that determines a journal's centrality or impact from its degree of connectedness within a network of journal relationships, a process conceptually similar to Google's PageRank [Brin 1998]. These measures of journal impact, when calculated over both the 1998 and 2001 journal networks, reflect how important specific journals were in the LANL RL user community in 1998 and 2001. A comparison of these values will indicate which journals have lost or gained importance and, consequently, how the needs of the LANL user community have changed over this 3-year period.

The generated measures of journal impact, however, confound both LANL specific impact trends and trends in the general scientific community. If a specific journal has gained impact over the period 1998 to 2001 in the general scientific community, it can be expected to do so at the LANL RL as well. However, we are interested in those trends that occur specifically in the LANL RL community and wish to separate general journal impact trends from those specific to the LANL RL. We therefore calculated the ratio of Insitute for Scientific Indexing (ISI) Impact Factor (IF) and reader journal impact to determine where a journal impact trend in the LANL RL deviated the most from trends in the ISI IF over the same period.

Formal Framework

Networks of Journal Relationships

Although a more formal definition of the methodology to create networks of journal relationships from DL download logs is provided in the June 2002 D-Lib article [Bollen 2002], we provide a brief overview below.

Assume we have acquired a DL log in which each line registers data pertaining to the download of a full-text article. Such data will generally include the originating IP number, the date and time of the download, and a document identifier. The LANL RL logs to which we applied this analysis, for example, contained the IP number from which the download request originated, the time and date on which the request occurred, and a document identifier including the ISSN for the journal in which the article was published.

We generally normalize such logs as follows:

anonymization: to protect the privacy of LANL RL patrons, we replace each IP number with a unique, but randomized user identification number

pruning: removal of all requests not pertaining to a document download

sorting: we sort all log lines according to user ID and time and date of the request

The resulting log file thus has all download requests issued by the same user, grouped and sorted according to the date and time on which they occurred.

The algorithm to generate a network of journal relationships examines the log at this point and extracts a list of journals from the article IDs occurring in the download log. It then proceeds to generate a network of journal relationships where each pair of journals is connected via a weighted link. This weight indicates the strength of the relationship between the two journals. In graph-theoretical terms, such a network is referred to as a directed, weighted graph. All weights are initialized to zero.

The algorithm then sequentially scans the log lines and examines each pair of downloads. In case the download requests were issued by the same user on the same day within a given period of time, the algorithm marks the two article downloads and extracts the journals in which the articles were published. It then increases the weight of the relationship between these two journals. In other words, if two articles were downloaded by the same user within a given period of time, the algorithm will slightly increase the link weight between the two journals in which the articles were published. The amount by which a journal relationship weight is increased can be varied according to the time passed between the two downloads or any other function [1].

As such, the algorithm will gradually adapt the weights of journal relationships in the network according to how often users have downloaded articles published in these journals within a given period of time or within the same session. Link weights will thus come to represent the degree of relationship between journals for a given community of DL users.

The structural features of such networks reveal what the collective perception of journal relationships is within a community of users and thus the particular views and perspectives of such a community.

We will refer to the thus generated networks as Reader Generated Networks (RGN) since they have been generated from download patterns as they were recorded for a community of article readers for a given DL. These networks can be contrasted to Author Generated Networks (AGN) whose structure has been defined by the explicit actions of individual authors, e.g. the World Wide Web (WWW) in which hyperlinks are manually inserted by web page authors, or the citation graph in which article authors explicitly reference material.

Measures of reader journal impact

Once a network of weighted journal relationships, the RGN, has been generated on the basis of a large number of article downloads in a DL log, we can extract measures of journal impact from the structure of such networks. Rather than rely on download frequency, we wish to establish the impact of a journal from the context of its relationship to other journals. A journal that has many connections to other journals in a network of journal relationships established from structural patterns in document downloads can be considered "central" to the reading behavior of DL users. In other words, such a journal is not simply read often, it has obtained a central position in readers' retrieval patterns. We refer to metrics that capture this aspect of a journal's impact as "structural" journal impact metrics, which can be juxtaposed to metrics based on download frequency.

The domain of social network analysis has defined numerous metrics to determine how central certain actors are in social networks, and how to interpret actor centrality as an indication of their rank, power or prestige with a social group. Generally such metrics are based on the assumption that an actor who relates to many other actors, and simultaneously is related to by many actors, is highly central to the structure of social relationships and can therefore be regarded as an actor of high power or prestige. Bonacich discusses a set of common power metrics in social network analysis, namely degree centrality, closeness centrality, "betweenness" centrality and the Bonacich or eigenvector centrality [Bonacich 1987]. In spite of their different definitions, all centrality measures determine an actor's prestige or power in a social network by examining the context of its relationships to other actors.

Although the discussed centrality metrics originate in social network analysis, they have found applications in a wide range of datasets such as bibliometric analysis and the analysis of document networks. Indeed, given that we have a network of document relationships, documents can be ranked according to their prestige by calculating their centrality scores on the basis of their relationships to other documents [Kleinberg 1999]. The Google PageRank algorithm [Brin 1998] demonstrates how eigenvector or Bonacich centrality can be successfully applied to determine the ranking of WWW pages according to their standing in a hyperlinked network.

In our research we have applied the Bonacich centrality, also referred to as eigenvector centrality, to the generated RGN. Centrality values calculated this way will indicate the prestige or impact of specific journals within the community of DL user whose download patterns shaped the network.

We denote the calculated eigenvector centrality of every journal i in the RGN as P(i).

Journal Impact Discrepancy Ranking

The journal impact values determined from the RGN networks correspond to a community specific measure of journal impact. However, this impact confounds two specific factors: community specific journal impact and general journal impact. Clearly, the impact of a journal within a specific community, such as the research community of the Los Alamos National Laboratory, is determined by factors relating to the specific characteristics of that community as well as general trends in the scientific community. Since we are most interested in what differentiates the LANL community from the general scientific community, we would like to contrast community-specific journal impact from journal impact in the wider scientific community. We have done so by contrasting the centrality values, P(i), calculated from the RGN networks with the ISI Impact Factor (IF) [Garfield 1979].

The ISI IF for a given journal i for a given year x is determined as the ratio between the number of citations to articles published in journal i in the two years preceding x over the total number of articles published in those two years. For example, assume we would like to determine the impact of journal i for the year 2001. We refer to the number of citations to articles published in journal i in the years 1999 and 2000 by the quantity A. The total number of journals published in journal i during 1999 and 2000 is denoted B. The ISI IF for this journal in 2001 would therefore be given by:

$IF = \frac{A}{B}$

We can now see how the ISI IF corresponds to the definition of degree centrality in social network analysis. Degree centrality is defined as the sum of the in and out degree of a network node, normalized by the total number of edges in the network. When applied to the citation graph, the ISI IF counts only the in degree of nodes (citations to the journal), and normalizes by the number of articles published in the journal over 2 years. As such the ISI IF can be considered a form of degree centrality, which will correspond to our general notions of node centrality and, thus, of prestige and impact in a social network. Applied to the citation graph, the ISI IF is therefore another instantiation of the concept of node centrality or prestige, which in this case is referred to as journal impact.

Since the ISI IF is defined over citation data collected for the entire scientific community, it can be used as a baseline for comparison to RGN centrality values: it expresses impact over the entire scientific community while the RGN centrality values correspond to community specific impact. Where the RGN centrality (i.e. the eigenvector centrality) P(i) and the ISI IF differ, they indicate discrepancies between the impact of a journal for the LANL community and the general science community.

We have calculated the eigenvector centrality of every journal in the RGN. Multiple RGNs have been generated from DL download logs registered in year x. We therefore denote the eigenvector centrality of journal i in year x as P(i,x). We have retrieved the ISI IF for the same year x, for the same set of journals. The ISI IF for each journal i in year x will be denoted IF(i,x).

We then define the Impact Discrepancy Ratio (IDR) for a given journal i for a given year x as:

$P\prime(i,x) = \frac{P(i,x)}{\mbox{IF}(i,x)}$

The values of P'(i,x) indicate the degree to which LANL specific journal impact deviates from the IF for the same journal in the general scientific community in year x. A high IDR value indicates the calculated LANL impact, P(i,x), was higher than could be expected from its ISI IF, IF(i,x), value. A low IDR value indicates a lower LANL impact, P(i,x), than could be expected from the ISI IF, IF(i,x). Therefore, by ranking a collection of journals according to their IDR values, we can determine for a given year x which journals most specifically had a high LANL impact compared to their ISI IF in the general scientific community. This will allow us to pinpoint those journals that are particularly indicative of the interests of the LANL community regardless of their general ISI IF ranking.

Detection of Research Trends

We further refine the IDR metric to detect journal impact discrepancy changes over time. We are particularly interested in changes in the IDR over time as these changes indicate how the impact of certain journals has deviated more or less from the ISI IF, both by increasing or decreasing their local LANL impact compared to the ISI IF.

We therefore define the Impact Discrepancy Ratio Trend (IDRT) as follows:

Let P'(i,x) be the IDR value calculated for year x, and we record IDR values for years x + 1, x + 2, etc. The IDRT value P'' for a given journal i over years x and x + n is given by:

$P\prime\prime(i,x,x+n) = \frac{P\prime(i,x)}{P'(i,x+n)}$

or the ratio of IDR values for year x over those calculated for year x + n.

Figure 1. Two scenarios demonstrating that for a journal i the local impact P(i), can deviate from the ISI IF, IF(i), over time.

Figure 1 provides a graphical representation of the rational underlying the definition of the IDRT metric. As shown in the figure, IDRT values are shaped by how strongly a journal's local impact — as indicated by the ratio of its local impact over the ISI IF for the same year, or IDR value — evolves over time. Journals with an IDRT score much larger or smaller than "1" will be those journals that, over a period of n years, have deviated most from the trend found in the general scientific community as represented by the ISI IF. Therefore, high or low IDRT-scoring journals will point to research trends observed uniquely in a specific user community. IDRT scores may need to be normalized by the value of n to allow comparisons between different samples, but since we intend to compare IDR values over only two data points (2 log samples), this issue will not affect our ability to detect specific research trends.

Results for the Los Alamos National Laboratory

We have applied the analysis mentioned in the above sections to two comparable DL download logs registered by the Los Alamos National Laboratory Research Library in 1998 and 2001. The 1998 download log covered the period from April to December 1998, while the 2001 log reflects activity from June to November of that year. The 1998 log contained 31,991 download requests issued by 1,941 unique users for articles published in a total of 472 journals. The 2002 log contained 40,847 download requests issued by 1,858 unique users for articles published in 1,829 journals. Two RGN networks were generated from both 1998 and 2001 LANL RL download logs.

We calculated eigenvector centrality values, and subsequently IDR and IDRT values, for a set of 292 journals in which readers had downloaded articles in both 1998 and 2001 LANL RL logs, and for which 1998 and 2001 ISI IF could be retrieved. The 292 journals to which we applied our analysis represent a subset of the total number of journals from which articles were downloaded in 1998 and 2001. Articles were downloaded from 472 journals in the 1998 log, and from 1,829 journals in the 2001 logs. However, since our IDRT metric implies a comparison of 1998 and 2001 IDR values, IDRT values could only be generated for journals occurring in both 1998 and 2001 logs and for which ISI IF values were available.

Stability

Local LANL journal impact — as measured by a journal's eigenvector centrality value derived from the RGN generated — is expected to be less stable than the ISI IF since the ISI IF is calculated from the citations of a large group of authors and represents journal impact for the entire scientific community. This community will be less subject to rapid changes in interests and focus than smaller, directed research communities such as those of LANL. This hypothesis underlies our decision to use the ISI IF as a baseline measurement of journal impact in the general science community.

To verify whether the ISI IF is indeed more stable over time than the LANL journal impact values, we calculated the Spearman correlation coefficient between the 1998 and 2001 LANL journal impact values (P(i,x)), and the correlation coefficient between the ISI IF (IF(i,x)) in 1998 and 2001 for the same set of journals. The correlation between the 1998 and 2001 eigenvector centrality values was found to be 0.738 (p<0.001), indicating a high stability over time. The Spearman correlation coefficient [2] between the ISI IF values in 1998 and 2001 for the same set of journals was found to be 0.877 (p<0.001). This indicates that the ISI IF is indeed much more stable over time than the LANL journal impact values. As expected, journal impact in the LANL changes more rapidly over time than the ISI IF.

Figures 2 and 3 plot the 1998 versus 2001 eigenvector centrality values and the 1998 and 2001 ISI IF values for this set of journals.


Figure 2. Scatterplot of 1998 vs. 2001 LANL journal impact values for 292 journals.	Figure 3. Scatterplot of 1998 vs. 2001 ISI IF values for 292 journals.

LANL journal impact ranking

Table 1: Twenty highest ranked journals by 1998 and 2001 eigenvector centrality and ISI IF values.

	Eigenvector centrality		ISI Impact Factor
rank	P(i,1998)	P(i,2001)	IF(i,1998)	P(i,2001)
1	PHYS REV LETT	NUCL INSTRUM METH A	TRENDS NEUROSCI	TRENDS NEUROSCI
2	SCIENCE	PHYSICA B	TRENDS BIOCHEM SCI	TRENDS BIOCHEM SCI
3	PHYS REV B	MAT SCI ENG A-STRUCT	IMMUNOL TODAY	SURF SCI REP
4	PHYS REV E	J ALLOY COMPD	TRENDS GENET	PROG MATER SCI
5	NUCL INSTRUM METH A	IEEE T NUCL SCI	SURF SCI REP	TRENDS GENET
6	BIOCHEM BIOPH RES CO	J RADIOANAL NUCL CH	DEV BIOL	IMMUNOL TODAY
7	NUCL INSTRUM METH B	J NUCL MATER	J MOL BIOL	PHYS REP
8	J COMPUT PHYS	THIN SOLID FILMS	PHYS REP	PROG BIOPHYS MOL BIO
9	CHEM PHYS LETT	J HYDROL	MECH DEVELOP	PROG SURF SCI
10	P NATL ACAD SCI USA	LANCET	PROG BIOPHYS MOL BIO	PROG NUCL MAG RES SP
11	J BIOL CHEM	CHEM PHYS LETT	PROG NUCL MAG RES SP	NUCL PHYS B
12	ANAL CHIM ACTA	J COMPUT PHYS	MOL PHYLOGENET EVOL	J MOL BIOL
13	J MOL BIOL	APPL SURF SCI	COGNITIVE PSYCHOL	DEV BIOL
14	APPL SURF SCI	SURF SCI	EARTH-SCI REV	COORDIN CHEM REV
15	THIN SOLID FILMS	J MOL BIOL	PHYS LETT B	EXP CELL RES
16	APPL PHYS LETT	INT J SOLIDS STRUCT	VIROLOGY	PHYS LETT B
17	PHYS LETT A	J CATAL	J AM SOC MASS SPECTR	ASTROPART PHYS
18	PHYSICA B	CATAL TODAY	GENOMICS	PROG POLYM SCI
19	SYNTHETIC MET	J CHROMATOGR A	NUCL PHYS B	COGNITIVE PSYCHOL
20	J ELECTROANAL CHEM	ADV SPACE RES	APPL CATAL B-ENVIRON	MECH DEVELOP

To demonstrate how the calculated eigenvector centrality values can indicate a journal's impact local to LANL, we provide a list of journals ranked according to their eigenvector centrality calculated from the 1998 and 2001 RGN and the 1998 and 2001 ISI IF in Table 1. We see how the ISI IF ranking of journals in 1998 almost exclusively focuses on journals in the domains of biology, biochemistry, and genetics, with the exception of a small number of journals relating to the mission of the LANL such as Physics Letters B, Progress in Nuclear Magnetic Resonance Spectroscopy, and Nuclear Physics B. The highest ranked journals in 1998 are Trends in Neuroscience and Trends in Biochemical Sciences. In 2001 we see that the ranking of some journals has changed, but overall we find the same focus on biology and biochemistry. With the exception of the journal Immunology Today, we find that the set of the three highest ranked journals has not changed. Both lists overlap by more than 10 journals. This picture confirms the high correlation data outlined in the section on Stability.

The 1998 and 2001 eigenvector centrality rankings, i.e. P(i,1998) and P(i,2001), reveal a strikingly different ranking of journals from both the 1998 and 2001 ISI IF rankings. For 1998, we find a strong focus on physics, nuclear science and molecular biology, in addition to material sciences. LANL RL management confirms this list does in fact strongly corresponds to the interests of the LANL community and the local impact or prestige of journals. The 2001 ranking of journals has changed drastically since 1998. We find the same focus on physics and nuclear science, but less so on molecular biology with the exception of the Journal of Molecular Biology which has approximately maintained its 1998 ranking. We find journals relating to hydrology (Journal of Hydrology) and space research (Advances in Space Research) that did not occur in the 1998 ranking, indicating that the LANL community may have started to explore new research domains. This data supports the relatively lower correlation coefficient between the 1998 and 2001 journal eigenvector centrality values than between the 1998 and 2001 ISI IF values.

The data suggest that the eigenvector centrality calculated from the RGN networks indeed captures the impact of journals for a community of DL users, and to a larger degree than a ranking according to the ISI IF does. This effect is even more striking when one considers that the listed ISI IF ranking has in fact been determined for a set of journals from which articles have been downloaded in the 1998 and 2001 logs. In other words, even if one were to register which journals have been frequently read in the LANL RL, and produce a ranking according to the ISI IF, one would not be able to produce a ranking of journals that correctly represented their true local impact.

IDR and IDRT results

We then proceeded to calculate IDR and IDRT values to produce a ranking of journals that may reveal particular research trends in the LANL community. High scoring IDRT journals are those whose local LANL impact has increased most strongly over the period of 1998 to 2001, while their ISI IF has remained stable or decreased. In other words, high IDRT values correspond to journals whose local impacts have deviated most from changes in the ISI IF over the same period and point to where the interests of a local user community have shifted. Low IDRT values point to journals whose local interest has sharply decreased while their general ISI IF has remained stable or increased.

IDRT values, including IDR values and ISI IF values, for ten of the highest and lowest scoring journals are listed in Tables 2 and 3. The ten highest IDRT scoring journals consist of a range of journals relating to a variety of subjects. We find, for example, the Journal of Arid Environments, Journal of Atmospheric and Solar-Terrestrial Physics, Remote Sensing of Environment and Planetary and Space Science, Trends in Biochemical Sciences. The set of high-ranking IDRT journals thus seems to correspond to space-based climatological observations relating to arid environments. The presence of the journal Planetary and Space Science seems to suggest an interrelationship with planetary studies. Indeed, LANL RL management validated these results by pointing to recent, directed efforts at LANL to support this type of research, which recently culminated in the production of the first map of hydrogen distribution on Mars. In other words, using the IDRT metric, we have been able to pinpoint local research developments that identify the agenda of a specific research community separate from the general scientific community.

Table 2: Ten highest IDRT scoring journals over 1998 and 2001 LANL RL download logs.

P''	P'(i,1998)	P'(i,2001)	IF(i,98)	IF(i,01)	Journal Title
2.854	5.764	16.457	0.597	0.640	Journal of Arid Environments
2.216	3.097	6.866	1.111	1.003	Journal of Volcanology and Geothermal Research
2.097	2.486	5.213	1.384	1.352	Annals of Botany - London
1.664	7.058	11.744	0.937	1.044	Journal of Atmospheric and Solar-Terrestrial Physics
1.478	4.84	7.157	1.410	1.697	Remote Sensing of Environment
1.354	13.862	18.773	0.826	1.246	Planetary and Space Science
1.294	12.105	15.669	0.902	0.318	Journal of Geochemical Exploration
1.279	4.304	5.504	3.133	3.643	Applied Catalysis B: Environmental
1.262	18.000	22.709	0.286	0.636	Non-destructive Testing and Evaluation International
1.256	1.562	1.963	17.085	14.329	Trends in Biochemical Sciences

Table 3: Ten lowest IDRT scoring journals over 1998 and 2001 LANL RL download logs.

P''	P'(i,1998)	P'(i,2001)	IF(i,98)	IF(i,01)	Journal Title
0.079	31.004	2.440	00.731	01.252	Ecotoxicology and Environmental Safety
0.077	7.834	0.605	01.363	01.741	Cancer Letters
0.074	27.742	2.050	00.333	00.514	Optics and Lasers in Engineering
0.066	10.604	0.706	00.978	01.493	Regulatory Toxicology and Pharmacology
0.060	6.597	0.401	01.516	02.631	Physiological and Molecular Plant Pathology
0.057	282.171	16.083	00.042	00.284	Journal of the Franklin Institute
0.050	183.419	9.195	00.105	00.391	Computers and Industrial Engineering
0.042	28.989	1.227	00.831	00.859	Superlattices and Microstructures
0.040	37.580	1.516	00.272	00.695	Journal of Electrostatics
0.035	10.636	0.370	00.559	14.000	Progress in Materials Science

Structural Trends

We have identified a set of journals whose local LANL impact has deviated most from the ISI IF over a period of three years, and ranked journals according to their degree of deviation, namely their IDRT score. We will examine how the structural characteristics of the journal relationships of a number of high IDRT journals changed over time.

Given the nature of the set of high IDRT scoring journals, and their relationships to planetary science and geology, we produced a map of the relationships to and from the Journal of Atmospheric and Solar-Terrestrial Physics to investigate which structural changes had occurred from 1998 to 2001.

Figure 4. Neighborhood of the Journal of Atmospheric and Solar-Terrestrial Physics plotted for 1998 data.

(For a larger view, click here.)

Figure 5. Neighborhood of the Journal of Atmospheric and Solar-Terrestrial Physics plotted for 2001 data.

(For a larger view, click here.)

Figure 4 displays the connections among all nodes related to the Journal of Atmospheric and Solar-Terrestrial Physics as they exist in the RGN generated from the 1998 LANL RL log data. Among the journals in the 1998 neighborhood, we find Atmospheric Environment, Advances in Space Research, Physical Review Letters and Physics of the Earth and Planetary Interiors, pointing to a set of journals characterized by their relationships to subjects of earth science and climatology.

Figure 5 displays the relationships among all journals as they exist in the graph generated from the 2001 LANL RL log data. We find that the Journal of Atmospheric and Solar-Terrestrial Physics has developed a larger number of relationships to other journals, as well as deepened the nature of journals to which it is now connected in comparison to 1998. One striking example is the journal Quaternary International which is directed at quaternary geologists, physical geographers, paleontologists, geomorphologists, archaeologists and soil scientists, as well as the IEEE Transactions on Power Delivery, Journal of Hydrology and Global and Planetary Change, all pinpointing the nature of a deliberate research effort at LANL to advance planetary science and more particularly the investigation of hydrogen, or water, levels in the geology of Mars [3].

An examination of other journal relationship context in 1998 and 2001 indicated similar evolutions. We examined the relationship among the journals connecting to and from the Journal of Geochemical Exploration as they occurred in the RGN generated from the 1998 and 2001 log data (see Figures 6 and 7).

Figure 6. Neighborhood of the Journal of Geochemical Exploration plotted for 1998 data.

(For a larger view, click here.)

Figure 7. Neighborhood of the Journal of Geochemical Exploration plotted for 2001 data.

(For a larger view, click here.)

The 1998 context of this journal consists, among others, of journals such as Contributions to Mineralogy and Petrology, Sedimentary Geology and Chemical Geology, indicating that its readership at the LANL focused primarily on issues of petrology and sedimentary geology. The 2001 context of this journal, however, points to changes similar to those that have taken place for the Journal of Atmospheric and Solar-Terrestrial Physics. We find connections to the Journal of Hydrology, Journal of Contaminant Hydrology, which relates to issues of groundwater pollution, and most significantly to Geochim et Cosmochim Acta, which relates to geochemistry and meteorites. We find that evolution of this journal's context in the RGN networks has shifted from geology research in the framework of petrology, to one that confirms our initial notions of Mars exploration. In fact, we may speculate that the presence of a link to the Journal of Contaminant Hydrology may be related to research on a possible mission to sample Mars' soil and efforts to avoid contamination of Martian soil and groundwater.

Conclusion

We have demonstrated a methodology to detect and interpret the nature of localized research trends as they occur in an institution by analyzing reader usage patterns derived from the institution's DL usage logs and comparing these to the ISI IF. This methodology successfully identified a planned and deliberate research project at the Los Alamos National Laboratory that caused the local impact of journals to deviate strongly from the ISI IF over a period of three years.

Implications

This research has some noteworthy implications for library managers related to the acquisition of digital content. Collection managers who evaluate and make tough content acquisition and retention decisions based solely on ISI IF run the risk of missing the focused and more specialized needs of their user communities. Conversely, collection managers now have a methodology to dynamically assess usage trends over time, which may lead to a more proactive role in managing both DL content and related associative linking resources and needs that can be fine-tuned for a given institution. The research results reported herein should also stimulate librarians to pro-actively take measures to insure they have access to their institution's activity log data, as it will likely prove to be a fertile source for gaining insight into the needs of their DL user communities. As libraries continue to collaborate with one another in the DL arena, using such insight among multiple institutions will be required to identify and appropriately support the increasing geographical distribution of work that characterizes today's scientific research.

As demand to provide increased levels of DL capabilities continues to grow, managers will be confronted with difficult trade-offs between ramping up technical staffing capabilities while attempting to continue to direct limited staff resources to dialog with and forecast user needs. Given constrained financial resources, those making content acquisition decisions can ill afford generalized decisions that do not reflect dynamically changing user community needs. The use of new tools enabling automated mechanisms to collect and yield targeted indicators of an institution's changing user communities needs is likely to become an increasingly strategic asset of successful digital libraries.

We believe the present results potentially foreshadow the initial steps toward a science of DL evaluation that does not merely take into account the preferences of users, but acknowledges the relationships and semantics underlying such preferences and how they change over time.

Future Research

We can identify a number of issues with our present methodology that may affect the validity of our data.

First, it is questionable whether the ISI IF and the eigenvector centrality calculated over the 1998 and 2001 RGN networks yield comparable journal impact metrics. Both are based on differently originating data sets: citation and download data, respectively, and have been calculated in a different manner. Our use of the ISI IF must, however, be understood in relation to the absence of any other data sets that express journal impact in the general scientific community. Given that DL download logs could be aggregated across a variety of institutions so that a representative sample were achieved, general journal impact data could be derived from such data that could, in its turn, be used as baseline measurement with which to compare journal impact derived from local DL download logs.

Second, at present the IDRT metric is not founded in a profound theoretical model of what we understand by journal impact and how it relates to the roles and behavior of its stakeholders such as authors, readers, DL managers and domain experts. Ideally, we would prefer to cast the developed metrics in a more general understanding of how the semantics of a citation relate to sequential patterns in reading, and how the generated metrics can be normalized to enable comparisons over a wide range of time and DL download logs. We believe it would be useful to further investigate whether the proposed metrics can be applied to different data sets, e.g. video and audio impact ranking. A recent collaboration with the Open Video project [Slaughter 2000, Geisler 2001] has indicated that the developed network methodology is equally efficient in the generation of document networks for multimedia archives, given that adequate user download logs are available. We can point to the prototype of a recommender service that has recently been developed to make use of document networks generated from DL download logs:
<http://isis.cs.odu.edu:4224/openvideo.html>.

Third, the presented results, though compelling, need to be more extensively investigated. Future research will focus on more advanced metrics to detect not only past research trends, but also to extrapolate these to the future so that DL evaluation can pro-actively indicate shifts and trends in the interests of user communities. Particularly, we are interested in extending these results to the automated detection of user communities within a given DL, and between multiple DLs spanning multi-institutional boundaries, and enabling the evaluation of such user communities over time.

We see the present results as the initial steps to a science of DL evaluation that does not merely take into account the preferences of users, but acknowledges the relationships and semantics that underlie such preferences and how they change over time.

Notes

[1] We could generate a network of article relationships in this manner, but our objective is to investigate journal impact.

[2] The Spearman correlation coefficient expresses how strongly two variables are related, i.e. the height and weight of adults. We find a significant correlation when increases or decreases in the value of one variable corresponds to similar changes in the other. A correlation coefficient of "1" indicates a perfect relationship while a correlation coefficient of "0" indicates the absence of a relationship. However, a correlation does not prove a causal relationship. It merely indicates that two variables co-vary. The p-value associated with a correlation coefficient indicates its statistical significance or in other words, what the chances are that the same correlation value may have been found by chance. A low p-value, usually less than 0.05, indicates a highly significant result.

[3] See <http://www.lanl.gov/worldview/news/releases/archive/03-019.shtml>.

Bibliography

Bollen, J., Luce, R., (2002). "Evaluation of digital library impact and user communities by analysis of usage patterns." D-Lib Magazine, 8(6). Available at <doi:10.1045/june2002-bollen>

Bonacich, P., (1987). "Power and centrality: A family of measures." American Journal of Sociology, 92(5), 1170-1182.

Braam, R.R., Moed, H.F., Raan, A.F.J. van., (1991). "Mapping of science by combined co-citation and word analysis. i. structural aspects." Journal of the American Society for Information Science, 42(4), 233-251.

Brin, S., Page, L., (1998). "The anatomy of a large-scale hypertextual web search engine." Computer Networks and ISDN Systems, 30(1-7), 107-117.

Darmoni, S.J., Roussel, F., Benichou, J., Thirion, B., Pinhas, N., (2002). "Reading factor: a new bibliometric criterion for managing digital libraries." Journal of the Medical Library Association, 90(3), 323-327.

Everett, J.E., Pecotich, A., (1991). "A combined loglinear/MDS model for mapping journals by citation analysis." Journal of the American Society for Information Science, 42(6), 405-413.

Garfield, E., (1979). Citation indexing: Its theory and application in science, technology, and humanities. New York: John Wiley and Sons.

Geisler, G., Marchionini, G., Nelson, M., Spinks, R., Yang, M., (2001). "Interface concepts for the open video project." In ASIST 2001: Proceedings of the 64th ASIST annual meetings, v38 (58-75), ASIST.

Gibson, D., Kleinberg, J., Raghavan, P., (1998). "Inferring web communities from link topology." In Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia: links, objects, time and space - structure in hypermedia systems ( 225-234). ACM Press.

Haykin, S., (1999). Neural networks: a comprehensive foundation, New Jersey, USA: Prentice Hall.

Hebb, D.O., (1949). The organization of behavior, New York: John Wiley.

Kaplan, N.R., Nelson, M.L., (2000). "Determining the publication impact of a digital library," Journal of the American Society of Information Science, 51(4), 324-339.

Kleinberg, J., Kumar, S., Raghavan, P., Rajagopalan, S., Tomkins, A., (1999). "The web as a graph: Measurements, models and methods." In T. Asano, H. Imai, D.T. Lee, S.-I. Nakano, T. Tokuyama, Computing and Combinatorics, 5th Annual International Conference, COCOON'99, (1627, 1-17), Tokyo, Japan: Springer.

Kleinberg, J.M., (1999). "Hubs, authorities, and communities." ACM Computing Surveys (CSUR), 31(4es), 5.

Slaughter, L., Marchionini, G., Geisler, G., (2000). "Open video: A framework for a test collection," Journal of network and computer applications, 23(3), 219-245.

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/may2003-bollen

D-Lib MagazineMay 2003

Volume 9 Number 5 ISSN 1082-9873

Usage Analysis for the Identification of Research Trends in Digital Libraries

Abstract

Stability

Notes

Copyright © Johan Bollen, Rick Luce, Soma Sekhara Vemulapalli, and Weining Xu

D-Lib Magazine
May 2003

Volume 9 Number 5

ISSN 1082-9873