Gilbert: A Simulation of the Structure of Academic Science

More Web Proxy on the site http://driver.im/

Gilbert, N. (1997) 'A Simulation of the Structure of Academic Science'
Sociological Research Online, vol. 2, no. 2, <http://www.socresonline.org.uk/2/2/3.html>

To cite articles published in Sociological Research Online, please reference the above information and include paragraph numbers if necessary

Received: 11/2/97 Accepted: 19/5/97 Published: 30/6/97

Abstract

The contemporary structure of scientific activity, including the publication of papers in academic journals, citation behaviour, the clustering of research into specialties and so on has been intensively studied over the last fifty years. A number of quantitative relationships between aspects of the system have been observed.

This paper reports on a simulation designed to see whether it is possible to reproduce the form of these observed relationships using a small number of simple assumptions. The simulation succeeds in generating a specialty structure with 'areas' of science displaying growth and decline. It also reproduces Lotka's Law concerning the distribution of citations among authors.

The simulation suggests that it is possible to generate many of the quantitative features of the present structure of science and that one way of looking at scientific activity is as a system in which scientific papers generate further papers, with authors (scientists) playing a necessary but incidental role. The theoretical implications of these suggestions are briefly explored.

Keywords:

Lotka's Law; Social Simulation; Sociology of Science; Sociometrics; Specialties

Introduction

1.1: As several of the pioneers of the sociology of science realized, science makes an excellent target for sociological examination. Compared with many other institutions such as crime or the family, science is reasonably clearly bounded in both time and location, and it leaves many traces behind for the analyst to examine: most obviously, the scientific literature, but also scientific biographies, financial records, and increasingly nowadays, administrative statistics.
1.2: The relative ease of access to such data was itself an inspiration to early commentators on science. A major concern during the early days of social studies of science was measurement of the extent and the growth of science (e.g. Hagstrom, 1965; Crane, 1972; Gilbert and Woolgar, 1974). There was also a keen interest in discovering 'laws' which related various quantities. More recently, such quantitative studies of science have become unfashionable, at least among those who see themselves as trying to understand science and scientific knowledge. Although there remains a substantial body of researchers who continue to engage in quantitative studies, the motivation for this work is almost always policy-related. Resisting fashion, this paper will take a quantitative approach to studying the structure of contemporary science, but do so from a new, but increasingly influential methodological perspective. It will use simulation to posit a generating structure for science, showing that this yields many of the quantitative relationships which were observed by the pioneers.
1.3: Computer simulation is still rather unusual as a method of investigation within sociology although for some purposes it has advantages over other, more traditional methods. Most of the sociological research which has used simulation has seen it as a way of making predictions, for example about fiscal transfers in ten or twenty years time, or about the state of the world system (see for example the microsimulation and system dynamics approaches reviewed briefly below). In this paper and in an increasing amount of current research, however, simulation is used in a quite different way: to explore theoretical possibilities, to undertake the equivalent of 'thought experiments' and to understand the limits of and constraints on social life by constructing 'artificial' societies. Simulation itself imposes no particular theoretical approach on the researcher: simulations may illuminate phenomenological as well as systems thinking; realist and relativist epistemologies; and fundamental issues as well as questions of policy significance. It does, however, require the researcher to specify his or her ideas with sufficient precision to enable a computer model to be constructed and this discipline can itself often yield unexpected insights.
1.4: The paper is concerned with the simulation of aspects of science in what might be considered to be its 'golden age', following the development of modern patterns of publication and citation and before the influence of 'Big Science' had come to have a major effect. If the simulation is successful for this ideal typical case, it may be extended to deal with other cases.
1.5: The next section outlines what simulation involves as a method of investigation. Simulation is then used to investigate a mechanism which might underlie 'Lotka's Law', one of the first and most consistent of the quantitative relationships to be observed in science. The fourth section of the paper describes a simulation of scientific publication which reproduces some of the observable features of science. The final section uses the simulation to suggest some conclusions about science as an institutional structure.

Simulation as a Method

2.1: Simulation consists of the construction of (computational) models of social phenomena. Like statistical models, a simulation model is designed to be an abstraction from and simplification of the 'target' system being modelled, but one which nevertheless reproduces significant features of the target (Gilbert and Doran, 1993). Most simulations in the social sciences have in practice been developed primarily for the purposes of prediction. For example, microsimulation models apply a set of conditional transition probability matrices to data on a sample of the population, in order to predict how the status (employment, income, and demographic characteristics) of individuals will have changed after one year, and then by repeating the simulation, how the individuals will have changed after several years. By aggregation, predictions about changes in the population as a whole may be made. The utility of this kind of prediction has mainly been in the field of social and fiscal policy, where it has been possible to draw conclusions about the long term effects of tax and welfare benefit policy changes (Harding, 1990; Hancock and Sutherland, 1992).
2.2: While microsimulation is certainly the most widely practised form of simulation in the social sciences, it is by no means the only one. System dynamics has spawned another research tradition in which the major focus has been on the prediction of global changes to the world system, taking account of environmental and population pressures (Forrester, 1971; Meadows et al, 1972; Meadows, 1992). It has also been applied more recently to other macro-level phenomena such as changes in attitudes and levels of deviance (Hanneman, 1988; Patrick et al, 1995; Jacobsen and Vanki, 1996). Over the last few years, there has been increasing interest in other approaches to simulation, especially as a result of developments in physics, mathematics and artificial intelligence. These differ from the earlier forms of simulation by modelling the mechanisms of social phenomena as well as their observable features, and by being much more concerned with theoretical development and explanation than with prediction.
2.3: Three strands of this new approach to simulation can be distinguished. First, there are simulations based on cellular automata (CA) models (Hegselmann, 1996). A CA consists of a grid of cells, each of which is in one of a small number of distinct states (e.g. 'on' or 'off', or 'dead' or 'alive'). Cells change state through the action of simple condition- action rules which depend only on the previous state of the cell in question and that of its immediate neighbours. CAs have been shown to be useful models of turbulent fluids, behaviour at the atomic level and some biological phenomena (Wolfram, 1986). They can also be used to model simple social behaviour. One of the earliest and best known examples is Schelling's (1971) work on segregation. He used a grid composed of white or black cells at random, with a proportion of locations left vacant. Each cell was supposed to be satisfied only if it was surrounded by more than a minimum proportion of cells of the same colour as itself. He showed that cells of the same colour formed clusters on the grid even when these were content to be surrounded by much less than a minority of cells of their own colour. The simulation showed how 'ghettos' could develop even if no individual wanted to be among a majority of their own colour, and in particular how macro-level features (the clusters of one colour) could emerge as an unintended consequence of the actions of individuals. The simulation is also interesting as an example of modelling where the emphasis is not on precise quantitative prediction, but on the explanation of general processes in an essentially qualitative way.
2.4: The interest in emergence and process is also apparent in the second recent strand of simulation research. This borrows models and techniques from the field of distributed artificial intelligence (DAI). DAI is concerned with the construction of systems in which there are a number of AI programs ('agents') interacting with each other (O'Hare and Jennings, 1996). For example, DAI systems have been built to achieve load balancing and scheduling of work between networked computers. In actual implementations, rather than there being a number of distinct programs, agents are simulated by running one program which cycles through the code for each agent in turn. While DAI was originally most concerned with developing systems to handle complex engineering problems in which agents with different bodies of knowledge could work co- operatively together, the analogy between agents and people, and between systems of interacting agents and social groups was quickly made. Sociologists have been consulted for ideas about how society 'worked' and some sociologists in turn have seen DAI as offering a potentially powerful tool for examining social theories.
2.5: While CA models are capable of simulating only the most simple interactions and basic changes of state, DAI models are able to be incorporate much more complex models of cognitive processes and communication. For example, agents may be constructed which develop 'internalized' models of other agents. The effects of 'false' beliefs about their world may be investigated and other issues related to the consequences of acquiring knowledge and belief may be studied (e.g. Doran et al, 1994; Doran and Palmer, 1995). However, the agents in most of these models do not incorporate any ability to learn. Learning, adaptation or evolution has been the central theme of the third strand of recent simulation research, based on models that use either artificial neural nets or genetic algorithms. These two techniques, although completely different, are both capable of simulating learning in its most general sense.
2.6: Artificial neural nets are founded on an analogy with the functioning of neurons in animal brains (Rumelhart and McClelland, 1986). Given a set of training data applied as input to the net and a corresponding set of target data, artificial neural nets can learn to associate given outputs with specific inputs. Following training, a properly set up neural net can identify the significant features of an input and use those to generate its output, thus allowing it to be used as a powerful and flexible pattern matcher. Neural nets have been linked together to form 'groups' or 'societies' and phenomena such as the development of a common lexicon for interaction between the nets (Hutchins and Hazlehurst, 1995) and the development of 'altruistic' behaviour have been demonstrated (Parisi et al, 1995).
2.7: Genetic algorithms (GA) are also loosely based on a biological analogy. A large population of individuals is generated, each endowed with a 'gene' assigned at random, and then sorted according to their 'fitness' using a measure of fitness specific to the problem. The fitter individuals are then used to 'breed' offspring, through both asexual and sexual reproduction, the latter involving the mixing of the parents' genes to form the child's gene. An element of mutation (random alteration of a small proportion of the genes in the gene pool) is also usually applied. The new generation thus created is then assessed against the fitness measure and the process continued through many generations. Normally, the average fitness of successive generations increases until some optimum is achieved (Davis, 1991). While genetic algorithms are most often used as a general purpose optimization technique, they can also be used for the simulation of some kinds of social learning or evolution. Some studies, especially in fields such as ecology, take the analogy with evolution very seriously; in others the GA model is treated as a 'black box' capable of developing individuals which optimize some externally given criterion, the fitness measure, and the analogy with evolution is of less interest (Chattoe, 1997).

Artificial Societies

3.1: Simulations based on any of these three approaches have tended to eschew mimicing the target social phenomenon in favour of simulating what are judged to be the most significant underlying processes at work. For example, the simulations developed by Latane and colleagues (Nowak and Latane, 1993) have helped him to develop, refine and extend his theory of dynamic social impact. There has been no attempt to simulate any actual instance of the impact of others' opinions on individual attitudes, although Latane has also carried out social psychological experiments in several domains which have tended to confirm the conclusions he has drawn from the simulations (Latane, 1996). Thus one way of regarding recent simulation work is as a type of theory development, in which the simulation is treated as a method for the formalization of theory. Once the theory has been refined, it can then be tested deductively in the usual way. This approach stands in contrast with the use of simulation for prediction mentioned above. The criteria which should be used to judge the work differ considerably for the two cases: essentially, simulation for theory development is effective if it is productive of new insights, while simulation for prediction is of course most useful if it gives accurate forecasts. In work aiming at theoretical development, the focus is not on 'what might happen', but rather 'what are the necessary and sufficient conditions for a given result to be obtained?'. In practice, this question is enlarged to consider not just the conditions required for an outcome, but the processes which could lead to that outcome (Gilbert, forthcoming).
3.2: A further extension of the move to use simulation for theory development is research on what have come to be called 'artificial societies' (Conte and Gilbert, 1995). Although still in its infancy, this mode of research aims not at simulating a natural society, but rather an artificial society which may exist only in the mind of the researcher. The objective is to understand the implications of social processes in possible worlds, thus illuminating the effects of the actual processes in our own world. Using artificial societies one can envisage social research at last gaining some of the benefits of experimentation, as one can build societies designed to test to the limits the implications of a theory.
3.3: While it has generally been acknowledged that social theory should be centrally concerned with process, theorizts have often found it difficult to develop a suitable vocabulary for expressing process orientated concepts, while process relevant empirical data has proved to be even harder to obtain. As a result, much of modern sociology pays mere lip service to a processual outlook. One of the major benefits of computational simulation is that it offers a formal lexicon and conceptual system for thinking about process, and a method for experimenting with processual theories. As a first stage in this way of thinking about societies, some theorizts have begun to identify characteristic or fundamental processes which can be seen to be manifested in a wide range of different social domains (Kontopoulos, 1993). Examples include the process underlying Schelling's segregation model mentioned above (which may also be important in many instances of social sorting, from ethnic segregation to maintaining class divisions), the Matthew Effect (Merton, 1968), and the process generating the distribution variously known as Lotka's Law, the Zipf distribution or the rank-size rule.
3.4: In the next section we shall illustrate the way that a simulation approach allows one to theorize about process by offering a simple interpretation of the basis of Lotka's Law in computational terms. In the following sections, we shall extend this simulation to consider further aspects of the structure of contemporary science.

Simulating Lotka's Law

4.1

It has long been known that the number of authors contributing a given number of papers to a scientific journal^[1] follows the Zipf distribution (e.g. Davis, 1941; Zipf, 1949). As Simon (1957) noted in an influential paper, this distribution is also common to a number of other phenomena, including word frequencies in large samples of prose, city sizes and income distributions. In all these examples, the distribution is highly skewed, with the frequencies of occurrence following an inverse power law. Lotka (1926) demonstrated that for scientists publishing in journals, the number of authors is inversely proportional to the square of the number of papers published by those authors. Simon shows that the Zipf distribution differs from the better known negative binomial and log series distributions and derives a stochastic process which could lead to the empirical distribution.

4.2

Simon's stochastic process is encapsulated by two propositions (Simon, 1957: p. 148), here expressed in terms of published papers: first, that the probability that the next paper is a paper by a given author who has already published i times is proportional to the number of authors that have contributed exactly i papers to the journal; and secondly, that there is a constant probability, alpha

, that the next author is an author who has not previously published in the journal. From these two propositions, Simon is able to derive the formula for the Zipf distribution:

(i) = ai^-k

where a is a constant of proportionality. For authors of scientific papers, k is approximately equal to 2. Also, if n is the number of authors and p is the number of papers,

= p/n

4.3

Simon's algebraic derivation still leaves open the question of the mechanism by which the observed distribution is generated. What, we might ask, is the process by which the number of authors publishing a given number of papers in a journal is found so regularly to be a Zipf distribution?

4.4

The process required to generate the distribution is in fact rather simple. We set up a model to generate new 'publications' and assign authorship to each publication using the following rules which follow directly from Simon's propositions:

Select a random number from a uniform distribution from 0 to 1. If this number is less than , give the publication a new (i.e. previously unpublished) author.
If the publication is not from a new author, select a paper randomly from those previously published and give the new publication the same author as the one so selected.

4.5

That these two rules give the desired result can be seen by comparing data reported by Simon for authors contributing to Chemical Abstracts and Econometrica with the results obtained from the simulation (see the Appendix). Table 1 shows the data, the estimates from Simon's formulae, and the mean of 10 runs of the simulation (because the simulation selects authors using a random process, the simulation distributions vary slightly from run to run).

Table 1: Number of Authors Contributing to Two Journals, by Number of Papers each has Published: Empirical, Analytical and Simulation Results

< td>51< td>27< td>17< td>9

	*Chemical Abstracts*			*Econometrica*

Number of contributions	Actual	Simon's estimate	Simulation	Actual	Simon's estimate	Simulation
1	3991	4050	4066	436	453	458
2	1059	1160	1175	107	119	120
3	493	522	526	61	51
4	287	288	302	40	27
5	184	179	176	14	16
6	131	120	122	23	11
7	113	86	93	6	7	7
8	85	64	63	11	5	6
9	64	49	50	1	4	4< /td>
10	65	38	45	0	3	2
11 or more	419	335	273	22	25	18< /td>

	0.30			0.41

4.6

One of the interesting features of this process is that it makes clear that the generation of the distribution is focused on papers, not authors. That is, rather than the simulation modelling authors contributing papers to journals, the simulation models a paper selecting its own author. This is not the way we naturally think of science, where the actors are considered to be the scientists, not the scientific papers. Yet, as a model for science, conceiving of scientists as merely the tools by which scientific papers get generated could have some heuristic value, an idea we shall pursue in the next section where we consider a more complex simulation of science.

Science as an Evolutionary Process

5.1: As we noted in the introduction, contemporary science exhibits a number of regularities in the relationships between quantitative indicators of its growth. The classic source for these relationships is de Solla Price's (1963) lectures on Little Science, Big Science, in which he argues that there is evidence of a qualitative change in science from traditional 'little' science to the big science of large research teams and expensive research equipment. In making this argument, de Solla Price summarizes well what was then known about the structure of 'little' science. In the following thirty years, the dramatic changes which de Solla Price envisaged have generally not occurred (with some exceptions) and his summary is therefore still useful.
5.2: The central theme of de Solla Price's book is that science is growing exponentially, with a doubling time of between 10 and 20 years, depending on the indicator. For him, the fundamental characteristic of science is the publication of research papers in academic journals. He notes that papers always include references to other papers in the scientific literature, with a mean of 10 references per paper. The number of journals has followed an exponential growth curve with a doubling every 15 years since the mid-eighteenth century. There is approximately one journal for every 100 scientists (where a scientist is someone who has published a scientific paper) and scientists divide themselves up into 'invisible colleges' of roughly this size.
5.3: References tend to be to the most recent literature. Half of the references made in a large sample of papers would be to other papers published not more than 9 to 15 years previously. However, because the number of papers is growing exponentially, every paper has on average an approximately equal chance of being cited, although there are large variations in the number of citations different papers receive.
5.4: These observed regularities constitute the criteria with which to judge the simulation. The task is to develop a model which will reproduce these regularities from a small set of plausible assumptions.

The Simulation

6.1

At the heart of the model is the idea that science as an institution can be characterized by 'papers', each proposing a new quantum of knowledge, and 'authors' who write papers. The simulation will model the generation of papers by authors.

6.2

The first assumption we make is that the simulation may proceed without reference to any external 'objective reality'. We shall simulate scientific papers each of which will capture some quantum of 'knowledge', but the constraints on this knowledge will be entirely internal to the model. To represent a quantum of knowledge, we shall use a sequence of bits. The bit sequences representing quanta of knowledge will be called 'kenes', a neologism intentionally similar to 'genes'.

6.3

Kenes could in principal consist of arbitrary bit sequences of indefinite length. However, we shall want to portray 'science' graphically in a convenient fashion and this means locating kenes in space. Since arbitrary bit sequences can only be mapped into spaces of indefinite dimensionality, we impose a rather strict limit on which sequences are allowable, purely for the purposes of permitting graphical displays. We require that each kene is composed of two sub- sequences of equal length, and we treat each sub-sequence as a representation of a coordinate on a plane. This restriction on kenes is substantial, but does not affect the logic of the simulation while making it much easier to see what is going on. It should be emphasized that the requirement that kenes can be mapped into a plane is not part of the model of the structure of science and could be relaxed in further work.

6.4

As a consequence of the fact that kenes can be decomposed into two coordinates, every kene can be assigned a position on the plane. Since each paper contains knowledge represented by one kene, that kene can stand for the paper and in particular, papers can also be located on the plane. In the simulation, each kene is composed of two coordinates, each 16 bits in length, giving a total 'scientific universe' of 2162 = 4,294,967,296 potential kenes, that is, an essentially infinite number compared with the number of papers generated during one run of the simulation. Authors can also be positioned on the plane according to the location of their latest paper.

6.5

One of the principal constraints on publication in science is that no two papers may be published which contain the same or similar knowledge. This amounts to the requirement that no two papers have identical kenes. In the model we extend this to require that no two papers have kenes which are 'similar', where similarity is measured by the distance between the kenes (a paper is deemed original if it lies more than delta

coordinate units away from any other paper, where delta

and the other Greek symbols below are numerical parameters set at the start of the simulation). Since distance is a well defined notion even in multi-dimensional space, the idea that kenes and thus papers can be close does not depend on the requirement that kenes must be located on a plane.

6.6

So far, we have defined the three essential entities in the model: papers, authors and kenes. Next we need to consider the basic processes which give rise to these entities. In line with the simulation of Lotka's Law described in the previous section, we propose that it is papers which give rise to further papers, with authors adopting only an incidental role in the process. A 'generator' paper is selected at random from those papers already published whose authors are still active in science. This spawns a new potential paper as a copy of itself, with the same kene. The new paper then selects a set of other papers to cite by randomly choosing papers located within the region of its generator paper (the 'region' is defined as the area within a circle of radius

). Each of the cited papers modifies the generator kene to some extent. The first such paper has the most influence on the paper's kene, with successive citations having a decreasing effect. A spatial way of thinking about the process is that each cited paper 'pulls' the kene from its original location some way towards the location of the cited kene.

6.7

More precisely, the x coordinate of the new paper (p) is affected by the cited paper (c) thus:

where m is a value between zero and one which increases randomly but monotonically for each successive citation. A similar equation determines the new y coordinate.

6.8

The result is a kene which is somewhat changed compared with the generator kene. If the changes are sufficient, the new kene will no longer be close to the generator kene. If the new kene is also not close to the kene of any previously published paper, it can be considered to be original and can be 'published'. If, however, the new kene is similar to a previous paper, the publication is abandoned.

6.9

Thus papers generate new papers which combine the influence of the generator paper with the papers it cites. Finally, publishable papers choose an author, following the procedure outlined above for the simulation of Lotka's Law. A proportion ( alpha

) of papers choose a new, previously unpublished author and the rest are assigned the author of the generator paper.

6.10

An increasing number of papers are generated at each time step since there is a small constant probability,

, of each published paper acting as a generator for a further paper at the next time step.

6.11

The rules concerning authors are much simpler. Authors remain in science from the time they publish their first paper until retirement. They are modelled as retiring when the duration of their time in science exceeds a value drawn from a uniform distribution from 0 to

time units.

6.12

This section has described the rules which determine the generation of papers and authors in the simulation. It may be noticed that the rules are local and at the 'micro' level. That is, they make no reference to the overall state of the simulation and do not refer to aggregate properties. Papers, for example, cite other papers in their region without reference to whether that locality is relatively dense or thinly spread, or to the positions of papers outside the neighbourhood.

6.13

The question which now arises is whether this micro-level behaviour can give rise to macro-level properties corresponding to the regularities which de Solla Price noted and which were summarized above.

Results

7.1

Figure 1 shows an animation of the three-dimensional display from one run of the simulation. The plane marked out by the square is the surface on which the kenes are projected, while the third dimension, orthogonal to the square, represents time. The simulation has been run for 1000 time steps. Each white dot represents one paper. The position of the dot is given by the x and y coordinates of the paper's kene and the z coordinate is determined by the time of publication.

Figure 1: The simulation output

7.2

The most evident feature of the display is the clustering of papers in the plane. While a few of these clusters are present at time zero, most develop gradually and some then fade away again with the passing of time.

7.3

The parameters listed in Table 2 were used for the simulation run in Figure 1.

Table 2: Parameters for the Simulation Run

Parameter	Value
	0.41
	400
	7000
	480
	0.0025

7.4

A sensitivity analysis shows that variations in the values of these parameters of less than a factor of 2 make little difference to the form of the output. Although the precise number and location of the clusters vary from run to run, every run develops such clusters.

7.5

More papers are published per unit time towards the end of the run than at the beginning, because of the propensity of the increasing number of papers to spawn further papers (parameter

). The rate of growth of papers is plotted in Figure 2 and can be seen clearly in Figure 1 where the density of dots on the plane at the end of the simulation period is considerably greater than at the beginning.

Figure 2: Cumulative Papers per Time Unit

7.6

Figures 3, 4 and 5 show some aspects of the aggregate features of the results of the simulation. Figure 3 indicates the distribution of the number of references per published paper. The distribution has a mean of 11.2, indicating that the average paper includes about 11 references to other papers.

Figure 3: References per Paper

7.7

Figure 4 displays the distribution of the number of papers 'written' by each of the 1,539 authors. The distribution is highly skewed, with the majority of authors publishing but one paper. We expect this distribution to follow Lotka's Law, and Table 3 compares the distribution obtained from the simulation and plotted in Figure 4 with the distribution predicted from Simon's formula for the Zipf distribution.

Figure 4: Papers per Author

Table 3: Papers per Author from the Simulation, Compared with the Zipf Distribution

Number of Papers	Simulation	Zipf Distribution
1	667	717
2	218	193
3	93	82
4	56	43
5	29	26
6	20	17
7	11	12
8	9	9
9	12	7
10	3	5
11 or more	15	24

(n, the number of authors, is 1,539; p, the number of papers, is 3,703; and alpha is 0.41)

7.8

The correspondence between the distribution from the simulation and that expected from Lotka's Law is quite close, but with somewhat fewer authors publishing one paper and more publishing more than one paper than the law predicts. This matches with the empirical distributions (see Table 1) where there is a similar slight lack of fit between the theoretical and actual distributions. It is noteworthy that the differences are found in the same places: for both the simulation and the empirical data, the theoretical distribution is too high for the number of authors contributing just one paper.

Figure 5: Citations per Author

7.9

Finally, Figure 5 shows the distribution of citations received by each author. The distribution is again highly skewed, but does not follow a Zipf distribution. While most papers received no citations at all, one paper received 191 (the median number of citations received per author is 7).

Discussion

8.1

The primary aim of the simulation has been to see whether, given some simple, localized rules about the creation of scientific papers, one could construct a model which reproduced aspects of the observed structure of academic science, viewed at the macro scale. The previous section has shown that the main institutional regularities summarized by de Solla Price are indeed visible in the simulation.

8.2

The model is based on the principle that a paper can be considered to be the encapsulation of a quantum of knowledge, a kene, and that every paper corresponds to a different kene. Papers are generated from other papers, sharing a kene with their generator, but modified according to the kenes of the papers which it cites. Nearby papers are cited more often than ones distant on the 'map' of knowledge. Papers select authors with constant probability. From these few principles, we have developed a model which gives plausible distributions for the number of papers per author, citations per author, references per paper, and an exponential growth rate for papers and authors.

8.3

An obvious characteristic of the results of the simulation, clearly visible in Figure 1, is the clustering of papers. One can readily identify such clusters with scientific specialties or what de Solla Price called 'invisible colleges'. The clusters are an example of the emergence of higher level structures from lower-level behaviour. The correspondence with specialities extends to the following qualitative observations:

A cluster starts with a few early papers from which it develops.
The cluster grows at an increasing rate until a time when the area appears to become 'exhausted' and the rate of publication falls away.
Although clusters are visible, it is hard to see any definite boundary delimiting a cluster. While one can imagine devising a rule based on the local density of papers in the space in order to identify the boundary of a cluster, any such rule is bound to be arbitrary. This is also a characteristic of specialties, where scientists are often able to say that a paper is within a specialty but unable to specify rules to separate those within from those outside. As a result, quantitative conclusions about the growth of clusters or specialties are difficult to formulate.

8.4

In a wider perspective, the simulation is interesting as a thought experiment: through constructing an artificial society with known rules of interaction, it becomes possible to demonstrate a possible 'logic' (Kontopoulos, 1993) or process which could give rise to the differentiation of knowledge products. Processes similar to the one demonstrated in the simulation could underlie the generation of other knowledge based and cultural products, from music to brands of toothpaste. The critical assumptions which apply as much to these fields as to science are that (a) there is no constraint on the composition of the artefacts (e.g. papers) from some 'external reality' and (b) that at least local judgements of similarity and difference between artefacts can be made.

8.5

In order to explore the model of science more easily, it was assumed that kenes could be mapped onto two dimensional space. It was this that allowed the display of the simulation as a three dimensional picture in Figure 1 and in turn allowed the visual identification of clusters. But as was pointed out in introducing the model, this assumption is too strong; there is no theoretical reason why kenes should be constrained to be those which can be mapped into two dimensions. The implications of relaxing this constraint, treating kenes as arbitrary bit strings, needs to be examined. This would allow the combination rule for evolving kenes from a generator paper and its citations to be altered to resemble more closely the crossover rules used in genetic algorithms. A genetic algorithm crossover rule for the 'breeding' of an offspring gene from two parents cuts both parents' genes at a randomly chosen point and then assembles the offspring gene from the 'left' section from one parent and the 'right' section from the other parent. A rule of this nature could not be used in the simulation reported here because the result of such a combination would have meant that the offspring kene could be located anywhere within the space, while the model requires that an offspring should be close to its parents. However, in multi-dimensional space, this no longer applies: an offspring born of crossover remains close to its parents, provided that the notion of distance used is appropriate to the multi-dimensional space. The question which then arises is whether the behaviour of such a multi-dimensional model is qualitatively different from that of the two-dimensional model explored in this paper.

8.6

The model of science we have developed is of course an abstraction, as with any model, whether intended for simulation or not. The model succeeds to the extent that the processes underlying the institution of science are illuminated, not because it manages to mimic the precise quantitative patterns of science. Differences between the simulation and any empirical example of science may be attributed to the effects of abstraction, the actual variation from one area of science to another, and to random differences between instances of the simulation, as well as to errors in the model. In the case of the model presented here, the abstraction has been extreme and it may be thought that as a result the significant and interesting aspects of science have been left out. For example, the model entirely neglects the construction which scientists put on their own and their fellow scientists' actions. Specialties are not just clusters of similar papers, but also sources of pride and belonging, battlegrounds and resources for scientists. These and similar features of science remain as challenges for the future.

Notes

¹ To obtain this distribution for a particular journal, one collects a list of all the papers published in the journal since it was founded, and counts the number of authors who wrote (or co-wrote) one paper in the list, the number who wrote two, and so on.

Acknowledgements

: I am grateful to Diana Hicks and Sylvan Katz for the discussion which was the initial impetus for this simulation, to Edmund Chattoe for continuing conversations about computational simulation, and to all my previous colleagues in SSK for whom this may all be heresy or worse. A version of the paper was presented to the EASST-4S Joint Conference, Bielefeld, October 10 - 13, 1996.

References

CHATTOE, E. (1997) 'Modeling Economic Interaction Using a Genetic Algorithm', Handbook of Evolutionary Computation. Oxford: Oxford University Press.

CONTE, R. and G. N. GILBERT (1995) 'Introduction' in G. N. Gilbert and R. Conte (editors) Artificial Societies: The Computer Simulation of Social Life. London: UCL.

CRANE, D. (1972) Invisible Colleges. Chicago: University of Chicago Press.

DAVIS, H. T. (1941) The Analysis of Economic Time Series. Principia Press.

DAVIS, L. (1991) Handbook of Genetic Algorithms. New York: Van Norstrand Reinhold.

de SOLLA PRICE, D. (1963) Little Science, Big Science. New York: Columbia University Press.

DORAN, J. and M. PALMER (1995) 'The EOS Project: Integrating Two Models of Palaeolithic Social Change' in N. Gilbert and R. Conte (editors) Artificial Societies. London: UCL Press.

DORAN, J., M. PALMER, G. N. GILBERT, et al (1994) 'The EOS Project: modelling Upper Palaeolithic social change' Simulating Societies: The Computer Simulation of Social Phenomena. London: UCL Press.

FORRESTER, J. W. (1971) World Dynamics. Cambridge, MA: Wright-Allen.

GILBERT, G.N. (forthcoming) 'The Simulation of Social Processes' in Coppock, J.T. (editor) (1997) Information Technology and Scholarly Disciplines. London: British Academy.

GILBERT, G. N. and J. DORAN, (editors) (1993) Simulating Societies: The Computer Simulation of Social Processes London: UCL Press.

GILBERT, G. N. and S. WOOLGAR (1974) 'The Quantitative Study of Science', Science Studies, vol. 4, pp. 279 - 294.

HAGSTROM, W. O. (1965) The Scientific Community. New York: Basic Books.

HANCOCK, R. and H. SUTHERLAND (1992) Microsimulation Models for Public Policy Analysis: New Frontiers. London: Suntory-Toyota International Centre for Economics and Related Disciplines.

HANNEMAN, R. (1988) Computer Aided Theory Construction. Beverly Hills: Sage.

HARDING, A. (1990) Dynamic Microsimulation Models: Problems and Prospects. Discussion Paper 48, London School of Economics Welfare State Programme.

HEGSELMANN, R. (1996) 'Cellular Automata in the Social Sciences: Perspectives, Restrictions and Artefacts' in R. Hegselmann, U. Mueller and K. G. Troitzsch (editors) Modelling and Simulation in the Social Sciences from the Philosophy of Science Point of View. Berlin: Springer- Verlag.

HUTCHINS, E. and B. HAZLEHURST (1995) 'How to Invent a Lexicon: The Development of Shared Symbols in Interaction' in G. N. Gilbert and R. Conte (editors) Artificial Societies. London: UCL Press.

JACOBSEN, C. and T. VANKI (1996) 'Violating an Occupational Sex-Stereotype: Israeli Women Earning Engineering Degrees', Sociological Research Online, vol. 1, no. 4, <http://www.socre sonline.org.uk/socresonline/1/4/3.html>.

KONTOPOULOS, K. M. (1993) The Logics of Social Structure. Cambridge: Cambridge University Press.

LATANe, B. (1996) 'Dynamic Social Impact' in R. Hegselmann, U. Mueller and K. G. Troitzsch (editors) Modelling and Simulation in the Social Sciences from the Philosophy of Science Point of View. Berlin: Springer-Verlag.

LOTKA, A. J. (1926) 'The Frequency Distribution of Scientific Productivity', Journal of the Washington Academy of Sciences, vol. 16, p. 317.

MEADOWS, D. H. (1992) Beyonds the Limits: Global Collapse or a Sustainable Future. London: Earthscan.

MEADOWS, D. H., D. L. MEADOWS, JORGE- RUUDERS, et al (1972) The Limits to Growth. London: Earth Island.

MERTON, R. K. (1968) 'The Matthew Effect in Science', Science, vol. 159(3810), pp. 56 - 63.

NOWAK, A. and B. LATANe (1993) 'Simulating the Emergence of Social Order from Individual Behaviour' in N. Gilbert and J. Doran (editors) Simulating Societies: The Computer Simulation of Social Phenomena. London: UCL Press.

O'HARE, G. and N. JENNINGS (1996) Foundations of Distributed Artificial Intelligence. London: Wiley and Sons.

PARISI, D., F. CECCONI and A. CERINI (1995) 'Kin- Directed Altruism and Attachment Behaviour in an Evolving Population of Neural Networks' in N. Gilbert and R. Conte (editors) Artificial Societies. London: UCL Press.

PATRICK, S., P. M. DORMAN and R. L. MARSH (1995) 'Simulating Correctional Disturbances: The Application of Organization Control Theory to Correctional Organizations via Computer Simulation', Simulating Societies, '95, Boca Raton, Florida.

RUMELHART, D. and G. McCLELLAND (1986) Parallel Distributed Processing, vols. I and II. Cambridge, MA: MIT Press.

SCHELLING, T. C. (1971) 'Dynamic Models of Segregation', Journal of Mathematical Sociology, vol. 1, pp. 143 - 186.

SIMON, H. A. (1957) Models of Man, Social and Rational. New York: Wiley.

WOLFRAM, S. (1986) Theory and Applications of Cellular Automata. Singapore: World Scientific.

ZIPF, G. K. (1949) Human Behaviour and the Principle of Least Effort. New York: Hafner.

Appendix

The following program, written in Common Lisp, was used to simulate the Zipf distribution and supplied the data shown in the fourth and seventh columns of Table 1.

;;;; Simulation of Lotka's Law
;;;
;;; Based on the exposition by H.A. Simon, On a class of skew distribution functions,
;;; in Models of Man (Wiley, 1957), Chapter 9, pp. 145-164
;;; (originally Biometrika, vol. 42, December 1955)
;;;
;;; Constants n and alpha are set to reproduce estimates for contributions to
;;; Econometrica in Table 3, column 9 of Simon (1957)
;;;
;;; Written in Common Lisp
;;;
;;; Nigel Gilbert January 14, 1996


(defparameter author-total 721) ;total number of authors
(defparameter alpha 41) ;percentage probability that a paper will 
 ; be published by a new author 


(defun lotka (bins)
"Function to simulate the distribution of authors of scientific papers who publish
different numbers of papers in a journal over some period of time.
Args: BINS is a vector to hold the number of authors publishing 0 ... 11 or more 
papers"

 (let ((published (make-array n :initial-element 0)) ; element i of the array holds the
 ; number of papers published by 
 ; author i
 (papers '()) ;list of published papers (actually consists of a list of the 
 ; index numbers of the authors who wrote each paper)
 (npapers 0) ;number of papers published so far
 (nauthors 0) ;number of authors who have published at least one paper so far
 new ;index number of author of the next paper to be published
 bin) ;index of the vector collecting the publication distribution

 (do ()
 ((= author-total nauthors)) ;go round the loop until we have 
 ; created total-authors

;; decide who will be the author of the next new paper
 (cond
 ;; it's a new author with probability alpha 
 ;; or it is always a new author if this is the first ever paper
 ((or (< (random 100) alpha) (= npapers 0))
 ;; create new author 
 (setq new nauthors)
 (incf nauthors))
 (t
 ;; old author
 ;; select a paper at random from those already published and set the author
 ;; to the author of that paper 
 (setq new (nth (random npapers) papers))))
 
;; 'publish' this new paper (add it to the list of published papers)
 (setq papers (cons new papers))

;; and increment the number of published papers 
 (incf npapers)
;; note that this author has published another paper. This is the end of the loop
 (incf (aref published new)))



;; obtain the distribution of the numbers of authors who have published x papers
;; any who have published 11 or more are put into the top bin

 (dotimes (a author-total) ;for each author...
 (setq bin (aref published a)) ;get the number of papers this author 
 ; has published
 (when (> bin 11) (setq bin 11)) ;if more than 11, set to 11
 ;finally, increment the count of the
 ; number of authors who have published 
 (incf (aref bins bin))))) ; this many papers


(defun run (&optional (trials 10))
"run the simulated distribution TRIALS times and print out the average over the trials"

 (let ((bins (make-array 12 :initial-element 0)))

 (dotimes (i trials) ;;execute the Lotka function trials
 (lotka bins)) ;; times, accumulating the results
 
;; print out the results, after dividing each by the number of trials to get the mean 
 (format t "Averages of ~D trials: " trials)
 (dotimes (b 12)
 (format t "~D " (round (/ (aref bins b) trials))))))