Home | Intro | Background | Participants | Links | Discussion | Proceedings
Storing Knowledge
by Doug CarlstonI. Summary
II. Why do we preserve knowledge?
III. The Problems with preserving knowledge
IV. Why we digitize anyway
V. Where are the Real Problems?
I. Summary
Content, data, information -- all are different terms for knowledge. Humans store knowledge because we benefit as a species from the ability to communicate information over time and space with accuracy. As our dependence on communicated knowledge has grown, certain problems, such as developing uniform standards for conveying knowledge, maintaining the integrity of information, and retrieving specific knowledge from a storage or transmission system, have become more intransigent.
Digital devices have offered the promise of greatly facilitating both the maintenance of informational integrity and the capacity for information retrieval. However, digital storage is new. As a result, bit interpretation standards are not yet fully evolved. Most media sacrifice physical longevity for capacity. And in the area of process knowledge, such as computer programs and algorithms, interpretive hardware devices are also largely unstandardized. In short, we have three problems to solve:
- the longevity of the medium on which information is to be stored;
- the longevity of the hardware systems that permit us to perceive the information on the medium; and
- the longevity of the systems that lets us interpret the information that we perceive.
There is actually a fourth problem that may prove even bigger than these, and that is simply one of data management we are creating so much archival content that the task of sorting, collating, renewing and preserving the data is proving increasingly problematic, as long as human intervention is required for these activities. It is also expensive, and any solution to the problem must take into account the questions of who pays and who benefits.
II. Why Do We Preserve Knowledge?
A part of the human competitive advantage lies in our ability to communicate. That skill permits us to act cooperatively, with far greater effectiveness than any individual could achieve acting alone. We seek to pass on our knowledge too, by teaching our young, so they can benefit from our knowledge and experience. As our notions of community expand, it becomes increasingly necessary to find ways to communicate over ever-increasing gaps of distance and time.
Writing and drawing permitted information to be forwarded over great distances through chains of people with less degradation in accuracy. In fact, most information that has been preserved over long periods of time and space survives because the knowledge is recorded in some fashion, copied and broadly disseminated. Copying introduces its own problems, of course. Degradation or "improvement" of the original content is common, and the original meaning may frequently be changed or lost altogether.
One could argue that, although there might be some benefit to the individual in passing information over expanses of space, passing information far into the future conveys no possible benefit to him and is, therefore, unlikely behavior. It sounds too altruistic. However, William Hamiltons work on a genetic basis for altruism illustrates that altruistic behavior makes sense to the extent that it supports descendants and community and thus increases the survivability of ones DNA. Knowledge sharing is thus genetically selected for. It is cooperative behavior which "rewards" ones DNA, at least to the extent that it is shared within ones kin group. As we talk about individuals and groups which "benefit" from activities in this paper, we will include this kind of genetically-motivated altruism in the notion of benefit.
This argues that the value of information sharing is greatest over the shortest extents of space and time. However, we have seen repeated instances where human cultures have suffered severely from a failure to acquire or preserve knowledge at a farther reach. Chinese emperors who burned their navies to prevent the inflow of external ideas saw their societies suffer in the long run as a result. The loss by various human societies of their hard-acquired special knowledge, from Damascus steel to Mayan astronomy, arguably factored in their failure to recoup their previous vibrancy and strength.
It is important to consider whether we are more concerned with forms of information preservation that depend upon continuous human involvement in the preservation process or with forms that explicitly anticipate a cultural discontinuity that makes such dependency unlikely, for the solutions are likely to be quite dissimilar. It is equally important to consider the economics of preservation, for economic considerations frequently make explicit patterns of effort and benefit that are all too frequently buried in popular rhetoric. A process that depends upon continuous human effort must also entail continuous perceived benefit.
III. The Problems with Preserving Knowledge
The ability to communicate over large expanses of space and time gradually altered the human condition. Governance of great empires became possible. Task specialization became increasingly possible, as people became able to depend upon the knowledge-building capacities of others. We came, in time, to depend upon external knowledge, and this created, in turn, a set of needs tied directly to that dependence.
Recording systems, such as alphabets and numerical conventions, emerged and began to compete. These systems made possible currencies and non-barter trade, codification of laws and rules, literature and the arts - in short, most of the elements of complex society. Empires depended upon their ability to administer huge territories, to react swiftly to threats, and to encourage commerce. Record keeping became integral to their economic and political systems.
Just as the printing press made possible widespread literacy, new technologies have made in increasingly possible to create and store data. This process continues at an exponential rate of expansion. It is estimated that we have created and stored since 1945 one hundred times as much information as we did in all of human history up until that time! Of course, in order for that data to have any value, it must be possible to search and sift through it with relative ease, to show it to intended parties and secure it from unintended ones, and to save it for as long as it is useful without degradation in either its legibility or comprehensibility. These have always been the concerns with information -- the problems, however, demand new solutions as the scale of the operation magnifies beyond what was previously conceived.
What was previously done by hand must be automated. Searching for data, re-recording data to preserve or transfer it, encrypting information to protect it -- all these processes have come to depend upon automation because of the volume of information traffic. But in order to automate these processes, information must be in a form that machines can handle. In the Fifties and Sixties researchers experimented with both analog and digital forms of information processors. For reasons that lie beyond the scope of this paper, the digital devices largely won out, and today most data must be encoded in digital form to be enacted upon by information processing devices.
IV. Why We Digitize Anyway
Machines can copy and transmit images and sounds in many different ways. Cameras are over 150 years old. Voice recordings date from the turn of the century, as do telephonic and radio transmissions. None of these innovations depended upon prior conversion of content into a digital form. However, all suffer from similar limitations. Repeated copying of the transmission quickly renders the content incomprehensible. The content is largely unsearchable (would that, even today we could search voicemail for that message we really want to listen to!). And, in general, the information is hard to compress without losing quality, so it tends to take up a great deal of space (at least relative to storage of the same information in digital format on the same medium).
Enter the IBM punch card. Suddenly we have a format we can search. Copies are exact and never degrade (although the paper medium of the card certainly does!). The key to it all is that the information has to be encoded in a form that a machine can deal with -- a digital format. However, that encoding introduces some new problems even as it makes some of our original concerns more tractable.
Digital content is easier to store, search, encrypt, and compress than its analog equivalent, at least if you have a computer. It is harder for a human to interpret, for it introduces at least several additional steps in the interpretation process. Now it is not sufficient to: 1) recognize the alphabet used to convey information and 2) speak the language encoded in those alphabetical characters. Now one must also: 3) be able to discern or possess a device that is able to discern the 1s and 0s on the storage medium, and 4) know or possess a device that "knows" the encoding algorithm that translates the series of 1s and 0s on the medium into a taxonomy that one can interpret. In other words, one is now dependent upon a machine to perceive and interpret the stored information.
Now, in addition to previous concerns, one has to worry about whether that device will remain in proper working order into the future or whether similarly purposed devices in the future will be able to read and interpret the same data upon the same media as the current device does. In short, we now have three important and interconnected classes of concerns connected to the long-term storage and retrieval of information: 1) the longevity of the medium on which information is to be stored; 2) the longevity of the systems that permit us to perceive the information on the medium; and 3) the longevity of the systems that permit us to interpret the information that we perceive.
V. Where are the Real Problems?
A) The Longevity of Media
Lets look at each of these groups of problems separately first. The first group centers about the media upon which or within which we store information, and at first glance it would appear that this is an area in which we are doing a far worse job at preserving our intellectual heritage than did our forebears.
As we move from clay tablets to papyrus to paper to cellulose to magnetic tape to optical plastic, we move to increasingly ephemeral and fragile forms of storage. Many films and most magnetic tapes that are more than about 25 years old are too degraded to be viewed or heard. Most computer code that is ten years old is on unreadably degraded media (and, moreover, usually recorded in a manner unintelligible to currently available computer devices). Libraries whose old paper card catalogues served them for 100 years find that microfiche catalogues wear out in a fraction of that time. We seem in many ways to be moving backwards, toward increasingly temporary and fragile systems of information preservation.
On the other hand, modern technology has made it increasingly easy and inexpensive to copy information. Instead of a monk devoting his life to the careful recopying of a valued manuscript, a Xerox machine permits one to achieve the same result with a fraction of the time or effort. As a result, the natural culling process by which product was carefully evaluated before being deemed worthy of copying (thus forcing it to be subject to occasional analysis of its perceived value) is partly eliminated. Instead, the information is increasingly as likely to become lost in a blizzard of content as it is to become lost from the deterioration of the media itself.
Add, therefore, to this innovation the innovation of digitization, which permits vastly improved searching, link that with a global information network that permits individuals worldwide to search for, retrieve and copy information which they deem to be of value, and the ephemeral quality of the media seems of less concern, at least as long as the information is likely to be of continuous benefit to some group of people in this network, as long as that group is permitted to maintain the content, and as long as some society to which some of these people belong is able to maintain technical continuity. After all, earlier forms of storage, such as analog tape and paper, do continue to exists -- its simply a matter of cost and time to archive information both ways.
The three conditions above are not insignificant, however. It is not at all uncommon for that which was believed worthless to be greatly valued at a point in the future. With respect to the second condition, history is replete with instances of intentional destruction of information which is deemed inconsistent with or inimical to a particular religious, cultural or ideological viewpoint. The heterogeneity of a global networked society may be helpful in some respects, but that same global society may fail to protect the knowledge of groups whose cultural or linguistic traditions are not part of the emerging global standards. Nor is it at all certain that such groups would belong to, or be able to draw on the skills of, whatever source of technical continuity is needed to preserve their information. In other words, these three conditions make it almost inevitable that valued information will be lost. Creating and storing content on a more permanent storage medium such as paper may be a necessary condition for the long-term survival of most content, although it is certainly not a sufficient condition by itself.
B) The Longevity of Perception Systems
As long as humans were able to use their built-in perception systems to observe stored information, no issue of an appropriate standard was ever raised -- our standards were hard-wired into our physical selves. But the moment Thomas Alva Edison created his cylindrical audio recording, we began to become dependent upon mechanical devices to transmute information into auditory or visual signals that we were capable of hearing or seeing (whether we could understand those perceptible signals or not).
This created a real problem for future receipt of all kinds of stored content, not just digital content. Finding devices to play 33 1/3 rpm vinyl records is becoming increasingly difficult. Old film formats are equally hard to view. Just as 5 1/4 inch floppy disks are now archaisms of digital content, many analog storage media suffer from incompatibility with modern electronic devices.
In most cases, it is possible, as long as the original media are still functional, to have the content copied from the archaic media format onto new media formatted for a modern device. This is an often expensive proposition, however, and is usually performed by specialized agencies that maintain legacy systems and that have build highly original conversion facilities. That these kinds of operations are still largely small, specialized businesses suggests that many people feel that the historical content in their possession is not worth the cost of keeping currently readable. It might be worthwhile investigating the cost of developing hardware devices capable of reading (if not interpreting) the broadest range of media, if only to permit more widespread conversion of data in archaic formats.
The further constraints discussed at the end of the previous section apply equally here, with the additional caveat that a perception device will not be appreciated in the abstract, but only for the kinds and quantity of data, the perception of which it makes possible. In brief, no data means no value, and, shortly therefore, no such device. Of course, generalized analytical tools should always be capable of detect most forms of inscription, so there will always remain the possibility of re-inventing the appropriate device, should the need for it become evident only in the future. The issue is one of cost the absence of a readily available device requires the reinvention and construction of a one-off, which creates a higher threshold of needed value in the data to justify the expense.
C) The Longevity of Interpretation Systems
Even where the storage media have endured and the information has been recorded in a perceptible manner, all too often the meaning of the data escapes us. Whether we are talking about Mayan knot records, ancient cuneiform writings or computer programs written in 1401 autocoder, all too often we fail to possess the keys to unlock a real understanding of the meaning of stored data.
However, the picture is not uniformly bleak. Many interpretation systems are quite new and are an integral part of the explosion of stored data which we are experiencing. It will take time to discover which will become standards, widely comprehensible over the eons. Yet, in at least one area, great progress has been made.
In the case of human language, the picture is bright. We are well on our way toward a global standard English. It is not only now spoken by a higher percentage of the globes population than has ever spoken a single language before. It has also the virtue of being highly stable, for it has remained consistently comprehensible for 500 years (which is unusual among languages). Change in language slows as means of fixing it increase (i.e. books, films, recordings), and the dominance of English in global film and song is probably as important a factor in its standardization as breadth of its use and the depth of the literature.
But the problem takes on a whole new dimension when we talk about process information - content that consists of a series of steps or instructions for a particular device to enact. Computer programs for obsolete computers may or may not be written in archaic computer languages - in either event, reproducing what the obsolete machine would have done when fed the content is a challenge of sometimes daunting proportions.
Yet with increasing frequency our content consists of or contains process information. If one viewed this paper as a word processor file, not in its ones and zeros form, but already transliterated into it English alphanumeric form (but before a computer had processed the file), one would see, before the reasonably comprehensible words which you are now reading, a great deal of apparent gibberish. That gibberish consists of a series of instructions to the word processing program telling it which type fonts and sizes are being used, what margin settings to use, kerning settings, paragraph standards, and so forth. If this word processor program were lost or modified (as it has been almost every year), it would not be easily possible to deduce the exact form and layout of this paper, even if most of the thoughts could still be fairly readily perceived. This is a fairly trivial example, since the layout of this paper is not of great importance, but the point should be clear process information is everywhere and, with increasing frequency, it will not be possible to perceive the full expression of the content-creators intent if the ability to perceive the process information is lost.
Imagine if you will that we are talking about process content that represents the instructions for building a virtual space and populating it with still and animated images tied to sounds. Even if one could disambiguate the various data forms and figure out what was image, what was sound, and what was descriptive code, the authors expression is virtually impossible to deduce absent its interpretation via his original processing device. If in the future it becomes common to create digital wire models of complex inventions and other devices in lieu of written words, we will have an entire body of obviously important process data held hostage to its original interpretation device.
Perhaps, in these areas we just have to give it time. We do seem to have some movement toward standards numerical bits have been translated in a reasonably consistent way into numerals and letters of the Roman alphabet (and others), a necessary first step toward a process Rosetta Stone. And there appears to be a compelling universal interest in standardizing the operating systems and chief applications of commonly available computers, although these standards themselves continue to evolve at a hazardous rate. Perhaps this process will not continue indefinitely, in which case we are confronting merely an interim problem while the universal standards are finally worked out.
D) The Management Issue
One of the reasons that valued information is increasingly converted to digital form is simply so that it can be treated mechanically rather than manually. As we mentioned earlier digital data is far easier to search, compress, encrypt, copy and disseminate by automated means. Without automated data processors the explosion of data in the last 50 years would have created an insupportable burden on the human population we would necessarily have had to decide to abandon some simply because the cost of management of the content was beyond our administrative capacity.
However, automation does not really solve the administration problem. It just moves the yardsticks, permitting a single person to handle a larger territory (and, incidentally, changing popular expectations about what kinds of information can and should be preserved). Even if all information were intended by its creators and by authorities to be publicly and freely available (which is patently not the case), the challenge to create an environment to permit an at-will link between any record and any individual is a non-trivial task. That may, indeed be the ultimate result of the Internet revolution, and as search engines and intelligent agents are rapidly refined, the realization of this vision becomes plausible. After all, we are talking about a process in its infancy.
But most information is not available for free, at least at inception. It is often kept in confidence until that point in time when its owner perceives that it has no further proprietary value. What happens to it then is the subject of the next section, but lets focus for the time being simply on the question of the management of proprietary knowledge.
Understand that our beginning hypothesis was that humans preserved information in order to gain a personal benefit. There are forms of personal benefit that would describe behaviors in which one gave away proprietary knowledge (either to be seen as generous or out of a personal perception that the general public benefit would also be beneficial to oneself and ones descendants). Assuming, however, that there are many instances in which individuals do not see benefit in giving away the information they possess (banned books, personal tax records, next quarters earnings, secret police files, what really happened in Nanking, etc.), the question of how to preserve information even for the limited purposes of the current possessor creates administrative headaches that have already created calamities of the first order.
Critical IRS and Social Security records have been lost or destroyed, valued books lost within libraries due to misshelving, commercially valuable film archives damaged beyond repair by the simple passage of time. Even if valued content is stored in non-digital, longer term formats, indices into that content must be kept in digital form to aid retrieval and multi-level searches. These indices must not only be constantly renewed; they must also be "exercised" (i.e. used and the results compared with the request to ensure continued integrity of both content and index). In the long run, however, much data will either be "put out to pasture," without review or renewal, or else it will be editorially sorted, with the valued preserved and the unvalued discarded.
E) The Cost of Data
The real problem of long-term information storage, whether intermediated or not, however, is one of cost. The future cannot make payments to the past (if it could, I have no doubt that some scoundrel would abscond with the net present value of all future sources of revenue), so the value of content to individuals in the future is irrelevant what matters is the value people currently impart to the preservation of information for the future.
This is not a hopeless proposition. Proprietary information, as we noted above, will be preserved for as long as the proprietor considers that it has value to him. After that, if it is in his exclusive possession, he will put it into the public domain if he does not consider that it does him harm, and if he is either required to do so or if he believes that it will be publicly valued and that he will garner some benefit from that value. If it is not in his exclusive possession, upon the expiration of his ownership, it will automatically become part of the public domain.
Cultures that pride themselves upon their accumulation of knowledge from the past should be equally assiduous about preserving new knowledge for the future. It is part of the cultures statement of worth, a kind of conspicuous consumption of financial resources intended to display to all the cultural value of the society vis-ˆ-vis the rest of the world. (The pyramids, which seem similar in intent on the surface, differ in that they were primarily a statement to the world of the dead.) Further, to the extent that a society has grown and prospered due to its reliance on and use of knowledge and the information sciences, knowledge and its preservation might be generically protected out of a general sense that information is good stuff and just might come in useful in some unknown way in the future.
The question is for how long. Hamiltons work would suggest that the value of altruistic acts declines with each succeeding generation, so we should be far more interested in short-term than long term information use and preservation. Short of the survivalists instinct to bury away important information for the rainy day of Armageddon, it will be easiest to find resources to pay for the indefinite preservation of content that represents what people currently believe to be defining works of human greatness. What about the rest, however? What about the Van Goghs or Franz Kafkas, whose genius was unrecognized in their own time? And what about more mundane information, such as genome records for existing species, rainfall records, or popular music?
This is where the economics of storage get important. One solution would be to "tag" content with the financial resources to preserve and maintain it indefinitely into the future. In others words, find a vehicle for individuals to donate money to preserve particular content, and develop a methodology that uses the best of non-digital backup techniques along with digital preservation and indexing, and a program for refreshing and "testing" the content on a regular basis into the future. This is not the lowest cost road, and we may have, in the end, to make value judgments about the level of protection we can afford to afford much content. Multiple tracks with separate pricing structures might permit preservation donors to make choices that fit their pocketbooks.
Regardless of the exact approach that is eventually chosen, we should not lose sight of the fact that maintaining continuity of content is the ultimate goal, not just maintaining digital continuity, and best practices in most cases will involve preservation on a variety of formats, including hard copy or analog magnetic storage, as well as digital.
Home | Intro | Background | Participants | Links | Discussion | Proceedings