Preservation of Digital Information

A new approach

Information may be seen as a valuable commodity. Where information is considered valuable, there will always be a need to preserve it.

For example, the British Library has undertaken the project of preserving up to 100 Web sites of historical and social importance. The Library is intending to expand this to 10,000 sites and to take a half-yearly snapshot of the entire.uk domain, which at present accounts for nearly 25 million web pages as of 2002. Why? According to the Digital Preservation Society, the average web page has a habit of disappearing after approximately 60 days. Admittedly most of the information is junk, but some of it might be valuable.

Thus the need to preserve information is becoming an increasingly important issue for the librarian.

What is preservation?

In the old days, the word preservation was generally thought to be the process of repairing 'crumbling books in musty old libraries..." (1)

While this was once a common task of the librarian and archivist say 50 years ago, we now define preservation as the continuing access to information stored on any suitable storage media for as far into the future as possible.

As Patricia Battin, president of the Commission on Preservation and Access in Washington, had defined it:

"...the strategies and actions necessary to provide access to the accumulated human record as far into the future as possible." (2)

There are currently at least four ways to store and preserve digital information.

1. The Human Brain

This may sound a little funny, but perhaps this is exactly what we need to do. For example, what happens if the unthinkable does happen?

For example, what if there is a power failure and people cannot access the digital information on computers? We could rely on books. But given the current state of world affairs, even books are not entirely safe. Suppose someone destroys all the books through the use of fire? It would help society if people knew something about what was recorded in the digital and book format. In that way, information can be resurrected and preserved for future generations.

With this in mind, the idea of preserving information using the human brain does start to sound attractive.

There is also one other benefit of converting digital information into analog form and then storing it into the human brain. The brain has a remarkable ability to simplify the shear volume of information society has created for itself into compact and easy to digest information.

Just think of the benefits in simplifying and storing essential information in the brain. It would reduce the demand for computers and paper to store and preserve any old information in any quantity, not to mention alleviate the massive physical storage problems associated with information stored in the paper format.

However, the brain does have a few disadvantages. Firstly, the brain is usually not that quick to store a huge amount of the digital information. Secondly, we cannot expect the brain to recall all that information accurately in the original form as it was stored in memory (known as photographic memory). Why? Because the brain has a habit of selecting information due to the presence of other information in memory called beliefs as well as inherent "blocks" in the nervous system to filter and simplify the information. And to make things just that little bit more difficult, our creative right-side of the brain is also trying to constantly find simpler and easier ways to understand the original information by reorganising that information.

Unless we can develop powerful accelerated learning techniques to help the brain learn quickly, easily and accurately (and so preserve) the original information without the use of any other storage media, we are forever reliant on other technology for the solution to preserving and storing information in its exact form.

2. Paper

Paper is one of the oldest methods of preserving information. The method has been around for thousands of years and it is unlikely that it will change in the foreseable future.

The advantage of paper over other storage systems is clear. Firstly, it is easy to present and read the information on paper without the need for special tools—unless you include your brain and eyes! Furthermore, you have a reasonably quick and accurate way of gathering the original information from paper without worrying about the creative mind distorting the information in tiny increments over time.

The disadvantages of paper, however, is that:

There is also one other disadvantage: updating information on paper is usually a laborious and expensive process taking nearly 12 months to update, print and distribute information on paper to the global market. Updating on paper is only useful if the information is of an exceptionally high-quality and of a stable form for use in libraries over many years or decades.

Then there is the problem with mass-produced paper today for preserving information. Paper that has gone through the manufacturing process often retain some acidic chemicals in the material. And it is a fact that acids do destroy paper. Therefore, the paper will deteriorate over time until it becomes too fragile to handle by hand after around 100 years.

3. Floppy disk

Floppy disks, on the other hand have the advantages of being:

The disadvantages of floppy disks are:

4. CD-ROM

This has roughly the same sorts of advantages and disadvantages as a floppy disk, except that a CD-ROM can store approximately 440 floppy disks (or between 650MB and 700MB) worth of information, making it much more suitable for multimedia applications (i.e. storing high-quality pictures, sounds and movies).

CD-ROMs also have the advantage over floppy disks of not being easily damaged by normal magnetic fluctuations because information is stored as microscopic "depressions" or "holes" etched onto the reflective metal media.

Recently, the DVD-format laser CD disks have superseded the CD-ROM. These disks are able to store between 2GB and 9.6GB of very high quality images, sounds and movies or, of greater benefit to everyone, to fit more quantities of high-quality published information in books and floppy disks onto a single disk.

Again the problem of any digital storage technology is the need for special tools (e.g. the computer) beyond the standard brain and eyes to read and update the digital information. Nevertheless once you have the computer, information on a CD can be easily accessed with great speed and accuracy in its original form and updated as required (by burning onto a new CD or using a CD-RW disk).

5. The Internet

The Internet is a bunch of computers all linked together via the telephone network with each one storing information on magnetic (floppy and internal hard disks) and optical (CD) storage media. Hence all the advantages and disadvantages of the Internet remain essentially unchanged as for floppy disks and CD-ROMs. The only slight advantage is its ability to have similar or the same information distributed and stored on many different computers around the world.

In other words, in the event of a disaster where part of the information on the Internet is lost, there is an excellent chance the information will be preserved on a different part of the network.

Although the Internet can store literally thousands of gigabytes of information on any type of storage media in existence compared to the miserly 650MB on a CD, it does suffer one main problem: access to multimedia information can be painfully slow at certain times of the day and on different parts of the Internet network due to limited bandwidth.

Apart from that minor inconvenience, and if the quality of information on the Internet can be dramatically improved and easily located using quality search engine technology and accelerated learning techniques, the Internet has the potential of permanently storing and preserving valuable human knowledge in the digital format.

What is the best way to store digital information today?

Unless the bandwidth of the Internet network and the difficulties in finding quality information is improved, the best way of storing and preserving digital information today is still the CD.

There is also an economic advantage as well: CDs are the most cost efficient storage media available today.

Can digital information be preserved forever?

Preserving information on CDs on a permanent basis is, however, another question altogether. It all depends on (i) the quality of the materials used to build a CD; (ii) whether the technology to read CDs will be available in 100 years from now, and (iii) how regularly we go about duplicating CDs over time which will determine the likelihood of preserving digital information permanently.

In the early days (i.e. the mid-1980s), the construction of CDs suffered from poor quality plastics that slowly reacted with the reflective (aluminium) media inside. This meant only one thing: it greatly reduces the lifespan of CDs for storing and preserving digital information to around 10 years.

Nowadays, the plastics have been improved considerably together with built-in chemical dyes designed to reduce this problem. The best we can hope to achieve today when preserving digital information on CD is roughly 200 years (e.g. using the professional Kodak CD-R products).

However, the plastics still have one other inherent problem: they are too soft and can scratch easily through normal use and this can affect the quality of the information getting through the plastic from the reflective media by the light and thus reduce the lifespan of the CD. Unless plastics can improve significantly in the coming years, we can expect digital information to be permanently preserved only by constantly copying the data onto a new CD every say 50 to 150 years. For regularly used CDs in a library, this may have to be done every 12 months.

But for information to be preserved forever, there must be enough CD copies of the original stored away for comparison and recreation of a new CD copy at a later date.

Nature does it too, and why shouldn't we do the same with CDs?

This copying process may look like a pain in the arse for some librarians and archivists, but this is exactly what goes on in the natural world and is something that we must all do.

For example, when we look inside our living cells, we notice the existence of a special macromolecule called Deoxyribonucleic Acid. This molecule is vital to human life for it is designed to store genetic information. Furthermore, it has also "learnt" to replicate itself regularly and incessantly in order to preserve its genetic information.

So we should not be surprised by the fact that CDs will probably have to be regularly copied onto new CDs to ensure digital information is permanently preserved.

But technology changes too quickly

There is a concern among librarians and archivists that preserving information in digital form using CDs will be a complete waste of time because technology changes too quickly. Who knows what we might be using tomorrow? Today we use CDs, but tomorrow we may be using something else.

As Battin said:

"Perhaps the most sobering consideration of all for contemporary and future librarians and archivists is the extraordinarily short life cycles of the new technologies. We will not have the luxury enjoyed by our predecessors of benign neglect made possible by the fact that the life cycle of acid paper, despite its fragility, outlasted the career of the individual librarian or archivist." (3)

Is this concern properly justified?

As nature has discovered, no physical media can remain in its original state forever because of the constant wear n' tear imposed on it from the environment. Even when we do not do anything to the media, the universe is constantly bombarding every physical material with radiation, bacteria and various chemicals in the air and elsewhere.

While we try our hardest to minimise this wear n' tear through effective preservation techniques, the ultimate aim for the librarian and archivist is to ensure the information is accessible to everyone.

This is the purpose of information; it has to be used (i.e. accessed). And that means the media for storing that information will always be subject to some kind of wear n' tear. Yes, new materials will become available in the future to extend the lifespan of these storage media. In addition, how well the media is protected when not in use is just as critical to maximising the lifecycle of the media as is allowing people to access the media for its information. Yet, the time will come when the information must be copied onto a new high-quality media.

Now when the time comes to copy the information onto a new storage disk, it should be difficult to transfer the information across. It is not as if it is going to be like transferring text and pictures on a printed page and digitising it onto a disk. Once the information is in the digital format and librarians/archivists still have a machine to read the CDs, there will be programs available to easily duplicate the CD contents and transfer it onto a new storage disk of the future.

It won't matter what kind of technology comes along in the future. It will still perform in the same simple process of storing and retrieving digital information. The only difference will be in the speed and capacity of the new technology which will be of tremendous benefit to the librarian/archivist.

And given how the storage media will improve dramatically over time with much greater storage capacities, the easy in copying all the digital information at a later date will get much easier and easier despite the quantity of information being stored which is increasing dramatically over time.

Thus it is immaterial what kind of high-quality digital technology we may use today or in the near future. What is more important is the ability to copy easily and quickly the information in an accurate way (i.e. without loss of data) onto a new storage media (and thus a new technology).

This is the crucial point. If information is stored on paper, it would take too long to copy it, and there would be a risk of losing information along the way with each copying that we do to it. But if information is in the digital form and is copied earlier enough before the media is irreparably damaged, the digital information can be transferred very quickly and easily and there is virtually no loss in the original data, especially if you follow a few simple rules when copying the data.

It takes too long to copy digital information

Some librarians and archivists may believe it takes too long to copy digital information today onto new storage medias of the future. We should remember that the future storage media will definitely carry a lot more information. But more information does not necessarily translate into more time being spent copying the information. As technology and storage media improves, the copying process will get easier and quicker over time. Eventually a time will come when all the digital information in the world can be copied and stored onto a single storage media.

For example, the task of copying say 1000 different CD titles in a library may take a month to achieve using present-day technologies. But what happens when a new type of disk of the future becomes available to store all these CDs onto a new type of disk. The process of copying the new disk onto another of the same type will be a whole lot easier for everyone—perhaps taking only 5 minutes to complete using the new technology!

For example, the next frontier in information storage systems is the holographic disk. This is a disk that can store literally hundreds of gigabytes of information. Now imagine how long it would take to duplicate one or two of these disks (without compression) in order to preserve all the information held on 1000 CDs?

The critical thing to remember is that once the information is in the digital form, it should be quick and painless to replicate the information. And it should get easier to copy and preserve that information over time as new copying technologies become available and new storage media become able to store a much greater amount of digital information than is possible today.

How should digital information on CDs be preserved?

Using today's CD technology, the best technique to preserving digital information is as follows:

  1. Make as many duplicate CD copies of the original CD as is considered reasonable for preserving the digital information;
  2. Distribute and archive the duplicate CD copies in different libraries throughout the world;
  3. Make another copy of the CDs considered reasonable by each library to allow access to the information by the general public;
  4. When the time comes to make a brand new original CD after 50 or 150 years (depending on the level of wear n' tear and the quality of the materials used by manufacturers to construct the original CD), bring together all the archived duplicate CDs from around the world into one central location (this could be done electronically and automatically over the Internet with the help of software and computers);
  5. Rebuild the digital information onto a brand new original CD by getting the software on a computer to compare all the old duplicate and the old original CDs and then choosing the most common 0's and 1's for storing onto the new CD. If necessary, the software should try to read the 0's and 1's on each CD at least 30 times and choose the best information. This will greatly improve the chances of preserving the original digital information as it was recorded nearly 50 or 150 years ago;
  6. Once we have "recreated" our brand new CD with the "rebuilt" digital information already stored on it, repeat step 1 to ensure the best chance of preserving the original digital information for at least another 50 or 150 years (or possibly longer given the likelihood of having much higher quality digital storage disks in the future);

To understand this copying process better, just think of a human being as like a CD where the genetic code of the person is recorded millions of times onto the disk. If there are enough scratches and other natural changes going on in the CD as it ages over time, you cannot expect to reproduce the genetic code of one single DNA stored on the CD accurately because there could be damage on that particular part of the CD. You have to compare the information with the genetic code of all other DNA stored on different parts of the CD to have a hope of accurately "recreating" the original genetic code at the time the CD was produced and recorded. And for even better accuracy, there should be a number of CDs reproduced and distributed as widely as possible so that when they are brought together, a new CD can be recreated with the best and most accurate information obtained after comparing the information on all the CDs. And it is this new "recreated" information that gets recorded onto a brand new CD of the future.

The same technique is also applied in biology. For example, if you try to clone a single living organism using any old DNA molecule contained in just one of its own cells (i.e. inject the cell into a genetically-sterile ovum supplied from the same or different organism and pass electricity through the ovum to "kick-start" the process of getting the nucleus to split apart and produce a foetuse), there would be nothing for the DNA to compare itself with another to ensure its genetic code is accurate and contains the best information transferred to the cloned organism.

Who knows? Perhaps the DNA may already have acquired significant damage over the lifetime of the living organism. Without "checks and balances" in place to ensure the DNA is of a high quality, it would be difficult to be sure the cloning process is successful based on this one and potentially "old" DNA. In other words, there is a risk of genetic mutations being carried over to the offspring to produce abnormalities and health problems in this cloned version of the organism.

However, if the DNA from two relatively different organisms within the same or very similar species are mixed, there will be enough genetic information to compare in order to create a high-quality offspring. In fact, by having DNA from relatively different organisms of the same or very similar species not only will there be an accurate reproduction of the essential genes to make a healthy organism of at least the same quality of the parents, but there is an excellent chance the offspring will carry some favourable improvements in the genes from one or both parents.

Digital information will not guarantee the preservation of all recorded information available today

Despite the great promise of extremely long digital preservation using the above method, digital information and the technology for creating it today will not be able to preserve everything we have in its exact form for all eternity.

Why? Simply because most of the information available today is not in the digital form and there is too much of it to be converted into digital information. Then there is the scarcity of resources to handle the task of preserving digitally all this information.

In essence, the high costs, the limited time to convert non-digital information to digital form, the insufficient manpower, and the shear volume of non-digital information currently stored in libraries around the world will put an end to this all-out and encompassing preservation strategy.

The only effective preservation strategy for the 21st century will be the selective preservation strategy. Selective preservation strategy simply means asking ourselves the important question, "What can we save?"

As Battin said:

"It was apparent from the beginning that, since we could not save everything, the critical issue in preservation strategy would be selection." (4)

From this, an effective preservation strategy will be developed whereby we will know precisely what is the most important information to be preserved and for how long we want it preserved before embarking on a time-consuming and expensive task of preserving it, and perhaps in the digital form.